Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 164]
cs.CV [Total: 159]
cs.AI [Total: 53]
cs.SD [Total: 16]
cs.LG [Total: 220]
cs.MA [Total: 4]
cs.MM [Total: 1]
eess.AS [Total: 10]
eess.IV [Total: 12]

cs.CL

[1] Enhancing Biomedical Named Entity Recognition using GLiNER-BioMed with Targeted Dictionary-Based Post-processing for BioASQ 2025 task 6

Ritesh Mehta

Main category: cs.CL

TL;DR: Evaluated GLiNER-BioMed model on BioASQ BioNER task, introduced dictionary-based post-processing that improved F1 from 0.79 to 0.83 on dev set but failed to generalize to test set (0.77 vs 0.79 baseline).

Details

Motivation: Address challenges in biomedical NER, particularly distinguishing similar entity types like genes and chemicals, and improve information extraction from scientific literature.

Method: Used GLiNER-BioMed model on BioASQ dataset with a targeted dictionary-based post-processing strategy to fix common misclassifications. Also explored alternative methods including Conditional Random Fields.

Result: Post-processing improved micro F1-score from 0.79 to 0.83 on development set, but did not generalize to blind test set (0.77 vs baseline 0.79).

Conclusion: Dictionary-based refinement shows potential for pre-trained BioNER models but faces overfitting challenges to development data and requires robust generalization for real-world use.

Abstract: Biomedical Named Entity Recognition (BioNER), task6 in BioASQ (A challenge in large-scale biomedical semantic indexing and question answering), is crucial for extracting information from scientific literature but faces hurdles such as distinguishing between similar entity types like genes and chemicals. This study evaluates the GLiNER-BioMed model on a BioASQ dataset and introduces a targeted dictionary-based post-processing strategy to address common misclassifications. While this post-processing approach demonstrated notable improvement on our development set, increasing the micro F1-score from a baseline of 0.79 to 0.83, this enhancement did not generalize to the blind test set, where the post-processed model achieved a micro F1-score of 0.77 compared to the baselines 0.79. We also discuss insights gained from exploring alternative methodologies, including Conditional Random Fields. This work highlights the potential of dictionary-based refinement for pre-trained BioNER models but underscores the critical challenge of overfitting to development data and the necessity of ensuring robust generalization for real-world applicability.

[2] Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

Shahriar Kabir Nahin, Hadi Askari, Muhao Chen, Anshuman Chhabra

Main category: cs.CL

TL;DR: Test-Time Scaling (TTS) becomes vulnerable to producing unsafe outputs when candidate diversity is reduced, as demonstrated by the RefDiv attack method that stress tests TTS pipelines across multiple models and strategies.

Details

Motivation: To investigate the previously unrecognized failure mode in TTS where reduced candidate diversity significantly increases the likelihood of unsafe outputs, challenging the assumption that diverse candidate pools always enhance reliability.

Method: Developed RefDiv, a reference-guided diversity reduction protocol that serves as a diagnostic attack to stress test TTS pipelines across four open-source models and two TTS strategies (Monte Carlo Tree Search and Best-of-N).

Result: Constraining diversity consistently increases unsafe output rates across all tested models and TTS strategies, with effects often stronger than adversarial prompts. Existing safety classifiers fail to detect RefDiv-generated adversarial inputs.

Conclusion: TTS has a general vulnerability to diversity reduction attacks, and current safety defenses offer limited protection, highlighting the need for more robust TTS strategies that are secure against diversity-targeted stress tests.

Abstract: Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across four open-source models (Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3 and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard and OpenAI Moderation API), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode. Through this work, we hope to motivate future research on designing robust TTS strategies that are both effective and secure against diversity-targeted stress tests as illustrated by RefDiv.

[3] Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech

Yuxin Li, Eng Siong Chng, Cuntai Guan

Main category: cs.CL

TL;DR: HAREN-CTC is a novel architecture for speech-based depression detection that integrates multi-layer SSL features using cross-attention and CTC loss to handle sparse temporal supervision, achieving state-of-the-art performance.

Details

Motivation: Traditional speech-based depression detection methods struggle with extracting meaningful features and capturing sparse, heterogeneous depressive cues over time, often overfitting to specific datasets and failing to leverage the full hierarchical structure of SSL models.

Method: Proposes HAREN-CTC with two key modules: Hierarchical Adaptive Clustering that reorganizes SSL features into complementary embeddings, and Cross-Modal Fusion that models inter-layer dependencies through cross-attention, combined with CTC loss for alignment-aware training.

Result: Achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA datasets, outperforming prior methods in both standard data splits and generalization settings with five-fold cross-validation.

Conclusion: HAREN-CTC effectively addresses the limitations of existing SDD methods by leveraging multi-layer SSL features through hierarchical organization and cross-attention, demonstrating superior performance in detecting subtle depression signals from speech.

Abstract: Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments. However, it remains limited by the difficulty of extracting meaningful features and capturing sparse, heterogeneous depressive cues over time. Pretrained self-supervised learning (SSL) models such as WavLM provide rich, multi-layer speech representations, yet most existing SDD methods rely only on the final layer or search for a single best-performing one. These approaches often overfit to specific datasets and fail to leverage the full hierarchical structure needed to detect subtle and persistent depression signals. To address this challenge, we propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework, combined with Connectionist Temporal Classification loss to handle sparse temporal supervision. HAREN-CTC comprises two key modules: a Hierarchical Adaptive Clustering module that reorganizes SSL features into complementary embeddings, and a Cross-Modal Fusion module that models inter-layer dependencies through cross-attention. The CTC objective enables alignment-aware training, allowing the model to track irregular temporal patterns of depressive speech cues. We evaluate HAREN-CTC under both an upper-bound setting with standard data splits and a generalization setting using five-fold cross-validation. The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.

[4] Systematic Diagnosis of Brittle Reasoning in Large Language Models

V. S. Raghu Parupudi

Main category: cs.CL

TL;DR: A framework to measure mathematical reasoning in AI models by analyzing step-by-step reasoning on GSM8K, identifying specific failure points through error categorization and unsupervised clustering of reasoning modes.

Details

Motivation: To understand the extent of mathematical comprehension in machine learning models beyond standard benchmarks by diagnosing specific failure points and reasoning patterns.

Method: Generate structured step-by-step reasoning from GPT-3.5-turbo on GSM8K dataset, then use GPT-4o-mini to categorize errors and perform unsupervised clustering of reasoning sentences to identify emergent reasoning modes.

Result: Reveals a nonhuman-like brittleness in AI reasoning - near-perfect accuracy on procedural modes like sequential calculation, but performance plummets on modes requiring combinatorial reasoning with restrictions.

Conclusion: Provides a granular method to evaluate mathematical comprehension and offers a precise roadmap for developing new capabilities and more reliable AI applications by quantifying distinct reasoning skills.

Abstract: A central question in artificial intelligence is the extent to which machine learning models comprehend mathematics. To address this, we propose a novel framework for measuring mathematical reasoning that moves beyond standard benchmarks to diagnose specific failure points. Our method first generates structured, step-by-step reasoning from gpt-3.5-turbo on the GSM8K dataset. We then use a more capable analyst model, gpt-4o-mini, to categorize errors and, crucially, perform an unsupervised clustering of every reasoning sentence to identify emergent “reasoning modes.” This analysis reveals a cognitive profile with a stark, nonhuman-like brittleness: while the model achieves near-perfect accuracy on procedural modes like sequential calculation, its performance on modes requiring combinatorial reasoning with restrictions plummets. By identifying and quantifying the reliability of these distinct reasoning skills, our work provides a more granular method to evaluate mathematical comprehension and offers a precise roadmap for developing new capabilities and more reliable future applications.

[5] Mnemosyne: An Unsupervised, Human-Inspired Long-Term Memory Architecture for Edge-Based LLMs

Aneesh Jonelagadda, Christina Hahn, Haoze Zheng, Salvatore Penachio

Main category: cs.CL

TL;DR: Mnemosyne is an unsupervised, human-inspired long-term memory architecture for edge-based LLMs that uses graph-structured storage with filtering, committing/pruning mechanisms, and probabilistic recall with temporal decay. It outperforms baseline methods in healthcare dialogue evaluations.

Details

Motivation: Current LLM memory systems rely on brute-force context expansion or static retrieval pipelines that fail on edge-constrained devices and struggle with longitudinal conversations where repetitive but temporally distinct dialogues occur.

Method: Uses graph-structured storage with modular substance and redundancy filters, memory committing and pruning mechanisms, probabilistic recall with temporal decay and refresh processes modeled after human memory, and creates a concentrated “core summary” from memory graph.

Result: Achieved 65.8% win rate in blind human evaluations vs 31.1% for baseline RAG, highest LoCoMo benchmark scores in temporal reasoning and single-hop retrieval, and second highest overall score of 54.6% beating Mem0 and OpenAI baselines.

Conclusion: Improved factual recall, enhanced temporal reasoning, and more natural user-facing responses are feasible with an edge-compatible and easily transferable unsupervised memory architecture.

Abstract: Long-term memory is essential for natural, realistic dialogue. However, current large language model (LLM) memory systems rely on either brute-force context expansion or static retrieval pipelines that fail on edge-constrained devices. We introduce Mnemosyne, an unsupervised, human-inspired long-term memory architecture designed for edge-based LLMs. Our approach uses graph-structured storage, modular substance and redundancy filters, memory committing and pruning mechanisms, and probabilistic recall with temporal decay and refresh processes modeled after human memory. Mnemosyne also introduces a concentrated “core summary” efficiently derived from a fixed-length subset of the memory graph to capture the user’s personality and other domain-specific long-term details such as, using healthcare application as an example, post-recovery ambitions and attitude towards care. Unlike existing retrieval-augmented methods, Mnemosyne is designed for use in longitudinal healthcare assistants, where repetitive and semantically similar but temporally distinct conversations are limited by naive retrieval. In experiments with longitudinal healthcare dialogues, Mnemosyne demonstrates the highest win rate of 65.8% in blind human evaluations of realism and long-term memory capability compared to a baseline RAG win rate of 31.1%. Mnemosyne also achieves current highest LoCoMo benchmark scores in temporal reasoning and single-hop retrieval compared to other same-backboned techniques. Further, the average overall score of 54.6% was second highest across all methods, beating commonly used Mem0 and OpenAI baselines among others. This demonstrates that improved factual recall, enhanced temporal reasoning, and much more natural user-facing responses can be feasible with an edge-compatible and easily transferable unsupervised memory architecture.

[6] Confidence, Not Perplexity: A Better Metric for the Creative Era of LLMs

V. S. Raghu Parupudi

Main category: cs.CL

TL;DR: The paper proposes Confidence Score (CS) as a less biased alternative to reference-free metrics like self-perplexity for evaluating creative text generation, showing it significantly reduces creativity bias while maintaining evaluative strengths.

Details

Motivation: Traditional reference-free metrics like self-perplexity are strongly biased against creative text generation, favoring more conventional responses over novel and creative ones.

Method: Proposes Confidence Score (CS) derived from a model’s output probability distribution. Conducts experiments using gpt-4o-mini on 99 creative prompts to compare CS with fluency-based metrics.

Result: CS preferred novel responses 19% of the time vs. 0% for fluency-based metrics (statistically significant difference with 95% CI: [11.1%, 27.3%]). CS also effectively distinguished between easy, medium, and hard tasks with non-overlapping confidence intervals.

Conclusion: The Confidence Score mitigates the creativity bias of traditional metrics while retaining their core evaluative strengths, offering a more balanced assessment for modern LLMs.

Abstract: Reference-free metrics like self-perplexity are strongly biased against creative text generation. We propose the Confidence Score (CS), derived from a model’s output probability distribution, as a less biased alternative. Experiments on gpt-4o-mini show that while fluency-based metrics prefer novel responses in 0% of cases on 99 creative prompts, our CS does so 19% of the time, a statistically significant difference (95% CI for difference: [11.1%, 27.3%]). We also show that CS effectively distinguishes between easy, medium, and hard tasks, confirmed by non-overlapping confidence intervals. The Confidence Score thus mitigates the creativity bias of traditional metrics while retaining their core evaluative strengths, offering a more balanced assessment for modern LLMs.

[7] Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation

Devleena Das, Rajeev Patwari, Ashish Sirasao

Main category: cs.CL

TL;DR: Recover-LoRA is a lightweight, dataset-agnostic method that uses synthetic data and logit distillation to recover accuracy in degraded language models through selective layer LoRA adapters.

Details

Motivation: To address functional degradations in language models caused by inference optimizations like quantization, pruning, and improper serialization, focusing on recovery from any weight degradation sources rather than just quantization.

Method: Uses synthetic data and logit distillation to learn LoRA adapters on selective layers that align degraded models with their full precision counterparts.

Result: Recovers model accuracies by 5-17% on both multi-head attention (MHA) and group-query attention (GQA) small language models across diverse evaluation datasets.

Conclusion: Recover-LoRA effectively recovers accuracy in degraded models across various attention architectures and datasets, providing a general solution for model deployment optimization issues.

Abstract: Inference optimizations such as quantization, pruning, format and datatype conversion, model export, and serialization can lead to functional degradations in language model task performance. While most efforts on performance recovery for deployment focus on robust quantization techniques, we focus on recovering model accuracies from any sources that degrade model weights, such as improper model serialization. In this work, we propose Recover-LoRA, a lightweight and dataset agnostic method to recover accuracy in degraded models. Recover-LoRA uses synthetic data and logit distillation to learn LoRA adapters on selective layers that facilitate aligning the degraded model to its full precision model. We investigate the utility of Recover-LoRA across a diverse set of small language models (SLMs), including models with varying attention architectures, multi-head attention (MHA) and group-query attention (GQA), as well as several evaluation datasets. Our results show that Recover-LoRA recovers model accuracies by 5-17% on MHA and GQA SLMs.

[8] Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

Cong Zeng, Shengkun Tang, Yuanzhou Chen, Zhiqiang Shen, Wenchao Yu, Xujiang Zhao, Haifeng Chen, Wei Cheng, Zhiqiang Xu

Main category: cs.CL

TL;DR: The paper proposes reframing AI-generated text detection as an out-of-distribution (OOD) detection problem rather than binary classification, using one-class learning methods to achieve better generalization across domains and models.

Details

Motivation: Existing binary classification approaches for AI-generated text detection suffer from poor generalization across domains and models because they incorrectly assume human-written texts form a coherent distribution, when in reality human texts are highly diverse and cannot be effectively captured through limited sampling.

Method: The authors propose an OOD detection framework using one-class learning methods (DeepSVDD and HRN) and score-based learning techniques (energy-based method), treating machine-generated texts as in-distribution samples and human-written texts as distributional outliers.

Result: The OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset, and demonstrates robust performance across multilingual, attacked, and unseen-model/domain text settings.

Conclusion: Reframing AI-generated text detection as an OOD detection problem enables more robust and generalizable performance compared to traditional binary classification approaches, particularly when dealing with the inherent diversity of human-written texts.

Abstract: The rapid advancement of large language models (LLMs) such as ChatGPT, DeepSeek, and Claude has significantly increased the presence of AI-generated text in digital communication. This trend has heightened the need for reliable detection methods to distinguish between human-authored and machine-generated content. Existing approaches both zero-shot methods and supervised classifiers largely conceptualize this task as a binary classification problem, often leading to poor generalization across domains and models. In this paper, we argue that such a binary formulation fundamentally mischaracterizes the detection task by assuming a coherent representation of human-written texts. In reality, human texts do not constitute a unified distribution, and their diversity cannot be effectively captured through limited sampling. This causes previous classifiers to memorize observed OOD characteristics rather than learn the essence of `non-ID’ behavior, limiting generalization to unseen human-authored inputs. Based on this observation, we propose reframing the detection task as an out-of-distribution (OOD) detection problem, treating human-written texts as distributional outliers while machine-generated texts are in-distribution (ID) samples. To this end, we develop a detection framework using one-class learning method including DeepSVDD and HRN, and score-based learning techniques such as energy-based method, enabling robust and generalizable performance. Extensive experiments across multiple datasets validate the effectiveness of our OOD-based approach. Specifically, the OOD-based method achieves 98.3% AUROC and AUPR with only 8.9% FPR95 on DeepFake dataset. Moreover, we test our detection framework on multilingual, attacked, and unseen-model and -domain text settings, demonstrating the robustness and generalizability of our framework. Code, pretrained weights, and demo will be released.

[9] YpathRAG:A Retrieval-Augmented Generation Framework and Benchmark for Pathology

Deshui Yu, Yizhi Wang, Saihui Jin, Taojie Zhu, Fanyi Zeng, Wen Qian, Zirui Huang, Jingli Ouyang, Jiameng Li, Zhen Song, Tian Guan, Yonghong He

Main category: cs.CL

TL;DR: YpathRAG is a pathology-oriented RAG framework that uses dual-channel hybrid retrieval and LLM-based evidence judgment to improve factual reliability in pathology domains, achieving significant gains in retrieval quality and answer accuracy.

Details

Motivation: LLMs still hallucinate in high-barrier domains like pathology, and prior domain fine-tuning approaches don't expand knowledge boundaries or enforce evidence-grounded constraints.

Method: Built pathology vector database covering 28 subfields and 1.53M paragraphs; uses dual-channel hybrid retrieval (BGE-M3 dense + vocabulary-guided sparse retrieval) with LLM-based supportive-evidence judgment module.

Result: Recall@5 of 98.64% on YpathR benchmark (23pp gain over baseline); increased accuracies by 9.0% on average and up to 15.6% on YpathQA-M challenging questions for both general and medical LLMs.

Conclusion: YpathRAG demonstrates improved retrieval quality and factual reliability, providing scalable construction paradigm and interpretable evaluation for pathology-oriented RAG.

Abstract: Large language models (LLMs) excel on general tasks yet still hallucinate in high-barrier domains such as pathology. Prior work often relies on domain fine-tuning, which neither expands the knowledge boundary nor enforces evidence-grounded constraints. We therefore build a pathology vector database covering 28 subfields and 1.53 million paragraphs, and present YpathRAG, a pathology-oriented RAG framework with dual-channel hybrid retrieval (BGE-M3 dense retrieval coupled with vocabulary-guided sparse retrieval) and an LLM-based supportive-evidence judgment module that closes the retrieval-judgment-generation loop. We also release two evaluation benchmarks, YpathR and YpathQA-M. On YpathR, YpathRAG attains Recall@5 of 98.64%, a gain of 23 percentage points over the baseline; on YpathQA-M, a set of the 300 most challenging questions, it increases the accuracies of both general and medical LLMs by 9.0% on average and up to 15.6%. These results demonstrate improved retrieval quality and factual reliability, providing a scalable construction paradigm and interpretable evaluation for pathology-oriented RAG.

[10] Identifying & Interactively Refining Ambiguous User Goals for Data Visualization Code Generation

Mert İnan, Anthony Sicilia, Alex Xie, Saujas Vaduguru, Daniel Fried, Malihe Alikhani

Main category: cs.CL

TL;DR: This paper analyzes ambiguities in human-AI communication for data visualization code generation, develops a taxonomy of ambiguity types, proposes metrics to quantify them, and explores how multi-turn dialogue with pragmatic models can reduce ambiguity and improve code accuracy.

Details

Motivation: Ambiguities in natural language can lead to AI outputs that seem correct but fail to reflect the speaker's intent, particularly in data visualization where code generation is sensitive to linguistic ambiguities.

Method: Developed a taxonomy of ambiguity types in data visualization tasks, proposed metrics to quantify ambiguities, and evaluated three pragmatic dialogue models (Gricean Cooperativity, Discourse Representation Theory, Questions under Discussion) through simulated user studies using Matplotlib problems from DS-1000 dataset.

Result: The proposed ambiguity metrics correlate better with human annotations than uncertainty baselines. Multi-turn pragmatic dialogues effectively reduce ambiguity and enhance code accuracy by better matching user goals.

Conclusion: Multi-turn dialogue exchanges with pragmatic models are valuable for reducing ambiguity in code generation tasks, particularly in data visualization, leading to more accurate outputs that better reflect user intent.

Abstract: Establishing shared goals is a fundamental step in human-AI communication. However, ambiguities can lead to outputs that seem correct but fail to reflect the speaker’s intent. In this paper, we explore this issue with a focus on the data visualization domain, where ambiguities in natural language impact the generation of code that visualizes data. The availability of multiple views on the contextual (e.g., the intended plot and the code rendering the plot) allows for a unique and comprehensive analysis of diverse ambiguity types. We develop a taxonomy of types of ambiguity that arise in this task and propose metrics to quantify them. Using Matplotlib problems from the DS-1000 dataset, we demonstrate that our ambiguity metrics better correlate with human annotations than uncertainty baselines. Our work also explores how multi-turn dialogue can reduce ambiguity, therefore, improve code accuracy by better matching user goals. We evaluate three pragmatic models to inform our dialogue strategies: Gricean Cooperativity, Discourse Representation Theory, and Questions under Discussion. A simulated user study reveals how pragmatic dialogues reduce ambiguity and enhance code accuracy, highlighting the value of multi-turn exchanges in code generation.

[11] LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

Raffaele Mura, Giorgio Piras, Kamilė Lukošiūtė, Maura Pintor, Amin Karbasi, Battista Biggio

Main category: cs.CL

TL;DR: LatentBreak is a white-box jailbreak attack that generates natural adversarial prompts with low perplexity to evade perplexity-based defenses, outperforming existing methods.

Details

Motivation: Existing jailbreak attacks can be detected by simple perplexity-based filtering, so there's a need for more sophisticated attacks that evade such defenses.

Method: Substitutes words in input prompts with semantically-equivalent alternatives by minimizing distance in latent space between adversarial prompts and harmless requests, avoiding high-perplexity suffixes or templates.

Result: LatentBreak generates shorter, low-perplexity prompts that outperform competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.

Conclusion: LatentBreak successfully evades perplexity-based defenses while maintaining attack effectiveness, demonstrating the need for more robust safety mechanisms beyond simple perplexity filtering.

Abstract: Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.

[12] Toward a Safer Web: Multilingual Multi-Agent LLMs for Mitigating Adversarial Misinformation Attacks

Nouar Aldahoul, Yasir Zaki

Main category: cs.CL

TL;DR: This paper presents a multilingual LLM framework with retrieval-augmented generation to detect misinformation through language-switching, query inflation, and structural reformatting attacks, deployable as a web plugin.

Details

Motivation: The rapid spread of misinformation threatens public discourse and decision-making, with specific adversarial transformations like language-switching and structural reformatting not being systematically studied before.

Method: Developed a multilingual, multi-agent large language model framework with retrieval-augmented generation that can be deployed as a web plugin into online platforms.

Result: The framework successfully detects misinformation across diverse adversarial attacks including language-switching across multiple languages, query length inflation, and structural reformatting into multiple-choice questions.

Conclusion: AI-driven misinformation detection is crucial for safeguarding online factual integrity against diverse attacks, and plugin-based deployment is feasible for real-world web applications.

Abstract: The rapid spread of misinformation on digital platforms threatens public discourse, emotional stability, and decision-making. While prior work has explored various adversarial attacks in misinformation detection, the specific transformations examined in this paper have not been systematically studied. In particular, we investigate language-switching across English, French, Spanish, Arabic, Hindi, and Chinese, followed by translation. We also study query length inflation preceding summarization and structural reformatting into multiple-choice questions. In this paper, we present a multilingual, multi-agent large language model framework with retrieval-augmented generation that can be deployed as a web plugin into online platforms. Our work underscores the importance of AI-driven misinformation detection in safeguarding online factual integrity against diverse attacks, while showcasing the feasibility of plugin-based deployment for real-world web applications.

[13] DDO: Dual-Decision Optimization for LLM-Based Medical Consultation via Multi-Agent Collaboration

Zhihao Jia, Mingyi Jia, Junwen Duan, Jianxin Wang

Main category: cs.CL

TL;DR: DDO is a novel LLM-based framework that decouples medical consultation into symptom inquiry and disease diagnosis sub-tasks, optimizing them with distinct objectives through multi-agent collaboration.

Details

Motivation: Existing LLM-based methods fail to capture the dual nature of medical consultation, which involves both sequential symptom inquiry and classification-based disease diagnosis, leading to ineffective symptom inquiry and unreliable diagnosis.

Method: Proposes DDO framework that performs dual-decision optimization by decoupling symptom inquiry and disease diagnosis into separate sub-tasks, optimizing them with distinct objectives through a collaborative multi-agent workflow.

Result: Experiments on three real-world medical consultation datasets show DDO consistently outperforms existing LLM-based approaches and achieves competitive performance with state-of-the-art generation-based methods.

Conclusion: DDO demonstrates effectiveness in medical consultation tasks by properly addressing the dual nature of the problem through decoupled optimization of symptom inquiry and disease diagnosis sub-tasks.

Abstract: Large Language Models (LLMs) demonstrate strong generalization and reasoning abilities, making them well-suited for complex decision-making tasks such as medical consultation (MC). However, existing LLM-based methods often fail to capture the dual nature of MC, which entails two distinct sub-tasks: symptom inquiry, a sequential decision-making process, and disease diagnosis, a classification problem. This mismatch often results in ineffective symptom inquiry and unreliable disease diagnosis. To address this, we propose \textbf{DDO}, a novel LLM-based framework that performs \textbf{D}ual-\textbf{D}ecision \textbf{O}ptimization by decoupling the two sub-tasks and optimizing them with distinct objectives through a collaborative multi-agent workflow. Experiments on three real-world MC datasets show that DDO consistently outperforms existing LLM-based approaches and achieves competitive performance with state-of-the-art generation-based methods, demonstrating its effectiveness in the MC task. The code is available at https://github.com/zh-jia/DDO.

Yu Liu, Hanlei Shi, Haoxun Li, Yuqing Sun, Yuxuan Ding, Linlin Gong, Leyuan Qu, Taihao Li

Main category: cs.CL

TL;DR: The paper proposes a hotspot-centric approach for Emotion Recognition in Conversations (ERC) that focuses on detecting emotion-salient spans across modalities, using Hotspot-Gated Fusion and Mixture-of-Aligners to handle sparse evidence and modality misalignment.

Details

Motivation: ERC is challenging because discriminative evidence is sparse, localized, and often asynchronous across text, audio, and video modalities, requiring methods that can focus on salient regions while handling cross-modal misalignment.

Method: A unified model that detects per-utterance hotspots in text, audio, and video, fuses them with global features via Hotspot-Gated Fusion, aligns modalities using a routed Mixture-of-Aligners, and encodes conversational structure via cross-modal graph.

Result: Experiments on standard ERC benchmarks show consistent gains over strong baselines, with ablations confirming the contributions of Hotspot-Gated Fusion and Mixture-of-Aligners components.

Conclusion: The hotspot-centric view offers a new perspective on modality fusion in ERC and can inform future multimodal learning approaches by focusing modeling on salient spans while mitigating misalignment issues.

Abstract: Emotion Recognition in Conversations (ERC) is hard because discriminative evidence is sparse, localized, and often asynchronous across modalities. We center ERC on emotion hotspots and present a unified model that detects per-utterance hotspots in text, audio, and video, fuses them with global features via Hotspot-Gated Fusion, and aligns modalities using a routed Mixture-of-Aligners; a cross-modal graph encodes conversational structure. This design focuses modeling on salient spans, mitigates misalignment, and preserves context. Experiments on standard ERC benchmarks show consistent gains over strong baselines, with ablations confirming the contributions of HGF and MoA. Our results point to a hotspot-centric view that can inform future multimodal learning, offering a new perspective on modality fusion in ERC.

[15] MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation

Weihua Zheng, Zhengyuan Liu, Tanmoy Chakraborty, Weiwen Xu, Xiaoxue Gao, Bryan Chen Zhengyu Tan, Bowei Zou, Chang Liu, Yujia Hu, Xing Xie, Xiaoyuan Yi, Jing Yao, Chaojun Wang, Long Li, Rui Liu, Huiyao Liu, Koji Inoue, Ryuichi Sumida, Tatsuya Kawahara, Fan Xu, Lingyu Ye, Wei Tian, Dongjun Kim, Jimin Jung, Jaehyung Seo, Nadya Yuki Wangsajaya, Pham Minh Duc, Ojasva Saxena, Palash Nandi, Xiyan Tao, Wiwik Karlina, Tuan Luong, Keertana Arun Vasan, Roy Ka-Wei Lee, Nancy F. Chen

Main category: cs.CL

TL;DR: MMA-ASIA is a comprehensive framework to evaluate LLMs’ cultural awareness in Asian contexts through a multilingual, multimodal benchmark covering 8 countries and 10 languages with 27,000 questions requiring cultural reasoning.

Details

Motivation: LLMs' multimodal understanding and reasoning often degrade outside Western, high-resource settings, highlighting the need to evaluate cultural awareness in diverse contexts.

Method: Human-curated multilingual multimodal benchmark with 27,000 questions, cross-modal alignment, five-dimensional evaluation protocol, Cultural Awareness Grounding Validation Module, and Vision-ablated Prefix Replay (VPR) method.

Result: The framework enables direct tests of cross-modal transfer and detects shortcut learning, providing insights into why models diverge across languages and modalities.

Conclusion: MMA-ASIA offers actionable insights for building culturally reliable multimodal LLMs by systematically evaluating cultural awareness disparities and cross-modal consistency.

Abstract: Large language models (LLMs) are now used worldwide, yet their multimodal understanding and reasoning often degrade outside Western, high-resource settings. We propose MMA-ASIA, a comprehensive framework to evaluate LLMs' cultural awareness with a focus on Asian contexts. MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned multiple-choice benchmark covering 8 Asian countries and 10 languages, comprising 27,000 questions; over 79 percent require multi-step reasoning grounded in cultural context, moving beyond simple memorization. To our knowledge, this is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech. This enables direct tests of cross-modal transfer. Building on this benchmark, we propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity. To ensure rigorous assessment, a Cultural Awareness Grounding Validation Module detects “shortcut learning” by checking whether the requisite cultural knowledge supports correct answers. Finally, through comparative model analysis, attention tracing, and an innovative Vision-ablated Prefix Replay (VPR) method, we probe why models diverge across languages and modalities, offering actionable insights for building culturally reliable multimodal LLMs.

[16] GraphGhost: Tracing Structures Behind Large Language Models

Xinnan Dai, Kai Guo, Chung-Hsiang Lo, Shenglai Zeng, Jiayuan Ding, Dongsheng Luo, Subhabrata Mukherjee, Jiliang Tang

Main category: cs.CL

TL;DR: GraphGhost is a unified framework that represents neuron activations and signal propagation in LLMs as graphs, enabling analysis of structural reasoning mechanisms through graph algorithms and interventions.

Details

Motivation: To understand the structural mechanisms underlying LLMs' reasoning capabilities, which remain under-explored despite their remarkable performance.

Method: Represent neuron activations and signal propagation as graphs, use graph algorithms like PageRank to characterize properties, and perform structural interventions on key neuron nodes.

Result: Reveals shared and model-specific reasoning behaviors across datasets, and shows that edits to key neuron nodes can trigger reasoning collapse affecting both logical flow and semantic understanding.

Conclusion: GraphGhost serves as a powerful tool for analyzing, intervening in, and understanding the structural foundations of reasoning in LLMs.

Abstract: Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, yet the structural mechanisms underlying these abilities remain under explored. In this work, we introduce GraphGhost, a unified framework that represents neuron activations and their signal propagation as graphs, explaining how LLMs capture structural semantics from sequential inputs and generate outputs through structurally consistent mechanisms. This graph-based perspective enables us to employ graph algorithms such as PageRank to characterize the properties of LLMs, revealing both shared and model-specific reasoning behaviors across diverse datasets. We further identify the activated neurons within GraphGhost and evaluate them through structural interventions, showing that edits to key neuron nodes can trigger reasoning collapse, altering both logical flow and semantic understanding. Together, these contributions position GraphGhost as a powerful tool for analyzing, intervening in, and ultimately understanding the structural foundations of reasoning in LLMs.

[17] CLARITY: Clinical Assistant for Routing, Inference, and Triage

Vladimir Shaposhnikov, Aleksandr Nesterov, Ilia Kopanichuk, Ivan Bakulin, Egor Zhelvakov, Ruslan Abramov, Ekaterina Tsapieva, Iaroslav Bespalov, Dmitry V. Dylov, Ivan Oseledets

Main category: cs.CL

TL;DR: CLARITY is an AI clinical assistant platform that combines FSM-structured dialogue with LLM-powered agents for patient routing, consultations, and severity assessment, achieving human-level performance with faster consultation times.

Details

Motivation: To create an efficient AI-driven platform for patient-to-specialist routing and clinical consultations that can integrate with existing healthcare workflows and IT solutions.

Method: Hybrid architecture combining Finite State Machine for structured dialogue flows with collaborative agents using Large Language Models to analyze symptoms and prioritize referrals. Built on modular microservices framework.

Result: Successfully integrated into national interhospital platform with 55,000+ user dialogues in 2 months. Validation on 2,500 expert-annotated dialogues showed CLARITY surpasses human-level performance in first-attempt routing precision and reduces consultation duration by up to 3 times.

Conclusion: CLARITY provides safe, efficient, and robust clinical assistance that can scale to meet healthcare demands while significantly improving routing accuracy and consultation efficiency compared to human performance.

Abstract: We present CLARITY (Clinical Assistant for Routing, Inference and Triage), an AI-driven platform designed to facilitate patient-to-specialist routing, clinical consultations, and severity assessment of patient conditions. Its hybrid architecture combines a Finite State Machine (FSM) for structured dialogue flows with collaborative agents that employ Large Language Model (LLM) to analyze symptoms and prioritize referrals to appropriate specialists. Built on a modular microservices framework, CLARITY ensures safe, efficient, and robust performance, flexible and readily scalable to meet the demands of existing workflows and IT solutions in healthcare. We report integration of our clinical assistant into a large-scale national interhospital platform, with more than 55,000 content-rich user dialogues completed within the two months of deployment, 2,500 of which were expert-annotated for subsequent validation. The validation results show that CLARITY surpasses human-level performance in terms of the first-attempt routing precision, naturally requiring up to 3 times shorter duration of the consultation than with a human.

[18] The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf

Main category: cs.CL

TL;DR: Comparative study shows full spoken conversation input yields highest performance for Spoken Dialog State Tracking with Speech-LLMs, with attention-pooling compression offering good trade-off.

Details

Motivation: To systematically evaluate different context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs and identify the most effective approach.

Method: Systematically compared three approaches: traditional multimodal context (text history + spoken current turn), full spoken history, and compressed spoken history using attention-pooling. Experiments conducted on SpokenWOZ corpus.

Result: Full spoken conversation as input achieved highest performance among similar-sized models, significantly surpassing prior methods. Attention-pooling compression maintained competitive accuracy with reduced context size.

Conclusion: Full spoken history provides superior context utilization for Spoken Dialog State Tracking, while attention-pooling compression offers an effective balance between performance and computational efficiency.

Abstract: This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs. We systematically evaluate traditional multimodal context (combining text history and spoken current turn), full spoken history, and compressed spoken history approaches. Our experiments on the SpokenWOZ corpus demonstrate that providing the full spoken conversation as input yields the highest performance among models of similar size, significantly surpassing prior methods. Furthermore, we show that attention-pooling-based compression of the spoken history offers a strong trade-off, maintaining competitive accuracy with reduced context size. Detailed analysis confirms that improvements stem from more effective context utilization.

[19] Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications

Mingxuan Liu, Yuhe Ke, Wentao Zhu, Mayli Mertens, Yilin Ning, Jingchi Liao, Chuan Hong, Daniel Shu Wei Ting, Yifan Peng, Danielle S. Bitterman, Marcus Eng Hock Ong, Nan Liu

Main category: cs.CL

TL;DR: LLMs in healthcare show gender bias inconsistencies in judging patient gender relevance and necessity for diagnosis, despite relatively consistent diagnoses across different LLM gender assignments.

Details

Motivation: To investigate whether LLMs assuming human-like clinical roles may replicate or amplify gender-related biases that have historically influenced physician behaviors and patient outcomes.

Method: Used case studies from NEJM Challenge, assigned genders (female, male, or unspecified) to multiple open-source and proprietary LLMs, and evaluated response consistency across LLM-gender assignments for diagnosis and judgments on patient gender relevance/necessity.

Result: Diagnoses were relatively consistent across LLM genders for most models, but all models showed substantial inconsistency in judging patient gender relevance and necessity, particularly for relevance judgments, with some models displaying systematic female-male disparity.

Conclusion: This underexplored bias could undermine LLM reliability in clinical practice, highlighting the need for routine checks of identity-assignment consistency to ensure equitable AI-supported clinical care.

Abstract: The integration of large language models (LLMs) into healthcare holds promise to enhance clinical decision-making, yet their susceptibility to biases remains a critical concern. Gender has long influenced physician behaviors and patient outcomes, raising concerns that LLMs assuming human-like roles, such as clinicians or medical educators, may replicate or amplify gender-related biases. Using case studies from the New England Journal of Medicine Challenge (NEJM), we assigned genders (female, male, or unspecified) to multiple open-source and proprietary LLMs. We evaluated their response consistency across LLM-gender assignments regarding both LLM-based diagnosis and models' judgments on the clinical relevance or necessity of patient gender. In our findings, diagnoses were relatively consistent across LLM genders for most models. However, for patient gender’s relevance and necessity in LLM-based diagnosis, all models demonstrated substantial inconsistency across LLM genders, particularly for relevance judgements. Some models even displayed a systematic female-male disparity in their interpretation of patient gender. These findings present an underexplored bias that could undermine the reliability of LLMs in clinical practice, underscoring the need for routine checks of identity-assignment consistency when interacting with LLMs to ensure reliable and equitable AI-supported clinical care.

[20] Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking

Mohammad Hossein Sameti, Sepehr Harfi Moridani, Ali Zarean, Hossein Sameti

Main category: cs.CL

TL;DR: Proposes an accent-invariant ASR framework that uses accent classification and spectrogram masking for data augmentation to improve robustness against accent variations in English and Persian speech.

Details

Motivation: Transformer-based ASR models are sensitive to accent and dialectal variations, leading to high word error rates in linguistically diverse languages like English and Persian.

Method: Train spectrogram-based classifier to capture accent cues, mask influential regions, and use masked spectrograms for data augmentation to enhance ASR robustness.

Result: Substantial WER reductions in both English and Persian settings using Whisper model, with new Persian dataset establishing first systematic benchmark for accent variation.

Conclusion: Effectively advances multilingual ASR systems resilient to accent and dialect diversity, providing foundation for future studies on low-resource, linguistically diverse languages.

Abstract: Pre-trained transformer-based models have significantly advanced automatic speech recognition (ASR), yet they remain sensitive to accent and dialectal variations, resulting in elevated word error rates (WER) in linguistically diverse languages such as English and Persian. To address this challenge, we propose an accent-invariant ASR framework that integrates accent and dialect classification into the recognition pipeline. Our approach involves training a spectrogram-based classifier to capture accent-specific cues, masking the regions most influential to its predictions, and using the masked spectrograms for data augmentation. This enhances the robustness of ASR models against accent variability. We evaluate the method using both English and Persian speech. For Persian, we introduce a newly collected dataset spanning multiple regional accents, establishing the first systematic benchmark for accent variation in Persian ASR that fills a critical gap in multilingual speech research and provides a foundation for future studies on low-resource, linguistically diverse languages. Experimental results with the Whisper model demonstrate that our masking and augmentation strategy yields substantial WER reductions in both English and Persian settings, confirming the effectiveness of the approach. This research advances the development of multilingual ASR systems that are resilient to accent and dialect diversity. Code and dataset are publicly available at: https://github.com/MH-Sameti/Accent_invariant_ASR

[21] Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech

Yakov Kolani, Maxim Melichov, Cobi Calev, Morris Alper

Main category: cs.CL

TL;DR: Phonikud is a lightweight G2P system for Hebrew that generates fully-specified IPA transcriptions, enabling effective real-time TTS with superior speed-accuracy trade-offs.

Details

Motivation: Existing Hebrew TTS solutions ignore crucial phonetic features like stress and have underspecified transcriptions even with vowel marks, limiting their accuracy and effectiveness.

Method: Adapts an existing diacritization model with lightweight adaptors to create a G2P system that outputs fully-specified IPA transcriptions, and contributes the ILSpeech dataset with IPA annotations.

Result: Phonikud more accurately predicts phonemes from Hebrew text compared to prior methods, enabling training of effective real-time Hebrew TTS models with superior performance.

Conclusion: The introduced Phonikud system and ILSpeech dataset address Hebrew’s orthographic complexity and enable high-quality real-time TTS with fully-specified phonetic details.

Abstract: Real-time text-to-speech (TTS) for Modern Hebrew is challenging due to the language’s orthographic complexity. Existing solutions ignore crucial phonetic features such as stress that remain underspecified even when vowel marks are added. To address these limitations, we introduce Phonikud, a lightweight, open-source Hebrew grapheme-to-phoneme (G2P) system that outputs fully-specified IPA transcriptions. Our approach adapts an existing diacritization model with lightweight adaptors, incurring negligible additional latency. We also contribute the ILSpeech dataset of transcribed Hebrew speech with IPA annotations, serving as a benchmark for Hebrew G2P, as training data for TTS systems, and enabling audio-to-IPA for evaluating TTS performance while capturing important phonetic details. Our results demonstrate that Phonikud G2P conversion more accurately predicts phonemes from Hebrew text compared to prior methods, and that this enables training of effective real-time Hebrew TTS models with superior speed-accuracy trade-offs. We release our code, data, and models at https: //phonikud.github.io.

Kaiqi Yang, Hang Li, Yucheng Chu, Zitao Liu, Mi Tian, Hui Liu

Main category: cs.CL

TL;DR: The paper presents an iterative LLM-based framework to automatically generate math word problems with meaningful distracting conditions while preserving original solutions, addressing limitations of existing datasets.

Details

Motivation: Existing MWP datasets lack distracting conditions, and when added manually, they often change solutions requiring intensive manual effort. Current datasets have low difficulty and out-of-context expressions that are easy to detect.

Method: An iterative framework using LLMs with specialized prompts to revise MWPs from multiple perspectives and cognitive levels, generating distracting conditions that don’t alter original solutions.

Result: The framework efficiently generates high-quality MWPs with meaningful distracting conditions while eliminating the need to produce new answers, substantially reducing manual effort.

Conclusion: The proposed framework provides an effective solution for creating credible benchmarking datasets with distracting conditions, maintaining data quality while minimizing manual intervention.

Abstract: Mathematical reasoning serves as a crucial testbed for evaluating the intelligence of large language models (LLMs), and math word problems (MWPs) represent one of the most widely used formats. Most existing MWP datasets contain only the necessary information, while problems with distracting or excessive conditions are often overlooked. Prior studies have shown that popular LLMs experience a dramatic performance drop when such distracting conditions are introduced. However, available datasets of MWPs with distracting conditions remain limited, and most exhibit low difficulty and out-of-context expressions. These shortcomings make the distracting conditions easy to detect and disregard, thereby reducing the credibility of benchmarking on these datasets. Moreover, when distracting conditions are added, the reasoning process and answers may change, requiring intensive manual effort to check and rewrite solutions. To address these issues, we design an iterative framework that leverages LLMs to generate distracting conditions automatically. We develop a set of prompts to revise MWPs from multiple perspectives and cognitive levels, encouraging the creation of meaningful distracting conditions as well as suggestions for further refinement. A key advantage of our framework is the preservation of shared solutions between the original and revised problems: the LLMs are explicitly guided to generate distractions that do not alter the original solution, thus eliminating the need to produce new answers. This framework is efficient and easy to deploy, substantially reducing the effort required to generate MWPs with distracting conditions while maintaining high data quality.

[23] LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests

Juan Miguel Navarro Carranza

Main category: cs.CL

TL;DR: This paper presents a method to test LLM generalization by evaluating them on paraphrased benchmark questions, revealing accuracy drops that suggest memorization and contamination issues.

Details

Motivation: To address concerns about benchmark score inflation in LLMs due to memorization of test items or near duplicates, and to probe true generalization capabilities.

Method: A simple protocol that re-evaluates models on paraphrased versions of benchmark questions, using Mistral-7B-Instruct and Qwen2.5-7B-Instruct on ARC-Easy and ARC-Challenge datasets with controlled decoding and robust paraphrase-cleaning.

Result: Paraphrasing induces a non-trivial accuracy drop between original and paraphrased items, consistent with concerns about contamination and brittle surface-form shortcuts.

Conclusion: The findings support prior concerns about benchmark contamination and highlight the need for more robust evaluation methods that test true generalization rather than memorization.

Abstract: Benchmark scores for Large Language Models (LLMs) can be inflated by memorization of test items or near duplicates. We present a simple, protocol that probes generalization by re-evaluating models on paraphrased versions of benchmark questions. Using Mistral-7B-Instruct and Qwen2.5-7B-Instruct, we measure the accuracy gap between original and paraphrased items on ARC-Easy and ARC-Challenge. Our pipeline controls decoding, enforces multiple-choice output format, and includes a robust paraphrase-cleaning step to preserve semantics. We find that paraphrasing induces a non-trivial accuracy drop (original vs. paraphrased), consistent with prior concerns about contamination and brittle surface-form shortcuts.

[24] JAI-1: A Thai-Centric Large Language Model

Attapol T. Rutherford, Jullajak Karnjanaekarin, Narongkorn Panitsrisit, Pontakorn Trakuekul, Sumana Sumanakul, Natchanon Pollertlam

Main category: cs.CL

TL;DR: JAI-1 is a 75B parameter Thai-centric language model that uses an upscaling strategy to preserve original English knowledge while systematically integrating Thai language capabilities, outperforming Typhoon2-70B on Thai benchmarks.

Details

Motivation: Existing Thai models risk eroding pre-existing knowledge when adding Thai-specific information through additional training on existing models. There's a need for a method that preserves general intelligence while effectively integrating new linguistic capabilities.

Method: Upscaling strategy: starting from a smaller high-performing English LLM, expanding parameter space and using new capacity to systematically integrate Thai knowledge. Pre-trained on 1.5T tokens (300B+ Thai), followed by supervised fine-tuning and alignment tuning with 600K+ instruction examples.

Result: Superior performance compared to Typhoon2-70B on Thai-centric benchmarks (IFEval-TH, MT-Bench-TH, JAI-Hall-Bench), validating the efficacy of the upscaling and knowledge-integration framework.

Conclusion: The upscaling approach successfully preserves original model knowledge while enabling scalable Thai language integration, creating a unique architecture distinct from other open-source models.

Abstract: This technical report introduces JAI-1, a Thai-centric language model with 75B parameters. Recent Thai models have primarily relied on existing open-source models, applying additional training without structural modifications to specialize in Thai. However, this approach risks eroding pre-existing knowledge in the model’s parameter space during the injection of Thai-specific information, as optimized parameters for general tasks may conflict with new linguistic requirements. In contrast, JAI-1 adopts an upscaling strategy: starting from a smaller, high-performing English open-source LLM, we expanded its parameter space and utilized the newly allocated capacity to systematically integrate Thai-language knowledge. This methodology not only preserves the original model’s general intelligence but also establishes a unique architecture distinct from other open-source models, enabling scalable future enhancements. During pre-training, JAI-1 was exposed to 1.5T tokens, including over 300B Thai language tokens. This was followed by post-training stages – supervised fine-tuning and alignment tuning – using more than 600K instruction-based examples. The final model demonstrated superior performance compared to Typhoon2-70B on Thai-centric benchmarks (IFEval-TH, MT-Bench-TH, and JAI-Hall-Bench), validating the efficacy of its upscaling and knowledge-integration framework.

[25] From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational Agents

Wen-Yu Chang, Tzu-Hung Huang, Chih-Ho Chen, Yun-Nung Chen

Main category: cs.CL

TL;DR: A sales-oriented dialogue agent that adapts based on user profiles (age, gender, occupation) shows occupation has the strongest influence on conversational intent. A lightweight occupation-conditioned strategy improves dialogue effectiveness.

Details

Motivation: Realistic user-simulator studies are essential for tuning effective conversation strategies in agentic dialogue models, particularly for sales-oriented applications.

Method: Introduce a lightweight, occupation-conditioned strategy that guides the agent to prioritize intents aligned with user preferences based on user profiles spanning age, gender, and occupation.

Result: Occupation produces the most pronounced differences in conversational intent compared to age and gender. The occupation-conditioned strategy results in shorter and more successful dialogues.

Conclusion: Rich simulator profiles are important, and simple persona-informed strategies can enhance the effectiveness of sales-oriented dialogue systems.

Abstract: Amid the rapid rise of agentic dialogue models, realistic user-simulator studies are essential for tuning effective conversation strategies. This work investigates a sales-oriented agent that adapts its dialogue based on user profiles spanning age, gender, and occupation. While age and gender influence overall performance, occupation produces the most pronounced differences in conversational intent. Leveraging this insight, we introduce a lightweight, occupation-conditioned strategy that guides the agent to prioritize intents aligned with user preferences, resulting in shorter and more successful dialogues. Our findings highlight the importance of rich simulator profiles and demonstrate how simple persona-informed strategies can enhance the effectiveness of sales-oriented dialogue systems.

[26] Text2Stories: Evaluating the Alignment Between Stakeholder Interviews and Generated User Stories

Francesco Dente, Fabiano Dalpiaz, Paolo Papotti

Main category: cs.CL

TL;DR: Text2Stories introduces metrics for evaluating how well software requirements (user stories) align with stakeholder needs from interview transcripts, measuring correctness and completeness through automated text-to-story alignment.

Details

Motivation: Current evaluation of whether LLM-generated requirements faithfully reflect stakeholder needs remains largely manual, creating a need for automated alignment assessment.

Method: Segments transcripts into chunks and frames alignment as a matching problem between chunks and stories, using LLM-based matchers and embedding models for blocking.

Result: LLM-based matcher achieves 0.86 macro-F1 on held-out annotations, outperforming embedding models alone, which still enable effective blocking.

Conclusion: Text2Stories provides scalable, source-faithful metrics that complement existing user-story quality criteria and enable comparison across different story sets.

Abstract: Large language models (LLMs) can be employed for automating the generation of software requirements from natural language inputs such as the transcripts of elicitation interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders’ needs remains a largely manual task. We introduce Text2Stories, a task and metrics for text-to-story alignment that allow quantifying the extent to which requirements (in the form of user stories) match the actual needs expressed by the elicitation session participants. Given an interview transcript and a set of user stories, our metric quantifies (i) correctness: the proportion of stories supported by the transcript, and (ii) completeness: the proportion of transcript supported by at least one story. We segment the transcript into text chunks and instantiate the alignment as a matching problem between chunks and stories. Experiments over four datasets show that an LLM-based matcher achieves 0.86 macro-F1 on held-out annotations, while embedding models alone remain behind but enable effective blocking. Finally, we show how our metrics enable the comparison across sets of stories (e.g., human vs. generated), positioning Text2Stories as a scalable, source-faithful complement to existing user-story quality criteria.

[27] PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction

Anubhav Shrimal, Aryan Jain, Soumyajit Chowdhury, Promod Yenigalla

Main category: cs.CL

TL;DR: PARSE is a system that autonomously optimizes JSON schemas for LLM-based information extraction, improving accuracy by up to 64.7% and reducing errors by 92% while maintaining backward compatibility.

Details

Motivation: Current approaches treat JSON schemas as static contracts designed for humans, leading to suboptimal extraction performance, hallucinations, and unreliable agent behavior when schemas are ambiguous or incomplete.

Method: PARSE has two components: ARCHITECT autonomously optimizes JSON schemas for LLM consumption with backward compatibility through RELAY code generation, and SCOPE implements reflection-based extraction with combined static and LLM-based guardrails.

Result: Achieved up to 64.7% improvement in extraction accuracy on SWDE, with combined framework improvements reaching 10% across models, while reducing extraction errors by 92% within first retry and maintaining practical latency.

Conclusion: PARSE demonstrates that JSON schemas can be systematically improved for LLM consumption, enabling more reliable structured information extraction from unstructured text for autonomous agent systems.

Abstract: Structured information extraction from unstructured text is critical for emerging Software 3.0 systems where LLM agents autonomously interact with APIs and tools. Recent approaches apply large language models directly to extraction tasks using existing JSON schemas, often with constraint decoding or reinforcement learning approaches to ensure syntactic validity, but treat JSON schemas as static contracts designed for human developers, leading to suboptimal extraction performance, frequent hallucinations, and unreliable agent behavior when schemas contain ambiguous or incomplete specifications. We recognize that JSON schemas themselves are a form of natural language understanding contract that encodes rules, relationships, and expectations about data structure contracts that LLMs should be able to both interpret and systematically improve. Consequently, we develop PARSE (Parameter Automated Refinement and Schema Extraction), a novel system with two synergistic components: ARCHITECT, which autonomously optimizes JSON schemas for LLM consumption while maintaining backward compatibility through RELAY (an integrated code generation system), and SCOPE, which implements reflection-based extraction with combined static and LLM-based guardrails. We evaluate PARSE qualitatively and quantitatively on three datasets including Schema-Guided Dialogue (SGD), Structured Web Data Extraction (SWDE), and internal retail conversation data, and find that it achieves up to 64.7% improvement in extraction accuracy on SWDE with combined framework improvements reaching 10% across models, while reducing extraction errors by 92% within the first retry and and maintaining practical latency.

[28] Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B

Nisar Ahmed, Muhammad Imran Zaman, Gulshan Saleem, Ali Hassan

Main category: cs.CL

TL;DR: Evaluation-focused prompts inflate LLM performance metrics by encouraging visible reasoning and strict formatting, but don’t consistently improve actual capability. A/B testing shows evaluation framing increases CoT length and reduces answer compliance without reliable accuracy gains.

Details

Motivation: To investigate whether "evaluation scent" in LLM benchmarks inflates measured performance without corresponding real-world capability improvements, since real deployments require terse, contract-bound answers rather than rubric-scented prompts.

Method: Used GPT-OSS-20B model in six paired A/B scenarios varying framing (evaluation vs real-world) and reasoning depth. Tested deterministic math, code-fix, citation generation, incentive flips, CoT visibility, and multilingual headers with deterministic validators measuring accuracy, compliance, and other metrics.

Result: Evaluation framing reliably increased CoT length (hundreds to >1000 characters) and reduced answer-only compliance, with limited/inconsistent accuracy gains. Improved formatting but not substance. Incentive wording affected error composition - caution improved accuracy at high reasoning while competence yielded riskier outputs.

Conclusion: Benchmark gains may not reflect deployable capability. Recommendations include neutral phrasing, contract-aware grading, style-delta reporting, and multilingual dashboards to ensure measured improvements translate to real-world performance.

Abstract: Benchmarks for large language models (LLMs) often rely on rubric-scented prompts that request visible reasoning and strict formatting, whereas real deployments demand terse, contract-bound answers. We investigate whether such “evaluation scent” inflates measured performance without commensurate capability gains. Using a single open-weights model (GPT-OSS-20B), we run six paired A/B scenarios that hold task content and decoding fixed while varying framing (evaluation-oriented vs. real-world) and reasoning depth (Medium/High): deterministic math, strict code-fix, citation generation, incentive flips (caution vs. competence), CoT visibility, and multilingual (Urdu) headers. Deterministic validators compute accuracy, answer-only compliance, hedging/refusals, chain-of-thought (CoT) length, and schema compliance, with pre-registered deltas and composite indices. Across scenarios, evaluation framing reliably inflates CoT (hundreds to >1000 characters) and reduces answer-only compliance, with limited or inconsistent accuracy gains. In structured outputs, it improves wrappers (e.g., fenced blocks, enumerated lists) but not regex-validated substance. Incentive wording reweights error composition: praising caution modestly improves accuracy at high reasoning and reduces wrong-but-confident errors, whereas praising competence yields terser but riskier outputs. Urdu rubric headers reproduce these signatures and can decrease accuracy at higher reasoning depth, indicating multilingual parity risks. We provide a reproducible A/B framework (prompt banks, validators, per-run scores, scripts; versioned DOI) and practical guidance: neutral phrasing or dual-framing checks, contract-aware grading, style-delta reporting, confidence governance, and multilingual dashboards to ensure that benchmark gains reflect deployable capability.

[29] From What to Why: Thought-Space Recommendation with Small Language Models

Prosenjit Biswas, Pervez Shaik, Abhinav Thorat, Ravi Kolla, Niranjan Pedanekar

Main category: cs.CL

TL;DR: PULSE is a framework that uses SLM-generated rationales as direct learning signals to create a common understanding across domains called Thought Space, outperforming traditional recommendation models.

Details

Motivation: LLMs have high inference costs for real-world deployment, while SLMs' reasoning capabilities for recommendation are underexplored. Existing systems fail to fully leverage natural language rationales as learning signals.

Method: PULSE treats SLM-generated rationales as first-class learning signals, supervising them with interaction histories to jointly model user actions and their semantic drivers, creating a Thought Space across domains.

Result: PULSE outperforms leading ID, Collaborative Filtering, and LLM-based sequential recommendation models across multiple benchmark datasets, and shows superior transferability in cross-domain recommendation.

Conclusion: The novel approach of treating rationales as direct learning signals yields more robust and generalizable embeddings, demonstrating strong performance in both recommendation and downstream reasoning tasks.

Abstract: Large Language Models (LLMs) have advanced recommendation capabilities through enhanced reasoning, but pose significant challenges for real-world deployment due to high inference costs. Conversely, while Small Language Models (SLMs) offer an efficient alternative, their reasoning capabilities for recommendation remain underexplored. Existing systems often use natural language rationales merely as unsupervised descriptive text, failing to harness their full potential as learning signals. In this work our main idea is to create a common understanding of user and items across multiple domains called Thought Space with SLMs instead of using LLMs’ distilled knowledge. To that end we propose PULSE (Preference Understanding by Latent Semantic Embeddings), a framework that treats SLM-generated rationales as director learning signals, supervising them with interaction histories to jointly model user actions (what) and their semantic drivers (why). Existing methods consider only interactions such as sequences and embeddings, whereas PULSE treats rationales as first-class signals, this novel design yields embeddings that are more robust and generalizable. Extensive experiments demonstrate that PULSE outperforms leading ID, Collaborative Filtering (CF), and LLM-based sequential recommendation models across multiple benchmark datasets. Furthermore, PULSE exhibits superior transferability in cross-domain recommendation and demonstrates strong performance on downstream tasks such as reasoning-oriented question answering. Our code is available \href{https://anonymous.4open.science/r/Thinking_PULSE-0FC5/README.md}{here}.

[30] ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

Jingbiao Mei, Mingsheng Sun, Jinghong Chen, Pengda Qin, Yuhong Li, Da Chen, Bill Byrne

Main category: cs.CL

TL;DR: ExPO-HM is a new approach for hateful meme detection that combines explanation generation with detection, achieving state-of-the-art performance by addressing limitations of previous methods through policy optimization and reasoning quality metrics.

Details

Motivation: Current hateful meme detection systems provide only binary predictions without explanations, failing to support real-world moderation needs. Recent explain-then-detect approaches underperform simple baselines due to missing policy-relevant cues and insufficient reward signals.

Method: ExPO-HM uses SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality, inspired by human annotator training processes.

Result: Achieves state-of-the-art performance across three hateful meme benchmarks with up to 15% and 17% F1 improvement over GRPO and DPO baselines, excelling in binary detection, fine-grained classification, and reasoning quality.

Conclusion: ExPO-HM successfully moves hateful meme detection from simple binary alarms to explanation-driven detection, providing accurate, interpretable, and actionable moderation support.

Abstract: Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15% and 17% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support.

[31] Next Semantic Scale Prediction via Hierarchical Diffusion Language Models

Cai Zhou, Chenyu Wang, Dinghuai Zhang, Shangyuan Tong, Yifei Wang, Stephen Bates, Tommi Jaakkola

Main category: cs.CL

TL;DR: HDLM is a novel discrete diffusion model for language modeling that uses hierarchical vocabulary mapping from detailed to coarse-grained semantics, enabling progressive semantic refinement during generation.

Details

Motivation: To develop a more effective discrete diffusion model for language modeling that can handle semantic hierarchies and provide better text generation quality than existing approaches.

Method: Uses hierarchical vocabulary with surjective mapping between low-level detailed tokens and high-level coarse tokens. Forward process perturbs tokens to higher-level ancestors, reverse process progressively predicts more detailed semantics. Includes closed-form ELBO derivation and practical training techniques.

Result: Extensive experiments show HDLM achieves consistently lower validation and generative perplexity than baseline models, demonstrating superior text generation performance.

Conclusion: HDLM provides an effective hierarchical diffusion framework for language modeling that outperforms existing methods and offers a general time-varying semantic scale prediction approach.

Abstract: In this paper we introduce Hierarchical Diffusion Language Models (HDLM) – a novel family of discrete diffusion models for language modeling. HDLM builds on a hierarchical vocabulary where low-level tokens with detailed semantics are surjectively mapped to high-level tokens with coarse-grained meanings. In the forward process, each token is independently perturbed to its higher-level ancestor with more abstract semantics according to the scheduler, while in the reverse process the model progressively predicts the next, more detailed semantics. Taken together, HDLM provides a general time-varying next semantic scale prediction process for language modeling. We derive closed-form expressions for the diffusion Evidence Lower Bound (ELBO), and show that HDLM can be implemented in a flexible manner while including the existing MDLM as a special case. We also propose practical training techniques based on the insights. Extensive text generation experiments validate the effectiveness of HDLM, which demonstrates consistently lower validation and generative perplexity than baselines.

[32] Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression

Chengzhengxu Li, Xiaoming Liu, Zhaohan Zhang, Shaochu Zhang, Shengchao Liu, Guoxin Ma, Yu Lan, Chao Shen

Main category: cs.CL

TL;DR: UCoT is an efficient reasoning framework that uses upfront thought embeddings to compress Chain-of-Thought reasoning, reducing token usage by 50% while improving performance.

Details

Motivation: Long Chain-of-Thought reasoning suffers from high computational costs and latency in LLMs. Existing CoT compression methods either require manual prompt engineering or sacrifice reasoning details in compressed datasets.

Method: A cooperative workflow with a small compressor model that generates upfront thought embeddings, and a large executor model that uses these embeddings to produce short reasoning. Two-stage training: first stage trains compressor, second stage optimizes executor with reward mechanism.

Result: Reduces token usage on GSM8K dataset by 50% while achieving 3.08% higher performance than state-of-the-art methods. Maintains powerful reasoning ability while significantly shortening CoT length.

Conclusion: UCoT provides an effective automated approach for CoT compression that improves reasoning efficiency without sacrificing performance, addressing key limitations of previous methods.

Abstract: Recent developments have enabled advanced reasoning in Large Language Models (LLMs) via long Chain-of-Thought (CoT), while long CoT suffers from high computational costs and significant latency losses owing to the autoregressive nature of generative LLMs. CoT compression aims to improve efficiency in the reasoning process by reducing output length. Previous works trade reasoning efficiency by either laborious discrete prompt designing or the construction of external compressed CoT datasets that sacrifice key reasoning details. In this work, we propose Upfront CoT (UCoT): an efficient reasoning framework with upfront thought embedding to automate CoT compression. UCoT is a cooperative workflow involving a small model (compressor) and a large model (executor). The first stage of UCoT trains compressor to generate upfront thought embeddings rich in reasoning information for the executor, avoiding the drawbacks of manually designed prompts. The second stage optimizes executor to utilize upfront thought embeddings to derive the correct answer with short reasoning, using a reward mechanism. Extensive experiments show that UCoT maintains the powerful reasoning ability of executor while significantly reducing the length of CoT. It is worth mentioning that when applying UCoT to the Qwen2.5-7B-Instruct model, the usage of tokens on GSM8K dataset is reduced by 50%, while the performance is 3.08% higher than that of the state-of-the-art (SOTA) method. The code and dataset are in supplementary material.

[33] Formalizing Style in Personal Narratives

Gustave Cortal, Alain Finkel

Main category: cs.CL

TL;DR: A framework for analyzing style in personal narratives by formalizing linguistic choices using functional linguistics, computer science methods, and psychological observations, demonstrated through dream narrative analysis.

Details

Motivation: There is a lack of formal framework for systematically analyzing stylistic choices in personal narratives, which are fundamental to conveying subjective experiences.

Method: Integrates functional linguistics (language as meaningful choices), computer science (automatic extraction of sequential patterns), and psychology. Uses language models to extract linguistic features like processes, participants, and circumstances.

Result: Applied to hundreds of dream narratives, including a PTSD war veteran case study. Analysis revealed distinctive patterns, particularly dominance of verbal processes over mental ones, showing relationship between linguistic choices and psychological states.

Conclusion: The framework successfully formalizes style in personal narratives and demonstrates how linguistic patterns can reveal psychological states, providing a systematic approach for analyzing subjective experiences through language.

Abstract: Personal narratives are stories authors construct to make meaning of their experiences. Style, the distinctive way authors use language to express themselves, is fundamental to how these narratives convey subjective experiences. Yet there is a lack of a formal framework for systematically analyzing these stylistic choices. We present a novel approach that formalizes style in personal narratives as patterns in the linguistic choices authors make when communicating subjective experiences. Our framework integrates three domains: functional linguistics establishes language as a system of meaningful choices, computer science provides methods for automatically extracting and analyzing sequential patterns, and these patterns are linked to psychological observations. Using language models, we automatically extract linguistic features such as processes, participants, and circumstances. We apply our framework to hundreds of dream narratives, including a case study on a war veteran with post-traumatic stress disorder. Analysis of his narratives uncovers distinctive patterns, particularly how verbal processes dominate over mental ones, illustrating the relationship between linguistic choices and psychological states.

[34] A Novel Framework for Augmenting Rating Scale Tests with LLM-Scored Text Data

Joe Watson, Ivan O’Conner, Chia-Wen Chen, Luning Sun, Fang Luo, David Stillwell

Main category: cs.CL

TL;DR: This study introduces a novel framework that combines LLM-scored text with traditional rating scales to create augmented psychological tests, demonstrating improved measurement precision and accuracy for depression assessment.

Details

Motivation: Traditional psychological assessments rely on structured rating scales that miss the rich nuance of natural language, while recent LLM advances offer potential to harness qualitative data for enhanced measurement.

Method: Developed a framework combining LLM-scored text and traditional rating-scale items, tested on real-world sample (n=693) and synthetic dataset (n=3,000), using empirical selection of informative LLM scoring instructions based on item information calculations.

Result: Augmented tests achieved statistically significant improvements in measurement precision and accuracy, with LLM items providing information gain equivalent to adding 6.3-16.0 items to the original 19-item test.

Conclusion: The framework represents a conceptual shift in automated scoring that bypasses typical bottlenecks, providing a scalable approach to enhance traditional psychometric measures using transcribed text, with potential utility in clinical health and beyond.

Abstract: Psychological assessments typically rely on structured rating scales, which cannot incorporate the rich nuance of a respondent’s natural language. This study leverages recent LLM advances to harness qualitative data within a novel conceptual framework, combining LLM-scored text and traditional rating-scale items to create an augmented test. We demonstrate this approach using depression as a case study, developing and assessing the framework on a real-world sample of upper secondary students (n=693) and corresponding synthetic dataset (n=3,000). On held-out test sets, augmented tests achieved statistically significant improvements in measurement precision and accuracy. The information gain from the LLM items was equivalent to adding between 6.3 (real data) and 16.0 (synthetic data) items to the original 19-item test. Our approach marks a conceptual shift in automated scoring that bypasses its typical bottlenecks: instead of relying on pre-labelled data or complex expert-created rubrics, we empirically select the most informative LLM scoring instructions based on calculations of item information. This framework provides a scalable approach for leveraging the growing stream of transcribed text to enhance traditional psychometric measures, and we discuss its potential utility in clinical health and beyond.

[35] dInfer: An Efficient Inference Framework for Diffusion Language Models

Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, Da Zheng

Main category: cs.CL

TL;DR: dInfer is an efficient inference framework for diffusion-based large language models that achieves 10x speedup over prior systems and 2-3x speedup over optimized autoregressive models while maintaining output quality.

Details

Motivation: Diffusion-based LLMs offer inherent parallelism but lack standardized and efficient inference frameworks, limiting their widespread adoption despite increasing open-source availability.

Method: dInfer decomposes inference into four modular components (model, diffusion iteration manager, decoding strategy, KV-cache manager) with novel algorithms and system-level optimizations for each component.

Result: Achieves over 1,100 tokens/sec on HumanEval and averages 800+ tokens/sec across six benchmarks on 8×H800 GPUs, with 10x speedup over Fast-dLLM and 2-3x speedup over optimized AR models like QWen2.5-3B.

Conclusion: dInfer provides an efficient and extensible inference framework for diffusion LLMs that significantly outperforms existing systems while maintaining model performance, enabling broader adoption of diffusion-based language models.

Abstract: Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components-model, diffusion iteration manager, decoding strategy, and KV-cache manager-and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared with AR models (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with latest vLLM inference engine, dInfer still deliverers $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.

[36] Scaling Laws for Code: A More Data-Hungry Regime

Xianzhen Luo, Wenzhen Zheng, Qingfu Zhu, Rongyi Zhang, Houyi Li, Siming Huang, YuanTao Fan, Wanxiang Che

Main category: cs.CL

TL;DR: This paper conducts the first large-scale empirical study of scaling laws for Code LLMs, revealing that code requires higher data-to-parameter ratios than natural language and follows the more expressive Farseer law.

Details

Motivation: Current scaling laws are primarily analyzed on natural language, but fundamental differences like strict syntax between code and NL make it unclear if these laws apply directly to code LLMs.

Method: Conducted 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B, fitting both Chinchilla law and Farseer law, plus additional experiments on code-NL mixtures.

Result: Code LLMs scale effectively with model size, but code is more data-hungry requiring higher data-to-parameter ratios than NL. The Farseer law offers greater accuracy than Chinchilla law for code.

Conclusion: Code follows different scaling patterns than natural language, with the Farseer law being more suitable. NL data benefits resource-constrained scenarios but becomes detrimental at higher compute budgets.

Abstract: Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.

[37] Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning

Li Zhang, Matthias Grabmair, Morgan Gray, Kevin Ashley

Main category: cs.CL

TL;DR: This paper proposes a formal framework to evaluate LLMs’ capability in legal case-based reasoning, revealing a paradox where models perform well on surface-level tasks but fail at hierarchical and integrated reasoning despite using more computational resources.

Details

Motivation: To investigate LLMs' proficiency in complex legal reasoning involving case analogies and distinctions, which is crucial for legal practice but not well understood.

Method: A three-stage reasoning framework that decomposes legal case analysis: modeling cases with factual predicates (factors), organizing them hierarchically, and defining verifiable rules for identifying distinctions and evaluating their significance.

Result: LLMs achieve high accuracy on surface-level reasoning (Task 1) but performance degrades significantly on hierarchical reasoning (Task 2: 64.82%-92.09%) and collapses on integrated analysis (Task 3: 11.46%-33.99%). Models consistently use more computational resources for incorrect responses.

Conclusion: LLMs have fundamental limitations in complex legal reasoning, with “thinking longer” not equating to “thinking smarter,” highlighting the need for addressing these limitations for trustworthy legal AI.

Abstract: Case-based reasoning is a cornerstone of U.S. legal practice, requiring professionals to argue about a current case by drawing analogies to and distinguishing from past precedents. While Large Language Models (LLMs) have shown remarkable capabilities, their proficiency in this complex, nuanced form of reasoning needs further investigation. We propose a formal framework that decomposes the process of identifying significant distinctions between cases into three-stage reasoning tasks. Our framework models cases using factual predicates called factors, organizes them into a legal knowledge hierarchy, and defines verifiable rules for identifying distinctions, analyzing their argumentative support, and evaluating their significance. Through comprehensive evaluation of modern reasoning LLMs, we reveal a paradox: while models achieve high accuracy on surface-level reasoning (Task 1), performance degrades on hierarchical reasoning (Task 2: 64.82%-92.09%) and collapses on integrated analysis (Task 3: 11.46%-33.99%). Most strikingly, we find that models consistently expend more computational resources on incorrect responses than correct ones, suggesting that “thinking longer” does not always mean “thinking smarter.” Our work provides a methodology for fine-grained analysis of LLM reasoning capabilities in complex domains and reveals fundamental limitations that must be addressed for robust and trustworthy legal AI.

[38] How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

Xianzhen Luo, Jinyang Huang, Wenzhen Zheng, Qingfu Zhu, Mingzheng Xu, Yiheng Xu, Yuantao Fan, Libo Qin, Wanxiang Che

Main category: cs.CL

TL;DR: The paper introduces TC-Bench, a compact and diverse benchmark for evaluating LLM-generated test cases by formalizing benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix to address issues like computational costs and score inflation in existing benchmarks.

Details

Motivation: Existing benchmarks for evaluating LLM-generated test cases suffer from high computational costs, score inflation, and bias towards trivial bugs over rare critical faults, creating a need for a more efficient and representative evaluation framework.

Method: The authors formalize benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix, propose WrongSelect algorithm to select maximally diverse wrong codes, and apply this framework to millions of competitive programming submissions to construct TC-Bench.

Result: TC-Bench is a compact, diverse, and inflation-resistant benchmark where even the most advanced test case generation methods achieve only ~60% exclusion rates, exposing a significant gap in their diagnostic power.

Conclusion: The proposed framework successfully addresses limitations of existing benchmarks by providing a mathematically grounded approach to construct minimal yet comprehensive evaluation benchmarks for LLM-generated test cases, revealing substantial room for improvement in current methods.

Abstract: Evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task. Existing benchmarks suffer from high computational costs, score inflation, and a bias towards trivial bugs over rare, critical faults. In this work, we ask two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix. The rank of this matrix specifies the minimal number of independent error patterns (wrong codes) and provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity. To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power. Our dataset is available at: https://huggingface.co/datasets/Luoberta/TC-Bench and our code is at: https://github.com/Luowaterbi/TC-Bench.

[39] How Reliable is Language Model Micro-Benchmarking?

Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta

Main category: cs.CL

TL;DR: Micro-benchmarking with small subsets of benchmarks often fails to rank models consistently compared to full benchmarks, and random sampling can be competitive with existing methods when using 250+ examples.

Details

Motivation: To address the time and cost constraints of language model development by evaluating on small benchmark subsets, but questioning whether micro-benchmarks can reliably rank models like full benchmarks.

Method: Introduced a meta-evaluation measure that analyzes how well micro-benchmarks rank models based on performance differences on full benchmarks, testing various micro-benchmark sizes and methods.

Result: Found that no micro-benchmarking method can consistently rank model pairs with small performance differences (3.5 points on MMLU-Pro, 4 points on BIG-bench Hard), and 250+ examples are often needed where random sampling becomes competitive.

Conclusion: Micro-benchmarking requires careful size selection (often 250+ examples) for reliable model rankings, and existing methods may not outperform random sampling at larger sizes, providing guidance for balancing evaluation efficiency and reliability.

Abstract: Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models more consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark. This approach can determine which model pairs can be ranked correctly by a micro-benchmark, allowing for a finer-grained analysis of the trade-off between micro-benchmark size and reliability. Prior work has suggested selecting as few as 10 examples; we find that no micro-benchmarking method can consistently rank model pairs 3.5 points of accuracy apart on MMLU-Pro or 4 points apart on BIG-bench Hard. In order to consistently rank model pairs with relatively similar performances, we show that often as many as 250 examples must be selected, at which point random sampling is competitive with existing micro-benchmarking methods. When comparing only 8B instruction-tuned models on MMLU-Pro micro-benchmarks with 25 examples, we find that more than half of pairwise comparisons are not likely to be preserved. Our work provides actionable guidance for both micro-benchmark users and developers in navigating the trade-off between evaluation efficiency and reliability.

[40] Coordinates from Context: Using LLMs to Ground Complex Location References

Tessa Masis, Brendan O’Connor

Main category: cs.CL

TL;DR: This paper proposes an LLM-based approach for geocoding compositional location references, showing that fine-tuned smaller LLMs can match the performance of larger off-the-shelf models.

Details

Motivation: Geocoding is essential for analyzing unstructured text, and the paper focuses on the challenging task of geocoding compositional location references where multiple location components need to be reasoned about together.

Method: The authors evaluate LLMs’ geospatial knowledge and reasoning skills, then propose an LLM-based strategy specifically designed for geocoding compositional location references.

Result: The proposed approach improves performance for the geocoding task, and a relatively small fine-tuned LLM achieves comparable performance with much larger off-the-shelf models.

Conclusion: Fine-tuned smaller LLMs can effectively handle compositional location reference geocoding, providing a practical and efficient alternative to larger models while maintaining performance.

Abstract: Geocoding is the task of linking a location reference to an actual geographic location and is essential for many downstream analyses of unstructured text. In this paper, we explore the challenging setting of geocoding compositional location references. Building on recent work demonstrating LLMs’ abilities to reason over geospatial data, we evaluate LLMs’ geospatial knowledge versus reasoning skills relevant to our task. Based on these insights, we propose an LLM-based strategy for geocoding compositional location references. We show that our approach improves performance for the task and that a relatively small fine-tuned LLM can achieve comparable performance with much larger off-the-shelf models.

[41] Measuring Moral LLM Responses in Multilingual Capacities

Kimaya Basu, Savi Kolari, Allison Yu

Main category: cs.CL

TL;DR: Evaluation of frontier and open-source LLMs across five dimensions in multilingual contexts shows GPT-5 performs best overall, while other models display inconsistency across languages and categories, particularly in Consent & Autonomy and Harm Prevention & Safety.

Details

Motivation: With widespread LLM usage across countries and languages, there's a growing need to understand and guardrail their multilingual responses across different dimensions.

Method: Evaluated responses of frontier and leading open-source models across five dimensions in low and high-resource languages using a five-point grading rubric and a judge LLM.

Result: GPT-5 performed best overall, with highest scores in Consent & Autonomy (3.56) and Harm Prevention & Safety (4.73), while Gemini 2.5 Pro scored lowest (1.39 and 1.98 respectively). Other models showed inconsistency across languages and categories.

Conclusion: Findings emphasize the need for further testing on how linguistic shifts impact LLM responses across categories and improvement in these areas.

Abstract: With LLM usage becoming widespread across countries, languages, and humanity more broadly, the need to understand and guardrail their multilingual responses increases. Large-scale datasets for testing and benchmarking have been created to evaluate and facilitate LLM responses across multiple dimensions. In this study, we evaluate the responses of frontier and leading open-source models in five dimensions across low and high-resource languages to measure LLM accuracy and consistency across multilingual contexts. We evaluate the responses using a five-point grading rubric and a judge LLM. Our study shows that GPT-5 performed the best on average in each category, while other models displayed more inconsistency across language and category. Most notably, in the Consent & Autonomy and Harm Prevention & Safety categories, GPT scored the highest with averages of 3.56 and 4.73, while Gemini 2.5 Pro scored the lowest with averages of 1.39 and 1.98, respectively. These findings emphasize the need for further testing on how linguistic shifts impact LLM responses across various categories and improvement in these areas.

[42] Learning What to Remember: Adaptive Probabilistic Memory Retention for Memory-Efficient Language Models

S M Rafiuddin, Muntaha Nujat Khan

Main category: cs.CL

TL;DR: Adaptive Retention is a probabilistic token selection method that learns which representations to keep under a global budget, reducing quadratic attention complexity while maintaining performance.

Details

Motivation: Transformer attention scales quadratically with sequence length (O(n²)), which limits practical use of long-context applications.

Method: Uses Bernoulli gates trained via Hard-Concrete/variational relaxation for probabilistic layer-wise token selection, with top-M rule at inference for strict budget enforcement.

Result: Keeping only 30-50% of tokens preserves ≥95% of full-model performance, reduces peak memory by ~35-45%, and improves throughput by up to ~1.8x across classification, QA, and summarization tasks.

Conclusion: This architecture-agnostic approach provides practical long-context efficiency without modifying base attention mechanisms or task heads.

Abstract: Transformer attention scales quadratically with sequence length O(n^2), limiting long-context use. We propose Adaptive Retention, a probabilistic, layer-wise token selection mechanism that learns which representations to keep under a strict global budget M. Retention is modeled with Bernoulli gates trained via a Hard-Concrete/variational relaxation and enforced with a simple top-M rule at inference, making the method differentiable and drop-in for standard encoders. Across classification, extractive QA, and long-document summarization, keeping only 30-50% of tokens preserves >= 95% of full-model performance while cutting peak memory by ~35-45% and improving throughput by up to ~1.8x. This architecture-agnostic approach delivers practical long-context efficiency without modifying base attention or task heads.

[43] Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

Wangjie You, Xusheng Wang, Xing Wang, Wenxiang Jiao, Chao Feng, Juntao Li, Min Zhang

Main category: cs.CL

TL;DR: CCMOR is a Chinese commonsense multi-hop reasoning benchmark that evaluates LLMs’ ability to combine Chinese-specific factual knowledge with multi-step logical reasoning, revealing limitations in long-tail knowledge processing.

Details

Motivation: To address the understudied evaluation of LLMs in general Chinese-language contexts and assess their ability to integrate Chinese-specific factual knowledge with multi-step reasoning.

Method: Constructed a domain-balanced seed set from existing QA datasets, developed an LLM-powered pipeline to generate multi-hop questions based on factual unit chains, and implemented human-in-the-loop verification by domain experts.

Result: Evaluation of state-of-the-art LLMs revealed persistent limitations in processing long-tail knowledge and executing knowledge-intensive reasoning. Retrieval-augmented generation significantly improved performance by mitigating knowledge gaps.

Conclusion: The CCMOR benchmark effectively exposes LLMs’ limitations in Chinese commonsense reasoning, and retrieval augmentation provides a promising solution to address knowledge gaps in multi-hop reasoning tasks.

Abstract: While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs’ ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs’ ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.

[44] MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

Siddeshwar Raghavan, Tanwi Mallick

Main category: cs.CL

TL;DR: MOSAIC is a multi-agent LLM framework for scientific coding tasks that uses specialized agents in a student-teacher paradigm to handle rigorous, domain-specific scientific workflows without requiring test cases.

Details

Motivation: Scientific coding differs from general-purpose coding by requiring rigorous algorithms, deep domain knowledge, domain-specific reasoning, and algorithm iteration without I/O test cases. Many scientific problems involve sequential subproblems that need to be solved stepwise.

Method: Training-free multi-agent framework with specialized agents for self-reflection, rationale creation, coding, and debugging within a student-teacher paradigm. Uses Consolidated Context Window (CCW) to mitigate LLM hallucinations and facilitates stepwise problem decomposition with targeted error correction.

Result: MOSAIC outperforms existing approaches on scientific coding benchmarks in terms of accuracy, robustness, and interpretability.

Conclusion: The specialized agentic framework effectively addresses the unique challenges of scientific code generation by enabling stepwise problem solving, error correction, and mitigating hallucinations in complex scientific tasks with chained subproblems.

Abstract: We present MOSAIC, a multi-agent Large Language Model (LLM) framework for solving challenging scientific coding tasks. Unlike general-purpose coding, scientific workflows require algorithms that are rigorous, interconnected with deep domain knowledge, and incorporate domain-specific reasoning, as well as algorithm iteration without requiring I/O test cases. Many scientific problems also require a sequence of subproblems to be solved, leading to the final desired result. MOSAIC is designed as a training-free framework with specially designed agents to self-reflect, create the rationale, code, and debug within a student-teacher paradigm to address the challenges of scientific code generation. This design facilitates stepwise problem decomposition, targeted error correction, and, when combined with our Consolidated Context Window (CCW), mitigates LLM hallucinations when solving complex scientific tasks involving chained subproblems. We evaluate MOSAIC on scientific coding benchmarks and demonstrate that our specialized agentic framework outperforms existing approaches in terms of accuracy, robustness, and interpretability.

[45] The Model’s Language Matters: A Comparative Privacy Analysis of LLMs

Abhishek K. Mishra, Antoine Boutet, Lucas Magnana

Main category: cs.CL

TL;DR: This paper investigates how language structure affects privacy leakage in multilingual LLMs, showing that privacy vulnerability scales with linguistic redundancy and tokenization granularity across English, Spanish, French, and Italian.

Details

Motivation: LLMs are increasingly deployed across multilingual applications handling sensitive data, but their scale and linguistic variability introduce major privacy risks that have been mostly evaluated only for English.

Method: Quantified six linguistic indicators and evaluated three attack vectors (extraction, counterfactual memorization, membership inference) on LLMs trained on English, Spanish, French, and Italian medical corpora.

Result: Italian exhibited the strongest privacy leakage, English showed higher membership separability, while French and Spanish displayed greater resilience due to higher morphological complexity.

Conclusion: Language structure significantly affects privacy leakage in LLMs, underscoring the need for language-aware privacy-preserving mechanisms in multilingual deployments.

Abstract: Large Language Models (LLMs) are increasingly deployed across multilingual applications that handle sensitive data, yet their scale and linguistic variability introduce major privacy risks. Mostly evaluated for English, this paper investigates how language structure affects privacy leakage in LLMs trained on English, Spanish, French, and Italian medical corpora. We quantify six linguistic indicators and evaluate three attack vectors: extraction, counterfactual memorization, and membership inference. Results show that privacy vulnerability scales with linguistic redundancy and tokenization granularity: Italian exhibits the strongest leakage, while English shows higher membership separability. In contrast, French and Spanish display greater resilience due to higher morphological complexity. Overall, our findings provide the first quantitative evidence that language matters in privacy leakage, underscoring the need for language-aware privacy-preserving mechanisms in LLM deployments.

Jia Ao Sun, Hao Yu, Fabrizio Gotti, Fengran Mo, Yihong Wu, Yuchen Hui, Jian-Yun Nie

Main category: cs.CL

TL;DR: Search-on-Graph (SoG) is a framework that enables LLMs to perform iterative graph navigation on knowledge graphs using a single Search function, achieving state-of-the-art performance on KGQA benchmarks without fine-tuning.

Details

Motivation: LLMs struggle with knowledge-intensive multi-hop questions due to missing long-tail facts, hallucinations, and outdated knowledge, while existing KGQA methods face trade-offs between query brittleness, noise from large subgraphs, and exponential search spaces in complex agent frameworks.

Method: SoG follows an “observe-then-navigate” principle where LLMs iteratively examine actual available relations from current entities before deciding next hops, using a single Search function that adapts to different KG schemas and handles high-degree nodes through adaptive filtering.

Result: SoG achieves state-of-the-art performance across six KGQA benchmarks on Freebase and Wikidata, with particularly strong gains on Wikidata benchmarks (+16% improvement over previous best methods) and consistent improvements on Freebase benchmarks.

Conclusion: The Search-on-Graph framework provides an effective solution for LLM-based knowledge graph question answering by enabling iterative informed graph navigation without the limitations of existing methods.

Abstract: Large language models (LLMs) have demonstrated impressive reasoning abilities yet remain unreliable on knowledge-intensive, multi-hop questions – they miss long-tail facts, hallucinate when uncertain, and their internal knowledge lags behind real-world change. Knowledge graphs (KGs) offer a structured source of relational evidence, but existing KGQA methods face fundamental trade-offs: compiling complete SPARQL queries without knowing available relations proves brittle, retrieving large subgraphs introduces noise, and complex agent frameworks with parallel exploration exponentially expand search spaces. To address these limitations, we propose Search-on-Graph (SoG), a simple yet effective framework that enables LLMs to perform iterative informed graph navigation using a single, carefully designed \textsc{Search} function. Rather than pre-planning paths or retrieving large subgraphs, SoG follows an ``observe-then-navigate’’ principle: at each step, the LLM examines actual available relations from the current entity before deciding on the next hop. This approach further adapts seamlessly to different KG schemas and handles high-degree nodes through adaptive filtering. Across six KGQA benchmarks spanning Freebase and Wikidata, SoG achieves state-of-the-art performance without fine-tuning. We demonstrate particularly strong gains on Wikidata benchmarks (+16% improvement over previous best methods) alongside consistent improvements on Freebase benchmarks.

[47] Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

Ragib Amin Nihal, Rui Wen, Kazuhiro Nakadai, Jun Sakuma

Main category: cs.CL

TL;DR: PE-CoA is a framework using five conversation patterns to create effective multi-turn jailbreaks, achieving state-of-the-art performance across 12 LLMs and 10 harm categories, revealing pattern-specific vulnerabilities and shared failure modes across model families.

Details

Motivation: Existing multi-turn jailbreaking methods rely on heuristic strategies with limited insight into model weaknesses, and the relationship between conversation patterns and vulnerabilities across harm categories is poorly understood.

Method: Proposed Pattern Enhanced Chain of Attack (PE-CoA) framework with five conversation patterns to construct effective multi-turn jailbreaks through natural dialogue.

Result: Achieved state-of-the-art performance across twelve LLMs spanning ten harm categories, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics where robustness to one pattern doesn’t generalize to others, and model families share similar failure modes.

Conclusion: Findings highlight limitations of current safety training and indicate the need for pattern-aware defenses to address the discovered vulnerabilities.

Abstract: Large language models (LLMs) remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. These attacks target different harm categories (like malware generation, harassment, or fraud) through distinct conversational approaches (educational discussions, personal experiences, hypothetical scenarios). Existing multi-turn jailbreaking methods often rely on heuristic or ad hoc exploration strategies, providing limited insight into underlying model weaknesses. The relationship between conversation patterns and model vulnerabilities across harm categories remains poorly understood. We propose Pattern Enhanced Chain of Attack (PE-CoA), a framework of five conversation patterns to construct effective multi-turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics: models exhibit distinct weakness profiles where robustness to one conversational pattern does not generalize to others, and model families share similar failure modes. These findings highlight limitations of safety training and indicate the need for pattern-aware defenses. Code available on: https://github.com/Ragib-Amin-Nihal/PE-CoA

[48] Quality Estimation Reranking for Document-Level Translation

Krzysztof Mrozinski, Minji Kang, Ahmed Khota, Vincent Michael Sutanto, Giovanni Gatti De Giacomo

Main category: cs.CL

TL;DR: Document-level quality estimation reranking improves machine translation by selecting the best translations from multiple candidates, achieving significant BLEURT score gains (+2.00 to +5.09) across different models.

Details

Motivation: While QE reranking is effective at sentence level, its application to document-level translation remains underexplored despite the growing importance of document-level MT.

Method: Evaluated QE reranking on document-level translation using various learned metrics (SLIDE) and LLM-based metrics (GEMBA-DA), testing with different numbers of candidates (2-32) across decoder-only LLMs and encoder-decoder NMT models.

Result: Best learned metric SLIDE improved BLEURT-20 scores by +2.00 with 2 candidates and +5.09 with 32 candidates. LLM-based GEMBA-DA achieved gains of +1.63 and +4.30 respectively. Gains were smaller with longer inputs but still positive (+2.34 with SLIDE on longest documents).

Conclusion: Document-level QE reranking provides practical value with minimal runtime overhead, demonstrating effectiveness across different translation models and document lengths.

Abstract: Quality estimation (QE) reranking is a form of quality-aware decoding which aims to improve machine translation (MT) by scoring and selecting the best candidate from a pool of generated translations. While known to be effective at the sentence level, its application to the increasingly prominent domain of document-level translation remains underexplored. In this work, we evaluate QE reranking performance on document-level (rather than the typical sentence-level) translation, using various learned and large language model (LLM)-based QE metrics. We find that with our best learned metric, SLIDE, BLEURT-20 scores improve by +2.00 with only two candidates, and by +5.09 with 32, across both decoder-only LLM models and encoder-decoder neural machine translation (NMT) models. Using the best LLM-based metric, GEMBA-DA, gains of +1.63 and +4.30 are achieved under the same conditions. Although gains shrink with longer inputs, reranking with 32 candidates yields improvements of +2.34 (SLIDE) and +1.40 (GEMBA-DA) on our longest documents (512-1024 source tokens). These findings demonstrate the practical value of document-level QE, with minimal runtime overhead given suitable translation models and hardware.

[49] FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Jian-Yun Nie

Main category: cs.CL

TL;DR: FinAuditing is the first benchmark for evaluating LLMs on financial auditing tasks using real XBRL filings, testing semantic, relational, and numerical consistency across structured financial documents.

Details

Motivation: GAAP complexity and XBRL's hierarchical structure make financial auditing hard to automate, and LLMs' ability to reason over structured financial documents remains unexplored.

Method: Built FinAuditing benchmark from real US-GAAP XBRL filings with three subtasks: FinSM (semantic consistency), FinRE (relational consistency), FinMR (numerical consistency), plus unified evaluation framework with retrieval, classification, and reasoning metrics.

Result: Zero-shot experiments on 13 LLMs show inconsistent performance across semantic, relational, and mathematical dimensions, with 60-90% accuracy drops when reasoning over hierarchical multi-document structures.

Conclusion: Current LLMs have systematic limitations in taxonomy-grounded financial reasoning, and FinAuditing provides foundation for developing trustworthy, structure-aware financial intelligence systems.

Abstract: The complexity of the Generally Accepted Accounting Principles (GAAP) and the hierarchical structure of eXtensible Business Reporting Language (XBRL) filings make financial auditing increasingly difficult to automate and verify. While large language models (LLMs) have demonstrated strong capabilities in unstructured text understanding, their ability to reason over structured, interdependent, and taxonomy-driven financial documents remains largely unexplored. To fill this gap, we introduce FinAuditing, the first taxonomy-aligned, structure-aware, multi-document benchmark for evaluating LLMs on financial auditing tasks. Built from real US-GAAP-compliant XBRL filings, FinAuditing defines three complementary subtasks, FinSM for semantic consistency, FinRE for relational consistency, and FinMR for numerical consistency, each targeting a distinct aspect of structured auditing reasoning. We further propose a unified evaluation framework integrating retrieval, classification, and reasoning metrics across these subtasks. Extensive zero-shot experiments on 13 state-of-the-art LLMs reveal that current models perform inconsistently across semantic, relational, and mathematical dimensions, with accuracy drops of up to 60-90% when reasoning over hierarchical multi-document structures. Our findings expose the systematic limitations of modern LLMs in taxonomy-grounded financial reasoning and establish FinAuditing as a foundation for developing trustworthy, structure-aware, and regulation-aligned financial intelligence systems. The benchmark dataset is available at Hugging Face.

[50] Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

Haomin Zhuang, Yujun Zhou, Taicheng Guo, Yue Huang, Fangxu Liu, Kai Song, Xiangliang Zhang

Main category: cs.CL

TL;DR: The paper introduces a multi-temperature sampling method that applies different temperature settings to reasoning tokens (higher temperature) and knowledge tokens (lower temperature) to enhance exploration while maintaining factual correctness in LLM reasoning tasks.

Details

Motivation: Prior approaches restrict updates but don't explicitly facilitate exploration during token generation. The authors aim to promote exploratory behavior during sampling by distinguishing between reasoning and knowledge tokens.

Method: Apply distinct temperature settings for different token types: higher temperatures for reasoning tokens to encourage exploration, and lower temperatures for knowledge tokens to maintain factual correctness. Investigate various multi-temperature scheduling strategies in reinforcement learning contexts.

Result: Empirical evaluations on several reasoning benchmarks demonstrate that the approach significantly enhances the reasoning performance of LLMs.

Conclusion: The multi-temperature sampling method effectively improves LLM reasoning by explicitly promoting exploration during token generation while preserving factual accuracy through differentiated temperature control.

Abstract: Reinforcement Learning has demonstrated substantial improvements in the reasoning abilities of Large Language Models (LLMs), exhibiting significant applicability across various domains. Recent research has identified that tokens within LLMs play distinct roles during reasoning tasks, categorizing them into high-entropy reasoning tokens and low-entropy knowledge tokens. Prior approaches have typically focused on restricting updates to indirectly encourage exploration, yet they do not explicitly facilitate exploratory behavior during the token generation stage itself. In this work, we introduce a complementary approach that explicitly promotes exploration during sampling by applying distinct temperature settings for different token types. Specifically, our method employs higher temperatures for reasoning tokens to actively encourage exploration, while retaining lower temperatures for knowledge tokens to maintain factual correctness. Furthermore, we systematically investigate various multi-temperature scheduling strategies and their impacts within reinforcement learning contexts. Empirical evaluations on several reasoning benchmarks demonstrate that our approach significantly enhances the reasoning performance of LLMs. The code is available at https://github.com/zhmzm/Multi_Temperature_Verl.git.

[51] A Unified Biomedical Named Entity Recognition Framework with Large Language Models

Tengxiao Lv, Ling Luo, Juntao Li, Yanhua Wang, Yuchen Pan, Chao Liu, Yanan Wang, Yan Jiang, Huiyi Lv, Yuanyuan Sun, Jian Wang, Hongfei Lin

Main category: cs.CL

TL;DR: A unified BioNER framework using LLMs that handles nested entities and cross-lingual generalization through symbolic tagging, bilingual joint fine-tuning, and contrastive learning-based entity selection.

Details

Motivation: Existing BioNER methods struggle with nested entities, entity boundary ambiguity, and cross-lingual generalization, limiting their effectiveness in medical information extraction.

Method: Reformulate BioNER as text generation with symbolic tagging for flat/nested entities, perform bilingual joint fine-tuning across Chinese/English datasets, and use contrastive learning-based entity selector with boundary-sensitive samples.

Result: Achieves state-of-the-art performance on four benchmark datasets and robust zero-shot generalization across languages on two unseen corpora.

Conclusion: The proposed framework effectively addresses key BioNER challenges and demonstrates strong cross-lingual generalization capabilities.

Abstract: Accurate recognition of biomedical named entities is critical for medical information extraction and knowledge discovery. However, existing methods often struggle with nested entities, entity boundary ambiguity, and cross-lingual generalization. In this paper, we propose a unified Biomedical Named Entity Recognition (BioNER) framework based on Large Language Models (LLMs). We first reformulate BioNER as a text generation task and design a symbolic tagging strategy to jointly handle both flat and nested entities with explicit boundary annotation. To enhance multilingual and multi-task generalization, we perform bilingual joint fine-tuning across multiple Chinese and English datasets. Additionally, we introduce a contrastive learning-based entity selector that filters incorrect or spurious predictions by leveraging boundary-sensitive positive and negative samples. Experimental results on four benchmark datasets and two unseen corpora show that our method achieves state-of-the-art performance and robust zero-shot generalization across languages. The source codes are freely available at https://github.com/dreamer-tx/LLMNER.

[52] Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors

Xin Liu, RunSong Zhao, PengCheng Huang, XinYu Liu, JunYi Xiao, ChunYang Xiao, Tong Xiao, Shengxiang Gao, Zhengtao Yu, JingBo Zhu

Main category: cs.CL

TL;DR: SAC is a novel context compression method that directly selects anchor tokens from original context and aggregates contextual information into their KV representations, eliminating the need for autoencoding training and outperforming existing methods.

Details

Motivation: Current context compression methods rely on autoencoding tasks which create a mismatch between reconstruction optimization and actual downstream tasks, weakening features beneficial for real-world usage.

Method: SAC selects anchor tokens from original context and aggregates contextual information into their KV representations using anchor embeddings and bidirectional attention modification, eliminating autoencoding training.

Result: SAC consistently outperforms existing context compression methods across various compression ratios, achieving 1 EM improvement at 5x compression over strong baselines on MRQA, with increasing advantages at higher compression ratios.

Conclusion: SAC provides a more effective approach to context compression by directly leveraging contextual tokens rather than relying on autoencoding tasks, demonstrating superior performance across different compression scenarios.

Abstract: Context compression presents a promising approach for accelerating large language model (LLM) inference by compressing long contexts into compact representations. Current context compression methods predominantly rely on autoencoding tasks to train context-agnostic compression tokens to compress contextual semantics. While autoencoding tasks enable compression tokens to acquire compression capabilities, compression via autoencoding tasks creates a fundamental mismatch: the models are optimized for reconstruction that diverge from actual downstream tasks, thereby weakening the features more beneficial for real-world usage. We propose Semantic-Anchor Compression (SAC), a novel method that shifts from autoencoding task based compression to an architecture that is equipped with this compression capability \textit{a priori}. Instead of training models to compress contexts through autoencoding tasks, SAC directly selects so-called anchor tokens from the original context and aggregates contextual information into their key-value (KV) representations. By deriving representations directly from the contextual tokens, SAC eliminates the need for autoencoding training. To ensure compression performance while directly leveraging anchor tokens, SAC incorporates two key designs: (1) anchor embeddings that enable the compressor to identify critical tokens, and (2) bidirectional attention modification that allows anchor tokens to capture information from the entire context. Experimental results demonstrate that SAC consistently outperforms existing context compression methods across various compression ratios. On out-of-distribution evaluation using MRQA, SAC achieves 1 EM improvement at 5x compression over strong baselines, with increasing advantages at higher compression ratios.

[53] Artificial Impressions: Evaluating Large Language Model Behavior Through the Lens of Trait Impressions

Nicholas Deas, Kathleen McKeown

Main category: cs.CL

TL;DR: LLMs develop artificial impressions resembling human stereotypes, which can be linearly decoded from hidden representations and predict response quality and hedging behavior.

Details

Motivation: To understand how LLMs form internal representations that resemble human impressions and stereotypes based on language, and how these artificial impressions affect model behavior.

Method: Used linear probes on generated prompts to predict impressions according to the Stereotype Content Model (SCM), analyzing the relationship between impressions and downstream model behavior as well as prompt features.

Result: LLMs inconsistently report impressions when prompted directly, but impressions are consistently linearly decodable from hidden representations. Artificial impressions predict response quality and hedging use, and are influenced by content, style, and dialect features in prompts.

Conclusion: LLMs develop internal representations that function like human impressions, which can be systematically studied and have measurable effects on model outputs, revealing how language features shape artificial stereotypes in AI systems.

Abstract: We introduce and study artificial impressions–patterns in LLMs’ internal representations of prompts that resemble human impressions and stereotypes based on language. We fit linear probes on generated prompts to predict impressions according to the two-dimensional Stereotype Content Model (SCM). Using these probes, we study the relationship between impressions and downstream model behavior as well as prompt features that may inform such impressions. We find that LLMs inconsistently report impressions when prompted, but also that impressions are more consistently linearly decodable from their hidden representations. Additionally, we show that artificial impressions of prompts are predictive of the quality and use of hedging in model responses. We also investigate how particular content, stylistic, and dialectal features in prompts impact LLM impressions.

[54] SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures

Jiaming Wang, Zhe Tang, Yilin Jin, Peng Ding, Xiaoyu Li, Xuezhi Cao

Main category: cs.CL

TL;DR: SOP-Maze is a benchmark for evaluating LLMs on complex standard operating procedures from real business scenarios, revealing key failure modes in procedural reasoning.

Details

Motivation: Existing benchmarks don't adequately test LLMs' ability to handle complex business SOPs with multiple decision points and procedural logic.

Method: Created SOP-Maze benchmark with 397 tasks from 23 real-world SOP scenarios, categorized into Lateral Root System (wide-option selection) and Heart Root System (deep logical reasoning).

Result: Most state-of-the-art models struggle with SOP-Maze, showing three main error types: route blindness, conversational fragility, and calculation errors.

Conclusion: LLMs need significant improvement in handling complex procedural reasoning, and SOP-Maze provides a systematic framework for evaluating these capabilities.

Abstract: As large language models (LLMs) are widely deployed as domain-specific agents, many benchmarks have been proposed to evaluate their ability to follow instructions and make decisions in real-world scenarios. However, business scenarios often involve complex standard operating procedures (SOPs), and the evaluation of LLM capabilities in such contexts has not been fully explored. To bridge this gap, we propose SOP-Maze, a benchmark constructed from real-world business data and adapted into a collection of 397 tasks from 23 complex SOP scenarios. We further categorize SOP tasks into two broad classes: Lateral Root System (LRS), representing wide-option tasks that demand precise selection; and Heart Root System (HRS), which emphasizes deep logical reasoning with complex branches. Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze. We conduct a comprehensive analysis and identify three key error categories: (i) route blindness: difficulty following procedures; (ii) conversational fragility: inability to handle real dialogue nuances; and (iii) calculation errors: mistakes in time or arithmetic reasoning under complex contexts. The systematic study explores LLM performance across SOP tasks that challenge both breadth and depth, offering new insights for improving model capabilities. We have open-sourced our work on https://github.com/ADoublLEN/SOP-Maze.

[55] A Human Behavioral Baseline for Collective Governance in Software Projects

Mobina Noori, Mahasweta Chakraborti, Amy X Zhang, Seth Frey

Main category: cs.CL

TL;DR: This paper analyzes how open source communities evolve their governance structures by examining version-controlled governance documents from 710 projects, finding that governance expands by adding more roles and actions while maintaining stable rule composition.

Details

Motivation: To understand how open source communities describe participation and control through governance documents, and to establish a baseline for evaluating future AI-mediated workflows' impact on authority distribution.

Method: Used a corpus of 710 projects with paired snapshots, parsed text into actors, rules, actions, and objects, then measured change using entropy for evenness, richness for diversity, and Jensen Shannon divergence for drift.

Result: Projects define more roles and actions over time distributed more evenly, while rule composition remains stable, indicating governance grows by expanding and balancing participation categories without major shifts in prescriptive force.

Conclusion: The analysis provides a reproducible baseline for evaluating whether future AI-mediated workflows concentrate or redistribute authority in open source governance.

Abstract: We study how open source communities describe participation and control through version controlled governance documents. Using a corpus of 710 projects with paired snapshots, we parse text into actors, rules, actions, and objects, then group them and measure change with entropy for evenness, richness for diversity, and Jensen Shannon divergence for drift. Projects define more roles and more actions over time, and these are distributed more evenly, while the composition of rules remains stable. These findings indicate that governance grows by expanding and balancing categories of participation without major shifts in prescriptive force. The analysis provides a reproducible baseline for evaluating whether future AI mediated workflows concentrate or redistribute authority.

[56] Creation of the Chinese Adaptive Policy Communication Corpus

Bolun Sun, Charles Chang, Yuen Yuen Ang, Pingxu Hao, Ruotong Mu, Yuchen Xu, Zhengxin Zhang

Main category: cs.CL

TL;DR: CAPC-CG is the first open dataset of Chinese policy directives annotated with a five-color taxonomy for clear and ambiguous language, covering national laws and regulations from 1949-2023 with 3.3 million paragraph units.

Details

Motivation: To create the first comprehensive annotated dataset for Chinese policy communication research, supporting downstream NLP tasks and multilingual policy analysis.

Method: Developed a corpus spanning 1949-2023 containing national laws, administrative regulations, and ministerial rules, segmented into paragraphs and annotated using a two-round labeling framework with expert and trained coders.

Result: Achieved high inter-annotator agreement (Fleiss’s kappa K=0.86), released comprehensive metadata and annotation codebook, and provided baseline classification results with large language models.

Conclusion: CAPC-CG enables supervised modeling and supports downstream NLP research in policy communication, with potential for multilingual applications.

Abstract: We introduce CAPC-CG, the Chinese Adaptive Policy Communication (Central Government) Corpus, the first open dataset of Chinese policy directives annotated with a five-color taxonomy of clear and ambiguous language categories, building on Ang’s theory of adaptive policy communication. Spanning 1949-2023, this corpus includes national laws, administrative regulations, and ministerial rules issued by China’s top authorities. Each document is segmented into paragraphs, producing a total of 3.3 million units. Alongside the corpus, we release comprehensive metadata, a two-round labeling framework, and a gold-standard annotation set developed by expert and trained coders. Inter-annotator agreement achieves a Fleiss’s kappa of K = 0.86 on directive labels, indicating high reliability for supervised modeling. We provide baseline classification results with several large language models (LLMs), together with our annotation codebook, and describe patterns from the dataset. This release aims to support downstream tasks and multilingual NLP research in policy communication.

[57] MASA: LLM-Driven Multi-Agent Systems for Autoformalization

Lan Zhang, Marco Valentino, André Freitas

Main category: cs.CL

TL;DR: MASA is a multi-agent framework using LLMs for autoformalization - converting natural language to formal representations through collaborative agents.

Details

Motivation: Autoformalization bridges natural language and formal reasoning, but existing approaches lack modularity and flexibility to adapt to rapidly evolving AI capabilities.

Method: Multi-agent system architecture with emphasis on modularity, flexibility and extensibility, using collaborative LLM agents that can integrate with theorem provers.

Result: Demonstrated effectiveness through real-world mathematical definitions and formal mathematics datasets, showing enhanced efficiency and reliability.

Conclusion: Multi-agent systems combining LLMs and theorem provers have strong potential for improving autoformalization, providing valuable tools for researchers and practitioners.

Abstract: Autoformalization serves a crucial role in connecting natural language and formal reasoning. This paper presents MASA, a novel framework for building multi-agent systems for autoformalization driven by Large Language Models (LLMs). MASA leverages collaborative agents to convert natural language statements into their formal representations. The architecture of MASA is designed with a strong emphasis on modularity, flexibility, and extensibility, allowing seamless integration of new agents and tools to adapt to a fast-evolving field. We showcase the effectiveness of MASA through use cases on real-world mathematical definitions and experiments on formal mathematics datasets. This work highlights the potential of multi-agent systems powered by the interaction of LLMs and theorem provers in enhancing the efficiency and reliability of autoformalization, providing valuable insights and support for researchers and practitioners in the field.

[58] DARO: Difficulty-Aware Reweighting Policy Optimization

Jingyu Zhou, Lu Ma, Hao Liang, Chengyu Shen, Bin Cui, Wentao Zhang

Main category: cs.CL

TL;DR: DARO is a new RLVR method that dynamically adjusts loss contributions based on difficulty levels, outperforming existing methods like GRPO in math reasoning tasks.

Details

Motivation: Existing RLVR methods like GRPO and its variants use static weighting schemes tied to sample difficulty, which prevents adaptation to the model's evolving capabilities and creates loss scale issues.

Method: Difficulty-Aware Reweighting Policy Optimization (DARO) dynamically adjusts the loss contribution of each difficulty group based on the model’s current learning state.

Result: DARO outperforms four leading baselines across six math benchmarks on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Llama3.1-8B, achieving significantly faster convergence and superior final performance.

Conclusion: Dynamic difficulty-aware reweighting is crucial for effective RLVR training, addressing the limitations of static weighting schemes in existing methods.

Abstract: Recent advances in large language models (LLMs) have shown that reasoning ability can be significantly enhanced through Reinforcement Learning with Verifiable Rewards (RLVR). Group Relative Policy Optimization (GRPO) has emerged as the de facto approach for RLVR, inspiring numerous variants. However, our mathematical analysis reveals that these methods are fundamentally weighted variations of GRPO. We provide a unified view, demonstrating that their reliance on static or overly simplistic weighting schemes tied to sample difficulty prevents adaptation to a model’s evolving capabilities. This creates a significant loss scale issue, where training disproportionately focuses on certain difficulty levels at the expense of others, hindering overall performance. To address these limitations, we introduce \textbf{Difficulty-Aware Reweighting Policy Optimization (DARO)}, a method that dynamically adjusts the loss contribution of each difficulty group based on the model’s learning state. Extensive experiments on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, and Llama3.1-8B show that DARO outperforms four leading baselines across six math benchmarks, achieving significantly faster convergence and superior final performance.

[59] Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models

Yutao Mou, Xiaoling Zhou, Yuxiao Luo, Shikun Zhang, Wei Ye

Main category: cs.CL

TL;DR: LoRA-based refusal training enables performance-preserving safety alignment using only safety data, serving as cost-efficient safety patches.

Details

Motivation: Safety alignment is challenging to enhance without degrading general performance, and current approaches require expensive searches for optimal data proportions.

Method: LoRA-based refusal training that decouples safety into a low-rank subspace orthogonal to the model’s intrinsic transformation space.

Result: Demonstrates that LoRA serves as cost-efficient, performance-preserving, and plug-and-play safety patches.

Conclusion: LoRA effectively enables safety enhancements without interfering with inherent model capabilities through orthogonal subspace separation.

Abstract: Safety alignment is essential for building trustworthy artificial intelligence, yet it remains challenging to enhance model safety without degrading general performance. Current approaches require computationally expensive searches for the optimal proportion of safety-critical and general-purpose data to balance safety and general performance, incurring high costs with limited gains. In this work, we show that LoRA-based Refusal-training enables performance-preserving safety alignment even when trained solely on safety data, demonstrating that LoRA serves as cost-efficient, performance-preserving, and plug-and-play safety patches. Beyond empirical findings, we provide both theoretical and experimental evidence that LoRA effectively decouples safety into a low-rank subspace largely orthogonal to the model’s intrinsic transformation space, ensuring that safety enhancements do not interfere with inherent capabilities.

[60] LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction

Shengmin Piao, Jieun Lee, Sanghyun Park

Main category: cs.CL

TL;DR: LitE-SQL is a lightweight Text-to-SQL framework that achieves comparable performance to LLM-based methods using significantly fewer parameters, addressing privacy and deployment concerns.

Details

Motivation: To address concerns about deployment feasibility and data privacy in LLM-based Text-to-SQL methods by developing a lightweight alternative.

Method: Two-component framework: Schema Retriever using vector database for efficient schema linking, and SQL Generator fine-tuned in two stages (supervised fine-tuning + execution-guided reinforcement) for self-correction.

Result: Achieved 72.10% execution accuracy on BIRD and 88.45% on Spider 1.0, comparable to LLM-based methods while using 2x to 30x fewer parameters.

Conclusion: High-quality Text-to-SQL generation is feasible with lightweight models, offering practical solutions for privacy-sensitive and resource-constrained environments.

Abstract: The Text-to-SQL task translates natural language questions into SQL queries, enabling intuitive database interaction for non-experts. While recent methods leveraging Large Language Models (LLMs) achieve strong performance, their reliance on proprietary models raise concerns about deployment feasibility and data privacy. In this work, we introduce LitE-SQL, a Lightweight and Efficient framework with two components: (i) a Schema Retriever that performs efficient schema linking using a vector database of pre-computed schema embeddings, and (ii) a SQL Generator fine-tuned in two stages-supervised fine-tuning followed by execution-guided reinforcement-enabling self-correction without costly multi-candidate generation. On BIRD, LitE-SQL achieves 72.10% execution accuracy, and on Spider 1.0 it reaches 88.45%, demonstrating comparable or superior performance to LLM-based methods despite using 2x to 30x fewer parameters. Our findings demonstrate that high-quality Text-to-SQL generation is feasible with lightweight models, offering a practical solution for privacy-sensitive and resource-constrained settings.

Keno Harada, Lui Yoshida, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Main category: cs.CL

TL;DR: This paper proposes an iterative rubric refinement approach for LLM-based Automated Essay Scoring (AES) that improves scoring accuracy by having models refine rubrics based on their own scoring rationales and discrepancies with human scores.

Details

Motivation: LLM performance is highly sensitive to prompts, and there's potential to enhance AES by optimizing the scoring rubrics used by LLMs rather than just the prompts themselves.

Method: The approach prompts LLMs to iteratively refine scoring rubrics by reflecting on their own scoring rationales and observed discrepancies with human scores on sample essays.

Result: Experiments on TOEFL11 and ASAP datasets using GPT-4.1, Gemini-2.5-Pro, and Qwen-3-Next-80B-A3B-Instruct showed Quadratic Weighted Kappa (QWK) improvements of up to 0.19 and 0.47 respectively. Even with simple initial rubrics, the approach achieved comparable or better QWK than using detailed human-authored rubrics.

Conclusion: Iterative rubric refinement is crucial for LLM-based AES to enhance alignment with human evaluations, demonstrating that optimized rubrics can significantly improve scoring accuracy.

Abstract: The performance of Large Language Models (LLMs) is highly sensitive to the prompts they are given. Drawing inspiration from the field of prompt optimization, this study investigates the potential for enhancing Automated Essay Scoring (AES) by refining the scoring rubrics used by LLMs. Specifically, our approach prompts models to iteratively refine rubrics by reflecting on models’ own scoring rationales and observed discrepancies with human scores on sample essays. Experiments on the TOEFL11 and ASAP datasets using GPT-4.1, Gemini-2.5-Pro, and Qwen-3-Next-80B-A3B-Instruct show Quadratic Weighted Kappa (QWK) improvements of up to 0.19 and 0.47, respectively. Notably, even with a simple initial rubric, our approach achieves comparable or better QWK than using detailed human-authored rubrics. Our findings highlight the importance of iterative rubric refinement in LLM-based AES to enhance alignment with human evaluations.

[62] Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

Adity Khisa, Nusrat Jahan Lia, Tasnim Mahfuz Nafis, Zarif Masud, Tanzir Pial, Shebuti Rayana, Ahmedul Kabir

Main category: cs.CL

TL;DR: Fine-tuning multilingual transformer models on a novel Bangla-transliterated Chakma corpus significantly improves performance for this low-resource Indo-Aryan language, achieving up to 73.54% token accuracy and low perplexity of 2.90.

Details

Motivation: Chakma is an under-resourced Indo-Aryan language with limited available data, making it underrepresented in language models. The research aims to address this gap by creating resources for Chakma language processing.

Method: Created a novel corpus of Bangla-transliterated Chakma from Chakma literature, validated by native speakers. Fine-tuned six encoder-based multilingual transformer models (mBERT, XLM-RoBERTa, DistilBERT, DeBERTaV3, BanglaBERT, IndicBERT) on masked language modeling tasks.

Result: Fine-tuned multilingual models outperformed pre-trained counterparts, achieving up to 73.54% token accuracy and perplexity as low as 2.90. Analysis showed data quality impacts performance and revealed limitations of OCR pipelines for morphologically rich Indic scripts.

Conclusion: Bangla-transliterated Chakma is effective for transfer learning in Chakma language. The research demonstrates successful adaptation of multilingual models to low-resource languages and releases a manually validated dataset to encourage further research.

Abstract: As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based multilingual and regional transformer models (mBERT, XLM-RoBERTa, DistilBERT, DeBERTaV3, BanglaBERT, and IndicBERT) on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our manually validated monolingual dataset to encourage further research on multilingual language modeling for low-resource languages.

[63] Large Language Models Do NOT Really Know What They Don’t Know

Chi Seng Cheang, Hou Pong Chan, Wenxuan Zhang, Yang Deng

Main category: cs.CL

TL;DR: LLMs encode factuality signals in their internal representations but can still produce factual errors. The study shows that hallucinations based on subject knowledge use the same recall process as correct responses, making them indistinguishable, while other hallucinations are detectable.

Details

Motivation: To investigate whether LLMs' internal computations can reliably distinguish between factual and hallucinated outputs, given that LLMs encode factuality signals but also produce errors through shortcuts.

Method: Conducted a mechanistic analysis comparing two types of hallucinations based on their reliance on subject information, examining internal recall processes and hidden-state geometries.

Result: Hallucinations associated with subject knowledge use the same internal recall process as correct responses, leading to overlapping hidden-state geometries. Hallucinations detached from subject knowledge produce distinct, clustered representations that are detectable.

Conclusion: LLMs do not encode truthfulness in their internal states but only patterns of knowledge recall, demonstrating that ‘LLMs don’t really know what they don’t know’.

Abstract: Recent work suggests that large language models (LLMs) encode factuality signals in their internal representations, such as hidden states, attention weights, or token probabilities, implying that LLMs may “know what they don’t know”. However, LLMs can also produce factual errors by relying on shortcuts or spurious associations. These error are driven by the same training objective that encourage correct predictions, raising the question of whether internal computations can reliably distinguish between factual and hallucinated outputs. In this work, we conduct a mechanistic analysis of how LLMs internally process factual queries by comparing two types of hallucinations based on their reliance on subject information. We find that when hallucinations are associated with subject knowledge, LLMs employ the same internal recall process as for correct responses, leading to overlapping and indistinguishable hidden-state geometries. In contrast, hallucinations detached from subject knowledge produce distinct, clustered representations that make them detectable. These findings reveal a fundamental limitation: LLMs do not encode truthfulness in their internal states but only patterns of knowledge recall, demonstrating that “LLMs don’t really know what they don’t know”.

[64] Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation

Muhammad Ali Shafique, Kanwal Mehreen, Muhammad Arham, Maaz Amjad, Sabur Butt, Hamza Farooq

Main category: cs.CL

TL;DR: Alif-1.0-8B-Instruct is a multilingual Urdu-English LLM trained on a high-quality synthetic dataset using modified self-instruct technique, achieving superior performance on Urdu-specific tasks with under $100 training budget.

Details

Motivation: Address challenges in developing LLMs for low-resource languages like Urdu, including scarcity of high-quality datasets, multilingual inconsistencies, safety concerns, and poor quality translations in existing approaches.

Method: Modified self-instruct technique with unique prompts and seed values for each task, creating Urdu-Instruct dataset with Urdu-native chain-of-thought reasoning, bilingual translation, cultural relevance, and ethical safety alignments. Built upon pretrained Llama-3.1-8B.

Result: Outperformed Llama-3.1-8B-Instruct for Urdu-specific tasks and leading multilingual LLMs (Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, Cohere-Aya-Expanse-8B) with under $100 training budget.

Conclusion: High-performance low-resource language LLMs can be developed efficiently and culturally aligned using modified self-instruct approach, demonstrating cost-effective solution for underrepresented languages.

Abstract: Developing a high-performing large language models (LLMs) for low-resource languages such as Urdu, present several challenges. These challenges include the scarcity of high-quality datasets, multilingual inconsistencies, and safety concerns. Existing multilingual LLMs often address these issues by translating large volumes of available data. However, such translations often lack quality and cultural nuance while also incurring significant costs for data curation and training. To address these issues, we propose Alif-1.0-8B-Instruct, a multilingual Urdu-English model, that tackles these challenges with a unique approach. We train the model on a high-quality, multilingual synthetic dataset (Urdu-Instruct), developed using a modified self-instruct technique. By using unique prompts and seed values for each task along with a global task pool, this dataset incorporates Urdu-native chain-of-thought based reasoning, bilingual translation, cultural relevance, and ethical safety alignments. This technique significantly enhances the comprehension of Alif-1.0-8B-Instruct model for Urdu-specific tasks. As a result, Alif-1.0-8B-Instruct, built upon the pretrained Llama-3.1-8B, demonstrates superior performance compared to Llama-3.1-8B-Instruct for Urdu specific-tasks. It also outperformed leading multilingual LLMs, including Mistral-7B-Instruct-v0.3, Qwen-2.5-7B-Instruct, and Cohere-Aya-Expanse-8B, all within a training budget of under $100. Our results demonstrate that high-performance and low-resource language LLMs can be developed efficiently and culturally aligned using our modified self-instruct approach. All datasets, models, and code are publicly available at: https://github.com/traversaal-ai/alif-urdu-llm.

[65] ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

Chung-En Sun, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng

Main category: cs.CL

TL;DR: ReFIne is a training framework that improves reasoning models’ trustworthiness through better interpretability, faithfulness, and reliability, achieving significant gains across all three dimensions.

Details

Motivation: Current long chain-of-thought reasoning systems prioritize accuracy and token efficiency but overlook critical trustworthiness aspects like interpretability, faithfulness, and reliability.

Method: Proposed ReFIne framework combining supervised fine-tuning with GRPO to generate structured reasoning traces with planning, explicit disclosure of decisive information, and self-assessment of derivation soundness and confidence.

Result: ReFIne models showed +44.0% improvement in interpretability, +18.8% in faithfulness, and +42.4% in reliability across mathematical benchmarks, while maintaining structured reasoning traces.

Conclusion: Reasoning models should be optimized not only for accuracy but also for broader trustworthiness dimensions, with ReFIne demonstrating effective improvements in interpretability, faithfulness, and reliability.

Abstract: Recent advances in long chain-of-thought (CoT) reasoning have largely prioritized answer accuracy and token efficiency, while overlooking aspects critical to trustworthiness. We argue that usable reasoning systems must be trustworthy, characterized by three properties: interpretability, faithfulness, and reliability. To this end, we propose ReFIne, a new training framework that integrates supervised fine-tuning with GRPO to encourage models to: (i) improve interpretability by producing structured, tag-based traces with high-level planning that are easier for humans to follow; (ii) enhance faithfulness by explicitly disclosing the decisive information guiding each solution, with consistent cross-section references; and (iii) promote reliability by providing self-assessments of both the derivation’s soundness and the confidence of the final answer. We apply ReFIne to the Qwen3 models at multiple scales (1.7B/4B/8B) and evaluate across mathematical benchmarks of varying difficulty. Our experimental results show that ReFIne models generate clearer and better-structured reasoning traces (interpretability +44.0%), more faithfully expose their underlying decision process (faithfulness +18.8%), and offer informative confidence estimates (reliability +42.4%). These findings highlight an overlooked but important direction: reasoning models should be optimized not only for accuracy, but also for broader dimensions of trustworthiness. Our code is available at: https://github.com/Trustworthy-ML-Lab/Training_Trustworthy_LRM_with_Refine

[66] FrameEOL: Semantic Frame Induction using Causal Language Models

Chihiro Yano, Kosuke Yamada, Hayato Tsukagoshi, Ryohei Sasano, Koichi Takeda

Main category: cs.CL

TL;DR: Proposes FrameEOL, a CLM-based method for semantic frame induction using prompt-based embeddings with in-context learning and deep metric learning, outperforming existing methods.

Details

Motivation: CLMs like GPT and Llama succeed in language tasks but haven't been applied to semantic frame induction, while MLMs have shown good performance in this area.

Method: FrameEOL uses prompt-based embeddings that output frame names as labels, leveraging in-context learning and deep metric learning to create embeddings suitable for clustering.

Result: Outperforms existing frame induction methods on English and Japanese FrameNet datasets, with Japanese achieving comparable performance to MLM-based methods using only 5 ICL examples.

Conclusion: CLM-based approaches are effective for semantic frame induction, particularly for languages with limited frame resources like Japanese.

Abstract: Semantic frame induction is the task of clustering frame-evoking words according to the semantic frames they evoke. In recent years, leveraging embeddings of frame-evoking words that are obtained using masked language models (MLMs) such as BERT has led to high-performance semantic frame induction. Although causal language models (CLMs) such as the GPT and Llama series succeed in a wide range of language comprehension tasks and can engage in dialogue as if they understood frames, they have not yet been applied to semantic frame induction. We propose a new method for semantic frame induction based on CLMs. Specifically, we introduce FrameEOL, a prompt-based method for obtaining Frame Embeddings that outputs One frame-name as a Label representing the given situation. To obtain embeddings more suitable for frame induction, we leverage in-context learning (ICL) and deep metric learning (DML). Frame induction is then performed by clustering the resulting embeddings. Experimental results on the English and Japanese FrameNet datasets demonstrate that the proposed methods outperform existing frame induction methods. In particular, for Japanese, which lacks extensive frame resources, the CLM-based method using only 5 ICL examples achieved comparable performance to the MLM-based method fine-tuned with DML.

[67] When Retrieval Succeeds and Fails: Rethinking Retrieval-Augmented Generation for LLMs

Yongjie Wang, Yue Yu, Kaisong Song, Jun Lin, Zhiqi Shen

Main category: cs.CL

TL;DR: A comprehensive review of Retrieval-Augmented Generation (RAG) that examines its objectives, components, challenges, and applications where it enhances LLM performance.

Details

Motivation: LLMs struggle with rapidly evolving information and domain-specific queries due to training on static corpora, and traditional RAG frameworks have diminishing advantages as LLMs advance.

Method: The paper presents a systematic review analyzing RAG’s objectives, core components, key challenges, and critical weaknesses that limit its effectiveness.

Result: The review identifies specific applications where LLMs alone perform inadequately but RAG combined with LLMs substantially enhances their effectiveness.

Conclusion: This work aims to encourage researchers to reconsider RAG’s role and inspire development of next-generation RAG systems to better complement advancing LLM capabilities.

Abstract: Large Language Models (LLMs) have enabled a wide range of applications through their powerful capabilities in language understanding and generation. However, as LLMs are trained on static corpora, they face difficulties in addressing rapidly evolving information or domain-specific queries. Retrieval-Augmented Generation (RAG) was developed to overcome this limitation by integrating LLMs with external retrieval mechanisms, allowing them to access up-to-date and contextually relevant knowledge. However, as LLMs themselves continue to advance in scale and capability, the relative advantages of traditional RAG frameworks have become less pronounced and necessary. Here, we present a comprehensive review of RAG, beginning with its overarching objectives and core components. We then analyze the key challenges within RAG, highlighting critical weakness that may limit its effectiveness. Finally, we showcase applications where LLMs alone perform inadequately, but where RAG, when combined with LLMs, can substantially enhance their effectiveness. We hope this work will encourage researchers to reconsider the role of RAG and inspire the development of next-generation RAG systems.

[68] DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

Enze Zhang, Jiaying Wang, Mengxi Xiao, Jifei Liu, Ziyan Kuang, Rui Dong, Youzhong Dong, Sophia Ananiadou, Min Peng, Qianqian Xie

Main category: cs.CL

TL;DR: DITING is a comprehensive evaluation framework for web novel translation that assesses narrative and cultural fidelity across six dimensions, with AgentEval achieving the highest correlation with human judgments among automatic metrics.

Details

Motivation: Existing benchmarks for machine translation rely on surface-level metrics that fail to capture the distinctive traits of web novels, leaving LLM effectiveness in this domain unclear.

Method: Introduced DITING framework with 18K expert-annotated Chinese-English sentence pairs, proposed AgentEval (reasoning-driven multi-agent evaluation), and developed MetricAlign for metric comparison with 300 annotated sentence pairs.

Result: Chinese-trained LLMs surpass larger foreign counterparts, and DeepSeek-V3 delivers the most faithful and stylistically coherent translations among fourteen evaluated models.

Conclusion: Establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research in this domain.

Abstract: Large language models (LLMs) have substantially advanced machine translation (MT), yet their effectiveness in translating web novels remains unclear. Existing benchmarks rely on surface-level metrics that fail to capture the distinctive traits of this genre. To address these gaps, we introduce DITING, the first comprehensive evaluation framework for web novel translation, assessing narrative and cultural fidelity across six dimensions: idiom translation, lexical ambiguity, terminology localization, tense consistency, zero-pronoun resolution, and cultural safety, supported by over 18K expert-annotated Chinese-English sentence pairs. We further propose AgentEval, a reasoning-driven multi-agent evaluation framework that simulates expert deliberation to assess translation quality beyond lexical overlap, achieving the highest correlation with human judgments among seven tested automatic metrics. To enable metric comparison, we develop MetricAlign, a meta-evaluation dataset of 300 sentence pairs annotated with error labels and scalar quality scores. Comprehensive evaluation of fourteen open, closed, and commercial models reveals that Chinese-trained LLMs surpass larger foreign counterparts, and that DeepSeek-V3 delivers the most faithful and stylistically coherent translations. Our work establishes a new paradigm for exploring LLM-based web novel translation and provides public resources to advance future research.

[69] Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM

Seiya Ishikura, Hiroaki Yamada, Tatsuya Hiraoka, Hiroaki Yamada, Takenobu Tokunaga

Main category: cs.CL

TL;DR: Augmenting dialog data with think-aloud utterances improves LLM’s ability to mimic human personality traits, particularly Agreeableness and Neuroticism in the Big Five framework.

Details

Motivation: To enhance personality modeling in text chat by LLMs by incorporating speakers' internal thought processes through think-aloud utterances.

Method: Training persona LLMs with TAU-augmented dialog data, where TAUs verbalize speakers’ thoughts before actual utterances, and evaluating personality alignment using the Big Five framework.

Result: LLMs trained with TAU-augmented data showed better alignment with speakers’ Agreeableness and Neuroticism traits compared to those trained with original dialog data. TAU quality significantly impacts performance.

Conclusion: Think-aloud utterance augmentation effectively improves personality modeling in LLMs, with the quality of augmentation being crucial for optimal performance.

Abstract: This study proposes augmenting dialog data with think-aloud utterances (TAUs) for modeling individual personalities in text chat by LLM. TAU is a verbalization of a speaker’s thought before articulating the utterance. We expect “persona LLMs” trained with TAU-augmented data can mimic the speaker’s personality trait better. We tested whether the trained persona LLMs obtain the human personality with respect to Big Five, a framework characterizing human personality traits from five aspects. The results showed that LLMs trained with TAU-augmented data more closely align to the speakers’ Agreeableness and Neuroticism of Big Five than those trained with original dialog data. We also found that the quality of TAU-augmentation impacts persona LLM’s performance.

[70] Stronger Re-identification Attacks through Reasoning and Aggregation

Lucas Georges Gabriel Charpentier, Pierre Lison

Main category: cs.CL

TL;DR: This paper presents two strategies to improve re-identification attacks on de-identified text: using multiple ordering sequences for PII re-identification and leveraging reasoning models with extensive background knowledge.

Details

Motivation: Current methods for measuring the effectiveness of text de-identification techniques are limited. Re-identification attacks can better assess de-identification robustness, but existing approaches need improvement to create stronger adversarial evaluations.

Method: Two complementary strategies: (1) Using multiple ordering sequences for PII re-identification and aggregating predictions across orderings, and (2) Employing reasoning models that leverage extensive background knowledge to boost re-identification performance.

Result: The proposed strategies lead to improved re-identification results, with reasoning models particularly effective when adversaries have access to extensive background knowledge.

Conclusion: These enhanced re-identification attacks provide better assessment of de-identification method robustness and highlight the importance of considering sophisticated adversaries with background knowledge when evaluating privacy protection techniques.

Abstract: Text de-identification techniques are often used to mask personally identifiable information (PII) from documents. Their ability to conceal the identity of the individuals mentioned in a text is, however, hard to measure. Recent work has shown how the robustness of de-identification methods could be assessed by attempting the reverse process of re-identification, based on an automated adversary using its background knowledge to uncover the PIIs that have been masked. This paper presents two complementary strategies to build stronger re-identification attacks. We first show that (1) the order in which the PII spans are re-identified matters, and that aggregating predictions across multiple orderings leads to improved results. We also find that (2) reasoning models can boost the re-identification performance, especially when the adversary is assumed to have access to extensive background knowledge.

[71] LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning

Changjiang Gao, Zixian Huang, Jingyang Gong, Shujian Huang, Lei Li, Fei Yuan

Main category: cs.CL

TL;DR: Qwen3-XPlus models use layer-selective tuning on parallel data to enhance translation while maintaining reasoning capabilities, achieving significant improvements in multilingual tasks.

Details

Motivation: General LLMs excel at reasoning but struggle with translation tasks, while translation-enhanced LLMs lose reasoning capabilities. This work aims to bridge this gap.

Method: A novel translation-enhanced recipe that starts with instruct models and applies layer-selective tuning only on parallel data, using small parallel datasets for training.

Result: Qwen3-XPlus achieves 15+ spBLEU and 40+ xComet in low-resource languages like Swahili, with 1+ point average improvement on 7 multilingual tasks while maintaining reasoning proficiency comparable to Qwen3 instruct model.

Conclusion: This approach offers a promising method for multilingual enhancement that reduces complexity and improves accessibility for diverse languages.

Abstract: General Large Language Models (LLMs) excel in reasoning, but those enhanced for translation struggle with reasoning tasks. To address this, we propose a novel translationenhanced recipe that begins with instruct models and applies layer-selective tuning only on parallel data. Following this pipeline, we introduce the Qwen3-XPlus models, which demonstrate significant improvements in translation performance across both high- and lowresource languages, achieving 15+ spBLEU and 40+ xComet in low-resource languages, like Swahili. Interestingly, training only with small parallel datasets, Qwen3-XPlus achieves an average improvement of 1+ points on 7 multilingual tasks while maintaining proficiency comparable to the Qwen3 instruct model in 15 popular reasoning datasets. This work offers a promising approach to multilingual enhancement, significantly reducing complexity and enhancing accessibility for a wider range of languages. The code and model are publicly available.

[72] DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

Yiqi Li, Yusheng Liao, Zhe Chen, Yanfeng Wang, Yu Wang

Main category: cs.CL

TL;DR: DICE is a lightweight framework that uses small language models (SLMs) to refine LLM outputs through chain-of-thought correction, improving format accuracy by 35.4% and content correctness by 29.4% while preserving LLM reasoning capabilities.

Details

Motivation: LLMs often prioritize reasoning over strict output format adherence, and fine-tuning them is impractical due to high computational costs and limited parameter access.

Method: DICE decouples the process by having LLMs generate natural language responses, then using trained SLMs to analyze and refine outputs to meet structured specifications via chain-of-thought correction. It constructs structured CoT datasets through a two-stage method and applies dual-tuning for SLMs.

Result: Experiments show DICE improves format accuracy by 35.4% and content correctness by 29.4%, achieving state-of-the-art performance over competitive baselines.

Conclusion: DICE effectively preserves LLMs’ broad knowledge and reasoning capabilities while ensuring outputs conform to user demands through lightweight SLM-based refinement.

Abstract: When performing reasoning tasks with user-specific requirements, such as strict output formats, large language models (LLMs) often prioritize reasoning over adherence to detailed instructions. Fine-tuning LLMs on supervised datasets to address this is impractical due to high computational costs and limited parameter access. To tackle this, we propose DICE, a lightweight framework that guides small language models (SLMs) to refine LLMs’ outputs through chain-of-thought (CoT) correction. DICE decouples the process by first prompting LLMs to generate natural language responses, then using trained SLMs to analyze and refine these outputs to meet structured output specifications. This framework preserves LLMs’ broad knowledge and reasoning capabilities while ensuring the outputs conform to user demands. Specifically, DICE first constructs structured CoT adaptation datasets via a two-stage method and subsequently applies a dual-tuning strategy to fine-tune SLMs for generating structured outputs in an analyze-then-answer pattern. Experiments demonstrate that DICE improves the average format accuracy and content correctness of LLM outputs by 35.4% and 29.4%, respectively, achieving state-of-the-art (SOTA) performance over other competitive baselines.

[73] IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data

Tao Feng, Lizhen Qu, Niket Tandon, Gholamreza Haffari

Main category: cs.CL

TL;DR: IRIS is a novel framework for real-time causal discovery that combines statistical algorithms and LLM-based methods to discover both known and novel causal relations from initial variables without requiring pre-existing datasets.

Details

Motivation: Traditional causal discovery methods face challenges including expensive data collection, redundant computation for known relations, and unrealistic assumptions. LLM-based methods fail to uncover novel relations.

Method: IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations using a hybrid approach combining statistical algorithms and LLM-based methods. It includes a missing variable proposal component to expand causal graphs.

Result: The framework enables real-time causal discovery starting from only a set of initial variables without requiring pre-existing datasets.

Conclusion: IRIS addresses limitations of both traditional statistical methods and recent LLM-based approaches by providing a comprehensive solution for discovering both known and novel causal relations in real-time.

Abstract: Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant computation for known relations, and unrealistic assumptions. While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover known and novel causal relations. In addition to causal discovery on initial variables, the missing variable proposal component of IRIS identifies and incorporates missing variables to expand the causal graphs. Our approach enables real-time causal discovery from only a set of initial variables without requiring pre-existing datasets.

[74] CrisiText: A dataset of warning messages for LLM training in emergency communication

Giacomo Gonella, Gian Maria Campedelli, Stefano Menini, Marco Guerini

Main category: cs.CL

TL;DR: CrisiText is the first large-scale dataset for generating warning messages across 13 crisis scenarios, containing over 400,000 messages. The paper compares various NLG approaches including supervised fine-tuning, preference alignment, and few-shot learning.

Details

Motivation: Current NLP techniques in crisis situations are limited to classification tasks, overlooking the significant potential of timely warning message generation using NLG architectures to assist civilians during emergencies.

Method: Created a dataset from existing crisis descriptions by generating event chains paired with warning messages following expert guidelines. Compared supervised fine-tuning with preference alignment, zero-shot, and few-shot approaches, and evaluated out-of-distribution performance and automatic post-editing.

Result: The paper presents CrisiText dataset with 400,000+ warning messages across 13 crisis types and 18,000 crisis situations, with each message accompanied by three suboptimal warning types for comparative study.

Conclusion: The research demonstrates the feasibility and importance of using NLG for crisis warning message generation, providing a comprehensive dataset and evaluation framework for future research in this critical domain.

Abstract: Effectively identifying threats and mitigating their potential damage during crisis situations, such as natural disasters or violent attacks, is paramount for safeguarding endangered individuals. To tackle these challenges, AI has been used in assisting humans in emergency situations. Still, the use of NLP techniques remains limited and mostly focuses on classification tasks. The significant potential of timely warning message generation using NLG architectures, however, has been largely overlooked. In this paper we present CrisiText, the first large-scale dataset for the generation of warning messages across 13 different types of crisis scenarios. The dataset contains more than 400,000 warning messages (spanning almost 18,000 crisis situations) aimed at assisting civilians during and after such events. To generate the dataset, we started from existing crisis descriptions and created chains of events related to the scenarios. Each event was then paired with a warning message. The generations follow experts’ written guidelines to ensure correct terminology and factuality of their suggestions. Additionally, each message is accompanied by three suboptimal warning types to allow for the study of different NLG approaches. To this end, we conducted a series of experiments comparing supervised fine-tuning setups with preference alignment, zero-shot, and few-shot approaches. We further assessed model performance in out-of-distribution scenarios and evaluated the effectiveness of an automatic post-editor.

[75] DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning

Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao

Main category: cs.CL

TL;DR: DSPO is a new RL algorithm that trains LLMs to actively search external knowledge through multi-turn reasoning without supervised data, achieving significant performance improvements over previous methods.

Details

Motivation: Current approaches for enhancing LLMs with external knowledge search either rely on prompting or suffer from performance limitations and instability when applying RL to complex interactive tasks.

Method: Dynamic-filter Sequence-level Policy Optimization (DSPO) - an RL algorithm with sequence-level optimization and dynamic sample filtering that trains models purely through RL to interleave multi-turn search and reasoning.

Result: DSPO-trained 7B model improves over comparable previous work by 34.1% across multiple QA benchmarks, outperforms 14B model from previous work by nearly 9% relative in complex multihop QA like HotpotQA, while maintaining exceptional training stability.

Conclusion: DSPO effectively unlocks LLMs’ true agentic potential for complex knowledge-seeking tasks through robust RL training without requiring supervised demonstration data.

Abstract: Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model’s innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our DSPO-trained 7B model improves over a comparable previous work by \textbf{34.1%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf{9% relative}, maintaining exceptional training stability.

[76] Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

Yongding Tao, Tian Wang, Yihong Dong, Huanyu Liu, Kechi Zhang, Xiaolong Hu, Ge Li

Main category: cs.CL

TL;DR: Proposes Self-Critique, a novel method for detecting data contamination in RL post-training of LLMs by identifying policy collapse and entropy reduction, achieving up to 30% AUC improvement over baselines.

Details

Motivation: Addresses the critical research gap in detecting data contamination during RL post-training phase, which has become increasingly important for LLM reasoning but lacks specialized detection methods.

Method: Self-Critique method probes for policy collapse and entropy reduction in LLM outputs after RL training, using the observation that output entropy distribution collapses into sparse modes. Also introduces RL-MIA benchmark for evaluation.

Result: Significantly outperforms baseline methods across multiple models and contamination tasks, achieving up to 30% AUC improvement. Makes detection possible where existing methods perform close to random guessing.

Conclusion: Provides the first systematic study and effective solution for RL-phase data contamination detection, addressing a critical vulnerability in LLM evaluation reliability.

Abstract: Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model’s convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.

[77] CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation

Kaiwen Wei, Xiao Liu, Jie Zhang, Zijian Wang, Ruida Liu, Yuming Yang, Xin Xiao, Xiao Sun, Haoyang Zeng, Changzai Pan, Yidan Zhang, Jiang Zhong, Peijin Wang, Yingchao Feng

Main category: cs.CL

TL;DR: CFVBench is a new multimodal benchmark with 5,360 QA pairs from 599 videos, focusing on fine-grained multimodal understanding in complex formats like charts, news, and tutorials. Current MLLMs struggle with fine details, so the authors propose Adaptive Visual Refinement (AVR) to improve performance.

Details

Motivation: Existing multimodal benchmarks are limited in modality coverage and format diversity, often focusing on single-modality tasks or coarse-grained understanding. There's a need for benchmarks that test fine-grained multimodal comprehension across diverse, complex formats.

Method: Created CFVBench benchmark with 5,360 QA pairs from 599 videos spanning chart-heavy reports, news broadcasts, and software tutorials. Proposed Adaptive Visual Refinement (AVR) framework that adaptively increases frame sampling density and selectively invokes external tools when needed.

Result: Systematic evaluation of 7 retrieval methods and 14 MLLMs revealed current models (including GPT5 and Gemini) struggle with fine-grained multimodal details. AVR consistently enhanced fine-grained comprehension and improved performance across all evaluated MLLMs.

Conclusion: Fine-grained multimodal understanding remains a critical bottleneck for current MLLMs. The proposed AVR framework effectively addresses this limitation by adaptively refining visual processing, demonstrating significant performance improvements across diverse multimodal tasks.

Abstract: Multimodal Retrieval-Augmented Generation (MRAG) enables Multimodal Large Language Models (MLLMs) to generate responses with external multimodal evidence, and numerous video-based MRAG benchmarks have been proposed to evaluate model capabilities across retrieval and generation stages. However, existing benchmarks remain limited in modality coverage and format diversity, often focusing on single- or limited-modality tasks, or coarse-grained scene understanding. To address these gaps, we introduce CFVBench, a large-scale, manually verified benchmark constructed from 599 publicly available videos, yielding 5,360 open-ended QA pairs. CFVBench spans high-density formats and domains such as chart-heavy reports, news broadcasts, and software tutorials, requiring models to retrieve and reason over long temporal video spans while maintaining fine-grained multimodal information. Using CFVBench, we systematically evaluate 7 retrieval methods and 14 widely-used MLLMs, revealing a critical bottleneck: current models (even GPT5 or Gemini) struggle to capture transient yet essential fine-grained multimodal details. To mitigate this, we propose Adaptive Visual Refinement (AVR), a simple yet effective framework that adaptively increases frame sampling density and selectively invokes external tools when necessary. Experiments show that AVR consistently enhances fine-grained multimodal comprehension and improves performance across all evaluated MLLMs

[78] Inflated Excellence or True Performance? Rethinking Medical Diagnostic Benchmarks with Dynamic Evaluation

Xiangxu Zhang, Lei Li, Yanyun Zhou, Xiao Zhou, Yingying Zhang, Xian Wu

Main category: cs.CL

TL;DR: DyReMe is a dynamic benchmark for medical diagnostics that generates fresh, consultation-like cases with distractors and varied expression styles, evaluating LLMs on accuracy, veracity, helpfulness, and consistency to better reflect real clinical practice.

Details

Motivation: Current LLM evaluations in medical diagnostics rely on static benchmarks from public medical exams, which overestimate performance and ignore the ambiguity and variability of real-world clinical practice. Dynamic evaluations have been limited to superficial perturbations and narrow focus on accuracy.

Method: Proposed DyReMe benchmark generates fresh consultation-like cases with distractors (differential diagnoses, misdiagnosis factors) and varied expression styles. Evaluates LLMs on four dimensions: accuracy, veracity, helpfulness, and consistency.

Result: Experiments show DyReMe provides more challenging and realistic assessments, revealing significant misalignments between state-of-the-art LLM performance and real clinical practice requirements.

Conclusion: There is an urgent need for evaluation frameworks that better reflect the demands of trustworthy medical diagnostics, as current benchmarks fail to capture the complexity of real-world clinical practice.

Abstract: Medical diagnostics is a high-stakes and complex domain that is critical to patient care. However, current evaluations of large language models (LLMs) are fundamentally misaligned with real-world clinical practice. Most of them rely on static benchmarks derived from public medical exam items, which tend to overestimate model performance and ignore the difference between textbook cases and the ambiguous, varying conditions in the real world. Recent efforts toward dynamic evaluation offer a promising alternative, but their improvements are limited to superficial perturbations and a narrow focus on accuracy. To address these gaps, we propose DyReMe, a dynamic benchmark for medical diagnostics that better reflects real clinical practice. Unlike static exam-style questions, DyReMe generates fresh, consultation-like cases that introduce distractors such as differential diagnoses and common misdiagnosis factors. It also varies expression styles to mimic diverse real-world query habits. Beyond accuracy, DyReMe evaluates LLMs on three additional clinically relevant dimensions: veracity, helpfulness, and consistency. Our experiments demonstrate that this dynamic approach yields more challenging and realistic assessments, revealing significant misalignments between the performance of state-of-the-art LLMs and real clinical practice. These findings highlight the urgent need for evaluation frameworks that better reflect the demands of trustworthy medical diagnostics.

[79] CLARity: Reasoning Consistency Alone Can Teach Reinforced Experts

Jiuheng Lin, Cong Jiang, Zirui Wu, Jiarui Sun, Yansong Feng

Main category: cs.CL

TL;DR: CLARity is a cost-effective RL framework that improves reasoning quality in expert LLMs using small general-purpose LLMs, addressing the limitations of standard outcome-based RL on MCQs.

Details

Motivation: Training expert LLMs in data-scarce domains using MCQs with standard RL often degrades reasoning quality like logical consistency, while existing solutions like large-scale PRMs are too expensive.

Method: CLARity integrates consistency-aware reward mechanism with 2-stage refine-then-monitor pipeline and dynamic data reformulation strategy to enhance reasoning consistency and exploit limited data effectively.

Result: CLARity improves response consistency by 16.5% and accuracy by 7.5% over baselines, with human evaluations confirming improvements in coherence and professionalism.

Conclusion: CLARity offers a generalizable solution enabling smaller models to effectively guide expert models by reasoning consistency, providing cost-effective RL for data-scarce domains.

Abstract: Training expert LLMs in domains with scarce data is difficult, often relying on multiple-choice questions (MCQs). However, standard outcome-based reinforcement learning (RL) on MCQs is risky. While it may improve accuracy, we observe it often degrades reasoning quality such as logical consistency. Existing solutions to supervise reasoning, such as large-scale Process Reward Models (PRMs), are prohibitively expensive. To address this, we propose CLARity, a cost-effective RL framework that enhances reasoning quality using only a small, general-purpose LLM. CLARity integrates a consistency-aware reward mechanism with a 2-stage refine-then-monitor training pipeline to enhance reasoning consistency, and a dynamic data reformulation strategy to to better exploit limited data. Experiments demonstrate that CLARity improves response consistency by 16.5% and accuracy by 7.5% over baselines. Human evaluations further confirm holistic improvements in coherence and professionalism. Thus, CLARity offers a generalizable solution that enables smaller models to effectively guide expert models by reasoning consistency.Our code is open sourced at: https://github.com/Infinite-set/CLARity

[80] One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations

Kohei Oda, Po-Min Chuang, Kiyoaki Shirai, Natthawut Kertkeidkachorn

Main category: cs.CL

TL;DR: DualCSE is a sentence embedding method that uses two vectors per sentence to capture both explicit and implicit semantics, improving downstream task performance.

Details

Motivation: Conventional sentence embedding methods struggle with implicit semantics because they assign only a single vector per sentence, limiting their ability to capture nuanced meanings.

Method: Proposes DualCSE which assigns two embeddings per sentence: one for explicit semantics and one for implicit semantics, coexisting in shared space for flexible semantic selection.

Result: Experimental results show DualCSE effectively encodes both explicit and implicit meanings and improves downstream task performance.

Conclusion: DualCSE overcomes limitations of single-vector embeddings by using dual embeddings to better capture sentence semantics, benefiting applications like information retrieval and text classification.

Abstract: Sentence embedding methods have made remarkable progress, yet they still struggle to capture the implicit semantics within sentences. This can be attributed to the inherent limitations of conventional sentence embedding methods that assign only a single vector per sentence. To overcome this limitation, we propose DualCSE, a sentence embedding method that assigns two embeddings to each sentence: one representing the explicit semantics and the other representing the implicit semantics. These embeddings coexist in the shared space, enabling the selection of the desired semantics for specific purposes such as information retrieval and text classification. Experimental results demonstrate that DualCSE can effectively encode both explicit and implicit meanings and improve the performance of the downstream task.

[81] MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics

Jiapeng Wang, Changxin Tian, Kunlong Chen, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, Jun Zhou

Main category: cs.CL

TL;DR: MaP framework combines checkpoint merging and Pass@k metric to address instability in LLM evaluation during pre-training, providing smoother performance curves and more reliable model assessment.

Details

Motivation: Current LLM evaluation during pre-training suffers from significant instability that obscures true learning dynamics, making it difficult to reliably assess model progress.

Method: Introduces MaP framework with two components: checkpoint merging to smooth parameter space by averaging recent model weights, and Pass@k metric for robust statistical estimation of model capability.

Result: Extensive experiments show MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent model rankings compared to standard evaluation methods.

Conclusion: MaP provides a more reliable and faithful lens for observing LLM training dynamics, laying a crucial empirical foundation for LLM research by addressing both parameter and evaluation instability.

Abstract: Reliable evaluation is fundamental to the progress of Large Language Models (LLMs), yet the evaluation process during pre-training is plagued by significant instability that obscures true learning dynamics. In this work, we systematically diagnose this instability, attributing it to two distinct sources: \textit{Parameter Instability} from training stochasticity and \textit{Evaluation Instability} from noisy measurement protocols. To counteract both sources of noise, we introduce \textbf{MaP}, a dual-pronged framework that synergistically integrates checkpoint \underline{M}erging \underline{a}nd the \underline{P}ass@k metric. Checkpoint merging smooths the parameter space by averaging recent model weights, while Pass@k provides a robust, low-variance statistical estimate of model capability. Extensive experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent model rankings. Ultimately, MaP provides a more reliable and faithful lens for observing LLM training dynamics, laying a crucial empirical foundation for LLM research.

[82] ShiZhi: A Chinese Lightweight Large Language Model for Court View Generation

Zhitian Hou, Kun Zeng

Main category: cs.CL

TL;DR: ShiZhi is the first LLM specifically designed for court view generation, achieving 58.5 BLEU-1 on court view generation and 86.1% accuracy on charge prediction using a Chinese dataset of 110K+ cases.

Details

Motivation: Criminal Court View Generation is challenging due to case diversity and complexity, and directly generating from raw facts limits performance.

Method: Constructed CCVG dataset with 110K+ Chinese cases containing fact descriptions and court views, then trained a specialized LLM on this domain-specific data.

Result: Achieved 58.5 BLEU-1 on court view generation and 86.1% accuracy with 92.5% macro F1 on charge prediction.

Conclusion: Even small LLMs can generate reasonable and legally coherent court views when trained on high-quality domain-specific data.

Abstract: Criminal Court View Generation (CVG) is a fundamental task in legal artificial intelligence, aiming to automatically generate the “Court View” section of a legal case document. Generating court views is challenging due to the diversity and complexity of case facts, and directly generating from raw facts may limit performance. In this paper, we present ShiZhi, the first large language model (LLM) specifically designed for court view generation. We construct a Chinese Court View Generation dataset, CCVG, of more than 110K cases, each containing fact descriptions paired with corresponding court views. Based on this dataset, ShiZhi achieving 58.5 BLEU-1 on court view generation and 86.1% accuracy with 92.5% macro F1 on charge prediction. Experimental results demonstrate that even a small LLM can generate reasonable and legally coherent court views when trained on high-quality domain-specific data. Our model and dataset are available at \href{https://github.com/ZhitianHou/ShiZhi}{https://github.com/ZhitianHou/ShiZhi}.

[83] Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference

Jianuo Huang, Yaojie Zhang, Yicun Yang, Benhao Huang, Biqing Qi, Dongrui Liu, Linfeng Zhang

Main category: cs.CL

TL;DR: MaskKV is a training-free cache eviction framework for diffusion large language models that reduces memory costs while maintaining performance through mask-query guided scoring and adaptive cache budgeting.

Details

Motivation: Diffusion LLMs face high memory costs from bidirectional attention cache, limiting long-context handling. Existing ARM cache strategies don't work well for dLLMs due to their unique mask token characteristics.

Method: Two key innovations: 1) mask-query guided scoring using attention weights to identify less critical prompt tokens, 2) adaptive cache budgeting that reduces allocation in intermediate layers and focuses on prompt-preferring heads.

Result: On LLaDA with MaskKV, compressing KV cache to only 256 pairs (less than 5% of tokens) retains 94% of full-cache performance on LongBench and achieves up to 31x acceleration at 32k prompt length.

Conclusion: MaskKV effectively addresses dLLM cache memory issues while maintaining performance, enabling efficient long-context processing under resource-limited settings.

Abstract: Diffusion large language models (dLLMs) present a promising alternative to dominant autoregressive models (ARMs) by the ability of parallel decoding at the expense of substantial computation and memory costs. Specifically, the cache mechanism for bidirectional attention in dLLMs demands large memory footprint, restricting their ability to handle long contexts under resource-limited settings. Existing cache eviction strategies are designed for ARMs and ignore the unique characteristics of dLLMs, thus leading to unsatisfactory performance. To address these challenges, we introduce MaskKV, a training-free cache eviction framework tailored to dLLMs, focusing on the effect of mask tokens in dLLMs. MaskKV is built on two key innovations: (1) a mask-query guided scoring mechanism that leverages attention weights to identify and evict less critical prompt tokens for each head; (2) an adaptive cache budgeting strategy that improves efficiency by reducing allocation in intermediate layers and concentrating resources on prompt-preferring heads. On LLaDA with MaskKV, compressing the KV cache to only 256 pairs (less than 5% of tokens) retains 94% of the full-cache performance on LongBench and achieves up to 31x acceleration at 32k prompt length. The code is publicly available at: https://github.com/jianuo-huang/MaskKV

[84] Verifying Chain-of-Thought Reasoning via Its Computational Graph

Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda

Main category: cs.CL

TL;DR: CRV is a white-box method that verifies reasoning correctness by analyzing structural patterns in attribution graphs of CoT steps, revealing domain-specific computational fingerprints of errors and enabling targeted interventions.

Details

Motivation: Current CoT verification methods (black-box and gray-box) provide limited insight into why computations fail, lacking causal understanding of reasoning errors.

Method: Train a classifier on structural features of attribution graphs from correct vs incorrect CoT steps, treating them as execution traces of latent reasoning circuits.

Result: Structural signatures of error are highly predictive, domain-specific, and enable successful correction of faulty reasoning through targeted interventions on transcoder features.

Conclusion: White-box analysis of computational processes enables moving from simple error detection to causal understanding of LLM reasoning, establishing viability of reasoning verification via computational graphs.

Abstract: Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails. We introduce a white-box method: Circuit-based Reasoning Verification (CRV). We hypothesize that attribution graphs of correct CoT steps, viewed as execution traces of the model’s latent reasoning circuits, possess distinct structural fingerprints from those of incorrect steps. By training a classifier on structural features of these graphs, we show that these traces contain a powerful signal of reasoning errors. Our white-box approach yields novel scientific insights unattainable by other methods. (1) We demonstrate that structural signatures of error are highly predictive, establishing the viability of verifying reasoning directly via its computational graph. (2) We find these signatures to be highly domain-specific, revealing that failures in different reasoning tasks manifest as distinct computational patterns. (3) We provide evidence that these signatures are not merely correlational; by using our analysis to guide targeted interventions on individual transcoder features, we successfully correct the model’s faulty reasoning. Our work shows that, by scrutinizing a model’s computational process, we can move from simple error detection to a deeper, causal understanding of LLM reasoning.

[85] FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

Yu-Chen Lu, Chong-Yan Chen, Chi-Chih Chang, Yu-Fang Hu, Kai-Chiang Wu

Main category: cs.CL

TL;DR: FLRC is a fine-grained low-rank compression method that optimizes rank allocation per layer and uses progressive decoding to maintain LLM performance while reducing computational demands.

Details

Motivation: Large language models have high parameter counts that limit deployment on resource-constrained hardware, and uniform compression across layers causes significant performance degradation with poor decoding quality.

Method: Proposes Fine-grained Low-Rank Compressor (FLRC) that determines optimal rank allocation for each layer and incorporates progressive low-rank decoding to maintain text generation quality.

Result: Achieves up to 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework.

Conclusion: FLRC provides an effective solution for compressing LLMs while maintaining performance, enabling better deployment on resource-constrained hardware.

Abstract: Although large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.

[86] LLP: LLM-based Product Pricing in E-commerce

Hairu Wang, Sheng You, Qiheng Zhang, Xike Xie, Shuguang Han, Yuchen Wu, Fei Huang, Jufeng Chen

Main category: cs.CL

TL;DR: LLP is a novel LLM-based framework for second-hand product pricing that uses retrieval-augmented generation and two-stage optimization to provide accurate price suggestions, significantly outperforming existing methods.

Details

Motivation: Individual sellers on C2C platforms struggle with pricing second-hand products efficiently, and existing static regression models have poor generalization and fail to capture market dynamics.

Method: LLP retrieves similar products, uses LLMs to understand pricing information from text, applies SFT and GRPO optimization on a bidirectional reasoning dataset, and employs confidence-based filtering for unreliable suggestions.

Result: LLP substantially surpasses existing methods, generalizes well to unseen categories, and when deployed on Xianyu, raised static adoption rate from 40% to 72% at 30% coverage, maintaining 47% SAR at 90% recall.

Conclusion: The LLM-based generative framework with retrieval and optimization effectively addresses second-hand product pricing challenges and demonstrates strong practical performance in real-world deployment.

Abstract: Unlike Business-to-Consumer e-commerce platforms (e.g., Amazon), inexperienced individual sellers on Consumer-to-Consumer platforms (e.g., eBay) often face significant challenges in setting prices for their second-hand products efficiently. Therefore, numerous studies have been proposed for automating price prediction. However, most of them are based on static regression models, which suffer from poor generalization performance and fail to capture market dynamics (e.g., the price of a used iPhone decreases over time). Inspired by recent breakthroughs in Large Language Models (LLMs), we introduce LLP, the first LLM-based generative framework for second-hand product pricing. LLP first retrieves similar products to better align with the dynamic market change. Afterwards, it leverages the LLMs’ nuanced understanding of key pricing information in free-form text to generate accurate price suggestions. To strengthen the LLMs’ domain reasoning over retrieved products, we apply a two-stage optimization, supervised fine-tuning (SFT) followed by group relative policy optimization (GRPO), on a dataset built via bidirectional reasoning. Moreover, LLP employs a confidence-based filtering mechanism to reject unreliable price suggestions. Extensive experiments demonstrate that LLP substantially surpasses existing methods while generalizing well to unseen categories. We have successfully deployed LLP on Xianyu\footnote{Xianyu is China’s largest second-hand e-commerce platform.}, significantly outperforming the previous pricing method. Under the same 30% product coverage, it raises the static adoption rate (SAR) from 40% to 72%, and maintains a strong SAR of 47% even at 90% recall.

[87] ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering

Francesco Maria Molfese, Luca Moroni, Ciro Porcaro, Simone Conia, Roberto Navigli

Main category: cs.CL

TL;DR: ReTraceQA benchmark introduces process-level evaluation for commonsense reasoning, revealing that SLMs often give correct answers despite flawed reasoning, showing current metrics overestimate SLM capabilities.

Details

Motivation: Current evaluation practices for Small Language Models focus only on final answer accuracy, neglecting the validity of reasoning processes, leading to potential overestimation of model capabilities.

Method: Created ReTraceQA benchmark with expert-annotated dataset to evaluate reasoning processes, using strong LLMs as automated judges for reasoning-aware evaluation instead of answer-only metrics.

Result: 14-24% of instances show SLMs provide correct answers despite flawed reasoning. When using reasoning-aware evaluation, SLM performance drops significantly by up to 25% across all models and datasets.

Conclusion: Process-level evaluation reveals SLMs’ reasoning capabilities are weaker than suggested by answer-only metrics, highlighting the need for reasoning-aware benchmarks like ReTraceQA.

Abstract: While Small Language Models (SLMs) have demonstrated promising performance on an increasingly wide array of commonsense reasoning benchmarks, current evaluation practices rely almost exclusively on the accuracy of their final answers, neglecting the validity of the reasoning processes that lead to those answers. To address this issue, we introduce ReTraceQA, a novel benchmark that introduces process-level evaluation for commonsense reasoning tasks. Our expert-annotated dataset reveals that in a substantial portion of instances (14-24%), SLMs provide correct final answers despite flawed reasoning processes, suggesting that the capabilities of SLMs are often overestimated by evaluation metrics that focus only on comparing the final answer with the ground truth. Indeed, we show that when employing strong Large Language Models (LLMs) as automated judges for reasoning-aware evaluation rather than answer-only metrics, SLM performance drops significantly across all models and datasets, with scores decreasing by up to 25%.

[88] Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Yunxiang Zhang, Muhammad Khalifa, Lechen Zhang, Xin Liu, Ayoung Lee, Xinliang Frederick Zhang, Farima Fatahi Bayat, Lu Wang

Main category: cs.CL

TL;DR: ThinkLogit enables large models to perform long chain-of-thought reasoning without additional training by using logit arithmetic and a smaller reasoning model as guider, with ThinkLogit-DPO further boosting performance through preference optimization.

Details

Motivation: Large reasoning models currently require additional training for long chain-of-thought reasoning with strategies like backtracking and self-correction, which is costly and inefficient.

Method: ThinkLogit uses logit arithmetic to tune a large non-reasoning model using a smaller reasoning model as guider. ThinkLogit-DPO enhances this by training the guider with preference optimization over correct/incorrect reasoning pairs.

Result: Achieved 24.5% and 29.1% relative improvement in average accuracy over five reasoning benchmarks using Qwen2.5-32B guided by R1-Distill-Qwen-1.5B (21x smaller). Effective across different model families and orthogonal to post-training methods.

Conclusion: ThinkLogit provides a practical path to unlock long reasoning in large-scale models without costly post-training, offering significant performance improvements through decoding-time guidance from smaller models.

Abstract: Large reasoning models exhibit long chain-of-thought reasoning with strategies such as backtracking and self-correction, though recent studies suggest that these abilities typically require additional training. We first investigate whether such behaviors can be elicited without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logit arithmetic to tune a target large non-reasoning model for long reasoning using a substantially smaller reasoning model as the guider. We then show that we can further boost its performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model, a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in average accuracy by 24.5% and 29.1%, respectively, over five reasoning benchmarks using the Qwen2.5-32B guided by R1-Distill-Qwen-1.5B, a model 21x smaller. Moreover, we find that ThinkLogit remains effective when the guider and target come from different model families. It is also orthogonal to post-training methods for small models, as guiders improved through supervised distillation or reinforcement learning can be directly plugged in to yield stronger large models, offering a practical path to unlock long reasoning in large-scale models without costly post-training.

[89] NL2GenSym: Natural Language to Generative Symbolic Rules for SOAR Cognitive Architecture via Large Language Models

Fang Yuan, Junjie Zeng, Yue Hu, Zhengqiu Zhu, Quanjun Yin, Yuxiang Xie

Main category: cs.CL

TL;DR: NL2GenSym is a framework that uses LLMs to automatically generate symbolic rules for SOAR cognitive architecture from natural language, addressing the manual rule coding bottleneck.

Details

Motivation: SOAR's practical adoption is hindered by laborious manual rule coding, while LLMs show potential for efficient rule generation but lack experimental validation.

Method: Proposes NL2GenSym with Execution-Grounded Generator-Critic mechanism: LLM-based Generator creates rules from natural language using RAG-accessed knowledge base, rules are executed in SOAR for validation, and LLM-based Critic iteratively refines rules based on execution feedback.

Result: Achieves over 86% success rate in generating rules from natural language, generates novel heuristic rules that reduce average decision cycles to 1.98x optimal solution (1/1000 of baseline), and enables smaller models to outperform larger counterparts.

Conclusion: NL2GenSym effectively bridges the gap between LLMs and SOAR, enabling autonomous generation of generative symbolic rules from natural language with high success rates and performance improvements.

Abstract: SOAR, a classic symbol-based cognitive architecture, has been fostering the development of general, human-like intelligent agents. Nevertheless, its practical adoption is hindered by the laborious manual rule coding. Emerging Large Language Models (LLMs) present the immense potential for efficient rules generation. However, there is a critical gap that current research predominantly focuses on conceptual frameworks and lacks robust experimental validation. To bridge this gap, we propose \textit{N}atural \textit{L}anguage to \textit{Gen}erative \textit{Sym}bolic Rules (NL2GenSym), a novel framework that integrates LLMs with SOAR to autonomously produce generative symbolic rules from natural language. Specifically, our framework introduces a novel Execution-Grounded Generator-Critic mechanism. The LLM-based Generator, guided by a Retrieval-Augmented Generation-accessed self-evolving domain knowledge base, proposes rules from natural language. Subsequently, these rules are immediately executed within the SOAR environment to rigorously validate their correctness. Based on this execution-grounded feedback, a reflective LLM-based Critic drives the iterative refinement of these rules. Experiments on our specialized Water Jug Problem (WJP) dataset, utilizing both Gemini and Qwen series models, validate the efficacy of our framework. It achieves a success rate over 86% in generating rules from natural language. Crucially, the framework also generates novel heuristic rules, reducing average decision cycles for solving the WJP to 1.98 times the optimal solution and 1/1000 of baseline methods. Additionally, our initial experiments show that NL2GenSym enables smaller-parameter models to achieve better performance than larger counterparts.

[90] Understanding the Effects of Domain Finetuning on LLMs

Eshaan Tanwar, Deepak Nathani, William Yang Wang, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: This paper presents the first systematic study of domain-specific fine-tuning in large medical language models, revealing that fine-tuning modifies only a small subset of the representational subspace while preserving the pre-trained model’s representation.

Details

Motivation: To understand how fine-tuning reshapes the parametric space of domain-specialized LLMs, which are under-explored compared to auto-regressive or general-purpose instruct models.

Method: Proposed tuning vectors, a novel framework inspired by task vectors, to explicitly capture directional parameter shifts induced by fine-tuning. Analyzed directional alignment in MLP layers and attention heads.

Result: Fine-tuning modifies only a small subset of representational subspace; tuning vectors enhance instruction-following and generation quality; combining vectors across domains improves generalization; vectors primarily write new directional information into MLP layers while amplifying existing directions in attention heads.

Conclusion: The findings offer new insights into LLM adaptation and provide a general, interpretable framework for analyzing specialization in large language models.

Abstract: Large Language Models (LLMs) fine-tuned for specific domains exhibit strong performance; however, the underlying mechanisms by which this fine-tuning reshapes their parametric space are not well understood. Prior works primarily focus on auto-regressive or general-purpose instruct models, leaving domain-specialised LLMs under-explored. We present the first systematic study of domain-specific fine-tuning in large medical language models. Our analysis reveals that fine-tuning modifies only a small subset of the representational subspace, essentially preserving the pre-trained model’s representation. To interpret these changes in subspaces, we propose tuning vectors, a novel framework inspired by task vectors, which explicitly capture the directional parameter shifts induced by fine-tuning. We demonstrate that these vectors are critical for enhancing both instruction-following and generation quality. Furthermore, combining tuning vectors across different domains yields improved generalisation. Upon closer inspection of directional alignment, we find these vectors primarily write new directional information into the MLP layers of the model, while amplifying existing directions in attention heads. Our findings offer new insights into LLM adaptation and provide a general, interpretable framework for analysing specialisation in large language models.

[91] Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood

Xingyu Lin, Yilin Wen, En Wang, Du Su, Wenbin Liu, Chenfu Bao, Zhonghou Lv

Main category: cs.CL

TL;DR: TEPO is a novel token-level framework that addresses entropy collapse in GRPO methods by using Markov Likelihood to link group-level rewards with tokens, achieving state-of-the-art performance on mathematical reasoning tasks with improved training stability.

Details

Motivation: GRPO and related entropy-regularization methods face challenges from sparse token rewards in chain-of-thought reasoning, leading to entropy collapse or model collapse due to undifferentiated token-level entropy adjustments.

Method: TEPO incorporates Markov Likelihood (sequence likelihood) to link group-level rewards with tokens via token-level aggregation, providing a more differentiated approach to token-level entropy regularization.

Result: TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy), sets new state-of-the-art on mathematical reasoning tasks, and significantly enhances training stability.

Conclusion: The proposed TEPO framework effectively addresses the limitations of current entropy-regularization methods in GRPO, providing a more stable and performant approach for mathematical reasoning in LLMs.

Abstract: Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.

[92] Beyond Single-Granularity Prompts: A Multi-Scale Chain-of-Thought Prompt Learning for Graph

Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Xinyan Huang, Weigang Lu

Main category: cs.CL

TL;DR: MSGCOT is a multi-scale graph prompt-tuning framework that captures hierarchical structural features and uses progressive coarse-to-fine reasoning, outperforming single-granularity methods especially in few-shot scenarios.

Details

Motivation: Existing graph prompt-tuning methods use single-granularity prompts, ignoring the multi-scale structural information in graphs, which limits prompt semantic diversity.

Method: Uses a lightweight low-rank coarsening network to capture multi-scale structural features as hierarchical basis vectors, then dynamically integrates multi-scale information in a progressive coarse-to-fine prompt chain mimicking human cognition.

Result: Outperforms state-of-the-art single-granularity graph prompt-tuning methods on eight benchmark datasets, with particularly superior performance in few-shot scenarios.

Conclusion: Integrating multi-scale information into graph prompts through hierarchical feature capture and progressive reasoning significantly improves graph prompt-tuning performance.

Abstract: The “pre-train, prompt’’ paradigm, designed to bridge the gap between pre-training tasks and downstream objectives, has been extended from the NLP domain to the graph domain and has achieved remarkable progress. Current mainstream graph prompt-tuning methods modify input or output features using learnable prompt vectors. However, existing approaches are confined to single-granularity (e.g., node-level or subgraph-level) during prompt generation, overlooking the inherently multi-scale structural information in graph data, which limits the diversity of prompt semantics. To address this issue, we pioneer the integration of multi-scale information into graph prompt and propose a Multi-Scale Graph Chain-of-Thought (MSGCOT) prompting framework. Specifically, we design a lightweight, low-rank coarsening network to efficiently capture multi-scale structural features as hierarchical basis vectors for prompt generation. Subsequently, mimicking human cognition from coarse-to-fine granularity, we dynamically integrate multi-scale information at each reasoning step, forming a progressive coarse-to-fine prompt chain. Extensive experiments on eight benchmark datasets demonstrate that MSGCOT outperforms the state-of-the-art single-granularity graph prompt-tuning method, particularly in few-shot scenarios, showcasing superior performance.

[93] Active Model Selection for Large Language Models

Yavuz Durmazkeser, Patrik Okanovic, Andreas Kirsch, Torsten Hoefler, Nezihe Merve Gürel

Main category: cs.CL

TL;DR: LLM SELECTOR is the first framework for active model selection of LLMs that reduces annotation costs by up to 59.62% when selecting the best LLM for a task.

Details

Motivation: Prior evaluation approaches require fully annotated datasets, which is expensive and inefficient for model selection.

Method: The framework adaptively selects a small set of informative queries to annotate and uses a judge-based oracle annotation model to further reduce costs.

Result: Experiments on 6 benchmarks with 151 LLMs show significant annotation cost reduction while maintaining selection accuracy.

Conclusion: LLM SELECTOR provides an efficient and cost-effective approach for selecting the best LLM for any given task with limited annotations.

Abstract: We introduce LLM SELECTOR, the first framework for active model selection of Large Language Models (LLMs). Unlike prior evaluation and benchmarking approaches that rely on fully annotated datasets, LLM SELECTOR efficiently identifies the best LLM with limited annotations. In particular, for any given task, LLM SELECTOR adaptively selects a small set of queries to annotate that are most informative about the best model for the task. To further reduce annotation cost, we leverage a judge-based oracle annotation model. Through extensive experiments on 6 benchmarks with 151 LLMs, we show that LLM SELECTOR reduces annotation costs by up to 59.62% when selecting the best and near-best LLM for the task.

[94] On the Representations of Entities in Auto-regressive Large Language Models

Victor Morand, Josiane Mothe, Benjamin Piwowarski

Main category: cs.CL

TL;DR: The paper introduces entity mention reconstruction as a framework to study how LLMs internally represent entities, proposing the Entity Lens method that can generate multi-token entity mentions from hidden states.

Details

Motivation: Named entities are fundamental knowledge building blocks in text, but it's unclear how LLMs internally represent them. Prior research focused on explicit relationships rather than entity representations themselves.

Method: Proposed entity mention reconstruction framework using task vectors to generate multi-token mentions from LLM hidden states. Introduced Entity Lens, extending logit-lens to predict multi-token entity mentions.

Result: The method consistently generates multi-token entity mentions from various entity representations. Results show LLMs develop entity-specific mechanisms to represent and manipulate multi-token entities, including unseen ones.

Conclusion: LLMs develop specialized mechanisms for entity representation that can handle multi-token entities, even those not seen during training, providing new insights into how models encode factual knowledge.

Abstract: Named entities are fundamental building blocks of knowledge in text, grounding factual information and structuring relationships within language. Despite their importance, it remains unclear how Large Language Models (LLMs) internally represent entities. Prior research has primarily examined explicit relationships, but little is known about entity representations themselves. We introduce entity mention reconstruction as a novel framework for studying how LLMs encode and manipulate entities. We investigate whether entity mentions can be generated from internal representations, how multi-token entities are encoded beyond last-token embeddings, and whether these representations capture relational knowledge. Our proposed method, leveraging task vectors, allows to consistently generate multi-token mentions from various entity representations derived from the LLMs hidden states. We thus introduce the Entity Lens, extending the logit-lens to predict multi-token mentions. Our results bring new evidence that LLMs develop entity-specific mechanisms to represent and manipulate any multi-token entities, including those unseen during training. Our code is avalable at https://github.com/VictorMorand/EntityRepresentations .

[95] KORMo: Korean Open Reasoning Model for Everyone

Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song, Minkyung Cho, Junhun Yuk, Changsu Choi, Dongjae Shin, Huige Lee, Hoyun Song, Alice Oh, Kyungtae Lim

Main category: cs.CL

TL;DR: This paper presents KORMo-10B, the first large-scale open bilingual LLM for Korean, trained primarily on synthetic data (68.74% of Korean portion), achieving performance comparable to multilingual baselines without model collapse.

Details

Motivation: To address the lack of large-scale open bilingual LLMs for non-English languages like Korean, and to investigate whether synthetic data can effectively support large-scale pretraining without degradation.

Method: Trained a 10.8B-parameter model from scratch on Korean-English corpus with carefully curated synthetic data featuring balanced linguistic coverage and diverse instruction styles, using bilingual instruction tuning.

Result: The model achieved comparable performance to contemporary open-weight multilingual baselines across reasoning, knowledge, and instruction-following benchmarks, demonstrating that synthetic data can sustain long-horizon pretraining without model collapse.

Conclusion: Synthetic data-driven fully open models are viable for low-resource settings, with bilingual instruction tuning enabling near-native Korean reasoning and discourse coherence, establishing a reproducible framework for future multilingual LLM research.

Abstract: This work presents the first large-scale investigation into constructing a fully open bilingual large language model (LLM) for a non-English language, specifically Korean, trained predominantly on synthetic data. We introduce KORMo-10B, a 10.8B-parameter model trained from scratch on a Korean-English corpus in which 68.74% of the Korean portion is synthetic. Through systematic experimentation, we demonstrate that synthetic data, when carefully curated with balanced linguistic coverage and diverse instruction styles, does not cause instability or degradation during large-scale pretraining. Furthermore, the model achieves performance comparable to that of contemporary open-weight multilingual baselines across a wide range of reasoning, knowledge, and instruction-following benchmarks. Our experiments reveal two key findings: (1) synthetic data can reliably sustain long-horizon pretraining without model collapse, and (2) bilingual instruction tuning enables near-native reasoning and discourse coherence in Korean. By fully releasing all components including data, code, training recipes, and logs, this work establishes a transparent framework for developing synthetic data-driven fully open models (FOMs) in low-resource settings and sets a reproducible precedent for future multilingual LLM research.

[96] Domain-Adapted Pre-trained Language Models for Implicit Information Extraction in Crash Narratives

Xixi Wang, Jordanka Kovaceva, Miguel Costa, Shuai Wang, Francisco Camara Pereira, Robert Thomson

Main category: cs.CL

TL;DR: Fine-tuned compact open-source language models outperform large closed LLMs like GPT-4o in extracting complex crash information from unstructured narratives, while addressing privacy concerns and domain knowledge gaps.

Details

Motivation: Large-scale analysis of free-text crash narratives is challenging due to lack of standardized tools, privacy concerns with closed LLMs, and poor performance on inference-heavy tasks like crash type identification with nearly 100 categories.

Method: Applied fine-tuning techniques with Low-Rank Adaptation (LoRA) to compact open-source pre-trained language models (BERT) to inject domain-specific knowledge for crash narrative analysis.

Result: Fine-tuned compact models outperformed strong closed LLMs like GPT-4o on the Crash Investigation Sampling System dataset, captured richer narrative details, and even corrected mislabeled annotations, while requiring minimal training resources.

Conclusion: Compact open-source PLMs can effectively support reasoning-intensive extraction from crash narratives, offering privacy-preserving, domain-aware alternatives to large closed LLMs for traffic safety analysis.

Abstract: Free-text crash narratives recorded in real-world crash databases have been shown to play a significant role in improving traffic safety. However, large-scale analyses remain difficult to implement as there are no documented tools that can batch process the unstructured, non standardized text content written by various authors with diverse experience and attention to detail. In recent years, Transformer-based pre-trained language models (PLMs), such as Bidirectional Encoder Representations from Transformers (BERT) and large language models (LLMs), have demonstrated strong capabilities across various natural language processing tasks. These models can extract explicit facts from crash narratives, but their performance declines on inference-heavy tasks in, for example, Crash Type identification, which can involve nearly 100 categories. Moreover, relying on closed LLMs through external APIs raises privacy concerns for sensitive crash data. Additionally, these black-box tools often underperform due to limited domain knowledge. Motivated by these challenges, we study whether compact open-source PLMs can support reasoning-intensive extraction from crash narratives. We target two challenging objectives: 1) identifying the Manner of Collision for a crash, and 2) Crash Type for each vehicle involved in the crash event from real-world crash narratives. To bridge domain gaps, we apply fine-tuning techniques to inject task-specific knowledge to LLMs with Low-Rank Adaption (LoRA) and BERT. Experiments on the authoritative real-world dataset Crash Investigation Sampling System (CISS) demonstrate that our fine-tuned compact models outperform strong closed LLMs, such as GPT-4o, while requiring only minimal training resources. Further analysis reveals that the fine-tuned PLMs can capture richer narrative details and even correct some mislabeled annotations in the dataset.

[97] Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

Ines Altemir Marinas, Anastasiia Kucherenko, Alexander Sternfeld, Andrei Kucharavy

Main category: cs.CL

TL;DR: This paper presents a full-text indexing pipeline for the Apertus LLM training data using Elasticsearch on arm64 infrastructure, enabling large-scale data indexing and LLM safety applications.

Details

Motivation: Despite the proliferation of open-weight LLMs, access to LLM training data remains limited and inscrutable due to its massive scale, potentially containing critical data scraped from the internet that needs better accessibility and safety oversight.

Method: Developed a full-text indexing pipeline using Elasticsearch parallel indices deployed on the Alps infrastructure, a highly energy-efficient arm64 supercluster, to index 8.6T tokens out of 15.2T used for Apertus LLM training.

Result: Successfully indexed 8.6T tokens, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Demonstrated Elasticsearch can be ported to arm64 infrastructure and full-text indexing at modern LLM scale is feasible.

Conclusion: The approach enables previously inaccessible jailbreak-agnostic LLM safety and provides a scalable solution for large-scale data indexing, facilitating the transition towards greener computation in the AI research community.

Abstract: The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet. In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation.

[98] Hybrid Models for Natural Language Reasoning: The Case of Syllogistic Logic

Manuel Vargas Guzmán, Jakub Szymanik, Maciej Malicki

Main category: cs.CL

TL;DR: LLMs show good recursive reasoning but struggle with compositional generalization in logical inference. A hybrid symbolic-neural architecture is proposed to overcome these limitations.

Details

Motivation: To investigate and improve the generalization capabilities of neural models in logical reasoning, particularly distinguishing between compositionality and recursiveness.

Method: Used syllogistic reasoning as benchmark, analyzed LLM performance, and proposed a hybrid architecture combining symbolic reasoning with neural computation.

Result: LLMs demonstrate reasonable proficiency in recursiveness but struggle significantly with compositionality. The hybrid model achieves robust inference with high efficiency even with small neural components.

Conclusion: Hybrid models combining symbolic and neural approaches effectively address generalization barriers in neural reasoning systems, offering a promising direction for reliable logical provers.

Abstract: Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications like logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: compositionality, the capacity to abstract atomic logical rules underlying complex inferences, and recursiveness, the aptitude to build intricate representations through iterative application of inference rules. In the literature, these two aspects are often confounded together under the umbrella term of generalization. To sharpen this distinction, we investigated the logical generalization capabilities of pre-trained large language models (LLMs) using the syllogistic fragment as a benchmark for natural language reasoning. Though simple, this fragment provides a foundational yet expressive subset of formal logic that supports controlled evaluation of essential reasoning abilities. Our findings reveal a significant disparity: while LLMs demonstrate reasonable proficiency in recursiveness, they struggle with compositionality. To overcome these limitations and establish a reliable logical prover, we propose a hybrid architecture integrating symbolic reasoning with neural computation. This synergistic interaction enables robust and efficient inference, neural components accelerate processing, while symbolic reasoning ensures completeness. Our experiments show that high efficiency is preserved even with relatively small neural components. As part of our proposed methodology, this analysis gives a rationale and highlights the potential of hybrid models to effectively address key generalization barriers in neural reasoning systems.

[99] Multimodal Policy Internalization for Conversational Agents

Zhenhailong Wang, Jiateng Liu, Amin Fazel, Ritesh Sarkhel, Xing Fan, Xiang Li, Chenlei Guo, Heng Ji, Ruhi Sarikaya

Main category: cs.CL

TL;DR: Multimodal Policy Internalization (MPI) is a new task that internalizes complex multimodal policies into model parameters, eliminating the need for lengthy policy prompts during inference. The paper introduces TriMPI, a three-stage training framework that achieves improved policy adherence, generalization, and robustness.

Details

Motivation: Current LLM-based conversational agents use complex, lengthy policy prompts that are difficult to faithfully follow and impose high computational costs. With the rise of multimodal agents, there's a need for effective policy internalization methods that can handle multimodal behaviors, which existing approaches don't address.

Method: Proposes TriMPI, a three-stage training framework: 1) Policy knowledge injection via continual pretraining, 2) Supervised finetuning, and 3) PolicyRollout - a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. Built two datasets for synthetic and real-world decision-making tasks.

Result: TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. The framework successfully internalizes multimodal policies without requiring policy prompts during inference.

Conclusion: This work introduces the first comprehensive approach to multimodal policy internalization, providing datasets, training recipes, and evaluations to advance research in this emerging area. The method enables stronger policy-following capabilities for multimodal agents while reducing computational overhead.

Abstract: Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.

[100] StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

Yuchen Lu, Run Yang, Yichen Zhang, Shuguang Yu, Runpeng Dai, Ziwei Wang, Jiayi Xiang, Wenxin E, Siran Gao, Xinyao Ruan, Yirui Huang, Chenjing Xi, Haibo Hu, Yueming Fu, Qinglan Yu, Xiaobing Wei, Jiani Gu, Rui Sun, Jiaxuan Jia, Fan Zhou

Main category: cs.CL

TL;DR: StatEval is the first comprehensive benchmark for evaluating LLMs on statistical reasoning, covering foundational problems and research-level proofs, revealing current models’ limitations in statistical tasks.

Details

Motivation: Statistics as a distinct discipline remains underexplored in LLM benchmarking despite advances in mathematical and logical reasoning, creating a gap in evaluating statistical intelligence.

Method: Created StatEval with 13,817 foundational problems and 2,374 research-level proof tasks using a scalable multi-agent pipeline with human-in-the-loop validation for automated problem extraction and quality control.

Result: Closed-source models like GPT5-mini achieve below 57% on research-level problems, with open-source models performing significantly worse, highlighting unique challenges in statistical reasoning.

Conclusion: StatEval serves as a rigorous benchmark to advance statistical intelligence in LLMs, revealing current limitations and providing tools for future development.

Abstract: Large language models (LLMs) have demonstrated remarkable advances in mathematical and logical reasoning, yet statistics, as a distinct and integrative discipline, remains underexplored in benchmarking efforts. To address this gap, we introduce \textbf{StatEval}, the first comprehensive benchmark dedicated to statistics, spanning both breadth and depth across difficulty levels. StatEval consists of 13,817 foundational problems covering undergraduate and graduate curricula, together with 2374 research-level proof tasks extracted from leading journals. To construct the benchmark, we design a scalable multi-agent pipeline with human-in-the-loop validation that automates large-scale problem extraction, rewriting, and quality control, while ensuring academic rigor. We further propose a robust evaluation framework tailored to both computational and proof-based tasks, enabling fine-grained assessment of reasoning ability. Experimental results reveal that while closed-source models such as GPT5-mini achieve below 57% on research-level problems, with open-source models performing significantly lower. These findings highlight the unique challenges of statistical reasoning and the limitations of current LLMs. We expect StatEval to serve as a rigorous benchmark for advancing statistical intelligence in large language models. All data and code are available on our web platform: https://stateval.github.io/.

[101] Can We Reliably Rank Model Performance across Domains without Labeled Data?

Veronica Rammouz, Aaron Gonzalez, Carlos Cruzportillo, Adrian Tan, Nicole Beebe, Anthony Rios

Main category: cs.CL

TL;DR: This paper analyzes when unlabeled performance estimation methods produce reliable rankings across domains, finding that LLM-based error predictors outperform drift-based and zero-shot baselines, with reliability depending on performance differences and alignment with true failure patterns.

Details

Motivation: Understanding when NLP model performance estimates without labels produce reliable rankings across domains is important for model generalization assessment, as prior work lacks clarity on when these estimates are trustworthy.

Method: Two-step evaluation with four base classifiers and LLMs as error predictors, tested on GeoOLID and Amazon Reviews datasets across 15 domains, comparing LLM-based error predictors against drift-based and zero-shot baselines.

Result: LLM-based error predictors produce stronger and more consistent rank correlations with true accuracy than baselines. Ranking reliability increases with larger performance differences across domains and when error predictions align with base model’s true failure patterns.

Conclusion: The study clarifies when performance estimation methods can be trusted and provides guidance for cross-domain model evaluation, showing LLM-based approaches are most reliable under specific conditions.

Abstract: Estimating model performance without labels is an important goal for understanding how NLP models generalize. While prior work has proposed measures based on dataset similarity or predicted correctness, it remains unclear when these estimates produce reliable performance rankings across domains. In this paper, we analyze the factors that affect ranking reliability using a two-step evaluation setup with four base classifiers and several large language models as error predictors. Experiments on the GeoOLID and Amazon Reviews datasets, spanning 15 domains, show that large language model-based error predictors produce stronger and more consistent rank correlations with true accuracy than drift-based or zero-shot baselines. Our analysis reveals two key findings: ranking is more reliable when performance differences across domains are larger, and when the error model’s predictions align with the base model’s true failure patterns. These results clarify when performance estimation methods can be trusted and provide guidance for their use in cross-domain model evaluation.

[102] Dyna-Mind: Learning to Simulate from Experience for Better AI Agents

Xiao Yu, Baolin Peng, Michel Galley, Hao Cheng, Qianhui Wu, Janardhan Kulkarni, Suman Nath, Zhou Yu, Jianfeng Gao

Main category: cs.CL

TL;DR: Dyna-Mind is a two-stage training framework that teaches AI agents to perform mental simulation (“vicarious trial and error”) before acting, improving their performance in long-horizon interactive tasks like web navigation and computer use.

Details

Motivation: Current reasoning models excel at math and coding but perform poorly in long-horizon interactive tasks, lacking the human cognitive ability to mentally simulate alternative futures before acting.

Method: Two-stage framework: 1) ReSim trains agents to generate structured reasoning traces from expanded search trees using real environment interactions; 2) Dyna-GRPO uses online reinforcement learning with outcome rewards and intermediate states as feedback.

Result: Experiments on Sokoban, ALFWorld, and AndroidWorld show ReSim effectively infuses simulation ability into agents, and Dyna-GRPO learns better policies for long-horizon planning tasks.

Conclusion: Mental simulation is crucial for enabling AI agents to reason, plan, and act effectively in complex interactive environments, bridging the gap between their mathematical/coding abilities and real-world task performance.

Abstract: Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ‘‘vicarious trial and error’’ - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent’s reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent’s simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.

[103] Mitigating Overthinking through Reasoning Shaping

Feifan Song, Shaohang Wei, Bofei Gao, Yejie Wang, Wen Luo, Wei Li, Linli Yao, Weimin Xiong, Liang Chen, Tianyu Liu, Houfeng Wang

Main category: cs.CL

TL;DR: GRSP is a step-level penalization method that reduces overthinking in large reasoning models by using segment-level supervision instead of token-level, achieving better efficiency-accuracy balance.

Details

Motivation: Existing token-level penalization methods in RLVR reduce computational cost but harm model performance due to oversimplified supervision, creating a need for better granularity in supervision.

Method: Proposed Group Relative Segment Penalization (GRSP) - a step-level method that uses length-aware weighting across reasoning segment clusters based on correlation between segments, token consumption, and performance.

Result: GRSP achieves superior token efficiency without heavily compromising accuracy, especially on harder problems. It also stabilizes RL training and scales effectively across model sizes.

Conclusion: Granularity of supervision is crucial for balancing efficiency and accuracy in reasoning models, and step-level segment penalization (GRSP) provides an effective solution to reduce overthinking while maintaining performance.

Abstract: Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier Reward (RLVR) have shown great power in problem solving, yet they often cause overthinking: excessive, meandering reasoning that inflates computational cost. Prior designs of penalization in RLVR manage to reduce token consumption while often harming model performance, which arises from the oversimplicity of token-level supervision. In this paper, we argue that the granularity of supervision plays a crucial role in balancing efficiency and accuracy, and propose Group Relative Segment Penalization (GRSP), a step-level method to regularize reasoning. Since preliminary analyses show that reasoning segments are strongly correlated with token consumption and model performance, we design a length-aware weighting mechanism across segment clusters. Extensive experiments demonstrate that GRSP achieves superior token efficiency without heavily compromising accuracy, especially the advantages with harder problems. Moreover, GRSP stabilizes RL training and scales effectively across model sizes.

[104] Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

Yihong Liu, Raoyuan Zhao, Lena Altinger, Hinrich Schütze, Michael A. Hedderich

Main category: cs.CL

TL;DR: MulTypo introduces a multilingual typo generation algorithm to evaluate LLM robustness across languages, finding that typos consistently degrade performance especially in generative and reasoning tasks, with language-dependent robustness patterns.

Details

Motivation: Most LLM benchmarks assume clean input, leaving robustness to real-world typos across languages underexplored, despite LLMs being deployed in multilingual applications with user inputs containing natural errors.

Method: Developed MulTypo algorithm that simulates human-like errors using language-specific keyboard layouts and typing behavior, then evaluated 18 open-source LLMs across three model families and five downstream tasks including language inference, multi-choice QA, mathematical reasoning, and machine translation.

Result: Typos consistently degrade LLM performance, particularly in generative tasks and reasoning tasks, while natural language inference is more robust. Instruction tuning improves clean-input performance but increases brittleness under noise. High-resource languages are more robust than low-resource ones, and translation from English is more robust than into English.

Conclusion: The findings highlight the need for noise-aware training and multilingual robustness evaluation in LLM development, as current models show significant vulnerability to typographical errors across different languages and task types.

Abstract: Large language models (LLMs) are increasingly deployed in multilingual, real-world applications with user inputs – naturally introducing typographical errors (typos). Yet most benchmarks assume clean input, leaving the robustness of LLMs to typos across languages largely underexplored. To address this gap, we introduce MulTypo, a multilingual typo generation algorithm that simulates human-like errors based on language-specific keyboard layouts and typing behavior. We evaluate 18 open-source LLMs across three model families and five downstream tasks spanning language inference, multi-choice question answering, mathematical reasoning, and machine translation tasks. Our results show that typos consistently degrade performance, particularly in generative tasks and those requiring reasoning – while the natural language inference task is comparatively more robust. Instruction tuning improves clean-input performance but may increase brittleness under noise. We also observe language-dependent robustness: high-resource languages are generally more robust than low-resource ones, and translation from English is more robust than translation into English. Our findings underscore the need for noise-aware training and multilingual robustness evaluation. We make our code and data publicly available.

[105] SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Chengyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu

Main category: cs.CL

TL;DR: SPG is a new policy gradient method for diffusion LLMs that uses both upper and lower bounds of log-likelihood to reduce bias, outperforming existing RL methods on multiple reasoning tasks.

Details

Motivation: Aligning diffusion LLMs with human preferences via RL is challenging because their intractable log-likelihood prevents direct application of standard policy gradient methods, and existing surrogate methods like ELBO introduce significant bias.

Method: Proposed Sandwiched Policy Gradient (SPG) that leverages both upper and lower bounds of the true log-likelihood, providing better approximation than one-sided methods like ELBO.

Result: SPG significantly outperforms baselines, improving accuracy by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku over state-of-the-art RL methods for dLLMs.

Conclusion: SPG effectively addresses the policy gradient bias problem in diffusion LLMs through its dual-bound approach, demonstrating substantial performance gains across multiple reasoning benchmarks.

Abstract: Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

[106] Beyond Surface Reasoning: Unveiling the True Long Chain-of-Thought Capacity of Diffusion Large Language Models

Qiguang Chen, Hanjing Li, Libo Qin, Dengyun Peng, Jinhao Liu, Jiangyi Wang, Chengyue Wu, Xie Chen, Yantao Du, Wanxiang Che

Main category: cs.CL

TL;DR: DLLMs face Parallel-Sequential Contradiction (PSC) where parallel decoding conflicts with sequential reasoning needs, causing inefficiencies and limitations in complex reasoning tasks.

Details

Motivation: To understand why Diffusion LLMs struggle with complex reasoning despite their parallel decoding advantages, and identify the fundamental conflict between parallel processing and sequential reasoning requirements.

Method: Behavioral analysis of DLLMs in reasoning tasks, introduction of three scaling dimensions (parallel, diffusion, sequential), and empirical evaluation of scaling effects.

Result: DLLMs show genuine parallelism only for simple outputs, revert to autoregressive behavior in complex tasks, and PSC limits reasoning depth and self-reflection. Parallel scaling helps but diffusion/sequential scaling are constrained by PSC.

Conclusion: PSC is a fundamental limitation in DLLMs that requires specific mitigations like parallel-oriented prompting, diffusion early stopping, and parallel scaling to improve effectiveness and efficiency.

Abstract: Recently, Diffusion Large Language Models (DLLMs) have offered high throughput and effective sequential reasoning, making them a competitive alternative to autoregressive LLMs (ALLMs). However, parallel decoding, which enables simultaneous token updates, conflicts with the causal order often required for rigorous reasoning. We first identify this conflict as the core Parallel-Sequential Contradiction (PSC). Behavioral analyses in both simple and complex reasoning tasks show that DLLMs exhibit genuine parallelism only for directly decidable outputs. As task difficulty increases, they revert to autoregressive-like behavior, a limitation exacerbated by autoregressive prompting, which nearly doubles the number of decoding steps with remasking without improving quality. Moreover, PSC restricts DLLMs’ self-reflection, reasoning depth, and exploratory breadth. To further characterize PSC, we introduce three scaling dimensions for DLLMs: parallel, diffusion, and sequential. Empirically, while parallel scaling yields consistent improvements, diffusion and sequential scaling are constrained by PSC. Based on these findings, we propose several practical mitigations, parallel-oriented prompting, diffusion early stopping, and parallel scaling, to reduce PSC-induced ineffectiveness and inefficiencies.

[107] Hierarchical Indexing with Knowledge Enrichment for Multilingual Video Corpus Retrieval

Yu Wang, Tianhao Tan, Yifei Wang

Main category: cs.CL

TL;DR: A multi-stage framework for multilingual medical video retrieval that uses hierarchical chunking, knowledge graph enrichment, and coarse-to-fine tree search with LLM re-ranking to efficiently answer complex multi-hop questions across languages.

Details

Motivation: Existing systems either compress hour-long videos into coarse embeddings or incur prohibitive costs for fine-grained matching, making multilingual medical video retrieval challenging.

Method: Video subtitles are divided into semantically coherent chunks, enriched with KG facts, and organized into a hierarchical tree. A multilingual encoder generates node embeddings, and queries use coarse-to-fine tree search with LLM re-ranking of top chunks.

Result: State-of-the-art performance on the mVCR test set, with ablation studies confirming the complementary contributions of KG enrichment, hierarchical indexing, and LLM re-ranking.

Conclusion: The proposed method offers an accurate and scalable solution for multilingual retrieval in specialized medical video collections, avoiding exhaustive cross-encoder scoring while preserving chunk-level precision.

Abstract: Retrieving relevant instructional videos from multilingual medical archives is crucial for answering complex, multi-hop questions across language boundaries. However, existing systems either compress hour-long videos into coarse embeddings or incur prohibitive costs for fine-grained matching. We tackle the Multilingual Video Corpus Retrieval (mVCR) task in the NLPCC-2025 M4IVQA challenge with a multi-stage framework that integrates multilingual semantics, domain terminology, and efficient long-form processing. Video subtitles are divided into semantically coherent chunks, enriched with concise knowledge-graph (KG) facts, and organized into a hierarchical tree whose node embeddings are generated by a language-agnostic multilingual encoder. At query time, the same encoder embeds the input question; a coarse-to-fine tree search prunes irrelevant branches, and only the top-ranked chunks are re-scored by a lightweight large language model (LLM). This design avoids exhaustive cross-encoder scoring while preserving chunk-level precision. Experiments on the mVCR test set demonstrate state-of-the-art performance, and ablation studies confirm the complementary contributions of KG enrichment, hierarchical indexing, and targeted LLM re-ranking. The proposed method offers an accurate and scalable solution for multilingual retrieval in specialized medical video collections.

[108] Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation

Sondos Mahmoud Bsharat, Zhiqiang Shen

Main category: cs.CL

TL;DR: P-TTS is a test-time data augmentation method that uses only 90 reasoning examples with systematic prompt variations to enhance LLM reasoning through finetuning, achieving significant accuracy gains on mathematical reasoning benchmarks with minimal data requirements.

Details

Motivation: Current LLM reasoning requires large curated datasets which are laborious and resource-intensive to create. P-TTS aims to enhance reasoning capabilities with minimal annotation overhead.

Method: Leverages 90 manually selected reasoning instances and systematically varies exemplar augmentation through principled instruction prompting intensities at test time to synthesize diverse reasoning trajectory contexts, then finetunes Qwen-2.5 models on this data.

Result: P-TTS-7B and 32B models outperform competitive baselines with absolute accuracy gains up to +30.00% on AIME'24 and +26.63% on AIME'25, with comparable or better performance on MATH500 and GPQA-Diamond. Also enhances zero-shot generalization on out-of-domain benchmarks.

Conclusion: Test-time scaling effectively explores the latent space of reasoning patterns, amplifying LLM problem-solving with minimal annotation overhead, offering a practical, low-cost way to elicit LLM reasoning in resource-constrained domains.

Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually selected reasoning instances and systematically varies exemplar augmentation through principled instruction prompting intensities at test time to synthesize diverse reasoning trajectory contexts. Then we finetune the various sizes of Qwen-2.5 models on P-TTS data. Across a suite of mathematical reasoning AIME2024 & 25, MATH500, and GPQA-Diamond, our P-TTS-7B and 32B models outperform the prior competitive baselines like S1 and S1.1 (1K-shot), achieving absolute accuracy gains of +26.66% and +30.00% on AIME'24 (7B), and +13.34% and +6.67% on AIME'25 (7B); P-TTS-32B yields gains of +23.33% and +16.63% on AIME'24, and +26.63% and +3.33% on AIME'25 (vs. S1 and S1.1, respectively), with comparable or better performance on MATH500 and GPQA-Diamond. We further show that P-TTS enhances zero-shot generalization accuracy on out-of-domain reasoning benchmarks of Gaokao, Kaoyan, OlympiadBench, AMC23, GradeSchoolMath, and Minerva. Our analysis suggests that test-time scaling effectively explores the latent space of reasoning patterns, amplifying LLM problem-solving with minimal annotation overhead, and further unlocking the reasoning potential and capabilities of LLMs. Prompting Test-Time Scaling offers a practical, low-cost way to elicit LLM reasoning in resource-constrained or rapidly evolving domains.

[109] A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages

Raoyuan Zhao, Yihong Liu, Hinrich Schütze, Michael A. Hedderich

Main category: cs.CL

TL;DR: This paper presents the first comprehensive study of multilingual Chain-of-Thought reasoning, evaluating performance, consistency, and faithfulness across languages.

Details

Motivation: While large reasoning models increasingly use step-by-step CoT reasoning, the intermediate thinking traces themselves remain underexplored in multilingual settings, with most research focusing only on final-answer accuracy.

Method: The study evaluates three dimensions: (1) language compliance, accuracy, and consistency when models are instructed to think in target languages; (2) crosslingual consistency by interchanging thinking traces between languages; (3) faithfulness using perturbation-based techniques like truncation and error injection.

Result: The research reveals strong language preferences, divergent performance across languages, substantial variation in thinking trace quality depending on prompt language, and varying degrees of model reliance on traces across different languages.

Conclusion: Multilingual CoT reasoning shows significant language-dependent variations in performance, consistency, and faithfulness, highlighting the need for more comprehensive evaluation beyond just final-answer accuracy in multilingual settings.

Abstract: Large reasoning models (LRMs) increasingly rely on step-by-step Chain-of-Thought (CoT) reasoning to improve task performance, particularly in high-resource languages such as English. While recent work has examined final-answer accuracy in multilingual settings, the thinking traces themselves, i.e., the intermediate steps that lead to the final answer, remain underexplored. In this paper, we present the first comprehensive study of multilingual CoT reasoning, evaluating three key dimensions: performance, consistency, and faithfulness. We begin by measuring language compliance, answer accuracy, and answer consistency when LRMs are explicitly instructed or prompt-hacked to think in a target language, revealing strong language preferences and divergent performance across languages. Next, we assess crosslingual consistency of thinking traces by interchanging them between languages. We find that the quality and effectiveness of thinking traces vary substantially depending on the prompt language. Finally, we adapt perturbation-based techniques – i.e., truncation and error injection – to probe the faithfulness of thinking traces across languages, showing that models rely on traces to varying degrees. We release our code and data to support future research.

[110] WUGNECTIVES: Novel Entity Inferences of Language Models from Discourse Connectives

Daniel Brubaker, William Sheffield, Junyi Jessy Li, Kanishka Misra

Main category: cs.CL

TL;DR: This paper introduces WUGNECTIVES, a dataset to test if discourse connectives can help language models learn about novel entities, finding that tuning models for reasoning improves performance but concessive connectives remain challenging.

Details

Motivation: To investigate whether discourse connectives can inform language models about world knowledge, flipping the traditional approach of using world knowledge to predict connectives.

Method: Created WUGNECTIVES dataset with 8,880 stimuli evaluating LMs’ inferences about novel entities in connective-linked contexts, tested 17 different LMs at various scales and training regimens.

Result: Tuning LMs for reasoning behavior yields significant improvements on most connectives, but all models struggle with concessive connectives, showing large performance variation across connective types.

Conclusion: The study enables more nuanced investigations into how language cues are captured by LMs and highlights the functional role of discourse connectives in model learning.

Abstract: The role of world knowledge has been particularly crucial to predict the discourse connective that marks the discourse relation between two arguments, with language models (LMs) being generally successful at this task. We flip this premise in our work, and instead study the inverse problem of understanding whether discourse connectives can inform LMs about the world. To this end, we present WUGNECTIVES, a dataset of 8,880 stimuli that evaluates LMs’ inferences about novel entities in contexts where connectives link the entities to particular attributes. On investigating 17 different LMs at various scales, and training regimens, we found that tuning an LM to show reasoning behavior yields noteworthy improvements on most connectives. At the same time, there was a large variation in LMs’ overall performance across connective type, with all models systematically struggling on connectives that express a concessive meaning. Our findings pave the way for more nuanced investigations into the functional role of language cues as captured by LMs. We release WUGNECTIVES at https://github.com/sheffwb/wugnectives.

[111] AutoPR: Let’s Automate Your Academic Promotion!

Qiguang Chen, Zheng Yan, Mingda Yang, Libo Qin, Yixin Yuan, Hanjing Li, Jinhao Liu, Yiyan Ji, Dengyun Peng, Jiannan Guan, Mengkang Hu, Yantao Du, Wanxiang Che

Main category: cs.CL

TL;DR: AutoPR automates research paper promotion by transforming papers into engaging public content, with PRAgent framework showing 604% increase in watch time and 438% rise in likes compared to direct LLM approaches.

Details

Motivation: As research volume grows, scholars rely on social platforms for discovery and authors invest significant effort in promotion. AutoPR aims to streamline this process and reduce human effort in scholarly communication.

Method: PRAgent multi-agent framework with three stages: multimodal content extraction, collaborative synthesis for polished outputs, and platform-specific adaptation for optimal reach. Also introduced PRBench benchmark with 512 articles linked to promotional posts.

Result: PRAgent achieved 604% increase in total watch time, 438% rise in likes, and at least 2.9x boost in overall engagement compared to direct LLM pipelines. Platform modeling and targeted promotion were key contributors to performance gains.

Conclusion: AutoPR is established as a tractable, measurable research problem with PRAgent providing a roadmap for scalable automated scholarly communication that significantly improves engagement metrics.

Abstract: As the volume of peer-reviewed research surges, scholars increasingly rely on social platforms for discovery, while authors invest considerable effort in promoting their work to ensure visibility and citations. To streamline this process and reduce the reliance on human effort, we introduce Automatic Promotion (AutoPR), a novel task that transforms research papers into accurate, engaging, and timely public content. To enable rigorous evaluation, we release PRBench, a multimodal benchmark that links 512 peer-reviewed articles to high-quality promotional posts, assessing systems along three axes: Fidelity (accuracy and tone), Engagement (audience targeting and appeal), and Alignment (timing and channel optimization). We also introduce PRAgent, a multi-agent framework that automates AutoPR in three stages: content extraction with multimodal preparation, collaborative synthesis for polished outputs, and platform-specific adaptation to optimize norms, tone, and tagging for maximum reach. When compared to direct LLM pipelines on PRBench, PRAgent demonstrates substantial improvements, including a 604% increase in total watch time, a 438% rise in likes, and at least a 2.9x boost in overall engagement. Ablation studies show that platform modeling and targeted promotion contribute the most to these gains. Our results position AutoPR as a tractable, measurable research problem and provide a roadmap for scalable, impactful automated scholarly communication.

[112] Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models

Donghang Wu, Haoyang Zhang, Jun Chen, Xiangyu, Zhang, Hexin Liu, Eng Siong Chng, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu

Main category: cs.CL

TL;DR: Mind-Paced Speaking (MPS) enables real-time spoken language models to perform Chain-of-Thought reasoning while speaking by using a dual-brain approach that separates reasoning from speech generation, achieving high accuracy with low latency.

Details

Motivation: Real-time Spoken Language Models struggle with Chain-of-Thought reasoning due to high latency from sequential generation, and enabling them to think while speaking like humans is needed for better real-time interaction.

Method: Proposes a brain-inspired dual-brain framework with a ‘Formulation Brain’ for high-level reasoning that paces and guides an ‘Articulation Brain’ for fluent speech generation, eliminating mode-switching.

Result: MPS significantly outperforms existing think-while-speaking methods and achieves reasoning performance comparable to full CoT pre-computation while drastically reducing latency. Achieves 92.8% accuracy on Spoken-MQA and 82.5 score on URO-Bench.

Conclusion: The work effectively bridges the gap between high-quality reasoning and real-time interaction in spoken language models.

Abstract: Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is attracting increasing attention. We present, for the first time, Mind-Paced Speaking (MPS), a brain-inspired framework that enables high-fidelity, real-time reasoning. Similar to how humans utilize distinct brain regions for thinking and responding, we propose a novel dual-brain approach, employing a “Formulation Brain” for high-level reasoning to pace and guide a separate “Articulation Brain” for fluent speech generation. This division of labor eliminates mode-switching, preserving the integrity of the reasoning process. Experiments show that MPS significantly outperforms existing think-while-speaking methods and achieves reasoning performance comparable to models that pre-compute the full CoT before speaking, while drastically reducing latency. Under a zero-latency configuration, the proposed method achieves an accuracy of 92.8% on the mathematical reasoning task Spoken-MQA and attains a score of 82.5 on the speech conversation task URO-Bench. Our work effectively bridges the gap between high-quality reasoning and real-time interaction.

[113] Privacy-Preserving Parameter-Efficient Fine-Tuning for Large Language Model Services

Yansong Li, Zhixing Tan, Paula Branco, Yang Liu

Main category: cs.CL

TL;DR: RAPT is a privacy-preserving PEFT framework that uses local differential privacy to protect user data during LLM customization while maintaining competitive performance.

Details

Motivation: Private data used for LLM customization requires robust privacy protection to ensure data security, maintain user trust, and comply with regulatory standards.

Method: Uses local differential privacy to privatize data locally, combined with a novel privatized token reconstruction task trained jointly with downstream tasks to improve representation learning.

Result: Achieves competitive performance across tasks while providing privacy guarantees against adversaries.

Conclusion: RAPT provides an effective privacy-preserving solution for LLM customization that balances privacy protection with model performance.

Abstract: Parameter-Efficient Fine-Tuning (PEFT) provides a practical way for users to customize Large Language Models (LLMs) with their private data in LLM service scenarios. However, the inherently sensitive nature of private data demands robust privacy preservation measures during the customization of LLM services to ensure data security, maintain user trust, and comply with stringent regulatory standards. Based on PEFT, we propose Privacy-Preserving Parameter-Efficient Fine-Tuning (RAPT), a framework that offers privacy protection for LLM services. RAPT adopts a local privacy approach, enabling users to privatize their data locally using a text-to-text local differential privacy mechanism. Since PEFT performs poorly when directly trained on privatized data, we introduce a novel privatized token reconstruction task that is trained jointly with the downstream task, allowing LLMs to learn better task-dependent representations. Despite the simplicity of our framework, experiments show that RAPT achieves competitive performance across tasks while providing privacy guarantees against adversaries.

[114] Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning Code LLMs

Zichao Hu, Junyi Jessy Li, Arjun Guha, Joydeep Biswas

Main category: cs.CL

TL;DR: ROBO-INSTRUCT is a method for generating task-program pairs for robot programming by synthesizing task-specific simulation environments on the fly during program execution and using LLM-aided post-processing to refine instructions.

Details

Motivation: Collecting datasets of task-program pairs specific to each robot is time-consuming and expensive, and existing methods like SELF-INSTRUCT and EVOL-INSTRUCT cannot provide programs that correctly abide by physical-world and robot constraints.

Method: ROBO-INSTRUCT synthesizes task-specific simulation environments during program execution by inferring entity properties and enforcing constraints based on how entities are used in the task program, and integrates LLM-aided post-processing to refine instructions.

Result: Fine-tuned models using ROBO-INSTRUCT outperform all baseline methods and match or surpass the performance of several larger and proprietary models.

Conclusion: ROBO-INSTRUCT effectively addresses the challenge of generating constrained robot programs by dynamically creating simulation environments and refining instructions, enabling better performance than existing methods.

Abstract: Code LLMs have shown promising results with converting tasks in natural language to programs that can be executed by service robots. We are interested in finetuning small, specialized LLMs for this purpose, but collecting datasets of task-program pairs specific to each robot is time-consuming and expensive. While approaches such as SELF-INSTRUCT and EVOL-INSTRUCT are capable of generating novel tasks given a few examples, they are unable to provide the corresponding programs that correctly abide by physical-world and robot-constraints using the provided programming interface. Using a simulator is a natural potential solution to checking for such constraints, but building simulation environments that can handle arbitrary tasks and their necessary objects and locations, is challenging. To address these challenges, we introduce ROBO-INSTRUCT, which synthesizes task-specific simulation environments on the fly during program execution, by opportunistically inferring entity properties and enforcing corresponding constraints based on how the entities are used in the task program. Additionally, ROBO-INSTRUCT integrates an LLM-aided post-processing procedure to refine instructions for better alignment with robot programs. We demonstrate the effectiveness of ROBO-INSTRUCT across multiple LLMs, showing that our fine-tuned models outperform all baseline methods and even match or surpass the performance of several larger and proprietary models.

[115] On the Reliability of Large Language Models for Causal Discovery

Tao Feng, Lizhen Qu, Niket Tandon, Zhuang Li, Xiaoxi Kang, Gholamreza Haffari

Main category: cs.CL

TL;DR: LLMs are effective at recognizing frequent causal relations from pre-training data but struggle with generalization to new/rare relations, are sensitive to incorrect causal data, and rely heavily on contextual information.

Details

Motivation: To investigate how LLMs perform in causal discovery tasks, particularly examining the role of memorization, impact of incorrect training data, and contextual influences on their causal reasoning abilities.

Method: Used open-source LLMs (OLMo and BLOOM) with accessible pre-training corpora to examine three research questions: memorization effects, impact of incorrect causal relations, and contextual influences on causal understanding.

Result: LLMs show strong performance for frequently occurring causal relations but limited generalization to new/rare relations. Incorrect causal relations in training data significantly reduce confidence in correct relations. Contextual information critically affects causal relation discernment.

Conclusion: While LLMs demonstrate capability in causal discovery for familiar patterns, their performance is constrained by training data quality, frequency of patterns, and contextual dependencies, highlighting limitations in true causal reasoning generalization.

Abstract: This study investigates the efficacy of Large Language Models (LLMs) in causal discovery. Using newly available open-source LLMs, OLMo and BLOOM, which provide access to their pre-training corpora, we investigate how LLMs address causal discovery through three research questions. We examine: (i) the impact of memorization for accurate causal relation prediction, (ii) the influence of incorrect causal relations in pre-training data, and (iii) the contextual nuances that influence LLMs’ understanding of causal relations. Our findings indicate that while LLMs are effective in recognizing causal relations that occur frequently in pre-training data, their ability to generalize to new or rare causal relations is limited. Moreover, the presence of incorrect causal relations significantly undermines the confidence of LLMs in corresponding correct causal relations, and the contextual information critically affects the outcomes of LLMs to discern causal connections between random variables.

[116] Augmenting Compliance-Guaranteed Customer Service Chatbots: Context-Aware Knowledge Expansion with Large Language Models

Mengze Hong, Chen Jason Zhang, Di Jiang, Yuanqin He

Main category: cs.CL

TL;DR: The paper introduces Similar Question Generation (SQG) task for LLMs to enhance retrieval-based chatbots by generating semantically similar questions with varied expressions, improving knowledge base coverage while maintaining accuracy and compliance.

Details

Motivation: Retrieval-based chatbots need diverse question variations to handle customer inquiries effectively, but manually creating similar questions is costly. SQG provides a cost-effective way to augment knowledge bases while maintaining semantic meaning and regulatory compliance.

Method: Proposed context-aware approaches for SQG task, including optimization techniques for in-context prompt construction and optimal subset selection of similar questions under budget constraints.

Result: Achieved 92% user satisfaction rate in deployed chatbot system, representing an 18% improvement over unaugmented baseline. Both quantitative and human evaluations validated method effectiveness.

Conclusion: SQG demonstrates practical benefits for enhancing retrieval-based chatbots, highlighting LLMs’ potential as support tools for non-generative systems to ensure hallucination-free and compliance-guaranteed applications.

Abstract: Retrieval-based chatbots leverage human-verified Q&A knowledge to deliver accurate, verifiable responses, making them ideal for customer-centric applications where compliance with regulatory and operational standards is critical. To effectively handle diverse customer inquiries, augmenting the knowledge base with “similar questions” that retain semantic meaning while incorporating varied expressions is a cost-effective strategy. In this paper, we introduce the Similar Question Generation (SQG) task for LLM training and inference, proposing context-aware approaches to enable comprehensive semantic exploration and enhanced alignment with source question-answer relationships. We formulate optimization techniques for constructing in-context prompts and selecting an optimal subset of similar questions to expand chatbot knowledge under budget constraints. Both quantitative and human evaluations validate the effectiveness of these methods, achieving a 92% user satisfaction rate in a deployed chatbot system, reflecting an 18% improvement over the unaugmented baseline. These findings highlight the practical benefits of SQG and emphasize the potential of LLMs, not as direct chatbot interfaces, but in supporting non-generative systems for hallucination-free, compliance-guaranteed applications.

[117] Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

Eugene Jang, Kimin Lee, Jin-Woo Chung, Keuntae Park, Seungwon Shin

Main category: cs.CL

TL;DR: Byte-level BPE tokenization creates incomplete tokens that are vulnerable to exploitation through improbable bigrams, causing significant hallucination in language models.

Details

Motivation: Recent studies show tokenizers can be exploited to elicit unwanted model behaviors, particularly with incomplete tokens from byte-level BPE tokenization that are fragile when paired with unfamiliar tokens.

Method: Introduce improbable bigrams - out-of-distribution combinations of incomplete tokens designed to exploit their dependency on adjacent tokens.

Result: Improbable bigrams cause significantly higher hallucination rates, with 90% reduction in hallucination when alternative tokenization is used in Llama3.1.

Conclusion: Byte-level BPE tokenizers introduce vulnerabilities and blind spots in language models that can be exploited through tokenization artifacts.

Abstract: Tokenization is a crucial step that bridges human-readable text with model-readable discrete tokens. However, recent studies have revealed that tokenizers can be exploited to elicit unwanted model behaviors. In this work, we investigate incomplete tokens, i.e., undecodable tokens with stray bytes resulting from byte-level byte-pair encoding (BPE) tokenization. We hypothesize that such tokens are heavily reliant on their adjacent tokens and are fragile when paired with unfamiliar tokens. To demonstrate this vulnerability, we introduce improbable bigrams: out-of-distribution combinations of incomplete tokens designed to exploit their dependency. Our experiments show that improbable bigrams are significantly prone to hallucinatory behaviors. Surprisingly, the same phrases have drastically lower rates of hallucination (90% reduction in Llama3.1) when an alternative tokenization is used. We caution against the potential vulnerabilities introduced by byte-level BPE tokenizers, which may introduce blind spots to language models.

[118] Medchain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence

Jie Liu, Wenxuan Wang, Zizhan Ma, Guolin Huang, Yihang SU, Kao-Jung Chang, Wenting Chen, Haoliang Li, Linlin Shen, Michael Lyu

Main category: cs.CL

TL;DR: MedChain is a comprehensive clinical dataset and AI system designed to address the limitations of current LLM-based agents in clinical decision making by incorporating real-world features like personalization, interactivity, and sequentiality.

Details

Motivation: Current LLM-based agents perform poorly in real-world clinical decision making due to lack of comprehensive testing datasets that mirror actual medical practice, creating a gap between theoretical knowledge and practical application.

Method: Created MedChain dataset with 12,163 clinical cases covering five key clinical workflow stages, and developed MedChain-Agent system with feedback mechanism and MCase-RAG module to learn from previous cases and adapt responses.

Result: MedChain-Agent demonstrates remarkable adaptability in gathering information dynamically and handling sequential clinical tasks, significantly outperforming existing approaches.

Conclusion: The proposed MedChain dataset and MedChain-Agent system effectively bridge the gap between theoretical medical knowledge and practical clinical decision making, showing superior performance in real-world scenarios.

Abstract: Clinical decision making (CDM) is a complex, dynamic process crucial to healthcare delivery, yet it remains a significant challenge for artificial intelligence systems. While Large Language Model (LLM)-based agents have been tested on general medical knowledge using licensing exams and knowledge question-answering tasks, their performance in the CDM in real-world scenarios is limited due to the lack of comprehensive testing datasets that mirror actual medical practice. To address this gap, we present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow. MedChain distinguishes itself from existing benchmarks with three key features of real-world clinical practice: personalization, interactivity, and sequentiality. Further, to tackle real-world CDM challenges, we also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MCase-RAG module to learn from previous cases and adapt its responses. MedChain-Agent demonstrates remarkable adaptability in gathering information dynamically and handling sequential clinical tasks, significantly outperforming existing approaches.

[119] NLP-ADBench: NLP Anomaly Detection Benchmark

Yuangang Li, Jiaqi Li, Zhuo Xiao, Tiankai Yang, Yi Nian, Xiyang Hu, Yue Zhao

Main category: cs.CL

TL;DR: NLP-ADBench is a comprehensive benchmark for NLP anomaly detection with 8 datasets and 19 algorithms, showing that two-step methods with transformer embeddings outperform end-to-end approaches.

Details

Motivation: Anomaly detection is understudied in NLP, limiting effectiveness in detecting harmful content, phishing, and spam reviews.

Method: Created NLP-ADBench benchmark with 8 curated datasets and 19 algorithms (3 end-to-end, 16 two-step approaches using BERT and OpenAI embeddings).

Result: No single model dominates across all datasets; two-step methods with transformer embeddings consistently outperform end-to-end approaches; OpenAI embeddings beat BERT embeddings.

Conclusion: NLP-ADBench provides a unified framework for NLP anomaly detection and supports future research, highlighting the need for automated model selection.

Abstract: Anomaly detection (AD) is an important machine learning task with applications in fraud detection, content moderation, and user behavior analysis. However, AD is relatively understudied in a natural language processing (NLP) context, limiting its effectiveness in detecting harmful content, phishing attempts, and spam reviews. We introduce NLP-ADBench, the most comprehensive NLP anomaly detection (NLP-AD) benchmark to date, which includes eight curated datasets and 19 state-of-the-art algorithms. These span 3 end-to-end methods and 16 two-step approaches that adapt classical, non-AD methods to language embeddings from BERT and OpenAI. Our empirical results show that no single model dominates across all datasets, indicating a need for automated model selection. Moreover, two-step methods with transformer-based embeddings consistently outperform specialized end-to-end approaches, with OpenAI embeddings outperforming those of BERT. We release NLP-ADBench at https://github.com/USC-FORTIS/NLP-ADBench, providing a unified framework for NLP-AD and supporting future investigations.

[120] AD-LLM: Benchmarking Large Language Models for Anomaly Detection

Tiankai Yang, Yi Nian, Shawn Li, Ruiyao Xu, Yuangang Li, Jiaqi Li, Zhuo Xiao, Xiyang Hu, Ryan Rossi, Kaize Ding, Xia Hu, Yue Zhao

Main category: cs.CL

TL;DR: AD-LLM is the first benchmark evaluating LLMs for NLP anomaly detection, testing zero-shot detection, data augmentation, and model selection capabilities.

Details

Motivation: LLMs have transformed many NLP tasks but their potential in anomaly detection remains understudied, despite AD's importance in fraud detection, medical diagnosis, and industrial monitoring.

Method: Created AD-LLM benchmark with three tasks: zero-shot detection using pre-trained knowledge, data augmentation via synthetic data generation, and model selection for unsupervised AD models.

Result: LLMs perform well in zero-shot anomaly detection, carefully designed augmentation methods are effective, but explaining model selection for specific datasets remains challenging.

Conclusion: The study outlines six future research directions for using LLMs in anomaly detection, highlighting both promising capabilities and current limitations.

Abstract: Anomaly detection (AD) is an important machine learning task with many real-world uses, including fraud detection, medical diagnosis, and industrial monitoring. Within natural language processing (NLP), AD helps detect issues like spam, misinformation, and unusual user activity. Although large language models (LLMs) have had a strong impact on tasks such as text generation and summarization, their potential in AD has not been studied enough. This paper introduces AD-LLM, the first benchmark that evaluates how LLMs can help with NLP anomaly detection. We examine three key tasks: (i) zero-shot detection, using LLMs’ pre-trained knowledge to perform AD without tasks-specific training; (ii) data augmentation, generating synthetic data and category descriptions to improve AD models; and (iii) model selection, using LLMs to suggest unsupervised AD models. Through experiments with different datasets, we find that LLMs can work well in zero-shot AD, that carefully designed augmentation methods are useful, and that explaining model selection for specific datasets remains challenging. Based on these results, we outline six future research directions on LLMs for AD.

[121] RPO: Retrieval Preference Optimization for Robust Retrieval-Augmented Generation

Shi-Qi Yan, Quan Liu, Zhen-Hua Ling

Main category: cs.CL

TL;DR: RPO is a lightweight alignment method that adaptively leverages multi-source knowledge based on retrieval relevance, solving knowledge conflicts in RAG systems without requiring additional retrieval quality assessment procedures.

Details

Motivation: RAG systems struggle with knowledge conflicts when external retrieved knowledge differs from internal memorization, and existing methods require additional procedures to assess retrieval quality.

Method: Introduces Retrieval Preference Optimization (RPO) that derives implicit representation of retrieval relevance and incorporates it into the reward model, integrating retrieval evaluation and response generation into a single model.

Result: RPO outperforms RAG by 4-10% in accuracy on four datasets without any extra components, demonstrating robust generalization.

Conclusion: RPO is the only RAG-dedicated alignment approach that quantifies awareness of retrieval relevance in training, overcoming mathematical obstacles and providing an effective solution to knowledge conflicts.

Abstract: While Retrieval-Augmented Generation (RAG) has exhibited promise in utilizing external knowledge, its generation process heavily depends on the quality and accuracy of the retrieved context. Large language models (LLMs) struggle to evaluate the correctness of non-parametric knowledge retrieved externally when it differs from internal memorization, leading to knowledge conflicts during response generation. To this end, we introduce the Retrieval Preference Optimization (RPO), a lightweight and effective alignment method to adaptively leverage multi-source knowledge based on retrieval relevance. An implicit representation of retrieval relevance is derived and incorporated into the reward model to integrate retrieval evaluation and response generation into a single model, solving the problem that previous methods necessitate the additional procedure to assess the retrieval quality. Notably, RPO is the only RAG-dedicated alignment approach that quantifies the awareness of retrieval relevance in training, overcoming mathematical obstacles. Experiments on four datasets demonstrate that RPO outperforms RAG by 4-10% in accuracy without any extra component, exhibiting its robust generalization.

[122] AnyEdit: Edit Any Knowledge Encoded in Language Models

Houcheng Jiang, Junfeng Fang, Ningyu Zhang, Guojun Ma, Mingyang Wan, Xiang Wang, Xiangnan He, Tat-seng Chua

Main category: cs.CL

TL;DR: AnyEdit is a new autoregressive editing paradigm that decomposes long-form knowledge into sequential chunks and iteratively edits key tokens, overcoming limitations of current methods that struggle with diverse formats like poetry, code, and math.

Details

Motivation: Current LLM editing methods fail with long-form knowledge in diverse formats due to reliance on editing single token hidden states, creating an 'efficacy barrier' that limits practical knowledge updates.

Method: Proposes AnyEdit framework that decomposes long-form knowledge into sequential chunks and iteratively edits key tokens in each chunk, grounded in Chain Rule of Mutual Information theory.

Result: Outperforms strong baselines by 21.5% on benchmarks including UnKEBench, AKEW, and new EditEverything dataset for long-form diverse-formatted knowledge.

Conclusion: AnyEdit serves as plug-and-play framework enabling current editing methods to update knowledge with arbitrary length and format, significantly advancing LLM knowledge editing scope and practicality.

Abstract: Large language models (LLMs) often produce incorrect or outdated information, necessitating efficient and precise knowledge updates. Current model editing methods, however, struggle with long-form knowledge in diverse formats, such as poetry, code snippets, and mathematical derivations. These limitations arise from their reliance on editing a single token’s hidden state, a limitation we term “efficacy barrier”. To solve this, we propose AnyEdit, a new autoregressive editing paradigm. It decomposes long-form knowledge into sequential chunks and iteratively edits the key token in each chunk, ensuring consistent and accurate outputs. Theoretically, we ground AnyEdit in the Chain Rule of Mutual Information, showing its ability to update any knowledge within LLMs. Empirically, it outperforms strong baselines by 21.5% on benchmarks including UnKEBench, AKEW, and our new EditEverything dataset for long-form diverse-formatted knowledge. Additionally, AnyEdit serves as a plug-and-play framework, enabling current editing methods to update knowledge with arbitrary length and format, significantly advancing the scope and practicality of LLM knowledge editing.

[123] LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and Hardware Co-design

Renjie Wei, Songqiang Xu, Linfeng Zhong, Zebin Yang, Qingyu Guo, Yuan Wang, Runsheng Wang, Meng Li

Main category: cs.CL

TL;DR: LightMamba is a co-designed quantization algorithm and FPGA accelerator for efficient Mamba inference, achieving 4.65-6.06x higher energy efficiency than GPU baselines.

Details

Motivation: Mamba achieves linear computation complexity compared to Transformers but is hard to accelerate due to scattered activation outliers and complex computation dependency, making existing LLM accelerators inefficient.

Method: Proposes FPGA-friendly post-training quantization with rotation-assisted quantization and power-of-two SSM quantization to reduce computation to 4-bit, plus an FPGA accelerator with partial unrolling, computation reordering, and fine-grained tiling/fusion.

Result: Achieves 4.65x to 6.06x higher energy efficiency over GPU baseline on Xilinx Versal VCK190 FPGA, and 93 tokens/s (1.43x GPU baseline) on Alveo U280 FPGA.

Conclusion: LightMamba successfully enables efficient Mamba inference through co-designed quantization and FPGA acceleration, demonstrating significant energy efficiency improvements over GPU implementations.

Abstract: State space models (SSMs) like Mamba have recently attracted much attention. Compared to Transformer-based large language models (LLMs), Mamba achieves linear computation complexity with the sequence length and demonstrates superior performance. However, Mamba is hard to accelerate due to the scattered activation outliers and the complex computation dependency, rendering existing LLM accelerators inefficient. In this paper, we propose LightMamba that co-designs the quantization algorithm and FPGA accelerator architecture for efficient Mamba inference. We first propose an FPGA-friendly post-training quantization algorithm that features rotation-assisted quantization and power-of-two SSM quantization to reduce the majority of computation to 4-bit. We further design an FPGA accelerator that partially unrolls the Mamba computation to balance the efficiency and hardware costs. Through computation reordering as well as fine-grained tiling and fusion, the hardware utilization and memory efficiency of the accelerator get drastically improved. We implement LightMamba on Xilinx Versal VCK190 FPGA and achieve 4.65x to 6.06x higher energy efficiency over the GPU baseline. When evaluated on Alveo U280 FPGA, LightMamba reaches 93 tokens/s, which is 1.43x that of the GPU baseline. Our code is available at https://github.com/PKU-SEC-Lab/LightMamba.

[124] What Are They Filtering Out? An Experimental Benchmark of Filtering Strategies for Harm Reduction in Pretraining Datasets

Marco Antonio Stranisci, Christian Hardmeier

Main category: cs.CL

TL;DR: This paper presents a benchmark study of data filtering strategies for harm reduction in LLMs, revealing that while these strategies reduce harmful content, they also increase the underrepresentation of vulnerable groups in datasets.

Details

Motivation: There is a lack of research on the impact of data filtering strategies on vulnerable groups and their effectiveness has not been systematically addressed, despite their crucial role in developing safe LLMs.

Method: The authors reviewed 55 technical reports of English LMs and LLMs to identify existing filtering strategies and implemented an experimental setting to test their impact against vulnerable groups.

Result: The study shows that data filtering strategies have a positive impact in reducing harmful content, but this comes with the side effect of increasing underrepresentation of vulnerable groups to discrimination in datasets.

Conclusion: Data filtering strategies for harm reduction in LLMs create a trade-off where reducing harmful content inadvertently leads to increased underrepresentation of vulnerable groups, highlighting the need for more balanced approaches.

Abstract: Data filtering strategies are a crucial component to develop safe Large Language Models (LLM), since they support the removal of harmful contents from pretraining datasets. There is a lack of research on the actual impact of these strategies on vulnerable groups to discrimination, though, and their effectiveness has not been yet systematically addressed. In this paper we present a benchmark study of data filtering strategies for harm reduction aimed at providing a systematic evaluation on these approaches. We provide an overview $55$ technical reports of English LMs and LLMs to identify the existing filtering strategies in literature and implement an experimental setting to test their impact against vulnerable groups. Our results show that the positive impact that strategies have in reducing harmful contents from documents has the side effect of increasing the underrepresentation of vulnerable groups to discrimination in datasets. WARNING: the paper could contain racist, sexist, violent, and generally offensive contents

[125] How a Bilingual LM Becomes Bilingual: Tracing Internal Representations with Sparse Autoencoders

Tatsuro Inaba, Go Kamoda, Kentaro Inui, Masaru Isonuma, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, Yu Takagi

Main category: cs.CL

TL;DR: Bilingual language models first learn languages separately then form bilingual alignments in mid layers, with stronger effects in larger models.

Details

Motivation: To understand how bilingual language models develop complex internal representations and bilingual capabilities.

Method: Used sparse autoencoders to analyze internal representations, examining training steps, layers, and model sizes. Also employed a novel method integrating decomposed representations from fully trained models into mid-training models.

Result: Models first learn languages separately, then gradually form bilingual alignments in mid layers. Bilingual tendency is stronger in larger models. Bilingual representations play critical role in model performance.

Conclusion: Provides insights into how language models acquire bilingual capabilities through progressive learning and alignment formation.

Abstract: This study explores how bilingual language models develop complex internal representations. We employ sparse autoencoders to analyze internal representations of bilingual language models with a focus on the effects of training steps, layers, and model sizes. Our analysis shows that language models first learn languages separately, and then gradually form bilingual alignments, particularly in the mid layers. We also found that this bilingual tendency is stronger in larger models. Building on these findings, we demonstrate the critical role of bilingual representations in model performance by employing a novel method that integrates decomposed representations from a fully trained model into a mid-training model. Our results provide insights into how language models acquire bilingual capabilities.

[126] Understanding and Improving Information Preservation in Prompt Compression for LLMs

Weronika Łajewska, Momchil Hardalov, Laura Aina, Neha Anna John, Hang Su, Lluís Màrquez

Main category: cs.CL

TL;DR: The paper proposes a holistic evaluation framework for prompt compression methods, analyzing downstream task performance, grounding, and information preservation. It identifies limitations in current methods and improves one soft prompting approach, achieving significant performance gains.

Details

Motivation: Prompt length in information-intensive tasks grows fast, leading to computational overhead, performance degradation, and biases from irrelevant information. Existing prompt compression methods need comprehensive evaluation to understand their effectiveness.

Method: A holistic evaluation framework focusing on compression ratio, downstream task performance, grounding in input context, and information preservation. Analyzes state-of-the-art soft and hard compression methods and improves one soft prompting method by controlling compression granularity.

Result: Improved soft prompting method achieved up to +23% downstream performance, +8 BERTScore points in grounding, and 2.7x more entities preserved. Found best trade-off with soft prompting combined with sequence-level training.

Conclusion: The proposed evaluation framework reveals limitations in current prompt compression methods. Soft prompting with controlled granularity and sequence-level training provides the optimal effectiveness/compression rate trade-off.

Abstract: Recent advancements in large language models (LLMs) have enabled their successful application to a broad range of tasks. However, in information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information. Recently, various prompt compression techniques have been introduced to optimize the trade-off between reducing input length and retaining performance. We propose a holistic evaluation framework that allows for in-depth analysis of prompt compression methods. We focus on three key aspects, besides compression ratio: (i) downstream task performance, (ii) grounding in the input context, and (iii) information preservation. Using our framework, we analyze state-of-the-art soft and hard compression methods and show that some fail to preserve key details from the original prompt, limiting performance on complex tasks. By identifying these limitations, we are able to improve one soft prompting method by controlling compression granularity, achieving up to +23% in downstream performance, +8 BERTScore points in grounding, and 2.7x more entities preserved in compression. Ultimately, we find that the best effectiveness/compression rate trade-off is achieved with soft prompting combined with sequence-level training.The code is available at https://github.com/amazon-science/information-preservation-in-prompt-compression.

[127] RAISE: Reinforced Adaptive Instruction Selection For Large Language Models

Qingsong Lv, Yangning Li, Zihua Lan, Zishan Xu, Jiwei Tang, Tingwei Lu, Yinghui Li, Wenhao Jiang, Hong-Gee Kim, Hai-Tao Zheng, Philip S. Yu

Main category: cs.CL

TL;DR: RAISE is a dynamic, task-objective-driven instruction selection framework that uses reinforcement learning to select instructions during fine-tuning based on their expected impact on model performance, achieving superior results with only 1% of training steps.

Details

Motivation: Current instruction selection methods use fixed heuristic metrics and only select data before training, leading to insufficient optimization and difficulty in adapting to specific tasks.

Method: Models dynamic instruction selection as a sequential decision-making process using reinforcement learning, selecting instructions at each training step based on expected performance impact.

Result: Achieves superior performance compared to other instruction selection methods while updating only 1% of training steps compared to full-data training.

Conclusion: RAISE provides an interpretable and task-specific optimization approach that significantly improves instruction fine-tuning efficiency and effectiveness.

Abstract: In the instruction fine-tuning of large language models (LLMs), it is widely recognized that a few high-quality instructions are superior to a large number of low-quality instructions. At present, many instruction selection methods have been proposed, but most of these methods select instruction based on heuristic quality metrics, and only consider data selection before training. These designs lead to insufficient optimization of instruction fine-tuning, and fixed heuristic indicators are often difficult to optimize for specific tasks. Therefore, we design a dynamic, task-objective-driven instruction selection framework RAISE(Reinforced Adaptive Instruction SElection), which incorporates the entire instruction fine-tuning process into optimization, selecting instructions at each step based on the expected impact of each instruction on model performance improvement. Our approach is well interpretable and has strong task-specific optimization capabilities. By modeling dynamic instruction selection as a sequential decision-making process, we use RL to train our selection strategy. Extensive experiments and result analysis prove the superiority of our method compared with other instruction selection methods. Notably, RAISE achieves superior performance by updating only 1% of the training steps compared to full-data training, demonstrating its efficiency and effectiveness.

[128] ViClaim: A Multilingual Multilabel Dataset for Automatic Claim Detection in Videos

Patrick Giedemann, Pius von Däniken, Jan Deriu, Alvaro Rodrigo, Anselmo Peñas, Mark Cieliebak

Main category: cs.CL

TL;DR: ViClaim is a multilingual video transcript dataset for claim detection, showing strong cross-validation performance but challenges in generalization to unseen topics.

Details

Motivation: There is a growing need for tools to analyze claims in video content due to its influence in communication and misinformation, with existing efforts focusing mainly on written text.

Method: Created ViClaim dataset with 1,798 annotated video transcripts across 3 languages and 6 topics, using a custom annotation tool and testing with multilingual language models.

Result: Models achieved strong cross-validation performance (macro F1 up to 0.896) but struggled with generalization to unseen topics, especially for distinct domains.

Conclusion: Claim detection in video transcripts is complex, and ViClaim provides a foundation for advancing misinformation detection in video-based communication.

Abstract: The growing influence of video content as a medium for communication and misinformation underscores the urgent need for effective tools to analyze claims in multilingual and multi-topic settings. Existing efforts in misinformation detection largely focus on written text, leaving a significant gap in addressing the complexity of spoken text in video transcripts. We introduce ViClaim, a dataset of 1,798 annotated video transcripts across three languages (English, German, Spanish) and six topics. Each sentence in the transcripts is labeled with three claim-related categories: fact-check-worthy, fact-non-check-worthy, or opinion. We developed a custom annotation tool to facilitate the highly complex annotation process. Experiments with state-of-the-art multilingual language models demonstrate strong performance in cross-validation (macro F1 up to 0.896) but reveal challenges in generalization to unseen topics, particularly for distinct domains. Our findings highlight the complexity of claim detection in video transcripts. ViClaim offers a robust foundation for advancing misinformation detection in video-based communication, addressing a critical gap in multimodal analysis.

[129] Science Hierarchography: Hierarchical Organization of Science Literature

Muhan Gao, Jash Shah, Weiqi Wang, Kuan-Hao Huang, Daniel Khashabi

Main category: cs.CL

TL;DR: The paper proposes SCIENCE HIERARCHOGRAPHY, a method to organize scientific literature into hierarchical structures spanning multiple abstraction levels, using a hybrid approach combining embedding-based clustering with LLM-based prompting.

Details

Motivation: Scientific knowledge is growing rapidly, making it difficult to track progress and conceptual links across disciplines. Existing tools like citation networks and search engines lack the abstraction needed to represent the density and structure of activity across subfields.

Method: A hybrid approach that combines efficient embedding-based clustering with LLM-based prompting, striking a balance between scalability and semantic precision. This contrasts with LLM-heavy methods like iterative tree construction.

Result: The method achieves superior quality-speed trade-offs compared to LLM-heavy approaches. The hierarchies capture different dimensions of research contributions and reflect the interdisciplinary nature of modern science. Evaluation shows improved interpretability and effective navigation for LLM-based agents to locate target papers.

Conclusion: SCIENCE HIERARCHOGRAPHY provides an alternative pathway for exploring scientific literature beyond traditional search methods, offering insights into well-explored and under-explored fields through hierarchical organization.

Abstract: Scientific knowledge is growing rapidly, making it difficult to track progress and high-level conceptual links across broad disciplines. While tools like citation networks and search engines help retrieve related papers, they lack the abstraction needed to capture the needed to represent the density and structure of activity across subfields. We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure that spans multiple levels of abstraction – from broad domains to specific studies. Such a representation can provide insights into which fields are well-explored and which are under-explored. To achieve this goal, we develop a hybrid approach that combines efficient embedding-based clustering with LLM-based prompting, striking a balance between scalability and semantic precision. Compared to LLM-heavy methods like iterative tree construction, our approach achieves superior quality-speed trade-offs. Our hierarchies capture different dimensions of research contributions, reflecting the interdisciplinary and multifaceted nature of modern science. We evaluate its utility by measuring how effectively an LLM-based agent can navigate the hierarchy to locate target papers. Results show that our method improves interpretability and offers an alternative pathway for exploring scientific literature beyond traditional search methods. Code, data and demo are available: https://github.com/JHU-CLSP/science-hierarchography

[130] Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang

Main category: cs.CL

TL;DR: PaperCoder is a multi-agent LLM framework that automatically converts machine learning papers into functional code repositories through planning, analysis, and generation stages using specialized collaborating agents.

Details

Motivation: Machine learning research often lacks available code implementations, making reproduction and building upon prior work slow and labor-intensive, despite LLMs' capabilities in understanding scientific documents and generating high-quality code.

Method: Three-stage framework: 1) Planning - creates roadmap, system architecture with diagrams, file dependencies, and configuration files; 2) Analysis - interprets implementation-specific details; 3) Generation - produces modular, dependency-aware code using specialized collaborating agents.

Result: PaperCoder effectively creates high-quality, faithful implementations, consistently outperforming strong baselines on the PaperBench benchmark by substantial margins, as validated by both model-based and human evaluations including paper authors.

Conclusion: The multi-agent LLM framework successfully automates code generation from machine learning papers, demonstrating significant improvements over existing approaches and providing a practical solution to the reproducibility problem in ML research.

Abstract: Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, particularly from the authors of those papers, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins. Code is available at: https://github.com/going-doer/Paper2Code.

[131] System Prompt Optimization with Meta-Learning

Yumin Choi, Jinheon Baek, Sung Ju Hwang

Main category: cs.CL

TL;DR: This paper introduces bilevel system prompt optimization for LLMs, using meta-learning to create system prompts that are robust across diverse user prompts and transferable to unseen tasks.

Details

Motivation: Existing prompt optimization focuses only on task-specific user prompts, overlooking system prompts that could be optimized once and applied across different tasks and domains.

Method: A meta-learning framework that meta-learns system prompts by optimizing them over various user prompts across multiple datasets, with iterative updates to ensure synergy between system and user prompts.

Result: Experiments on 14 unseen datasets across 5 domains show the optimized system prompts generalize effectively to diverse user prompts and enable rapid adaptation to unseen tasks with fewer optimization steps.

Conclusion: The proposed bilevel system prompt optimization successfully creates transferable system prompts that improve performance and reduce adaptation effort for new tasks.

Abstract: Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.

[132] Beyond Demonstrations: Dynamic Vector Construction from Latent Representations

Wang Cai, Hsiu-Yuan Huang, Zhixiang Wang, Yunfang Wu

Main category: cs.CL

TL;DR: DyVec is a Dynamic Vector method that extracts robust latent representations using Exhaustive Query Rotation, adaptively segments them based on task complexity, and learns optimal injection positions via REINFORCE optimization, outperforming few-shot ICL and prior ICV methods.

Details

Motivation: Existing ICV methods are sensitive to ICL-specific factors, use coarse/fragmented representations, and rely on heuristic injection positions, limiting their applicability.

Method: DyVec uses Exhaustive Query Rotation to extract robust semantically aggregated latent representations, Dynamic Latent Segmentation to adaptively partition representations based on task complexity, and REINFORCE-based optimization to learn optimal injection positions.

Result: DyVec outperforms few-shot ICL, LoRA, and prior ICV baselines in experiments, demonstrating effectiveness of dynamic segmentation and injection of semantically aggregated representations.

Conclusion: DyVec provides a lightweight and data-efficient solution for inference-time task adaptation through robust representation extraction and adaptive injection strategies.

Abstract: In-Context derived Vector (ICV) methods extract task-relevant representations from large language models (LLMs) and reinject them during inference, achieving comparable performance to few-shot In-Context Learning (ICL) without repeated demonstration processing. However, existing ICV methods remain sensitive to ICL-specific factors, often use coarse or semantically fragmented representations as the source of the vector, and rely on heuristic-based injection positions, limiting their applicability. To address these issues, we propose Dynamic Vector (DyVec), which incorporates an Exhaustive Query Rotation (EQR) strategy to extract robust semantically aggregated latent representations by mitigating variance introduced by ICL. It then applies Dynamic Latent Segmentation and Injection to adaptively partition representations based on task complexity and leverages REINFORCE-based optimization to learn optimal injection positions for each segment. Experiments results show that DyVec outperforms few-shot ICL, LoRA, and prior ICV baselines. Further analysis highlights the effectiveness of dynamically segmenting and injecting semantically aggregated latent representations. DyVec provides a lightweight and data-efficient solution for inference-time task adaptation.

[133] FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

Yan Wang, Yang Ren, Lingfei Qian, Xueqing Peng, Keyi Wang, Yi Han, Dongji Feng, Fengran Mo, Shengyuan Lin, Qinchuan Zhang, Kaiwen He, Chenri Luo, Jianxing Chen, Junwei Wu, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Qianqian Xie, Jian-Yun Nie

Main category: cs.CL

TL;DR: FinTagging is a comprehensive benchmark for XBRL tagging that evaluates LLMs’ ability to extract financial facts through numerical reasoning and taxonomy alignment, with two subtasks: FinNI for numeric identification and FinCL for concept linking.

Details

Motivation: Current XBRL tagging benchmarks are oversimplified, treating tagging as flat classification over small concept subsets, ignoring hierarchical taxonomy semantics and structured output requirements, preventing fair evaluation of LLMs under realistic reporting conditions.

Method: Introduces FinTagging benchmark with two subtasks: FinNI extracts numerical entities and their types from XBRL reports, and FinCL maps each entity to corresponding concepts in the full US-GAAP taxonomy, producing structured financial fact representations.

Result: LLMs generalize well in numeric identification (FinNI) but struggle with fine-grained concept linking (FinCL), revealing limitations in structure-aware reasoning for accurate financial disclosure.

Conclusion: The benchmark reveals current LLM limitations in structure-aware financial reasoning, particularly in fine-grained concept linking, highlighting the need for improved models for accurate financial disclosure analysis.

Abstract: Accurately understanding numbers from financial reports is fundamental to how markets, regulators, algorithms, and normal people read the economy and the world, yet even with XBRL (eXtensible Business Reporting Language) designed to tag every figure with standardized accounting concepts, mapping thousands of facts to over 10,000 U.S. GAAP concepts remains costly, inconsistent, and error-prone. Existing benchmarks define tagging as flat, single-step, extreme classification over small subsets of US-GAAP concepts, overlooking both the taxonomy’s hierarchical semantics and the structured nature of real tagging, where each fact must be represented as a contextualized multi-field output. These simplifications prevent fair evaluation of large language models (LLMs) under realistic reporting conditions. To address these gaps, we introduce FinTagging, the first comprehensive benchmark for structure-aware and full-scope XBRL tagging, designed to evaluate LLMs’ ability to extract and align financial facts through numerical reasoning and taxonomy alignment across text and tables. We define two subtasks: FinNI for numeric identification, which extracts numerical entities and their types from XBRL reports, and FinCL for concept linking, which maps each extracted entity to the corresponding concept in the full US-GAAP taxonomy. Together, these subtasks produce a structured representation of each financial fact. We evaluate diverse LLMs under zero-shot settings and analyze their performance across both subtasks and overall tagging accuracy. Results show that LLMs generalize well in numeric identification but struggle with fine-grained concept linking, revealing current limitations in structure-aware reasoning for accurate financial disclosure. All code and datasets are available on GitHub and Hugging Face.

[134] Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios

Zhongzhen Huang, Linjie Mu, Yakun Zhu, Xiangyu Zhao, Shaoting Zhang, Xiaofan Zhang

Main category: cs.CL

TL;DR: MedE² is a two-stage post-training pipeline that enhances multimodal reasoning for medical applications by first fine-tuning with text-only reasoning demonstrations, then further training with multimodal medical cases.

Details

Motivation: Multimodal reasoning models have shown success in mathematics and science but remain underexplored in medical domains, where effective clinical decision-making depends on iterative, multimodal reasoning across diverse evidence sources.

Method: Two-stage post-training pipeline: Stage-I uses 2,000 text-only data samples with orchestrated reasoning demonstrations to elicit reasoning behaviors; Stage-II uses 1,500 curated multimodal medical cases to align model outputs with multimodal medical reasoning preferences.

Result: Models trained with MedE² consistently outperform baselines across multiple medical multimodal benchmarks, with additional validation on larger models and inference-time scaling confirming robustness and practical utility.

Conclusion: MedE² effectively improves reasoning performance of medical multimodal models, demonstrating efficacy and reliability for medical applications.

Abstract: Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE$^2$}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model’s reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textit{MedE$^2$} in improving the reasoning performance of medical multimodal models. Notably, models trained with \textit{MedE$^2$} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.

[135] Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

Wenrui Zhou, Mohamed Hendy, Shu Yang, Qingsong Yang, Zikun Guo, Yuyu Luo, Lijie Hu, Di Wang

Main category: cs.CL

TL;DR: VISE is the first benchmark for evaluating sycophantic behavior in Video-LLMs, where models align with user input even when it contradicts visual evidence. It introduces linguistic perspectives to video domain and proposes training-free mitigation strategies.

Details

Motivation: Video-LLMs are increasingly used in real-world applications requiring grounded multimodal reasoning, but their tendency towards sycophancy (aligning with user input against visual evidence) undermines trustworthiness. Current research lacks systematic benchmarks for video-language domain sycophancy.

Method: Proposed VISE benchmark evaluates sycophantic behavior across diverse question formats, prompt biases, and visual reasoning tasks. Introduces linguistic perspectives on sycophancy to video domain for fine-grained analysis. Also proposes two training-free mitigation strategies: interpretable key-frame selection for visual grounding and inference-time intervention on neural representations.

Result: VISE enables systematic evaluation of sycophantic behavior in Video-LLMs across multiple sycophancy types and interaction patterns. The proposed mitigation strategies reveal potential paths for reducing sycophantic bias without requiring retraining.

Conclusion: VISE fills a critical gap in evaluating Video-LLM sycophancy and provides foundational tools for understanding and mitigating this behavior. The benchmark and mitigation strategies contribute to improving factual consistency and reliability of Video-LLMs in real-world applications.

Abstract: As video large language models (Video-LLMs) become increasingly integrated into real-world applications that demand grounded multimodal reasoning, ensuring their factual consistency and reliability is of critical importance. However, sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts. Current sycophancy research has largely overlooked its specific manifestations in the video-language domain, resulting in a notable absence of systematic benchmarks and targeted evaluations to understand how Video-LLMs respond under misleading user input. To fill this gap, we propose VISE (Video-LLM Sycophancy Benchmarking and Evaluation), the first benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs across diverse question formats, prompt biases, and visual reasoning tasks. Specifically, VISE pioneeringly brings linguistic perspectives on sycophancy into the video domain, enabling fine-grained analysis across multiple sycophancy types and interaction patterns. Furthermore, we propose two potential training-free mitigation strategies, revealing potential paths for reducing sycophantic bias: (i) enhancing visual grounding through interpretable key-frame selection and (ii) steering model behavior away from sycophancy via targeted, inference-time intervention on its internal neural representations. Our code is available at https://github.com/William030422/Video-Sycophancy.

[136] Learning to Reason Across Parallel Samples for LLM Reasoning

Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, Eunsol Choi

Main category: cs.CL

TL;DR: The paper proposes Sample Set Aggregator (SSA), a compact LLM trained with reinforcement learning to aggregate multiple samples from base models for improved reasoning accuracy, achieving significant gains over majority voting and outperforming larger reward models.

Details

Motivation: Scaling test-time compute by sampling multiple answers and aggregating them (e.g., via majority voting or verifiers) brings performance gains for LLMs, but current aggregation methods are suboptimal.

Method: Train a compact LLM called SSA that takes concatenated sequences of multiple samples as input and outputs the final answer, optimized for accuracy using reinforcement learning.

Result: SSA improves over naive majority voting by 8% pass@5 on MATH, and a 3B SSA surpasses model-based re-ranking with a much larger 72B reward model. Shows strong generalization across sample set sizes, base models, and tasks.

Conclusion: Separating answer generation from answer aggregation enables efficient use of black box models and provides consistent performance gains across various reasoning tasks.

Abstract: Scaling test-time compute brings substantial performance gains for large language models (LLMs). By sampling multiple answers and heuristically aggregate their answers (e.g., either through majority voting or using verifiers to rank the answers), one can achieve consistent performance gains in math domains. In this paper, we propose a new way to leverage such multiple sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that takes a concatenated sequence of multiple samples and output the final answer, optimizing it for the answer accuracy with reinforcement learning. Experiments on five reasoning datasets demonstrate both the efficacy and efficiency of SSA. Notably, SSA improves over naive majority voting by 8% pass@5 on MATH. Furthermore, our 3B SSA surpasses model-based re-ranking with a much larger 72B process reward model. Our analysis also shows promising generalization ability of SSA, across sample set sizes, base model families and scales, and tasks. By separating LLMs to generate answers and LLMs to analyze and aggregate sampled answers, our approach can work with the outputs from premier black box models easily and efficiently.

[137] RedDebate: Safer Responses through Multi-Agent Red Teaming Debates

Ali Asad, Stephen Obadinma, Radin Shayanfar, Xiaodan Zhu

Main category: cs.CL

TL;DR: RedDebate is a multi-agent debate framework that enables LLMs to identify and mitigate unsafe behaviors through automated red-teaming and collaborative argumentation.

Details

Motivation: Existing AI safety approaches rely on costly human evaluation or single-model assessment, which are constrained by scalability and prone to oversight failures.

Method: Uses collaborative argumentation among multiple LLMs across diverse debate scenarios, with distinct long-term memory modules that preserve safety-relevant insights from debates for continuous refinement.

Result: Empirical evaluation shows RedDebate substantially reduces unsafe outputs, with memory integration providing further significant reductions in unsafe behaviors.

Conclusion: RedDebate is the first fully automated framework unifying multi-agent debate and red-teaming to progressively enhance LLM safety without human intervention.

Abstract: We introduce RedDebate, a novel multi-agent debate framework that provides the foundation for Large Language Models (LLMs) to identify and mitigate their unsafe behaviours. Existing AI safety approaches often rely on costly human evaluation or isolated single-model assessment, both constrained by scalability and prone to oversight failures. RedDebate employs collaborative argumentation among multiple LLMs across diverse debate scenarios, enabling them to critically evaluate one another’s reasoning and systematically uncover unsafe failure modes through fully automated red-teaming. We further integrate distinct long-term memory modules that preserve safety-relevant insights from debate interactions and leverage them during subsequent inference, facilitating continuous refinement of model behaviour. Empirical evaluation on safety benchmarks across a diverse set of models demonstrates that RedDebate substantially reduces unsafe outputs. While debate alone allows LLMs to refine their behaviour, the addition of memory yields further significant reductions. To the best of our knowledge, RedDebate is the first fully automated framework to unify multi-agent debate and red-teaming to progressively enhance LLM safety without human intervention.

[138] Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective

Weijie Xu, Yiwen Wang, Chi Xue, Xiangkun Hu, Xi Fang, Guimin Dong, Chandan K. Reddy

Main category: cs.CL

TL;DR: FiSCo is a statistical framework for detecting subtle semantic biases in LLM long-form responses by comparing claim-level entailment across demographic groups, outperforming existing methods.

Details

Motivation: LLMs generate biased responses that undermine reliability, and current evaluation methods fail to address biases in long-form responses and LLM output variability.

Method: Decompose model outputs into semantically distinct claims, use entailment checks for semantic comparison, and apply statistical hypothesis testing to compare inter- and intra-group similarities.

Result: FiSCo more reliably identifies nuanced biases while reducing the impact of stochastic LLM variability, outperforming various evaluation metrics on synthetic and human-annotated datasets.

Conclusion: FiSCo provides a robust framework for detecting subtle group-level fairness issues in LLMs through fine-grained semantic analysis at the claim level.

Abstract: Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications. Existing evaluation methods often overlook biases in long-form responses and the intrinsic variability of LLM outputs. To address these challenges, we propose FiSCo (Fine-grained Semantic Comparison), a novel statistical framework to evaluate group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups. Unlike prior work focusing on sentiment or token-level comparisons, FiSCo goes beyond surface-level analysis by operating at the claim level, leveraging entailment checks to assess the consistency of meaning across responses. We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities, enabling robust detection of subtle biases. We formalize a new group counterfactual fairness definition and validate FiSCo on both synthetic and human-annotated datasets spanning gender, race, and age. Experiments show that FiSCo more reliably identifies nuanced biases while reducing the impact of stochastic LLM variability, outperforming various evaluation metrics.

[139] Learning to Disentangle Latent Reasoning Rules with Language VAEs: A Systematic Study

Yingji Zhang, Marco Valentino, Danilo S. Carvalho, André Freitas

Main category: cs.CL

TL;DR: The paper proposes embedding explicit reasoning rules in Transformer-based language models using Language VAEs, showing that rules can be disentangled in the encoder’s parametric space and that prior knowledge injection improves rule retrieval.

Details

Motivation: Current Transformer LMs rely on memorization rather than rule-based inference for NLI tasks, limiting generalization, interpretability, and controllability.

Method: A complete pipeline using Language VAEs with three rule-based reasoning tasks, theoretical framework, and end-to-end architecture that disentangles reasoning rules in encoder space and injects prior knowledge via Query-Key-Value mechanism.

Result: Rules form distinct clusters in feature space; prior knowledge injection improves rule retrieval; mathematical reasoning shows performance plateaus with increased samples; ffn layers better preserve rule separation than attention layers.

Conclusion: Explicit reasoning rules can be successfully embedded in LMs via VAEs, enabling better rule-based inference and providing insights into model architecture choices for reasoning tasks.

Abstract: Incorporating explicit reasoning rules within the latent space of language models (LMs) offers a promising pathway to enhance generalisation, interpretability, and controllability. While current Transformer-based language models have shown strong performance on Natural Language Inference (NLI) tasks, they often rely on memorisation rather than rule-based inference. This work investigates how reasoning rules can be explicitly embedded and memorised within the LMs through Language Variational Autoencoders (VAEs). We propose a complete pipeline for learning reasoning rules within Transformer-based language VAEs. This pipeline encompasses three rule-based reasoning tasks, a supporting theoretical framework, and a practical end-to-end architecture. The experiment illustrates the following findings: Disentangled reasoning: Under explicit signal supervision, reasoning rules - viewed as functional mappings - can be disentangled within the encoder’s parametric space. This separation results in distinct clustering of rules in the output feature space. Prior knowledge injection: injecting reasoning information into the Query enables the model to more effectively retrieve the stored value Value from memory based on Key. This approach offers a simple method for integrating prior knowledge into decoder-only language models. Performance bottleneck: In mathematical reasoning tasks using Qwen2.5(0.5B), increasing sample count doesn’t improve performance beyond a point. Moreover, ffn layers are better than attention layers at preserving the separation of reasoning rules in the model’s parameters.

[140] Lizard: An Efficient Linearization Framework for Large Language Models

Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen

Main category: cs.CL

TL;DR: Lizard is a linearization framework that transforms Transformer LLMs into subquadratic architectures to overcome computational bottlenecks from quadratic attention complexity and KV cache growth.

Details

Motivation: Transformers face severe computational and memory bottlenecks with long sequences due to quadratic softmax attention complexity and growing KV cache that makes inference memory-bound by context length.

Method: Introduces subquadratic attention mechanism approximating softmax attention, augments architecture with compact learnable modules for adaptive memory control and length generalization, and uses hardware-aware algorithm to solve numerical instability in gated attention.

Result: Achieves near-lossless recovery of teacher model’s performance, outperforms previous methods by 9.4-24.5 points on 5-shot MMLU benchmark, and demonstrates superior associative recall.

Conclusion: Lizard effectively addresses Transformer limitations by providing subquadratic attention while preserving model quality, enabling efficient long-sequence processing.

Abstract: We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures. Transformers faces severe computational and memory bottlenecks with long sequences due to the quadratic complexity of softmax attention and the growing Key-Value (KV) cache that makes inference memory-bound by context length. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving model quality. Unlike prior linearization methods constrained by fixed, non-adaptive structures, Lizard augments the architecture with compact, learnable modules that enable adaptive memory control and robust length generalization. Moreover, we introduce a hardwareaware algorithm that solves numerical instability in gated attention to accelerate training. Extensive experiments show that Lizard achieves near-lossless recovery of its teacher model’s performance, significantly outperforming previous methods by up to 9.4 - 24.5 points on the 5-shot MMLU benchmark and demonstrating superior associative recall.

[141] ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline

Morris Alper, Moran Yanuka, Raja Giryes, Gašper Beguš

Main category: cs.CL

TL;DR: ConlangCrafter is an LLM-powered pipeline for automated constructed language creation that decomposes language design into modular stages and uses metalinguistic reasoning with randomness and self-refinement to generate diverse, coherent conlangs.

Details

Motivation: To leverage modern LLMs as computational creativity aids for end-to-end constructed language creation, bridging the gap between conlangs' artistic/philosophical roles and foundation models' generative capabilities.

Method: Multi-hop pipeline that decomposes language design into phonology, morphology, syntax, lexicon generation, and translation stages, using LLMs’ metalinguistic reasoning with injected randomness and self-refinement feedback for consistency.

Result: The system produces coherent and varied constructed languages without human linguistic expertise, demonstrating strong performance on consistency and typological diversity metrics.

Conclusion: ConlangCrafter successfully automates constructed language creation using LLMs, enabling generation of diverse and consistent conlangs through modular design and self-refinement mechanisms.

Abstract: Constructed languages (conlangs) such as Esperanto and Quenya have played diverse roles in art, philosophy, and international communication. Meanwhile, foundation models have revolutionized creative generation in text, images, and beyond. In this work, we leverage modern LLMs as computational creativity aids for end-to-end conlang creation. We introduce ConlangCrafter, a multi-hop pipeline that decomposes language design into modular stages - phonology, morphology, syntax, lexicon generation, and translation. At each stage, our method leverages LLMs’ metalinguistic reasoning capabilities, injecting randomness to encourage diversity and leveraging self-refinement feedback to encourage consistency in the emerging language description. We evaluate ConlangCrafter on metrics measuring consistency and typological diversity, demonstrating its ability to produce coherent and varied conlangs without human linguistic expertise.

[142] LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients

Egor Fadeev, Dzhambulat Mollaev, Aleksei Shestov, Omar Zoloev, Artem Sakhno, Dmitry Korolev, Ivan Kireev, Andrey Savchenko, Maksim Makarenko

Main category: cs.CL

TL;DR: LATTE is a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs, reducing computational costs while improving event sequence representations for financial applications.

Details

Motivation: Learning client embeddings from historical communication sequences is crucial for financial applications, but directly using LLMs on long event sequences is computationally expensive and impractical in real-world pipelines.

Method: Propose LATTE framework that uses contrastive learning to align raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by LLM, and used as supervision via contrastive loss.

Result: Significantly reduces inference cost and input size compared to conventional LLM processing of complete sequences. Outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets.

Conclusion: The method remains deployable in latency-sensitive environments while achieving superior performance compared to existing approaches.

Abstract: Learning clients embeddings from sequences of their historic communications is central to financial applications. While large language models (LLMs) offer general world knowledge, their direct use on long event sequences is computationally expensive and impractical in real-world pipelines. In this paper, we propose LATTE, a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss. The proposed approach significantly reduces inference cost and input size compared to conventional processing of complete sequence by LLM. We experimentally show that our method outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets while remaining deployable in latency-sensitive environments.

[143] LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination

Ziming Zhu, Chenglong Wang, Shunjie Xing, Yifu Huo, Fengning Tian, Quan Du, Di Yang, Chunliang Zhang, Tong Xiao, Jingbo Zhu

Main category: cs.CL

TL;DR: LaTeXTrans is a multi-agent system that translates LaTeX documents while preserving format and structure through specialized agents for parsing, translation, validation, and reconstruction.

Details

Motivation: Modern MT systems struggle with LaTeX documents that mix natural language with domain-specific syntax (equations, tables, figures, cross-references), requiring preservation of semantic integrity and compilability.

Method: Uses six specialized agents: Parser (decomposes LaTeX via placeholders), Translator, Validator, Summarizer, Terminology Extractor (collaborative translation), and Generator (reconstructs translated LaTeX).

Result: Outperforms mainstream MT systems in both translation accuracy and structural fidelity for LaTeX documents.

Conclusion: Provides an effective practical solution for translating LaTeX-formatted documents while maintaining format preservation and structural fidelity.

Abstract: Despite the remarkable progress of modern machine translation (MT) systems on general-domain texts, translating structured LaTeX-formatted documents remains a significant challenge. These documents typically interleave natural language with domain-specific syntax, such as mathematical equations, tables, figures, and cross-references, all of which must be accurately preserved to maintain semantic integrity and compilability. In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge. LaTeXTrans ensures format preservation, structural fidelity, and terminology consistency through six specialized agents: 1) a Parser that decomposes LaTeX into translation-friendly units via placeholder substitution and syntax filtering; 2) a Translator, Validator, Summarizer, and Terminology Extractor that work collaboratively to ensure context-aware, self-correcting, and terminology-consistent translations; 3) a Generator that reconstructs the translated content into well-structured LaTeX documents. Experimental results demonstrate that LaTeXTrans can outperform mainstream MT systems in both translation accuracy and structural fidelity, offering an effective and practical solution for translating LaTeX-formatted documents.The code of LaTeXTrans is available at https://github.com/NiuTrans/LaTeXTrans.

[144] Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

Muskan Saraf, Sajjad Rezvani Boroujeni, Justin Beaudry, Hossein Abedi, Tom Bush

Main category: cs.CL

TL;DR: LLMs show systematic bias in text evaluation based on perceived model identity, with false attribution causing up to 50 percentage point shifts in preferences and 12 percentage point changes in quality ratings.

Details

Motivation: To investigate the validity of LLMs as text evaluators and examine systematic bias in self- and cross-model evaluations across ChatGPT, Gemini, and Claude.

Method: Controlled experiment where blog posts from each model were evaluated by all three models under four attribution conditions (no attribution, true attribution, two false-attribution scenarios), using both preference voting and granular quality ratings across Coherence, Informativeness, and Conciseness.

Result: Pronounced asymmetries in model judgments: Claude label consistently elevated scores, Gemini label systematically depressed them. False attribution frequently reversed preference rankings with significant score shifts. Gemini showed severe self-deprecation, Claude demonstrated intensified self-preference.

Conclusion: Perceived model identity substantially distorts LLM evaluations independent of content quality, challenging the reliability of LLM-as-judge paradigms and highlighting the need for blind evaluation protocols and diverse multi-model validation frameworks.

Abstract: Large language models (LLMs) are increasingly deployed as evaluators of text quality, yet the validity of their judgments remains underexplored. This study investigates systematic bias in self- and cross-model evaluations across three prominent LLMs: ChatGPT, Gemini, and Claude. We designed a controlled experiment in which blog posts authored by each model were evaluated by all three models under four labeling conditions: no attribution, true attribution, and two false-attribution scenarios. Evaluations employed both holistic preference voting and granular quality ratings across three dimensions Coherence, Informativeness, and Conciseness with all scores normalized to percentages for direct comparison. Our findings reveal pronounced asymmetries in model judgments: the “Claude” label consistently elevated scores regardless of actual authorship, while the “Gemini” label systematically depressed them. False attribution frequently reversed preference rankings, producing shifts of up to 50 percentage points in voting outcomes and up to 12 percentage points in quality ratings. Notably, Gemini exhibited severe self-deprecation under true labels, while Claude demonstrated intensified self-preference. These results demonstrate that perceived model identity can substantially distort both high-level judgments and fine-grained quality assessments, independent of content quality. Our findings challenge the reliability of LLM-as-judge paradigms and underscore the critical need for blind evaluation protocols and diverse multi-model validation frameworks to ensure fairness and validity in automated text evaluation and LLM benchmarking.

[145] Can Large Language Models Master Complex Card Games?

Wei Wang, Fuqing Bie, Junzhe Chen, Dan Zhang, Shiyu Huang, Evgeny Kharlamov, Jie Tang

Main category: cs.CL

TL;DR: LLMs can master complex card games through supervised fine-tuning, achieving performance comparable to strong game AIs while maintaining some general capabilities when trained with mixed data.

Details

Motivation: To explore whether large language models (LLMs) can achieve success in complex games similar to specialized AI systems like AlphaGo, given LLMs' remarkable capabilities across various tasks.

Method: Systematically assess LLMs across eight diverse card games using supervised fine-tuning on high-quality gameplay data, evaluating performance and general capability retention.

Result: LLMs approach strong game AI performance through fine-tuning, achieve proficiency in multiple games simultaneously (with performance augmentation for similar games and conflicts for dissimilar ones), and experience mitigated general capability decline when trained with mixed instruction data.

Conclusion: LLMs demonstrate strong learning ability and versatility in mastering complex card games, showing potential for game AI applications while highlighting the importance of balanced training data to preserve general capabilities.

Abstract: Complex games have long been an important benchmark for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero have defeated top human players in Go and Chess, garnering widespread societal attention towards artificial intelligence. Concurrently, large language models (LLMs) have exhibited remarkable capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. In this paper, we explore the potential of LLMs in mastering complex card games. We systematically assess the learning capabilities of LLMs across eight diverse card games, evaluating the impact of fine-tuning on high-quality gameplay data, and examining the models’ ability to retain general capabilities while mastering these games. Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can achieve a certain level of proficiency in multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data. The evaluation results demonstrate strong learning ability and versatility of LLMs. The code is available at https://github.com/THUDM/LLM4CardGame

[146] COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Eugene Kwek, Wenpeng Yin

Main category: cs.CL

TL;DR: COMPACT is a joint pruning method that combines vocabulary pruning and FFN channel pruning to efficiently compress language models while maintaining standard transformer architecture and strong performance.

Details

Motivation: To make LLMs more efficient for edge deployment and interactive applications by addressing limitations of existing pruning methods that break transformer layouts or cause accuracy drops, especially on small language models.

Method: Jointly prunes rare vocabulary to shrink embedding/LM head layers and prunes FFN intermediate channels using common-token-weighted activations, aligning importance with post-pruning token distribution.

Result: State-of-the-art downstream performance across Qwen, LLaMA, and Gemma families (0.5B-70B) with substantial reductions in parameters, GPU memory, and latency.

Conclusion: COMPACT provides deployment-friendly pruning that maintains standard transformer architecture while achieving competitive efficiency gains across model scales.

Abstract: Making large language models (LLMs) more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a promising technique, but existing pruning methods are limited: width pruning often breaks the standard transformer layout, requiring custom inference code, while depth pruning can cause abrupt accuracy drops. Also, while many pruning approaches are effective against LLMs, they struggle to maintain performance on small language models (SLMs). In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/LM head layers and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT inherits strengths of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab. vs. FFN pruning), competitive pruning times, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream performance, with substantial reductions in parameters, GPU memory, and latency.

[147] Beyond I’m Sorry, I Can’t: Dissecting Large Language Model Refusal

Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah, Ranjan Satapathy, Erik Cambria, Roy Ka Wei Lee

Main category: cs.CL

TL;DR: Researchers use sparse autoencoders to identify and manipulate features in LLMs that cause refusal behavior on harmful prompts, creating jailbreaks and revealing mechanistic insights into safety behaviors.

Details

Motivation: To understand the internal causes of refusal behavior in instruction-tuned LLMs on harmful prompts, as this safety mechanism remains poorly understood despite its importance.

Method: A three-stage pipeline: (1) Find refusal-mediating direction and collect SAE features, (2) Greedy filtering to minimal feature set, (3) Use factorization machine to capture nonlinear interactions among active features.

Result: Identified jailbreak-critical features that causally influence refusal behavior, discovered redundant features that activate when primary refusal features are suppressed, and demonstrated ability to flip models from refusal to compliance.

Conclusion: The approach enables fine-grained auditing and targeted intervention in safety behaviors by manipulating interpretable latent spaces, revealing the mechanistic basis of refusal in LLMs.

Abstract: Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

[148] RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

Jane Luo, Xin Zhang, Steven Liu, Jie Wu, Yiming Huang, Yangyu Huang, Chengyu Yin, Ying Xin, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Qi Chen, Scarlett Li, Mao Yang

Main category: cs.CL

TL;DR: ZeroRepo introduces Repository Planning Graph (RPG) for structured repository generation, achieving 3.9x larger code output and significant improvements in test coverage and accuracy compared to baselines.

Details

Motivation: Current natural language planning approaches for repository generation produce unclear specifications, misaligned components, and brittle designs due to ambiguity and lack of structure.

Method: ZeroRepo uses a graph-driven framework with three stages: proposal-level planning, implementation-level construction, and graph-guided code generation with test validation, based on the Repository Planning Graph (RPG) representation.

Result: On RepoCraft benchmark (6 projects, 1,052 tasks), ZeroRepo generated 36K code lines and 445K tokens (3.9x larger than Claude Code), achieving 81.5% coverage and 69.7% test accuracy (27.3 and 35.8 point improvements over Claude Code).

Conclusion: RPG effectively models complex dependencies, enables sophisticated planning with near-linear scaling, and improves repository understanding, accelerating localization in code generation tasks.

Abstract: Large language models excel at generating individual functions or single files of code, yet generating complete repositories from scratch remains a fundamental challenge. This capability is key to building coherent software systems from high-level specifications and realizing the full potential of automated code generation. The process requires planning at two levels: deciding what features and modules to build (proposal stage) and defining their implementation details (implementation stage). Current approaches rely on natural language planning, which often produces unclear specifications, misaligned components, and brittle designs due to its inherent ambiguity and lack of structure. To address these limitations, we introduce the Repository Planning Graph (RPG), a structured representation that encodes capabilities, file structures, data flows, and functions in a unified graph. By replacing free-form natural language with an explicit blueprint, RPG enables consistent long-horizon planning for repository generation. Building on RPG, we develop ZeroRepo, a graph-driven framework that operates in three stages: proposal-level planning, implementation-level construction, and graph-guided code generation with test validation. To evaluate, we construct RepoCraft, a benchmark of six real-world projects with 1,052 tasks. On RepoCraft, ZeroRepo produces nearly 36K Code Lines and 445K Code Tokens, on average 3.9$\times$ larger than the strongest baseline (Claude Code), and 68$\times$ larger than other baselines. It achieves 81.5% coverage and 69.7% test accuracy, improving over Claude Code by 27.3 and 35.8 points. Further analysis shows that RPG models complex dependencies, enables more sophisticated planning through near-linear scaling, and improves agent understanding of repositories, thus accelerating localization.

[149] Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system

Shohei Hisada, Endo Sunao, Himi Yamato, Shoko Wakamiya, Eiji Aramaki

Main category: cs.CL

TL;DR: Study adapts HealthBench medical benchmark to Japanese context, revealing performance gaps in models due to rubric mismatches and cultural misalignment, highlighting need for localized evaluation framework.

Details

Motivation: Address scarcity of Japanese medical evaluation resources and limitations of translated multiple-choice questions for safe development of medical LLMs in Japan.

Method: Used machine-translated HealthBench scenarios to evaluate GPT-4.1 and LLM-jp-3.1 models, plus LLM-as-a-Judge approach to identify contextual gaps in clinical guidelines, healthcare systems and cultural norms.

Result: GPT-4.1 showed modest performance drop due to rubric mismatches, Japanese-native model failed significantly on clinical completeness, and significant proportion of rubric criteria require localization despite most scenarios being applicable.

Conclusion: Direct benchmark translation has limitations, urgent need for context-aware localized adaptation (J-HealthBench) for reliable and safe evaluation of medical LLMs in Japan.

Abstract: This study investigates the applicability of HealthBench, a large-scale, rubric-based medical benchmark, to the Japanese context. Although robust evaluation frameworks are essential for the safe development of medical LLMs, resources in Japanese are scarce and often consist of translated multiple-choice questions. Our research addresses this issue in two ways. First, we establish a performance baseline by applying a machine-translated version of HealthBench’s 5,000 scenarios to evaluate two models: a high-performing multilingual model (GPT-4.1) and a Japanese-native open-source model (LLM-jp-3.1). Secondly, we use an LLM-as-a-Judge approach to systematically classify the benchmark’s scenarios and rubric criteria. This allows us to identify ‘contextual gaps’ where the content is misaligned with Japan’s clinical guidelines, healthcare systems or cultural norms. Our findings reveal a modest performance drop in GPT-4.1 due to rubric mismatches, as well as a significant failure in the Japanese-native model, which lacked the required clinical completeness. Furthermore, our classification shows that, despite most scenarios being applicable, a significant proportion of the rubric criteria require localisation. This work underscores the limitations of direct benchmark translation and highlights the urgent need for a context-aware, localised adaptation, a “J-HealthBench”, to ensure the reliable and safe evaluation of medical LLMs in Japan.

[150] CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

Nithin Somasekharan, Ling Yue, Yadi Cao, Weichao Li, Patrick Emami, Pochinapeddi Sai Bhargav, Anurag Acharya, Xingyu Xie, Shaowu Pan

Main category: cs.CL

TL;DR: CFDLLMBench is a benchmark suite for evaluating LLMs’ capabilities in automating Computational Fluid Dynamics (CFD) experiments, testing graduate-level knowledge, numerical reasoning, and workflow implementation.

Details

Motivation: LLMs show strong NLP performance but their utility in automating complex physical system experiments like CFD remains underexplored, despite CFD being a critical computational science tool.

Method: Developed CFDLLMBench with three components: CFDQuery (CFD knowledge), CFDCodeBench (numerical/physical reasoning), and FoamBench (workflow implementation), using real-world CFD practices with rigorous evaluation framework.

Result: The benchmark provides reproducible results quantifying LLM performance across code executability, solution accuracy, and numerical convergence behavior.

Conclusion: CFDLLMBench establishes a foundation for developing and evaluating LLM-driven automation of numerical experiments for complex physical systems.

Abstract: Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system – a critical and labor-intensive component – remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components – CFDQuery, CFDCodeBench, and FoamBench – designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.

[151] InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

Wenjun Wang, Shuo Cai, Congkai Xie, Mingfa Feng, Yiming Zhang, Zhen Li, Kejing Yang, Ming Li, Jiannong Cao, Hongxia Yang

Main category: cs.CL

TL;DR: An end-to-end FP8 training recipe for LLMs that achieves near-lossless performance with significant efficiency gains over BF16 baseline.

Details

Motivation: High computational cost of training LLMs is a major barrier to innovation, and while FP8 offers theoretical efficiency gains, there's a lack of comprehensive open-source training recipes.

Method: Fine-grained, hybrid-granularity quantization strategy for FP8 training that integrates continual pre-training and supervised fine-tuning.

Result: Recipe is stable and essentially lossless, achieving performance on par with BF16 baseline while reducing training time by 22%, peak memory usage by 14%, and increasing throughput by 19%.

Conclusion: FP8 is established as a practical and robust alternative to BF16, with code release to democratize large-scale model training.

Abstract: The immense computational cost of training Large Language Models (LLMs) presents a major barrier to innovation. While FP8 training offers a promising solution with significant theoretical efficiency gains, its widespread adoption has been hindered by the lack of a comprehensive, open-source training recipe. To bridge this gap, we introduce an end-to-end FP8 training recipe that seamlessly integrates continual pre-training and supervised fine-tuning. Our methodology employs a fine-grained, hybrid-granularity quantization strategy to maintain numerical fidelity while maximizing computational efficiency. Through extensive experiments, including the continue pre-training of models on a 160B-token corpus, we demonstrate that our recipe is not only remarkably stable but also essentially lossless, achieving performance on par with the BF16 baseline across a suite of reasoning benchmarks. Crucially, this is achieved with substantial efficiency improvements, including up to a 22% reduction in training time, a 14% decrease in peak memory usage, and a 19% increase in throughput. Our results establish FP8 as a practical and robust alternative to BF16, and we will release the accompanying code to further democratize large-scale model training.

[152] RE-Searcher: Robust Agentic Search with Goal-oriented Planning and Self-reflection

Daocheng Fu, Jianbiao Mei, Licheng Wen, Xuemeng Yang, Cheng Yang, Rong Wu, Tao Hu, Siqi Li, Yufan Shen, Xinyu Cai, Pinlong Cai, Botian Shi, Yong Liu, Yu Qiao

Main category: cs.CL

TL;DR: RE-Searcher is a search agent that combines goal-oriented planning and self-reflection to improve LLM-powered search robustness in complex environments.

Details

Motivation: LLMs face challenges in real-world deployment due to knowledge cutoff, hallucination, and limited interaction modalities. While external search tools help, they expose agents to complex search environments where small query variations can lead to unproductive reasoning and error amplification.

Method: RE-Searcher explicitly articulates concrete search goals during search and reflects on whether retrieved evidence satisfies those goals, combining goal-oriented planning with self-reflection.

Result: Extensive experiments show improved search accuracy and state-of-the-art results. Perturbation studies demonstrate substantial resilience to noisy or misleading external signals, mitigating search process fragility.

Conclusion: The findings offer practical guidance for integrating LLM-powered agents into complex interactive environments and enabling more autonomous decision-making through robust search mechanisms.

Abstract: Large language models (LLMs) excel at knowledge-intensive question answering and reasoning, yet their real-world deployment remains constrained by knowledge cutoff, hallucination, and limited interaction modalities. Augmenting LLMs with external search tools helps alleviate these issues, but it also exposes agents to a complex search environment in which small, plausible variations in query formulation can steer reasoning into unproductive trajectories and amplify errors. We present a systematic analysis that quantifies how environmental complexity induces fragile search behaviors and, in turn, degrades overall performance. To address this challenge, we propose a simple yet effective approach to instantiate a search agent, RE-Searcher. During search, RE-Searcher explicitly articulates a concrete search goal and subsequently reflects on whether the retrieved evidence satisfies that goal. This combination of goal-oriented planning and self-reflection enables RE-Searcher to resist spurious cues in complex search environments and perform robust search. Extensive experiments show that our method improves search accuracy and achieves state-of-the-art results. Perturbation studies further demonstrate substantial resilience to noisy or misleading external signals, mitigating the fragility of the search process. We believe these findings offer practical guidance for integrating LLM-powered agents into more complex interactive environments and enabling more autonomous decision-making.

[153] Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, Weiyan Shi

Main category: cs.CL

TL;DR: The paper identifies typicality bias in preference data as a fundamental cause of mode collapse in LLM alignment, and proposes Verbalized Sampling (VS) - a training-free prompting strategy that significantly improves diversity across creative tasks without sacrificing accuracy.

Details

Motivation: To address mode collapse in post-training alignment of LLMs, which prior work attributed to algorithmic limitations, by identifying a data-level driver: typicality bias in preference data where annotators systematically favor familiar text.

Method: Formalized typicality bias theoretically, verified it empirically on preference datasets, and introduced Verbalized Sampling - a prompting strategy that asks models to verbalize probability distributions over multiple responses (e.g., “Generate 5 jokes about coffee and their corresponding probabilities”).

Result: VS significantly improved performance across creative writing (1.6-2.1x diversity increase), dialogue simulation, open-ended QA, and synthetic data generation without sacrificing factual accuracy and safety. More capable models benefited more from VS.

Conclusion: The work provides a data-centric perspective on mode collapse and a practical inference-time remedy (Verbalized Sampling) that helps unlock pre-trained generative diversity in LLMs.

Abstract: Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., “Generate 5 jokes about coffee and their corresponding probabilities”). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1x over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.

[154] Machine Learning for Detection and Analysis of Novel LLM Jailbreaks

John Hawkins, Aditya Pramar, Rodney Beard, Rohitash Chandra

Main category: cs.CL

TL;DR: The study analyzes machine learning models’ ability to distinguish jailbreak prompts from genuine uses in LLMs, finding that fine-tuned BERT models perform best at identifying both known and novel jailbreak strategies.

Details

Motivation: LLMs are vulnerable to jailbreak prompts that circumvent safety guardrails, creating a need for effective detection methods to identify malicious input manipulation.

Method: Evaluated different ML models for jailbreak detection, with focus on fine-tuning BERT models end-to-end using current datasets, and analyzed distinguishing keywords through visualization.

Result: Fine-tuned BERT models achieved the best performance in distinguishing jailbreak prompts from genuine uses, including identifying previously unseen jailbreak strategies.

Conclusion: Explicit reflexivity in prompt structure can signal jailbreak intention, and fine-tuned BERT models are currently the most effective approach for jailbreak detection.

Abstract: Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer’s policies. In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses, including looking at our ability to identify jailbreaks that use previously unseen strategies. Our results indicate that using current datasets the best performance is achieved by fine tuning a Bidirectional Encoder Representations from Transformers (BERT) model end-to-end for identifying jailbreaks. We visualise the keywords that distinguish jailbreak from genuine prompts and conclude that explicit reflexivity in prompt structure could be a signal of jailbreak intention.

[155] Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text

Nisar Hussain, Amna Qasim, Gull Mehak, Muhammad Zain, Momina Hafeez, Grigori Sidorov

Main category: cs.CL

TL;DR: Proposes QLoRA-based fine-tuning framework for offensive language detection in Roman Urdu-English code mixed text, achieving best results with Meta LLaMA 3 8B (91.45 F1 score).

Details

Motivation: Address challenges in NLP for code mixed languages like Roman Urdu due to unstated grammar, inconsistent spelling, and scarce labeled data for offensive language detection.

Method: Translate Roman Urdu-English dataset to English using Google Translate, then fine-tune multiple LLMs (Meta LLaMA 3 8B, Mistral 7B, LLaMA 2 7B, ModernBERT, RoBERTa) with QLoRA for memory-efficient adaptation on manually annotated dataset.

Result: Meta LLaMA 3 8B achieved highest F1 score of 91.45, followed by Mistral 7B at 89.66, surpassing traditional transformer baselines.

Conclusion: QLoRA enables effective fine-tuning of high-performing models for low-resource code mixed offensive language detection, demonstrating LLM potential and advancing scalable Roman Urdu moderation.

Abstract: The use of derogatory terms in languages that employ code mixing, such as Roman Urdu, presents challenges for Natural Language Processing systems due to unstated grammar, inconsistent spelling, and a scarcity of labeled data. In this work, we propose a QLoRA based fine tuning framework to improve offensive language detection in Roman Urdu-English text. We translated the Roman Urdu-English code mixed dataset into English using Google Translate to leverage English LLMs, while acknowledging that this translation reduces direct engagement with code mixing features. Our focus is on classification performance using English translated low resource inputs. We fine tuned several transformers and large language models, including Meta LLaMA 3 8B, Mistral 7B v0.1, LLaMA 2 7B, ModernBERT, and RoBERTa, with QLoRA for memory efficient adaptation. Models were trained and evaluated on a manually annotated Roman Urdu dataset for offensive vs non offensive content. Of all tested models, the highest F1 score of 91.45 was attained by Meta LLaMA 3 8B, followed by Mistral 7B at 89.66, surpassing traditional transformer baselines. These results demonstrate the efficacy of QLoRA in fine tuning high performing models for low resource environments such as code mixed offensive language detection, and confirm the potential of LLMs for this task. This work advances a scalable approach to Roman Urdu moderation and paves the way for future multilingual offensive detection systems based on LLMs.

[156] Scalable multilingual PII annotation for responsible AI in LLMs

Bharti Meena, Joanna Skubisz, Harshit Rajgarhia, Nand Dave, Kiran Ganesh, Shivali Dalmia, Abhishek Mukherji, Vasudevan Sundarababu

Main category: cs.CL

TL;DR: A scalable multilingual framework for PII annotation across 13 underrepresented locales, improving annotation quality through human-in-the-loop methodology and analytics-driven pipelines.

Details

Motivation: Ensure reliable handling of Personally Identifiable Information (PII) by LLMs across diverse regulatory contexts as adoption increases.

Method: Phased human-in-the-loop annotation methodology combining linguistic expertise with quality assurance, using inter-annotator agreement metrics and root-cause analysis.

Result: Substantial improvements in recall and false positive rates across pilot, training, and production phases, creating high-fidelity datasets for supervised LLM fine-tuning.

Conclusion: Iterative, analytics-driven pipelines enhance both annotation quality and downstream model reliability, addressing common annotator challenges in multilingual PII labeling.

Abstract: As Large Language Models (LLMs) gain wider adoption, ensuring their reliable handling of Personally Identifiable Information (PII) across diverse regulatory contexts has become essential. This work introduces a scalable multilingual data curation framework designed for high-quality PII annotation across 13 underrepresented locales, covering approximately 336 locale-specific PII types. Our phased, human-in-the-loop annotation methodology combines linguistic expertise with rigorous quality assurance, leading to substantial improvements in recall and false positive rates from pilot, training, and production phases. By leveraging inter-annotator agreement metrics and root-cause analysis, the framework systematically uncovers and resolves annotation inconsistencies, resulting in high-fidelity datasets suitable for supervised LLM fine-tuning. Beyond reporting empirical gains, we highlight common annotator challenges in multilingual PII labeling and demonstrate how iterative, analytics-driven pipelines can enhance both annotation quality and downstream model reliability.

[157] EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences

Kshitish Ghate, Andy Liu, Devansh Jain, Taylor Sorensen, Atoosa Kasirzadeh, Aylin Caliskan, Mona T. Diab, Maarten Sap

Main category: cs.CL

TL;DR: EVALUESTEER is a benchmark to measure LLMs’ and reward models’ steerability toward diverse user value and style preferences, revealing current models struggle with complex user profiles.

Details

Motivation: To address the need for pluralistic AI systems that accommodate diverse global user preferences and values, and fill the gap in existing datasets for controlled evaluation of reward model steering.

Method: Synthetically generated 165,888 preference pairs systematically varying 4 value dimensions (traditional, secular-rational, survival, self-expression) and 4 style dimensions (verbosity, readability, confidence, warmth), then evaluated 6 LLMs and RMs under 11 prompting conditions and 6 preference comparison scenarios.

Result: Best models achieved <75% accuracy when given full user profiles, compared to >99% accuracy when only relevant preferences were provided, showing limitations in identifying and adapting to relevant user profile information.

Conclusion: EVALUESTEER highlights current RMs’ limitations in handling complex user profiles and provides a challenging testbed for developing models that can be steered toward diverse human values and preferences.

Abstract: As large language models (LLMs) are deployed globally, creating pluralistic systems that can accommodate the diverse preferences and values of users worldwide becomes essential. We introduce EVALUESTEER, a benchmark to measure LLMs’ and reward models’ (RMs) steerability towards users’ value and stylistic preference profiles grounded in psychology and human-LLM interaction literature. To address the gap in existing datasets that do not support controlled evaluations of RM steering, we synthetically generated 165,888 preference pairs – systematically varying pairs along 4 value dimensions (traditional, secular-rational, survival, and self-expression) and 4 style dimensions (verbosity, readability, confidence, and warmth). We use EVALUESTEER to evaluate whether, given a user profile and a pair of candidate value-laden and style-laden responses, LLMs and RMs are able to select the output that aligns with the user’s preferences. We evaluate six open-source and proprietary LLMs and RMs under eleven systematic prompting conditions and six preference comparison scenarios. Notably, our results show that, when given the user’s full profile of values and stylistic preferences, the best models achieve <75% accuracy at choosing the correct response, in contrast to >99% accuracy when only relevant style and value preferences are provided. EVALUESTEER thus highlights the limitations of current RMs at identifying and adapting to relevant user profile information, and provides a challenging testbed for developing RMs that can be steered towards diverse human values and preferences.

[158] Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis

Main category: cs.CL

TL;DR: This paper investigates jailbreaking in large language models by analyzing internal representations, focusing on how hidden layers respond to jailbreak vs benign prompts in GPT-J and Mamba2 models.

Details

Motivation: Jailbreaking LLMs is a pressing concern as adversarial users exploit models through engineered prompts to elicit restricted outputs, and existing defenses are insufficient against evolving attack techniques.

Method: Analyzed internal representations of LLMs by examining how hidden layers respond to jailbreak versus benign prompts, specifically studying GPT-J and Mamba2 models.

Result: Found distinct layer-wise behaviors in how models process jailbreak versus benign prompts, revealing differences in internal model dynamics.

Conclusion: The findings suggest promising directions for leveraging internal model dynamics to develop more robust jailbreak detection and defense mechanisms.

Abstract: Jailbreaking large language models (LLMs) has emerged as a pressing concern with the increasing prevalence and accessibility of conversational LLMs. Adversarial users often exploit these models through carefully engineered prompts to elicit restricted or sensitive outputs, a strategy widely referred to as jailbreaking. While numerous defense mechanisms have been proposed, attackers continuously develop novel prompting techniques, and no existing model can be considered fully resistant. In this study, we investigate the jailbreak phenomenon by examining the internal representations of LLMs, with a focus on how hidden layers respond to jailbreak versus benign prompts. Specifically, we analyze the open-source LLM GPT-J and the state-space model Mamba2, presenting preliminary findings that highlight distinct layer-wise behaviors. Our results suggest promising directions for further research on leveraging internal model dynamics for robust jailbreak detection and defense.

[159] Online Rubrics Elicitation from Pairwise Comparisons

MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, Afra Feyza Akyürek

Main category: cs.CL

TL;DR: OnlineRubrics is a method that dynamically curates evaluation criteria through pairwise comparisons during LLM training, improving performance by up to 8% over static rubrics across multiple benchmarks.

Details

Motivation: Static rubrics in LLM training are vulnerable to reward-hacking and fail to capture emergent desiderata during training, limiting their effectiveness for open-ended long-form answers.

Method: Online Rubrics Elicitation (OnlineRubrics) dynamically curates evaluation criteria through pairwise comparisons of responses from current and reference policies, enabling continuous error identification and mitigation.

Result: The approach yields consistent improvements of up to 8% over training with static rubrics across AlpacaEval, GPQA, ArenaHard, and expert question validation sets.

Conclusion: Dynamic rubric elicitation through pairwise comparisons enables better LLM training by capturing emergent criteria and mitigating reward-hacking, with identified themes including transparency, practicality, organization, and reasoning.

Abstract: Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.

[160] Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li

Main category: cs.CL

TL;DR: HaystackCraft is a new benchmark that extends needle-in-a-haystack tests to evaluate LLM robustness against noisy contexts from biased retrieval and agentic workflows, revealing persistent challenges in long-context reasoning.

Details

Motivation: Existing NIAH benchmarks overlook how noisy contexts arise from real-world factors like biased retrieval and agentic workflows, failing to test model robustness in practical scenarios.

Method: Built HaystackCraft on English Wikipedia hyperlink network with multi-hop questions, evaluating heterogeneous retrieval strategies and extending to dynamic agentic settings where models refine queries and reflect on reasoning.

Result: Experiments with 15 models show: (1) stronger retrievers create more challenging distractors but graph-based reranking helps; (2) advanced models suffer cascading failures from self-generated distractors and struggle with early stopping in agentic tests.

Conclusion: HaystackCraft reveals persistent challenges in agentic long-context reasoning and serves as a valuable testbed for future model development.

Abstract: Modern long-context large language models (LLMs) perform well on synthetic “needle-in-a-haystack” (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors – distraction from heterogeneous biased retrievers and cascading errors in agentic workflows – to test models’ long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.

[161] Populism Meets AI: Advancing Populism Research with LLMs

Eduardo Ryô Tamaki, Yujin J. Jung, Julia Chatterley, Grant Mitchell, Semir Dzebo, Cristóbal Sandoval, Levente Littvay, Kirk A. Hawkins

Main category: cs.CL

TL;DR: A rubric and anchor guided chain of thought prompting approach enables LLMs to measure populist ideational content with accuracy comparable to expert human coders, overcoming limitations of traditional textual analysis methods.

Details

Motivation: Traditional textual analysis methods for measuring populism are costly, time-consuming, and difficult to scale across languages, contexts, and large corpora, creating a need for more efficient approaches.

Method: Used a rubric and anchor guided chain of thought prompting approach that mirrors human coder training, leveraging the Global Populism Database and prompting LLMs with adapted documentation used to train human coders.

Result: The domain-specific prompting strategy enabled LLMs to achieve classification accuracy on par with expert human coders, demonstrating ability to navigate nuanced, context-sensitive aspects of populism.

Conclusion: LLMs with specialized prompting strategies can effectively measure populist ideational content at scale, providing a viable alternative to traditional human coding approaches.

Abstract: Measuring the ideational content of populism remains a challenge. Traditional strategies based on textual analysis have been critical for building the field’s foundations and providing a valid, objective indicator of populist framing. Yet these approaches are costly, time consuming, and difficult to scale across languages, contexts, and large corpora. Here we present the results from a rubric and anchor guided chain of thought (CoT) prompting approach that mirrors human coder training. By leveraging the Global Populism Database (GPD), a comprehensive dataset of global leaders’ speeches annotated for degrees of populism, we replicate the process used to train human coders by prompting the LLM with an adapted version of the same documentation to guide the model’s reasoning. We then test multiple proprietary and open weight models by replicating scores in the GPD. Our findings reveal that this domain specific prompting strategy enables the LLM to achieve classification accuracy on par with expert human coders, demonstrating its ability to navigate the nuanced, context sensitive aspects of populism.

[162] Can Lessons From Human Teams Be Applied to Multi-Agent Systems? The Role of Structure, Diversity, and Interaction Dynamics

Rasika Muralidharan, Haewoon Kwak, Jisun An

Main category: cs.CL

TL;DR: This paper examines team dynamics in LLM-powered multi-agent systems, finding flat structures outperform hierarchical ones and diversity has nuanced effects, with agents showing overconfidence but appreciation for collaboration.

Details

Motivation: To explore team dynamics in LLM-powered multi-agent systems, inspired by human team science principles, as current research focuses less on team aspects despite growing interest in MAS with LLM agents.

Method: Proposed a multi-agent framework to examine team science aspects (structure, diversity, interaction dynamics) and evaluated team performance across four reasoning tasks: CommonsenseQA, StrategyQA, Social IQa, and Latent Implicit Hate.

Result: Flat teams performed better than hierarchical ones, diversity had nuanced impact, agents were overconfident about team performance, and post-task reflections showed appreciation for collaboration but challenges in integration and conversational coordination.

Conclusion: Team structure significantly impacts performance in LLM-powered multi-agent systems, with flat organizations outperforming hierarchical ones, while collaboration dynamics reveal both benefits and coordination challenges that need addressing.

Abstract: Multi-Agent Systems (MAS) with Large Language Model (LLM)-powered agents are gaining attention, yet fewer studies explore their team dynamics. Inspired by human team science, we propose a multi-agent framework to examine core aspects of team science: structure, diversity, and interaction dynamics. We evaluate team performance across four tasks: CommonsenseQA, StrategyQA, Social IQa, and Latent Implicit Hate, spanning commonsense and social reasoning. Our results show that flat teams tend to perform better than hierarchical ones, while diversity has a nuanced impact. Interviews suggest agents are overconfident about their team performance, yet post-task reflections reveal both appreciation for collaboration and challenges in integration, including limited conversational coordination.

[163] Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Mizanur Rahman, Amran Bhuiyan, Israt Jahan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang

Main category: cs.CL

TL;DR: Tiny vision-language models (<=2B parameters) perform poorly as automated judges in chart comprehension tasks. The paper proposes multi-criteria prompting and domain-adaptive transfer learning to create ChartJudge, a specialized 2B-parameter model that effectively transfers knowledge across datasets.

Details

Motivation: Large Vision-Language Models (7B parameters) work well as automated judges for chart comprehension, but tiny models (<=2B) perform poorly, limiting their use in resource-constrained settings where cost efficiency is crucial.

Method: Two approaches: (1) multi-criteria prompting that combines separate evaluation criteria into single queries, and (2) domain-adaptive transfer learning by fine-tuning a 2B-parameter LVLM on synthetic judgments from chart datasets to create ChartJudge.

Result: Multi-criteria prompting revealed robustness gaps causing huge performance drops in 7B models including specialized judges like LLaVA-Critic. ChartJudge effectively transferred knowledge across datasets and became more specialized. Fine-grained analysis provided insights into trade-offs between model size, prompt design, and transferability.

Conclusion: The approaches enable scalable, low-cost evaluation for chart reasoning tasks by balancing model size, prompt design, and transfer learning, making tiny LVLMs viable for automated chart comprehension judgments in resource-constrained environments.

Abstract: Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks.

[164] ARM2: Adaptive Reasoning Model with Vision Understanding and Executable Code

Jian Xie, Zhendong Chu, Aoxiao Zhong, Kai Zhang, Mingzhe Han, Xing Fan, Jialie Shen, Qingsong Wen

Main category: cs.CL

TL;DR: ARM2 is a unified model that adaptively balances reasoning performance and efficiency across multiple formats through reinforcement learning with length-aware optimization, reducing token usage by over 70% while maintaining performance.

Details

Motivation: Large Reasoning Models suffer from 'over-thinking' - generating unnecessarily long reasoning on simple tasks. Existing solutions like length penalties or routing mechanisms are heuristic and task-specific, lacking a general framework for adaptive reasoning.

Method: ARM2 uses a reinforcement learning framework augmented with length-aware optimization. It integrates vision understanding for multimodal applications and incorporates executable code into reasoning to reduce token costs while preserving performance.

Result: ARM2 achieves performance on par with traditional reasoning models trained with GRPO, while reducing token usage by over 70% on average. Extensive analyses validate its effectiveness and design soundness.

Conclusion: ARM2 provides a general framework for adaptive reasoning that effectively balances performance and efficiency across multiple formats, significantly reducing computational costs while maintaining reasoning quality.

Abstract: Large Reasoning Models (LRMs) often suffer from the ``over-thinking’’ problem, generating unnecessarily long reasoning on simple tasks. Some strategies have been proposed to mitigate this issue, such as length penalties or routing mechanisms, but they are typically heuristic and task-specific, lacking a general framework for adaptive reasoning. In this paper, we present ARM2, a unified model that adaptively balances reasoning performance and efficiency across multiple formats through a reinforcement learning framework augmented with length-aware optimization. Beyond conventional natural language inference, ARM2 integrates vision understanding, extending its applicability to multimodal. Moreover, ARM2 integrates executable code into reasoning, enabling substantial reductions in token cost while preserving task performance compared to long CoT. Experiments demonstrate that ARM2 achieves performance on par with traditional reasoning models trained with GRPO, while reducing token usage by over 70% on average. We further conduct extensive analyses to validate the effectiveness of ARM2 and the soundness of its design.

cs.CV

[165] Zero-shot image privacy classification with Vision-Language Models

Alina Elena Baia, Alessio Xompero, Andrea Cavallaro

Main category: cs.CV

TL;DR: VLMs underperform specialized models in image privacy prediction despite being more resource-intensive, though they show higher robustness to image perturbations.

Details

Motivation: To address the lack of systematic evaluation comparing large Vision-Language Models (VLMs) against specialized models for image privacy prediction, and to establish a fair benchmark.

Method: Established a zero-shot benchmark for image privacy classification, evaluated top-3 open-source VLMs using task-aligned prompts, and compared their performance, efficiency, and robustness against established vision-only and multi-modal methods.

Result: VLMs lag behind specialized, smaller models in privacy prediction accuracy despite their higher parameter count and slower inference, but exhibit higher robustness to image perturbations.

Conclusion: Specialized models currently outperform VLMs in image privacy prediction accuracy, challenging the trend of adopting generic VLMs for this specific task.

Abstract: While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.

Nirmal Elamon, Rouzbeh Davoudi

Main category: cs.CV

TL;DR: Fine-tuned multi-modal LLMs achieve up to 36% accuracy improvement on artificial text overlay detection using only 1,000 images, outperforming CNNs that need much more data.

Details

Motivation: To bridge vision and language by exploring how multi-modal LLMs can be adapted for precise visual understanding with minimal supervision, addressing the suboptimal performance of out-of-the-box LLMs on specialized visual tasks.

Method: Comprehensive comparison of fine-tuned traditional CNNs, zero-shot pre-trained multi-modal LLMs, and fine-tuned multi-modal LLMs on artificial text overlay detection in images.

Result: LLMs fine-tuned on very limited data (fewer than 1,000 images) achieved up to 36% accuracy improvement, matching or surpassing CNN-based baselines that typically require orders of magnitude more data.

Conclusion: LLM-based approaches show superior adaptability and data efficiency for real-world object detection tasks, providing actionable guidance for applying multi-modal transformers in low-resource visual environments.

Abstract: The field of object detection and understanding is rapidly evolving, driven by advances in both traditional CNN-based models and emerging multi-modal large language models (LLMs). While CNNs like ResNet and YOLO remain highly effective for image-based tasks, recent transformer-based LLMs introduce new capabilities such as dynamic context reasoning, language-guided prompts, and holistic scene understanding. However, when used out-of-the-box, the full potential of LLMs remains underexploited, often resulting in suboptimal performance on specialized visual tasks. In this work, we conduct a comprehensive comparison of fine-tuned traditional CNNs, zero-shot pre-trained multi-modal LLMs, and fine-tuned multi-modal LLMs on the challenging task of artificial text overlay detection in images. A key contribution of our study is demonstrating that LLMs can be effectively fine-tuned on very limited data (fewer than 1,000 images) to achieve up to 36% accuracy improvement, matching or surpassing CNN-based baselines that typically require orders of magnitude more data. By exploring how language-guided models can be adapted for precise visual understanding with minimal supervision, our work contributes to the broader effort of bridging vision and language, offering novel insights into efficient cross-modal learning strategies. These findings highlight the adaptability and data efficiency of LLM-based approaches for real-world object detection tasks and provide actionable guidance for applying multi-modal transformers in low-resource visual environments. To support continued progress in this area, we have made the code used to fine-tune the models available in our GitHub, enabling future improvements and reuse in related applications.

[167] Reproducible Evaluation of Data Augmentation and Loss Functions for Brain Tumor Segmentation

Saumya B

Main category: cs.CV

TL;DR: This paper presents a reproducible evaluation of U-Net segmentation for brain tumor MRI using focal loss and basic data augmentation, achieving 90% precision comparable to state-of-the-art methods.

Details

Motivation: Brain tumor segmentation faces challenges with class imbalance and limited model generalization, hindering progress in diagnosis and treatment planning.

Method: Used U-Net segmentation on brain tumor MRI with focal loss parameter tuning and three data augmentation techniques: horizontal flip, rotation, and scaling. Experiments conducted on a publicly available MRI dataset.

Result: U-Net with focal loss achieved 90% precision, comparable to state-of-the-art results. The study provides transparent, reproducible baseline with all code and results publicly available.

Conclusion: Establishes a reproducible baseline to guide future research on augmentation strategies and loss function design in brain tumor segmentation.

Abstract: Brain tumor segmentation is crucial for diagnosis and treatment planning, yet challenges such as class imbalance and limited model generalization continue to hinder progress. This work presents a reproducible evaluation of U-Net segmentation performance on brain tumor MRI using focal loss and basic data augmentation strategies. Experiments were conducted on a publicly available MRI dataset, focusing on focal loss parameter tuning and assessing the impact of three data augmentation techniques: horizontal flip, rotation, and scaling. The U-Net with focal loss achieved a precision of 90%, comparable to state-of-the-art results. By making all code and results publicly available, this study establishes a transparent, reproducible baseline to guide future research on augmentation strategies and loss function design in brain tumor segmentation.

[168] Adjusting Initial Noise to Mitigate Memorization in Text-to-Image Diffusion Models

Hyeonggeun Han, Sehwan Kim, Hyungjun Joo, Sangwoo Hong, Jungwoo Lee

Main category: cs.CV

TL;DR: The paper addresses memorization in text-to-image diffusion models by showing that initial noise samples affect when the denoising trajectory escapes memorization basins, and proposes strategies to adjust initial noise for earlier escape while maintaining image quality.

Details

Motivation: Text-to-image diffusion models often memorize and replicate training data, raising privacy and copyright concerns. Current approaches that delay classifier-free guidance (CFG) result in poor image-text alignment, creating a need for methods that enable earlier escape from memorization basins.

Method: Two mitigation strategies: (1) collective adjustment of initial noise to find samples that encourage earlier escape from memorization basins, and (2) individual adjustment of initial noise samples. Both approaches leverage the observation that different initial noise samples lead to varying escape times.

Result: The proposed approaches significantly reduce memorization while preserving image-text alignment, enabling earlier application of CFG without compromising output quality.

Conclusion: Initial noise selection is crucial for controlling memorization in diffusion models. Adjusting initial noise samples can effectively promote earlier escape from memorization basins, reducing privacy and copyright risks while maintaining generative performance.

Abstract: Despite their impressive generative capabilities, text-to-image diffusion models often memorize and replicate training data, prompting serious concerns over privacy and copyright. Recent work has attributed this memorization to an attraction basin-a region where applying classifier-free guidance (CFG) steers the denoising trajectory toward memorized outputs-and has proposed deferring CFG application until the denoising trajectory escapes this basin. However, such delays often result in non-memorized images that are poorly aligned with the input prompts, highlighting the need to promote earlier escape so that CFG can be applied sooner in the denoising process. In this work, we show that the initial noise sample plays a crucial role in determining when this escape occurs. We empirically observe that different initial samples lead to varying escape times. Building on this insight, we propose two mitigation strategies that adjust the initial noise-either collectively or individually-to find and utilize initial samples that encourage earlier basin escape. These approaches significantly reduce memorization while preserving image-text alignment.

[169] The Digital Mirror: Gender Bias and Occupational Stereotypes in AI-Generated Images

Siiri Leppälampi, Sonja M. Hyrynsalmi, Erno Vanhala

Main category: cs.CV

TL;DR: The study examines gender representation biases in AI-generated occupational images from DALL-E 3 and Ideogram, finding both tools reinforce traditional gender stereotypes to varying degrees.

Details

Motivation: Address the gap in AI visualization research that overlooks representational biases, particularly harmful gender biases in occupational settings, as AI image tools become more widely used.

Method: Generated over 750 AI images of occupations using DALL-E 3 and Ideogram, then conducted thematic analysis to evaluate gender representation biases.

Result: Both DALL-E 3 and Ideogram reinforce traditional gender stereotypes in AI-generated occupational images, though to different extents, risking narrow representations.

Conclusion: AI visualization tools risk reinforcing gender stereotypes; suggestions are proposed for practitioners, individuals, and researchers to improve gender representation in AI-generated images.

Abstract: Generative AI offers vast opportunities for creating visualisations, such as graphics, videos, and images. However, recent studies around AI-generated visualisations have primarily focused on the creation process and image quality, overlooking representational biases. This study addresses this gap by testing representation biases in AI-generated pictures in an occupational setting and evaluating how two AI image generator tools, DALL-E 3 and Ideogram, compare. Additionally, the study discusses topics such as ageing and emotions in AI-generated images. As AI image tools are becoming more widely used, addressing and mitigating harmful gender biases becomes essential to ensure diverse representation in media and professional settings. In this study, over 750 AI-generated images of occupations were prompted. The thematic analysis results revealed that both DALL-E 3 and Ideogram reinforce traditional gender stereotypes in AI-generated images, although to varying degrees. These findings emphasise that AI visualisation tools risk reinforcing narrow representations. In our discussion section, we propose suggestions for practitioners, individuals and researchers to increase representation when generating images with visible genders.

[170] Dynamic Mixture-of-Experts for Visual Autoregressive Model

Jort Vincenti, Metod Jazbec, Guoxuan Xia

Main category: cs.CV

TL;DR: Dynamic Mixture-of-Experts router integrated into Visual Autoregressive Models to reduce computational redundancy by 20% FLOPs and 11% faster inference while maintaining image quality.

Details

Motivation: Visual Autoregressive Models suffer from computational redundancy due to repeated Transformer calls at increasing resolutions, which needs optimization.

Method: Introduce a dynamic Mixture-of-Experts router with scale-aware thresholding that balances expert selection based on token complexity and resolution without additional training.

Result: Achieved 20% fewer FLOPs, 11% faster inference while matching the image quality of the dense baseline.

Conclusion: The proposed dynamic Mixture-of-Experts approach effectively reduces computational costs in VAR models without compromising image generation quality.

Abstract: Visual Autoregressive Models (VAR) offer efficient and high-quality image generation but suffer from computational redundancy due to repeated Transformer calls at increasing resolutions. We introduce a dynamic Mixture-of-Experts router integrated into VAR. The new architecture allows to trade compute for quality through scale-aware thresholding. This thresholding strategy balances expert selection based on token complexity and resolution, without requiring additional training. As a result, we achieve 20% fewer FLOPs, 11% faster inference and match the image quality achieved by the dense baseline.

[171] Out-of-Distribution Detection in LiDAR Semantic Segmentation Using Epistemic Uncertainty from Hierarchical GMMs

Hanieh Shojaei Miandashti, Claus Brenner

Main category: cs.CV

TL;DR: An unsupervised OOD detection method using epistemic uncertainty from hierarchical Bayesian modeling of GMM parameters in neural network feature space, achieving significant improvements over existing uncertainty-based methods without requiring auxiliary data.

Details

Motivation: Existing unsupervised OOD detection methods conflate epistemic and aleatoric uncertainties, leading to misclassification of ambiguous in-distribution regions as OOD. Supervised methods require auxiliary OOD datasets, which is impractical.

Method: Uses hierarchical Bayesian modeling of Gaussian Mixture Model parameters in deep neural network feature space to derive epistemic uncertainty, eliminating the need for auxiliary data or additional training stages.

Result: Outperforms existing uncertainty-based methods on SemanticKITTI dataset with 18% improvement in AUROC, 22% increase in AUPRC, and 36% reduction in FPR95 (from 76% to 40%) compared to predictive entropy approach.

Conclusion: The proposed unsupervised approach effectively separates epistemic uncertainty from aleatoric uncertainty, providing superior OOD detection performance without requiring auxiliary OOD datasets or additional training.

Abstract: In addition to accurate scene understanding through precise semantic segmentation of LiDAR point clouds, detecting out-of-distribution (OOD) objects, instances not encountered during training, is essential to prevent the incorrect assignment of unknown objects to known classes. While supervised OOD detection methods depend on auxiliary OOD datasets, unsupervised methods avoid this requirement but typically rely on predictive entropy, the entropy of the predictive distribution obtained by averaging over an ensemble or multiple posterior weight samples. However, these methods often conflate epistemic (model) and aleatoric (data) uncertainties, misclassifying ambiguous in distribution regions as OOD. To address this issue, we present an unsupervised OOD detection approach that employs epistemic uncertainty derived from hierarchical Bayesian modeling of Gaussian Mixture Model (GMM) parameters in the feature space of a deep neural network. Without requiring auxiliary data or additional training stages, our approach outperforms existing uncertainty-based methods on the SemanticKITTI dataset, achieving an 18% improvement in AUROC, 22% increase in AUPRC, and 36% reduction in FPR95 (from 76% to 40%), compared to the predictive entropy approach used in prior works.

[172] Hi-OSCAR: Hierarchical Open-set Classifier for Human Activity Recognition

Conor McCarthy, Loes Quirijnen, Jan Peter van Zandwijk, Zeno Geradts, Marcel Worring

Main category: cs.CV

TL;DR: Hi-OSCAR is a hierarchical open-set classifier for activity recognition that identifies known activities with high accuracy while rejecting unknown ones, and can localize unknown activities to the nearest hierarchical node.

Details

Motivation: There's a gap between real-world activities and annotated sensor datasets, and existing HAR classifiers fail to handle unseen activities properly. Also, activity classes have varying degrees of similarity and some overlap or contain sub-activities.

Method: Organize activity classes into a structured hierarchy and propose Hi-OSCAR, which performs hierarchical open-set classification. Also collected a new public dataset NFI_FARED with 19 activities from various contexts.

Result: The method achieves state-of-the-art accuracy for known activities while simultaneously rejecting unknown activities, and can localize unknown classes to the nearest internal node.

Conclusion: Hi-OSCAR enables more reliable HAR by handling open-set scenarios and providing insights beyond binary known/unknown classification through hierarchical localization of unknown activities.

Abstract: Within Human Activity Recognition (HAR), there is an insurmountable gap between the range of activities performed in life and those that can be captured in an annotated sensor dataset used in training. Failure to properly handle unseen activities seriously undermines any HAR classifier’s reliability. Additionally within HAR, not all classes are equally dissimilar, some significantly overlap or encompass other sub-activities. Based on these observations, we arrange activity classes into a structured hierarchy. From there, we propose Hi-OSCAR: a Hierarchical Open-set Classifier for Activity Recognition, that can identify known activities at state-of-the-art accuracy while simultaneously rejecting unknown activities. This not only enables open-set classification, but also allows for unknown classes to be localized to the nearest internal node, providing insight beyond a binary “known/unknown” classification. To facilitate this and future open-set HAR research, we collected a new dataset: NFI_FARED. NFI_FARED contains data from multiple subjects performing nineteen activities from a range of contexts, including daily living, commuting, and rapid movements, which is fully public and available for download.

[173] 3D Reconstruction from Transient Measurements with Time-Resolved Transformer

Yue Li, Shida Sun, Yu Hong, Feihu Xu, Zhiwei Xiong

Main category: cs.CV

TL;DR: Proposes Time-Resolved Transformer (TRT) for 3D reconstruction in photon-efficient imaging, with specialized attention mechanisms for spatio-temporal transient measurements. Develops TRT-LOS for line-of-sight and TRT-NLOS for non-line-of-sight imaging.

Details

Motivation: Address challenges in 3D reconstruction from transient measurements due to low quantum efficiency of sensors and high noise levels, especially for long-range or complex scenes in photon-efficient imaging.

Method: TRT architecture with spatio-temporal self-attention encoders that explore local and global correlations by splitting/downsampling features, and spatio-temporal cross attention decoders that integrate local and global features in token space.

Result: Both TRT-LOS and TRT-NLOS significantly outperform existing methods on synthetic and real-world data from different imaging systems. Also contributes large-scale synthetic LOS dataset and real-world NLOS measurements.

Conclusion: TRT provides a generic and effective solution for 3D reconstruction in photon-efficient imaging, with demonstrated superior performance over existing methods across different imaging scenarios.

Abstract: Transient measurements, captured by the timeresolved systems, are widely employed in photon-efficient reconstruction tasks, including line-of-sight (LOS) and non-line-of-sight (NLOS) imaging. However, challenges persist in their 3D reconstruction due to the low quantum efficiency of sensors and the high noise levels, particularly for long-range or complex scenes. To boost the 3D reconstruction performance in photon-efficient imaging, we propose a generic Time-Resolved Transformer (TRT) architecture. Different from existing transformers designed for high-dimensional data, TRT has two elaborate attention designs tailored for the spatio-temporal transient measurements. Specifically, the spatio-temporal self-attention encoders explore both local and global correlations within transient data by splitting or downsampling input features into different scales. Then, the spatio-temporal cross attention decoders integrate the local and global features in the token space, resulting in deep features with high representation capabilities. Building on TRT, we develop two task-specific embodiments: TRT-LOS for LOS imaging and TRT-NLOS for NLOS imaging. Extensive experiments demonstrate that both embodiments significantly outperform existing methods on synthetic data and real-world data captured by different imaging systems. In addition, we contribute a large-scale, high-resolution synthetic LOS dataset with various noise levels and capture a set of real-world NLOS measurements using a custom-built imaging system, enhancing the data diversity in this field. Code and datasets are available at https://github.com/Depth2World/TRT.

[174] Detection of high-frequency oscillations using time-frequency analysis

Mostafa Mohammadpour, Mehdi Zekriyapanah Gashti, Yusif S. Gasimov

Main category: cs.CV

TL;DR: A novel unsupervised clustering method for detecting high-frequency oscillations (HFOs) in epilepsy patients achieves high accuracy (97.67% sensitivity, 98.57% precision) and shows strong correlation with surgical outcomes.

Details

Motivation: HFOs are promising biomarkers for identifying epileptogenic zones, but visual detection is time-consuming, labor-intensive, and subjective. Automated detection methods are needed for clinical use.

Method: Uses unsupervised clustering to categorize events from time-frequency domain using S-transform, differentiating HFOs from spikes, background activity, and artifacts in ripple and fast ripple bands (80-500 Hz).

Result: Achieved 97.67% sensitivity, 98.57% precision, and 97.78% F-score on controlled dataset. In patients, showed strong correlation (0.73 ratio) between HFOs rates in resected vs non-resected contacts.

Conclusion: HFOs are confirmed as promising biomarkers of epileptogenicity. Removing HFOs (especially fast ripple) leads to seizure freedom, while remaining HFOs lead to seizure recurrence.

Abstract: High-frequency oscillations (HFOs) are a new biomarker for identifying the epileptogenic zone. Mapping HFO-generating regions can improve the precision of resection sites in patients with refractory epilepsy. However, detecting HFOs remains challenging, and their clinical features are not yet fully defined. Visual identification of HFOs is time-consuming, labor-intensive, and subjective. As a result, developing automated methods to detect HFOs is critical for research and clinical use. In this study, we developed a novel method for detecting HFOs in the ripple and fast ripple frequency bands (80-500 Hz). We validated it using both controlled datasets and data from epilepsy patients. Our method employs an unsupervised clustering technique to categorize events extracted from the time-frequency domain using the S-transform. The proposed detector differentiates HFOs events from spikes, background activity, and artifacts. Compared to existing detectors, our method achieved a sensitivity of 97.67%, a precision of 98.57%, and an F-score of 97.78% on the controlled dataset. In epilepsy patients, our results showed a stronger correlation with surgical outcomes, with a ratio of 0.73 between HFOs rates in resected versus non-resected contacts. The study confirmed previous findings that HFOs are promising biomarkers of epileptogenicity in epileptic patients. Removing HFOs, especially fast ripple, leads to seizure freedom, while remaining HFOs lead to seizure recurrence.

[175] Foraging with the Eyes: Dynamics in Human Visual Gaze and Deep Predictive Modeling

Tejaswi V. Panchagnula

Main category: cs.CV

TL;DR: Human visual gaze follows Levy walk dynamics similar to animal foraging, suggesting optimal information search efficiency. A CNN model can predict fixation patterns from images alone.

Details

Motivation: To understand the spatiotemporal statistics of human eye movements and explore whether visual gaze follows similar dynamics as animal foraging in sparse resource environments.

Method: Conducted large-scale experiment with 40 participants viewing 50 diverse images, recording over 4 million gaze points using high-speed eye tracker. Analyzed gaze trajectory statistics and trained CNN to predict fixation heatmaps from image input.

Result: Human gaze trajectory follows Levy walk dynamics akin to animal foraging. CNN model accurately reproduced salient fixation regions across novel images, showing gaze behavior is learnable from visual structure.

Conclusion: Human visual exploration obeys statistical laws analogous to natural foraging, opening avenues for modeling gaze through generative and predictive frameworks.

Abstract: Animals often forage via Levy walks stochastic trajectories with heavy tailed step lengths optimized for sparse resource environments. We show that human visual gaze follows similar dynamics when scanning images. While traditional models emphasize image based saliency, the underlying spatiotemporal statistics of eye movements remain underexplored. Understanding these dynamics has broad applications in attention modeling and vision-based interfaces. In this study, we conducted a large scale human subject experiment involving 40 participants viewing 50 diverse images under unconstrained conditions, recording over 4 million gaze points using a high speed eye tracker. Analysis of these data shows that the gaze trajectory of the human eye also follows a Levy walk akin to animal foraging. This suggests that the human eye forages for visual information in an optimally efficient manner. Further, we trained a convolutional neural network (CNN) to predict fixation heatmaps from image input alone. The model accurately reproduced salient fixation regions across novel images, demonstrating that key components of gaze behavior are learnable from visual structure alone. Our findings present new evidence that human visual exploration obeys statistical laws analogous to natural foraging and open avenues for modeling gaze through generative and predictive frameworks.

[176] Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Thomas Fel, Binxu Wang, Michael A. Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S. Lubana, Talia Konkle, Demba Ba, Martin Wattenberg

Main category: cs.CV

TL;DR: DINOv2’s internal representations are analyzed using SAEs, revealing functional specialization across tasks, dense rather than strictly sparse representations, and a structure based on convex mixtures of archetypes, leading to the Minkowski Representation Hypothesis.

Details

Motivation: To understand what DINOv2 perceives internally, as it's widely used for object, scene, and action recognition but its internal representations remain unknown.

Method: Adopt Linear Representation Hypothesis with SAEs to create a 32,000-unit dictionary, then analyze downstream task recruitment, geometry/statistics of learned concepts, and propose a refined view based on convex mixtures of archetypes.

Result: Found functional specialization: classification uses “Elsewhere” concepts (learned negations), segmentation uses boundary detectors, depth estimation uses monocular cues. Representations are partly dense, locally connected, and organized beyond linear sparsity. Proposed Minkowski Representation Hypothesis.

Conclusion: DINOv2 representations are organized as convex mixtures of archetypes, grounded in conceptual spaces and multi-head attention mechanisms, suggesting a more complex structure than simple linear sparsity.

Abstract: DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits “Elsewhere” concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors’ conceptual spaces and in the model’s mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.

[177] PhyDAE: Physics-Guided Degradation-Adaptive Experts for All-in-One Remote Sensing Image Restoration

Zhe Dong, Yuzhe Sun, Haochen Jiang, Tianzhu Liu, Yanfeng Gu

Main category: cs.CV

TL;DR: PhyDAE is a physics-guided degradation-adaptive method for remote sensing image restoration that explicitly models multiple heterogeneous degradations (haze, noise, blur, low-light) using a two-stage cascaded architecture with progressive degradation mining and physics-aware expert modules.

Details

Motivation: Existing all-in-one restoration methods overly rely on implicit feature representations and lack explicit modeling of degradation physics, making them ineffective for handling complex and heterogeneous degradations in remote sensing images.

Method: Two-stage cascaded architecture transforming degradation information from implicit to explicit signals; uses Residual Manifold Projector (RMP) and Frequency-Aware Degradation Decomposer (FADD) for comprehensive degradation analysis; incorporates physics-aware expert modules and temperature-controlled sparse activation strategies.

Result: Superior performance across all four restoration tasks on three benchmark datasets (MD-RSID, MD-RRSHID, MDRS-Landsat); substantial improvement in restoration quality with significant reductions in parameter count and computational complexity compared to state-of-the-art methods.

Conclusion: PhyDAE achieves optimal balance between performance and efficiency, demonstrating remarkable efficiency gains while maintaining high restoration quality through explicit physics-guided degradation modeling.

Abstract: Remote sensing images inevitably suffer from various degradation factors during acquisition, including atmospheric interference, sensor limitations, and imaging conditions. These complex and heterogeneous degradations pose severe challenges to image quality and downstream interpretation tasks. Addressing limitations of existing all-in-one restoration methods that overly rely on implicit feature representations and lack explicit modeling of degradation physics, this paper proposes Physics-Guided Degradation-Adaptive Experts (PhyDAE). The method employs a two-stage cascaded architecture transforming degradation information from implicit features into explicit decision signals, enabling precise identification and differentiated processing of multiple heterogeneous degradations including haze, noise, blur, and low-light conditions. The model incorporates progressive degradation mining and exploitation mechanisms, where the Residual Manifold Projector (RMP) and Frequency-Aware Degradation Decomposer (FADD) comprehensively analyze degradation characteristics from manifold geometry and frequency perspectives. Physics-aware expert modules and temperature-controlled sparse activation strategies are introduced to enhance computational efficiency while ensuring imaging physics consistency. Extensive experiments on three benchmark datasets (MD-RSID, MD-RRSHID, and MDRS-Landsat) demonstrate that PhyDAE achieves superior performance across all four restoration tasks, comprehensively outperforming state-of-the-art methods. Notably, PhyDAE substantially improves restoration quality while achieving significant reductions in parameter count and computational complexity, resulting in remarkable efficiency gains compared to mainstream approaches and achieving optimal balance between performance and efficiency. Code is available at https://github.com/HIT-SIRS/PhyDAE.

[178] Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, Jin Hao, Zijian Chen, Ruijia Wu, Tao Tang, Junhui Lv, Hongxia Xu, Hongwei Wang, Jun Xiao, Bin Feng, Fudong Zhu, Kenli Li, Weidi Xie, Jimeng Sun, Jian Wu, Zuozhu Liu

Main category: cs.CV

TL;DR: Hulu-Med is a transparent medical vision-language model that unifies understanding across text, 2D/3D images, and video, achieving state-of-the-art performance across 30 benchmarks with efficient training.

Details

Motivation: Address inefficiencies in clinical decision-making from integrating diverse medical data modalities and overcome challenges in medical VLM development including opaque pipelines, data scarcity, and architectural inflexibility.

Method: Built on unified patch-based vision encoder and LLM decoder, progressively trained on 16.7M samples from 2D to 3D and video comprehension, using medical-aware token reduction for efficient training.

Result: Achieved state-of-the-art performance across 30 benchmarks, surpassing open-source models and competing with proprietary systems in visual question-answering, medical report generation, and complex reasoning in multilingual and rare disease scenarios.

Conclusion: High-performance medical VLM can be achieved transparently, providing a foundational tool for accessible and impactful clinical AI, with complete pipeline open-sourced.

Abstract: Real-world clinical decision-making grapples with integrating information from diverse data modalities, including medical text, 2D/3D images, and video, leading to inefficiencies and potential diagnostic oversights. While generalist vision-language models (VLMs) offer promise, their medical development faces challenges of opaque pipelines, data scarcity, and architectural inflexibility. Here we present Hulu-Med, a transparent medical VLM that unifies understanding across all these modalities. Built upon a unified patch-based vision encoder and an LLM decoder, Hulu-Med was progressively trained on 16.7 million (M) samples to scale from 2D to 3D and video comprehension. The medical-aware token reduction enables efficient training, requiring only 4,000 to 40,000 GPU hours for 7B to 32B parameter variants. Extensive evaluation across 30 benchmarks exhibits state-of-the-art performance, surpassing leading open-source models and competing with proprietary systems in tasks spanning visual question-answering, medical report generation, and complex reasoning in multilingual and rare disease scenarios. By open-sourcing our complete pipeline, we establish that high-performance medical VLM can be achieved transparently, providing a foundational tool for accessible and impactful clinical AI. Code is released on \href{https://github.com/ZJUI-AI4H/Hulu-Med}{https://github.com/ZJUI-AI4H/Hulu-Med}.

[179] Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy

Main category: cs.CV

TL;DR: Puffin is a unified camera-centric multimodal model that integrates language understanding and diffusion-based generation for spatial awareness from arbitrary viewpoints, treating camera as language to bridge modality gaps.

Details

Motivation: To address the isolation between camera-centric understanding and generation in spatial intelligence, and create a unified model that can interpret and create scenes from any viewpoint.

Method: Introduces a novel paradigm that treats camera as language, enabling thinking with camera. Uses both global camera parameters and pixel-wise camera maps. Trained on Puffin-4M dataset of 4M vision-language-camera triplets with instruction tuning.

Result: Demonstrates superior performance over specialized models for camera-centric generation and understanding. Generalizes to diverse cross-view tasks including spatial imagination, world exploration, and photography guidance.

Conclusion: Puffin advances multimodal spatial intelligence research by unifying camera-centric understanding and generation, with code, models, dataset pipeline, and benchmark to be released.

Abstract: Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.

[180] Structured Output Regularization: a framework for few-shot transfer learning

Nicolas Ewen, Jairo Diaz-Rodriguez, Kelly Ramsay

Main category: cs.CV

TL;DR: Proposes Structured Output Regularization (SOR) - a transfer learning method that freezes internal network structures while using group lasso and L1 penalties to adapt to domain-specific features with minimal parameters.

Details

Motivation: Traditional transfer learning freezes weights and adds task-specific layers, which limits adaptation to domain-specific features and can still overfit with limited data.

Method: SOR freezes internal network structures (e.g., convolutional filters) while applying group lasso and L1 penalties to tailor models to specific data with minimal additional parameters.

Result: Achieved competitive results on three few-shot medical imaging classification tasks using DenseNet121 and EfficientNetB4 bases compared to established benchmarks.

Conclusion: SOR is a simple yet effective framework for transfer learning that enables broad applicability across various network components while maintaining computational efficiency.

Abstract: Traditional transfer learning typically reuses large pre-trained networks by freezing some of their weights and adding task-specific layers. While this approach is computationally efficient, it limits the model’s ability to adapt to domain-specific features and can still lead to overfitting with very limited data. To address these limitations, we propose Structured Output Regularization (SOR), a simple yet effective framework that freezes the internal network structures (e.g., convolutional filters) while using a combination of group lasso and $L_1$ penalties. This framework tailors the model to specific data with minimal additional parameters and is easily applicable to various network components, such as convolutional filters or various blocks in neural networks enabling broad applicability for transfer learning tasks. We evaluate SOR on three few shot medical imaging classification tasks and we achieve competitive results using DenseNet121, and EfficientNetB4 bases compared to established benchmarks.

[181] BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Xupeng Zhu, Haojie Huang, Lawson L. S. Wong

Main category: cs.CV

TL;DR: BEAR is a comprehensive benchmark for evaluating multimodal large language models’ embodied capabilities across 14 domains, revealing persistent limitations and proposing BEAR-Agent to enhance performance.

Details

Motivation: Existing benchmarks for multimodal large language models focus on specific domains like planning or spatial understanding, lacking systematic evaluation of comprehensive embodied capabilities needed for physical world interaction.

Method: Created BEAR benchmark with 4,469 interleaved image-video-text entries across 14 domains in 6 categories. Proposed BEAR-Agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities.

Result: Evaluation of 20 representative MLLMs revealed persistent limitations across all embodied capability domains. BEAR-Agent achieved 9.12% absolute gain and 17.5% relative improvement on GPT-5, and improved performance on embodied tasks in simulated environments.

Conclusion: BEAR provides a systematic framework for evaluating embodied capabilities in MLLMs, and BEAR-Agent demonstrates that enhancing these capabilities can benefit embodied tasks, addressing current limitations in physical world interaction.

Abstract: Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/

[182] SQ-GAN: Semantic Image Communications Using Masked Vector Quantization

Francesco Pezone, Sergio Barbarossa, Giuseppe Caire

Main category: cs.CV

TL;DR: SQ-GAN is a novel image compression method that integrates semantic segmentation and vector quantization to optimize compression for semantic communications while maintaining legacy system compatibility.

Details

Motivation: To develop an image compression method specifically optimized for semantic/task-oriented communications that can outperform existing compression schemes at extremely low compression rates.

Method: Uses semantic segmentation maps from off-the-shelf software, a semantic-conditioned adaptive mask module (SAMM) to selectively encode relevant features, and incorporates task-specific semantic class relevance through weighted loss functions during training.

Result: Outperforms state-of-the-art image compression methods (JPEG2000, BPG, deep-learning based) across multiple metrics including perceptual quality and semantic segmentation accuracy at extremely low compression rates.

Conclusion: SQ-GAN successfully integrates semantic-driven coding with vector quantization to achieve superior compression performance for semantic communications while maintaining backward compatibility with legacy systems.

Abstract: This work introduces Semantically Masked Vector Quantized Generative Adversarial Network (SQ-GAN), a novel approach integrating semantically driven image coding and vector quantization to optimize image compression for semantic/task-oriented communications. The method only acts on source coding and is fully compliant with legacy systems. The semantics is extracted from the image computing its semantic segmentation map using off-the-shelf software. A new specifically developed semantic-conditioned adaptive mask module (SAMM) selectively encodes semantically relevant features of the image. The relevance of the different semantic classes is task-specific, and it is incorporated in the training phase by introducing appropriate weights in the loss function. SQ-GAN outperforms state-of-the-art image compression schemes such as JPEG2000, BPG, and deep-learning based methods across multiple metrics, including perceptual quality and semantic segmentation accuracy on the reconstructed image, at extremely low compression rates.

[183] SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense

Jiayang Liu, Daniel Tso, Yiming Bu, Qinru Qiu

Main category: cs.CV

TL;DR: A biologically inspired defense framework using foveal-peripheral processing, saccadic eye movements, and cortical filling-in to mitigate adversarial attacks without requiring classifier retraining.

Details

Motivation: Traditional adversarial defenses are computationally intensive, while the human visual system achieves inherent robustness through biological mechanisms. The paper hypothesizes that attention-guided sparse sampling and predictive coding contribute to this robustness.

Method: Proposes a defense framework incorporating three biological mechanisms: foveal-peripheral processing, saccadic eye movements, and cortical filling-in. Uses reinforcement learning-guided saccades to capture multiple foveal-peripheral glimpses, which are integrated into reconstructed images before classification.

Result: Experiments on ImageNet show improved robustness across diverse classifiers and attack types. The method preserves semantic integrity while significantly reducing training overhead compared to other defense techniques.

Conclusion: The biologically inspired preprocessing effectively mitigates adversarial noise without requiring retraining or fine-tuning of downstream classifiers, enabling seamless integration with existing systems.

Abstract: Adversarial attacks significantly challenge the safe deployment of deep learning models, particularly in real-world applications. Traditional defenses often rely on computationally intensive optimization (e.g., adversarial training or data augmentation) to improve robustness, whereas the human visual system achieves inherent robustness to adversarial perturbations through evolved biological mechanisms. We hypothesize that attention guided non-homogeneous sparse sampling and predictive coding plays a key role in this robustness. To test this hypothesis, we propose a novel defense framework incorporating three key biological mechanisms: foveal-peripheral processing, saccadic eye movements, and cortical filling-in. Our approach employs reinforcement learning-guided saccades to selectively capture multiple foveal-peripheral glimpses, which are integrated into a reconstructed image before classification. This biologically inspired preprocessing effectively mitigates adversarial noise, preserves semantic integrity, and notably requires no retraining or fine-tuning of downstream classifiers, enabling seamless integration with existing systems. Experiments on the ImageNet dataset demonstrate that our method improves system robustness across diverse classifiers and attack types, while significantly reducing training overhead compared to both biologically and non-biologically inspired defense techniques.

[184] Solving Inverse Problems with FLAIR

Julius Erbach, Dominik Narnhofer, Andreas Dombos, Bernt Schiele, Jan Eric Lenssen, Konrad Schindler

Main category: cs.CV

TL;DR: FLAIR is a training-free variational framework that uses flow-based generative models as priors for inverse imaging problems, addressing key obstacles like intractable data likelihood and conflicting objectives between data fidelity and prior.

Details

Motivation: Flow-based generative models show impressive image generation quality but haven't achieved comparable fidelity as priors for inverse problems due to obstacles like intractable data likelihood, inability to directly condition on observations, and reconstruction deviations from observed data.

Method: Introduces variational objective for flow matching agnostic to degradation type, combines with deterministic trajectory adjustments to guide prior towards posterior regions, decouples optimization of data fidelity and regularization terms, and uses time-dependent calibration scheme modulating regularization strength based on off-line accuracy estimates.

Result: FLAIR consistently outperforms existing diffusion- and flow-based methods in reconstruction quality and sample diversity on standard imaging benchmarks.

Conclusion: FLAIR provides an effective framework for leveraging flow-based generative models as powerful priors for inverse imaging problems, overcoming previous limitations and achieving superior performance compared to existing methods.

Abstract: Flow-based latent generative models such as Stable Diffusion 3 are able to generate images with remarkable quality, even enabling photorealistic text-to-image generation. Their impressive performance suggests that these models should also constitute powerful priors for inverse imaging problems, but that approach has not yet led to comparable fidelity. There are several key obstacles: (i) the data likelihood term is usually intractable; (ii) learned generative models cannot be directly conditioned on the distorted observations, leading to conflicting objectives between data likelihood and prior; and (iii) the reconstructions can deviate from the observed data. We present FLAIR, a novel, training-free variational framework that leverages flow-based generative models as prior for inverse problems. To that end, we introduce a variational objective for flow matching that is agnostic to the type of degradation, and combine it with deterministic trajectory adjustments to guide the prior towards regions which are more likely under the posterior. To enforce exact consistency with the observed data, we decouple the optimization of the data fidelity and regularization terms. Moreover, we introduce a time-dependent calibration scheme in which the strength of the regularization is modulated according to off-line accuracy estimates. Results on standard imaging benchmarks demonstrate that FLAIR consistently outperforms existing diffusion- and flow-based methods in terms of reconstruction quality and sample diversity. Our code is available at https://inverseflair.github.io/.

[185] Detecting spills using thermal imaging, pretrained deep learning models, and a robotic platform

Gregory Yeghiyan, Jurius Azar, Devson Butani, Chan-Jin Chung

Main category: cs.CV

TL;DR: A real-time spill detection system using pretrained deep learning models with RGB and thermal imaging achieves up to 100% accuracy, with thermal models performing faster and more robustly across different lighting conditions.

Details

Motivation: To develop an efficient spill detection system for safety-critical applications that can operate in varied environments with high accuracy and speed.

Method: Utilizes pretrained deep learning models (VGG19, NasNetMobile) with RGB and thermal imaging on a balanced binary dataset of 4,000 images, tested on consumer-grade hardware (RTX 4080).

Result: Achieves up to 100% accuracy with thermal models performing faster (inference times as low as 44 ms) and more robustly across lighting conditions, with model sizes under 350 MB.

Conclusion: VGG19 model trained on thermal imaging performs best, demonstrating the system’s deployability in real-time safety-critical contexts with high efficiency and accuracy.

Abstract: This paper presents a real-time spill detection system that utilizes pretrained deep learning models with RGB and thermal imaging to classify spill vs. no-spill scenarios across varied environments. Using a balanced binary dataset (4,000 images), our experiments demonstrate the advantages of thermal imaging in inference speed, accuracy, and model size. We achieve up to 100% accuracy using lightweight models like VGG19 and NasNetMobile, with thermal models performing faster and more robustly across different lighting conditions. Our system runs on consumer-grade hardware (RTX 4080) and achieves inference times as low as 44 ms with model sizes under 350 MB, highlighting its deployability in safety-critical contexts. Results from experiments with a real robot and test datasets indicate that a VGG19 model trained on thermal imaging performs best.

[186] DiffMark: Diffusion-based Robust Watermark Against Deepfakes

Chen Sun, Haiyang Sun, Zhiqing Guo, Yunfeng Diao, Liejun Wang, Dan Ma, Gaobo Yang, Keqin Li

Main category: cs.CV

TL;DR: DiffMark is a robust watermarking framework using diffusion models to embed watermarks in facial images, designed to withstand Deepfake manipulations through modified training schemes and adversarial guidance.

Details

Motivation: Deepfakes pose security threats, and existing watermarking methods lack sufficient robustness against facial manipulations. Diffusion models offer potential for seamless watermark fusion during image generation.

Method: Modifies diffusion model training and sampling to use facial image and watermark as conditions. Uses timestep-dependent facial guidance, cross information fusion module for watermark integration, and frozen autoencoder to simulate Deepfake attacks during training.

Result: Experimental results demonstrate effectiveness against typical Deepfake manipulations. The framework generates robust watermarked images that withstand Deepfake attacks.

Conclusion: DiffMark provides an effective solution for robust watermarking against Deepfakes by leveraging diffusion models with specialized conditioning and adversarial training techniques.

Abstract: Deepfakes pose significant security and privacy threats through malicious facial manipulations. While robust watermarking can aid in authenticity verification and source tracking, existing methods often lack the sufficient robustness against Deepfake manipulations. Diffusion models have demonstrated remarkable performance in image generation, enabling the seamless fusion of watermark with image during generation. In this study, we propose a novel robust watermarking framework based on diffusion model, called DiffMark. By modifying the training and sampling scheme, we take the facial image and watermark as conditions to guide the diffusion model to progressively denoise and generate corresponding watermarked image. In the construction of facial condition, we weight the facial image by a timestep-dependent factor that gradually reduces the guidance intensity with the decrease of noise, thus better adapting to the sampling process of diffusion model. To achieve the fusion of watermark condition, we introduce a cross information fusion (CIF) module that leverages a learnable embedding table to adaptively extract watermark features and integrates them with image features via cross-attention. To enhance the robustness of the watermark against Deepfake manipulations, we integrate a frozen autoencoder during training phase to simulate Deepfake manipulations. Additionally, we introduce Deepfake-resistant guidance that employs specific Deepfake model to adversarially guide the diffusion sampling process to generate more robust watermarked images. Experimental results demonstrate the effectiveness of the proposed DiffMark on typical Deepfakes. Our code will be available at https://github.com/vpsg-research/DiffMark.

[187] LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution

Xiaohui Li, Shaobin Zhuang, Shuo Cao, Yang Yang, Yuandong Pu, Qi Qin, Siqi Luo, Bin Fu, Yihao Liu

Main category: cs.CV

TL;DR: LinearSR introduces a holistic framework that overcomes key challenges in applying Linear Attention to photorealistic image super-resolution, achieving state-of-the-art perceptual quality with exceptional efficiency.

Details

Motivation: To address the computational bottleneck of self-attention's quadratic complexity in generative SR models by leveraging Linear Attention's O(N) efficiency, while overcoming previously unsolved training and quality challenges.

Method: Uses three key innovations: 1) ESGF strategy for training stability, 2) SNR-based MoE architecture to mitigate perception-distortion trade-off, 3) TAG lightweight guidance paradigm based on ‘precision-over-volume’ principle.

Result: LinearSR achieves SOTA perceptual quality with exceptional efficiency - core diffusion forward pass reaches SOTA-level speed while maintaining competitive multi-step inference time.

Conclusion: This work provides the first robust methodology for applying Linear Attention in photorealistic SR, establishing a foundational paradigm for future efficient generative super-resolution research.

Abstract: Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention’s quadratic complexity (O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N) solution, but its promise for photorealistic SR has remained largely untapped, historically hindered by a cascade of interrelated and previously unsolved challenges. This paper introduces LinearSR, a holistic framework that, for the first time, systematically overcomes these critical hurdles. Specifically, we resolve a fundamental, training instability that causes catastrophic model divergence using our novel “knee point”-based Early-Stopping Guided Fine-tuning (ESGF) strategy. Furthermore, we mitigate the classic perception-distortion trade-off with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we establish an effective and lightweight guidance paradigm, TAG, derived from our “precision-over-volume” principle. Our resulting LinearSR model simultaneously delivers state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its overall multi-step inference time remains highly competitive. This work provides the first robust methodology for applying Linear Attention in the photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.

[188] Re-Identifying Kākā with AI-Automated Video Key Frame Extraction

Paula Maddigan, Andrew Lensen, Rachael C. Shaw

Main category: cs.CV

TL;DR: This paper presents an AI-based pipeline for non-invasive wildlife monitoring using computer vision to extract key frames from videos of kākā parrots, enabling accurate individual re-identification without physical tagging.

Details

Motivation: Traditional wildlife monitoring methods like leg banding are invasive and time-consuming. The study aims to develop non-invasive, efficient alternatives using computer vision for individual animal recognition and population monitoring.

Method: Unsupervised pipeline combining object detection (YOLO and Grounding DINO), optical flow blur detection, image encoding with DINOv2, and clustering methods to extract representative key frames from video recordings at custom-built feeders.

Result: The proposed key frame selection methods achieved high accuracy in kākā re-identification, providing a foundation for research in diverse and challenging environments.

Conclusion: The AI-based approach offers a valuable non-invasive alternative to traditional physical tagging methods, contributing to improved wildlife monitoring with applications in ecology and conservation biology.

Abstract: Accurate recognition and re-identification of individual animals is essential for successful wildlife population monitoring. Traditional methods, such as leg banding of birds, are time consuming and invasive. Recent progress in artificial intelligence, particularly computer vision, offers encouraging solutions for smart conservation and efficient automation. This study presents a unique pipeline for extracting high-quality key frames from videos of k={a}k={a} (Nestor meridionalis), a threatened forest-dwelling parrot in New Zealand. Key frame extraction is well-studied in person re-identification, however, its application to wildlife is limited. Using video recordings at a custom-built feeder, we extract key frames and evaluate the re-identification performance of our pipeline. Our unsupervised methodology combines object detection using YOLO and Grounding DINO, optical flow blur detection, image encoding with DINOv2, and clustering methods to identify representative key frames. The results indicate that our proposed key frame selection methods yield image collections which achieve high accuracy in k={a}k={a} re-identification, providing a foundation for future research using media collected in more diverse and challenging environments. Through the use of artificial intelligence and computer vision, our non-invasive and efficient approach provides a valuable alternative to traditional physical tagging methods for recognising k={a}k={a} individuals and therefore improving the monitoring of populations. This research contributes to developing fresh approaches in wildlife monitoring, with applications in ecology and conservation biology.

[189] Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization

Shuo Xing, Soumik Dey, Mingyang Wu, Ashirbad Mishra, Hansi Wu, Binbin Li, Zhengzhong Tu

Main category: cs.CV

TL;DR: Q-Router is an agentic framework for universal video quality assessment that uses vision-language models as routers to dynamically select and ensemble expert models based on video content, achieving superior generalization and interpretability across diverse video types.

Details

Motivation: Existing VQA models suffer from poor generalization across different content types (UGC, short-form videos, AIGC), limited interpretability, and lack of extensibility to novel use cases.

Method: Multi-tier model routing system that employs VLMs as real-time routers to dynamically reason and ensemble appropriate expert models based on input video semantics, with a heaviest tier for spatiotemporal artifacts localization.

Result: Matches or surpasses state-of-the-art VQA models on various benchmarks, substantially improves generalization and interpretability, excels on Q-Bench-Video, and capably localizes spatiotemporal artifacts.

Conclusion: Q-Router shows promise as a foundation for next-generation VQA systems and has potential as a reward function for post-training video generation models.

Abstract: Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision–language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.

[190] Alignment, Mining and Fusion: Representation Alignment with Hard Negative Mining and Selective Knowledge Fusion for Medical Visual Question Answering

Yuanhao Zou, Zhaozheng Yin

Main category: cs.CV

TL;DR: A unified framework for Medical Visual Question Answering that addresses modality alignment challenges, hard negative mining, and knowledge fusion through multi-level alignment, soft label-based hard negative discrimination, and gated cross-attention.

Details

Motivation: Current Med-VQA approaches lack unified modality alignment solutions, under-explore hard negatives, and may introduce irrelevant information through knowledge fusion techniques.

Method: Three key components: (1) multi-level modality alignment using contrastive learning and optimal transport, (2) hard negative mining with soft labels, (3) Gated Cross-Attention Module for knowledge integration.

Result: Outperforms previous state-of-the-art on major Med-VQA datasets including RAD-VQA, SLAKE, PathVQA and VQA-2019.

Conclusion: The proposed framework effectively addresses key challenges in Med-VQA through unified modality alignment, improved hard negative handling, and selective knowledge integration.

Abstract: Medical Visual Question Answering (Med-VQA) is a challenging task that requires a deep understanding of both medical images and textual questions. Although recent works leveraging Medical Vision-Language Pre-training (Med-VLP) have shown strong performance on the Med-VQA task, there is still no unified solution for modality alignment, and the issue of hard negatives remains under-explored. Additionally, commonly used knowledge fusion techniques for Med-VQA may introduce irrelevant information. In this work, we propose a framework to address these challenges through three key contributions: (1) a unified solution for heterogeneous modality alignments across multiple levels, modalities, views, and stages, leveraging methods like contrastive learning and optimal transport theory; (2) a hard negative mining method that employs soft labels for multi-modality alignments and enforces the hard negative pair discrimination; and (3) a Gated Cross-Attention Module for Med-VQA that integrates the answer vocabulary as prior knowledge and selects relevant information from it. Our framework outperforms the previous state-of-the-art on widely used Med-VQA datasets like RAD-VQA, SLAKE, PathVQA and VQA-2019.

[191] SkipSR: Faster Super Resolution with Token Skipping

Rohan Choudhury, Shanchuan Lin, Jianyi Wang, Hao Chen, Qi Zhao, Feng Cheng, Lu Jiang, Kris Kitani, Laszlo A. Jeni

Main category: cs.CV

TL;DR: SkipSR accelerates video super-resolution by identifying and skipping computation on low-detail regions, achieving up to 60% faster processing on 720p videos without quality loss.

Details

Motivation: Diffusion-based super-resolution is slow and expensive, limiting scalability to higher resolutions and longer videos, while many video regions are low-detail and don't need refinement.

Method: Proposes SkipSR framework that identifies low-detail regions from low-resolution input and skips computation on them entirely, only super-resolving areas that require refinement.

Result: Achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality.

Conclusion: SkipSR provides an effective strategy to accelerate video super-resolution while preserving perceptual quality in both standard and one-step diffusion SR models.

Abstract: Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality. Video demos are available at https://rccchoudhury.github.io/skipsr/

[192] D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

Yiyang Huang, Yizhou Wang, Yun Fu

Main category: cs.CV

TL;DR: D-CoDe is a training-free framework that adapts image-pretrained vision-language models for video understanding by addressing perception bottlenecks and token overload through dynamic compression and question decomposition.

Details

Motivation: Extending image-based vision-language models to video domain is challenging due to dense temporal visual inputs that exceed image model capacity, specifically facing perception bottlenecks and token overload issues.

Method: Proposes D-CoDe with two components: 1) Dynamic compression using adaptive frame selection and content-aware spatial token aggregation to reduce redundancy, 2) Question decomposition that reformulates queries into sub-questions to guide focused video understanding.

Result: D-CoDe effectively improves video understanding across various benchmarks and shows strong performance on challenging long-video tasks, demonstrating comprehensive video-language understanding capabilities.

Conclusion: The framework successfully addresses key challenges in adapting image VLMs to video domain without additional training, showing potential for handling complex video-language tasks through dynamic compression and question decomposition.

Abstract: Video large language models (Vid-LLMs), which excel in diverse video-language tasks, can be effectively constructed by adapting image-pretrained vision-language models (VLMs). However, this adaptation remains challenging, as it requires processing dense and temporally extended visual inputs that exceed the capacity of image-based models. This paper identifies the perception bottleneck and token overload as key challenges in extending image-based VLMs to the video domain. To address these issues, we propose D-CoDe, a training-free adaptation framework that incorporates dynamic compression and question decomposition. Specifically, dynamic compression alleviates the perception bottleneck through adaptive selection of representative frames and content-aware aggregation of spatial tokens, thereby reducing redundancy while preserving informative content. In parallel, question decomposition mitigates token overload by reformulating the original query into sub-questions, guiding the model to focus on distinct aspects of the video and enabling more comprehensive understanding. Experiments demonstrate that D-CoDe effectively improves video understanding across various benchmarks. Furthermore, strong performance on the challenging long-video benchmark highlights the potential of D-CoDe in handling complex video-language tasks. Code is available at https://github.com/hukcc/D-CoDe.

[193] FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation

Hongrui Wu, Zhicheng Gao, Jin Cao, Kelu Yao, Wen Shen, Zhihua Wei

Main category: cs.CV

TL;DR: FOLK proposes a fast open-vocabulary 3D instance segmentation method using label-guided knowledge distillation to avoid noisy 2D mappings and accelerate inference.

Details

Motivation: Existing methods map 3D instances to 2D RGB-D images for classification using VLMs, which introduces noise from occlusions and incurs high computational costs, slowing down inference.

Method: Design a teacher model that extracts high-quality 2D CLIP embeddings with visibility and viewpoint diversity, then distill this open-vocabulary knowledge into a 3D student model using label-guided distillation algorithm.

Result: Achieved state-of-the-art performance on ScanNet200 with AP50 score of 35.7, while running 6.0x to 152.2x faster than previous methods.

Conclusion: FOLK enables direct 3D instance classification from point clouds, avoiding occlusion noise and significantly accelerating inference while maintaining high performance.

Abstract: Open-vocabulary 3D instance segmentation seeks to segment and classify instances beyond the annotated label space. Existing methods typically map 3D instances to 2D RGB-D images, and then employ vision-language models (VLMs) for classification. However, such a mapping strategy usually introduces noise from 2D occlusions and incurs substantial computational and memory costs during inference, slowing down the inference speed. To address the above problems, we propose a Fast Open-vocabulary 3D instance segmentation method via Label-guided Knowledge distillation (FOLK). Our core idea is to design a teacher model that extracts high-quality instance embeddings and distills its open-vocabulary knowledge into a 3D student model. In this way, during inference, the distilled 3D model can directly classify instances from the 3D point cloud, avoiding noise caused by occlusions and significantly accelerating the inference process. Specifically, we first design a teacher model to generate a 2D CLIP embedding for each 3D instance, incorporating both visibility and viewpoint diversity, which serves as the learning target for distillation. We then develop a 3D student model that directly produces a 3D embedding for each 3D instance. During training, we propose a label-guided distillation algorithm to distill open-vocabulary knowledge from label-consistent 2D embeddings into the student model. FOLK conducted experiments on the ScanNet200 and Replica datasets, achieving state-of-the-art performance on the ScanNet200 dataset with an AP50 score of 35.7, while running approximately 6.0x to 152.2x faster than previous methods. All codes will be released after the paper is accepted.

[194] Modeling Time-Lapse Trajectories to Characterize Cranberry Growth

Ronan John, Anis Chihoub, Ryan Meegan, Gina Sidelli, Jeffery Neyhart, Peter Oudemans, Kristin Dana

Main category: cs.CV

TL;DR: Self-supervised vision transformers for cranberry growth monitoring without manual annotations, using time regression and class prediction to create interpretable temporal tracks.

Details

Motivation: Manual change monitoring in cranberry farming is time-consuming, and deep learning methods require tedious hand-annotations for fine-tuning.

Method: Fine-tuning vision transformers using self-supervised approach with two-fold pretext task (time regression and class prediction) to learn latent space for time-lapse plant evolution.

Result: Created interpretable 2D temporal tracks that can predict growth over time and distinguish temporal differences between cranberry varieties. Also provided novel time-lapse dataset with 8 varieties observed 52 times over 4 months.

Conclusion: The approach is general and applicable to other crops, providing interpretable time-series models of crop growth without manual annotations.

Abstract: Change monitoring is an essential task for cranberry farming as it provides both breeders and growers with the ability to analyze growth, predict yield, and make treatment decisions. However, this task is often done manually, requiring significant time on the part of a cranberry grower or breeder. Deep learning based change monitoring holds promise, despite the caveat of hard-to-interpret high dimensional features and hand-annotations for fine-tuning. To address this gap, we introduce a method for modeling crop growth based on fine-tuning vision transformers (ViTs) using a self-supervised approach that avoids tedious image annotations. We use a two-fold pretext task (time regression and class prediction) to learn a latent space for the time-lapse evolution of plant and fruit appearance. The resulting 2D temporal tracks provide an interpretable time-series model of crop growth that can be used to: 1) predict growth over time and 2) distinguish temporal differences of cranberry varieties. We also provide a novel time-lapse dataset of cranberry fruit featuring eight distinct varieties, observed 52 times over the growing season (span of around four months), annotated with information about fungicide application, yield, and rot. Our approach is general and can be applied to other crops and applications (code and dataset can be found at https://github. com/ronan-39/tlt/).

[195] PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

Daiki Yoshikawa, Takashi Matsubara

Main category: cs.CV

TL;DR: PHyCLIP uses an l1-Product metric on hyperbolic factors to simultaneously capture hierarchical relationships within concept families and compositional relationships across different concept families in vision-language models.

Details

Motivation: Current vision-language models struggle to represent both hierarchical structures within concept families (e.g., dog ≤ mammal ≤ animal) and compositional structures across concept families (e.g., "a dog in a car" ≤ dog, car) simultaneously.

Method: PHyCLIP employs an l1-Product metric on a Cartesian product of Hyperbolic factors, where intra-family hierarchies emerge within individual hyperbolic factors and cross-family composition is captured by the l1-product metric.

Result: PHyCLIP outperforms existing single-space approaches on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks, and offers more interpretable structures in the embedding space.

Conclusion: The proposed PHyCLIP framework effectively resolves the dilemma of representing both hierarchy and compositionality in vision-language models by using hyperbolic product spaces with l1-product metrics.

Abstract: Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across different concept families (e.g., “a dog in a car” $\preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.

[196] SegTrans: Transferable Adversarial Examples for Segmentation Models

Yufei Song, Ziqi Zhou, Qi Lu, Hangtao Zhang, Yifan Hu, Lulu Xue, Shengshan Hu, Minghui Li, Leo Yu Zhang

Main category: cs.CV

TL;DR: SegTrans is a novel transfer attack framework that improves adversarial example transferability across segmentation models by dividing inputs into local regions and remapping semantic information to generate enhanced samples for perturbation optimization.

Details

Motivation: Segmentation models are vulnerable to adversarial attacks in white-box settings, but existing methods have poor transferability across different models due to complex contextual dependencies and feature distribution gaps between surrogate and target models.

Method: SegTrans divides input samples into multiple local regions and remaps their semantic information to generate diverse enhanced samples, which replace original samples for perturbation optimization. It retains only local semantic information rather than global information.

Result: Extensive experiments on PASCAL VOC and Cityscapes datasets with four segmentation models and three backbone networks show SegTrans significantly improves transfer success rates without additional computational overhead. It achieves 8.55% average increase in transfer attack success rate and improves computational efficiency by over 100% compared to state-of-the-art methods.

Conclusion: SegTrans effectively addresses the transferability limitations of adversarial attacks on segmentation models by leveraging local semantic information remapping, achieving superior performance and efficiency compared to existing methods.

Abstract: Segmentation models exhibit significant vulnerability to adversarial examples in white-box settings, but existing adversarial attack methods often show poor transferability across different segmentation models. While some researchers have explored transfer-based adversarial attack (i.e., transfer attack) methods for segmentation models, the complex contextual dependencies within these models and the feature distribution gaps between surrogate and target models result in unsatisfactory transfer success rates. To address these issues, we propose SegTrans, a novel transfer attack framework that divides the input sample into multiple local regions and remaps their semantic information to generate diverse enhanced samples. These enhanced samples replace the original ones for perturbation optimization, thereby improving the transferability of adversarial examples across different segmentation models. Unlike existing methods, SegTrans only retains local semantic information from the original input, rather than using global semantic information to optimize perturbations. Extensive experiments on two benchmark datasets, PASCAL VOC and Cityscapes, four different segmentation models, and three backbone networks show that SegTrans significantly improves adversarial transfer success rates without introducing additional computational overhead. Compared to the current state-of-the-art methods, SegTrans achieves an average increase of 8.55% in transfer attack success rate and improves computational efficiency by more than 100%.

[197] Defense against Unauthorized Distillation in Image Restoration via Feature Space Perturbation

Han Hu, Zhuoran Zheng, Chen Lyu

Main category: cs.CV

TL;DR: ASVP is a runtime defense method that protects image restoration models from knowledge distillation attacks by perturbing feature maps using singular value decomposition to disrupt student learning while maintaining teacher performance.

Details

Motivation: Knowledge distillation attacks threaten deep model intellectual property, but existing defenses designed for classification don't work well for image restoration tasks due to continuous, high-dimensional outputs requiring spatial coherence and fine details.

Method: Adaptive Singular Value Perturbation (ASVP) operates on internal feature maps using SVD, amplifying top-k singular values to inject structured, high-frequency perturbations that disrupt distillation alignment.

Result: ASVP reduces student PSNR by up to 4 dB and SSIM by 60-75% across five image restoration tasks (super-resolution, low-light enhancement, underwater enhancement, dehazing, deraining) with negligible impact on teacher performance.

Conclusion: ASVP provides a practical and effective defense against unauthorized knowledge distillation for open-source restoration models, outperforming prior methods with stronger and more consistent protection.

Abstract: Knowledge distillation (KD) attacks pose a significant threat to deep model intellectual property by enabling adversaries to train student networks using a teacher model’s outputs. While recent defenses in image classification have successfully disrupted KD by perturbing output probabilities, extending these methods to image restoration is difficult. Unlike classification, restoration is a generative task with continuous, high-dimensional outputs that depend on spatial coherence and fine details. Minor perturbations are often insufficient, as students can still learn the underlying mapping.To address this, we propose Adaptive Singular Value Perturbation (ASVP), a runtime defense tailored for image restoration models. ASVP operates on internal feature maps of the teacher using singular value decomposition (SVD). It amplifies the topk singular values to inject structured, high-frequency perturbations, disrupting the alignment needed for distillation. This hinders student learning while preserving the teacher’s output quality.We evaluate ASVP across five image restoration tasks: super-resolution, low-light enhancement, underwater enhancement, dehazing, and deraining. Experiments show ASVP reduces student PSNR by up to 4 dB and SSIM by 60-75%, with negligible impact on the teacher’s performance. Compared to prior methods, ASVP offers a stronger and more consistent defense.Our approach provides a practical solution to protect open-source restoration models from unauthorized knowledge distillation.

[198] RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos

Zixi Yang, Jiapeng Li, Muxi Diao, Yinuo Jing, Kongming Liang

Main category: cs.CV

TL;DR: Ro-Bench is the first benchmark for evaluating Multi-modal Large Language Models (MLLMs) on dynamic out-of-distribution counterfactual video test sets, showing current models degrade significantly on manipulated content but can be improved through counterfactual fine-tuning.

Details

Motivation: Multi-modal Large Language Models have shown strong video understanding capabilities, but their robustness against manipulated video content remains unexplored, creating a need for systematic evaluation.

Method: Created Ro-Bench benchmark with high-quality, diverse video data by editing Style, Object, Background and their compositions to generate counterfactual test sets. Evaluated 8 recent video MLLMs and tested fine-tuning with counterfactual data.

Result: Current MLLMs show substantial performance degradation on Ro-Bench counterfactual videos. Fine-tuning with counterfactual data improved performance by 21.73% on Ro-Bench and 12.78% across 20 tasks in MVBench.

Conclusion: Counterfactual data is highly effective for enhancing MLLMs’ video understanding robustness, demonstrating the importance of testing and improving models against manipulated content.

Abstract: Recently, Multi-modal Large Language Models (MLLMs) have demonstrated significant performance across various video understanding tasks. However, their robustness, particularly when faced with manipulated video content, remains largely unexplored. In this paper, we introduce Ro-Bench, the first benchmark for evaluating MLLMs on dynamic out-of-distribution (OOD) counterfactual video test sets. Ro-Bench incorporates high-quality, diverse and temporally relevant video data, by editing Style, Object, Background and their compositions. We evaluated eight recent video MLLMs and found that current models exhibit substantial performance degradation on Ro-Bench when exposed to counterfactual video content. Furthermore, we demonstrate that fine-tuning MLLMs with counterfactual data enhances robustness, achieving a 21.73% performance increase on Ro-Bench and a 12.78% improvement across 20 tasks in the MVBench dataset. These findings underscore the effectiveness of counterfactual data in enhancing the video understanding ability of MLLMs. The code and data will be released shortly.

[199] Denoised Diffusion for Object-Focused Image Augmentation

Nisha Pillai, Aditi Virupakshaiah, Harrison W. Smith, Amanda J. Ashworth, Prasanna Gowda, Phillip R. Owens, Adam R. Rivers, Bindu Nanduri, Mahalingam Ramkumar

Main category: cs.CV

TL;DR: Proposed object-focused data augmentation framework for animal health monitoring using segmentation, transformations, and diffusion-based synthesis to overcome limited data availability in drone-based monitoring systems.

Details

Motivation: Address limitations in aerial drone-based animal health monitoring caused by small datasets, scene-specific issues (occluded animals, breed variations), and failure of transfer learning due to lack of domain-specific data.

Method: Segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes for enhanced animal detection.

Result: Initial experiments show superior performance on animal detection task compared to baseline models using the augmented dataset.

Conclusion: The framework enables real-time animal health monitoring in data-scarce scenarios by generating domain-specific data, bridging the gap between limited data and practical application.

Abstract: Modern agricultural operations increasingly rely on integrated monitoring systems that combine multiple data sources for farm optimization. Aerial drone-based animal health monitoring serves as a key component but faces limited data availability, compounded by scene-specific issues such as small, occluded, or partially visible animals. Transfer learning approaches often fail to address this limitation due to the unavailability of large datasets that reflect specific farm conditions, including variations in animal breeds, environments, and behaviors. Therefore, there is a need for developing a problem-specific, animal-focused data augmentation strategy tailored to these unique challenges. To address this gap, we propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings. Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes that enhance animal detection and monitoring performance. Our initial experiments demonstrate that our augmented dataset yields superior performance compared to our baseline models on the animal detection task. By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios, bridging the gap between limited data and practical applicability.

[200] Unleashing Perception-Time Scaling to Multimodal Reasoning Models

Yifan Li, Zhenghao Chen, Ziheng Wu, Kun Zhou, Ruipu Luo, Can Zhang, Zhentao He, Yufei Zhan, Wayne Xin Zhao, Minghui Qiu

Main category: cs.CV

TL;DR: DisTANCE is a perception benchmark showing LVLMs have limited visual estimation precision. Perception-Time Scaling (PTS) is proposed to enable token-rich perception and decomposition of complex problems, significantly improving accuracy from 8.0% to 64.7% and generalizing well.

Details

Motivation: While inference-time scaling has improved reasoning in LVLMs, its impact on visual perception remains unclear. Current LVLMs use fast perception as one-shot output without modeling the perceptual process, limiting estimation precision.

Method: Proposed Perception-Time Scaling (PTS) paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems. Combined with reinforcement learning techniques.

Result: PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%. Generalizes well to out-of-domain tasks. Even with purely synthetic PTS data, combining with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks.

Conclusion: PTS introduces more perception-related tokens and increases model’s attention to image tokens, enabling perception to align with and benefit from inference-time scaling. The approach effectively addresses the limitations of current fast perception paradigms in LVLMs.

Abstract: Recent advances in inference-time scaling, particularly those leveraging reinforcement learning with verifiable rewards, have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs). Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear. To investigate this gap, we introduce DisTANCE, a perception-centric benchmark for visual estimation tasks. Evaluation results show that LVLMs exhibit limited estimation precision, and inference-time scaling offers only marginal gains. We attribute this to the fast perception paradigm of current LVLMs, where visual understanding is treated as a one-shot output without modeling the underlying perceptual process. To address this, we propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems, thereby enabling perception to align with and benefit from inference-time scaling. Combined with reinforcement learning techniques, PTS significantly improves perception accuracy, raising high-precision performance on DisTANCE from 8.0% to 64.7%, and generalizes well to out-of-domain tasks. Surprisingly, even though PTS data are purely synthetic, combining them with math reasoning data yields consistent gains in both reasoning and real-world perception benchmarks. Further analysis reveals that PTS introduces more perception-related tokens and increases the model’s attention to image tokens. Our code and data will be publicly released.

[201] mmJoints: Expanding Joint Representations Beyond (x,y,z) in mmWave-Based 3D Pose Estimation

Zhenyu Wang, Mahathir Monjur, Shahriar Nirjon

Main category: cs.CV

TL;DR: mmJoints is a framework that enhances mmWave-based 3D pose estimation by adding joint descriptors that indicate sensing likelihood and location reliability, improving interpretability and downstream task performance.

Details

Motivation: In mmWave pose estimation, sparse signals and weak reflections cause models to rely too heavily on statistical priors rather than actual sensor data, which degrades performance in downstream tasks like gesture and activity recognition.

Method: Augments pre-trained mmWave-based 3D pose estimator outputs with additional joint descriptors that explicitly estimate the likelihood of a joint being sensed and the reliability of its predicted location.

Result: Achieves descriptor estimation error rate below 4.2%, improves joint position accuracy by up to 12.5%, and boosts activity recognition by up to 16% over state-of-the-art methods across 13 pose estimation settings with over 115,000 signal frames.

Conclusion: mmJoints successfully makes pose estimation bias explicit through joint descriptors, enhancing both interpretability and performance in downstream applications while maintaining high accuracy in descriptor estimation.

Abstract: In mmWave-based pose estimation, sparse signals and weak reflections often cause models to infer body joints from statistical priors rather than sensor data. While prior knowledge helps in learning meaningful representations, over-reliance on it degrades performance in downstream tasks like gesture and activity recognition. In this paper, we introduce mmJoints, a framework that augments a pre-trained, black-box mmWave-based 3D pose estimator’s output with additional joint descriptors. Rather than mitigating bias, mmJoints makes it explicit by estimating the likelihood of a joint being sensed and the reliability of its predicted location. These descriptors enhance interpretability and improve downstream task accuracy. Through extensive evaluations using over 115,000 signal frames across 13 pose estimation settings, we show that mmJoints estimates descriptors with an error rate below 4.2%. mmJoints also improves joint position accuracy by up to 12.5% and boosts activity recognition by up to 16% over state-of-the-art methods.

[202] Hierarchical Scheduling for Multi-Vector Image Retrieval

Maoliang Li, Ke Li, Yaoyang Liu, Jiayu Chen, Zihao Zheng, Yinjun Wu, Xiang Chen

Main category: cs.CV

TL;DR: HiMIR is an efficient hierarchical scheduling framework for image retrieval that improves accuracy and reduces computation by 3.5x compared to existing multi-vector retrieval systems.

Details

Motivation: Conventional retrieval approaches in multimodal LLMs have limited accuracy, and recent multi-vector retrieval methods still suffer from sub-optimal accuracy and efficiency due to poor query-image object alignment and redundant fine-grained image segments.

Method: HiMIR uses a hierarchical paradigm with multiple intermediate granularities for varying image objects to enhance alignment, leverages cross-hierarchy similarity consistency and hierarchy sparsity to minimize redundancy, and automatically configures parameters for each dataset.

Result: HiMIR achieves substantial accuracy improvements and reduces computation by up to 3.5 times over existing multi-vector retrieval systems.

Conclusion: The proposed HiMIR framework effectively addresses accuracy and efficiency limitations in image retrieval for multimodal LLM applications through hierarchical scheduling and redundancy minimization.

Abstract: To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - HiMIR. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, HiMIR not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.

[203] HandEval: Taking the First Step Towards Hand Quality Evaluation in Generated Images

Zichuan Wang, Bo Peng, Songlin Yang, Zhenchen Tang, Jing Dong

Main category: cs.CV

TL;DR: This paper introduces HandEval, the first quality assessment model specifically for evaluating generated hand images in text-to-image models, addressing the common problem of distorted and unrealistic hand generation.

Details

Motivation: Current text-to-image models struggle with generating accurate details in complex local regions like human hands, often producing structurally distorted and unrealistic textures. The quality assessment of hand regions has been largely neglected, limiting downstream applications.

Method: The authors created the HandPair dataset with 48k high- and low-quality hand pairs for training without manual annotation. They developed HandEval, a hand-specific quality assessment model that leverages MLLM capabilities and incorporates hand keypoint prior knowledge.

Result: HandEval demonstrates better alignment with human judgments than existing state-of-the-art methods. When integrated into image generation and AIGC detection pipelines, it significantly enhances hand realism and detection accuracy.

Conclusion: HandEval provides an effective solution for assessing generated hand quality and shows universal effectiveness in downstream applications, addressing a critical gap in text-to-image model evaluation.

Abstract: Although recent text-to-image (T2I) models have significantly improved the overall visual quality of generated images, they still struggle in the generation of accurate details in complex local regions, especially human hands. Generated hands often exhibit structural distortions and unrealistic textures, which can be very noticeable even when the rest of the body is well-generated. However, the quality assessment of hand regions remains largely neglected, limiting downstream task performance like human-centric generation quality optimization and AIGC detection. To address this, we propose the first quality assessment task targeting generated hand regions and showcase its abundant downstream applications. We first introduce the HandPair dataset for training hand quality assessment models. It consists of 48k images formed by high- and low-quality hand pairs, enabling low-cost, efficient supervision without manual annotation. Based on it, we develop HandEval, a carefully designed hand-specific quality assessment model. It leverages the powerful visual understanding capability of Multimodal Large Language Model (MLLM) and incorporates prior knowledge of hand keypoints, gaining strong perception of hand quality. We further construct a human-annotated test set with hand images from various state-of-the-art (SOTA) T2I models to validate its quality evaluation capability. Results show that HandEval aligns better with human judgments than existing SOTA methods. Furthermore, we integrate HandEval into image generation and AIGC detection pipelines, prominently enhancing generated hand realism and detection accuracy, respectively, confirming its universal effectiveness in downstream applications. Code and dataset will be available.

[204] Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation

Yuki Nii, Futa Waseda, Ching-Chun Chang, Isao Echizen

Main category: cs.CV

TL;DR: First defensive paradigm called Uncolorable Examples that embeds imperceptible perturbations in grayscale images to prevent unauthorized AI colorization while maintaining visual quality.

Details

Motivation: Address copyright infringement risks from unauthorized AI colorization of monochrome manga and films, as no effective prevention methods currently exist.

Method: PAChroma (Perception-Aware Chroma-Restrictive Perturbation) optimizes imperceptible perturbations using Laplacian filter for perceptual quality and diverse input transformations for transferability and robustness.

Result: Effectively degrades colorization quality on ImageNet and Danbooru datasets while maintaining visual appearance, meeting all four criteria: effectiveness, imperceptibility, transferability, and robustness.

Conclusion: First step toward protecting visual content from illegitimate AI colorization, paving way for copyright-aware defenses in generative media.

Abstract: AI-based colorization has shown remarkable capability in generating realistic color images from grayscale inputs. However, it poses risks of copyright infringement – for example, the unauthorized colorization and resale of monochrome manga and films. Despite these concerns, no effective method currently exists to prevent such misuse. To address this, we introduce the first defensive paradigm, Uncolorable Examples, which embed imperceptible perturbations into grayscale images to invalidate unauthorized colorization. To ensure real-world applicability, we establish four criteria: effectiveness, imperceptibility, transferability, and robustness. Our method, Perception-Aware Chroma-Restrictive Perturbation (PAChroma), generates Uncolorable Examples that meet these four criteria by optimizing imperceptible perturbations with a Laplacian filter to preserve perceptual quality, and applying diverse input transformations during optimization to enhance transferability across models and robustness against common post-processing (e.g., compression). Experiments on ImageNet and Danbooru datasets demonstrate that PAChroma effectively degrades colorization quality while maintaining the visual appearance. This work marks the first step toward protecting visual content from illegitimate AI colorization, paving the way for copyright-aware defenses in generative media.

[205] Speculative Jacobi-Denoising Decoding for Accelerating Autoregressive Text-to-image Generation

Yao Teng, Fuyun Wang, Xian Liu, Zhekai Chen, Han Shi, Yu Wang, Zhenguo Li, Weiyang Liu, Difan Zou, Xihui Liu

Main category: cs.CV

TL;DR: SJD2 is a framework that accelerates autoregressive text-to-image models by enabling parallel token generation through speculative Jacobi-denoising decoding, reducing the number of model forward passes while maintaining image quality.

Details

Motivation: Autoregressive text-to-image models suffer from slow inference due to sequential token-by-token decoding, requiring thousands of model forward passes per image generation.

Method: Introduces speculative Jacobi-denoising decoding with next-clean-token prediction paradigm, fine-tuning models to accept noise-perturbed embeddings and predict clean tokens, enabling parallel token generation through iterative denoising in embedding space.

Result: The method accelerates generation by reducing model forward passes while maintaining visual quality of generated images.

Conclusion: SJD2 successfully addresses the inefficiency of autoregressive models through parallel token generation using denoising-based Jacobi iterations.

Abstract: As a new paradigm of visual content generation, autoregressive text-to-image models suffer from slow inference due to their sequential token-by-token decoding process, often requiring thousands of model forward passes to generate a single image. To address this inefficiency, we propose Speculative Jacobi-Denoising Decoding (SJD2), a framework that incorporates the denoising process into Jacobi iterations to enable parallel token generation in autoregressive models. Our method introduces a next-clean-token prediction paradigm that enables the pre-trained autoregressive models to accept noise-perturbed token embeddings and predict the next clean tokens through low-cost fine-tuning. This denoising paradigm guides the model towards more stable Jacobi trajectories. During inference, our method initializes token sequences with Gaussian noise and performs iterative next-clean-token-prediction in the embedding space. We employ a probabilistic criterion to verify and accept multiple tokens in parallel, and refine the unaccepted tokens for the next iteration with the denoising trajectory. Experiments show that our method can accelerate generation by reducing model forward passes while maintaining the visual quality of generated images.

[206] On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

Hoigi Seo, Dong Un Kang, Hyunjin Cho, Joohoon Lee, Se Young Chun

Main category: cs.CV

TL;DR: The paper identifies uncertain visual tokens in vision encoders as a key cause of object hallucination in large vision-language models and proposes a method to mitigate this by masking these uncertain tokens during self-attention.

Details

Motivation: Object hallucination - generating descriptions of objects not present in input images - remains a critical challenge in large vision-language models despite their success.

Method: Proposes identifying uncertain visual tokens using adversarial perturbations and masking them during self-attention in middle layers of the vision encoder to suppress their influence on visual encoding.

Result: Extensive experiments show the method significantly reduces object hallucinations in LVLMs and can work synergistically with other existing techniques.

Conclusion: Uncertain visual tokens contribute to object hallucination, and selectively masking them in the vision encoder provides an effective strategy to mitigate this issue while maintaining compatibility with other methods.

Abstract: Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

[207] Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy

Xiaoxiao Ma, Feng Zhao, Pengyang Ling, Haibo Qiu, Zhixiang Wei, Hu Yu, Jie Huang, Zhixiong Zeng, Lin Ma

Main category: cs.CV

TL;DR: The paper proposes an entropy-informed decoding strategy for autoregressive image generation that improves quality and speed by using dynamic temperature control based on spatial entropy and entropy-aware speculative decoding.

Details

Motivation: Current autoregressive image generation models have sampling issues due to lower information density and non-uniform spatial distribution of image tokens compared to text tokens.

Method: Two main innovations: 1) dynamic temperature control guided by spatial entropy of token distributions, and 2) entropy-aware acceptance rules in speculative decoding.

Result: Achieves higher generation quality with faster synthesis speed, achieving near-lossless generation at about 85% of the inference cost of conventional acceleration methods.

Conclusion: The approach effectively enhances both generation quality and sampling speed across multiple benchmarks and diverse AR image generation models.

Abstract: In this work, we first revisit the sampling issues in current autoregressive (AR) image generation models and identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution. Accordingly, we present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed. Specifically, the proposed method introduces two main innovations: 1) dynamic temperature control guided by spatial entropy of token distributions, enhancing the balance between content diversity, alignment accuracy, and structural coherence in both mask-based and scale-wise models, without extra computational overhead, and 2) entropy-aware acceptance rules in speculative decoding, achieving near-lossless generation at about 85% of the inference cost of conventional acceleration methods. Extensive experiments across multiple benchmarks using diverse AR image generation models demonstrate the effectiveness and generalizability of our approach in enhancing both generation quality and sampling speed.

[208] Exploring Single Domain Generalization of LiDAR-based Semantic Segmentation under Imperfect Labels

Weitong Kong, Zichao Zeng, Di Wen, Jiale Wei, Kunyu Peng, June Moh Goo, Jan Boehm, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: The paper introduces Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL), proposing a dual-view framework called DuNe that achieves state-of-the-art performance across multiple datasets with noisy labels.

Details

Motivation: LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors, which degrades segmentation accuracy and is amplified under domain shifts. Existing noisy-label learning methods from 2D images don't work well with 3D LiDAR data due to sparse and irregular point cloud structures.

Method: Proposed DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions.

Result: Achieved 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with overall Arithmetic Mean of 49.57% and Harmonic Mean of 48.50%.

Conclusion: The proposed DuNe framework demonstrates robust domain generalization in LiDAR semantic segmentation tasks with noisy labels, outperforming existing noisy-label learning approaches adapted from image classification.

Abstract: Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re-annotation, domain generalization in LiDAR-based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy-label learning is well-studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL) and establish the first benchmark by adapting three representative noisy-label learning strategies from image classification to 3D segmentation. However, we find that existing noisy-label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions. Our approach shows state-of-the-art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS-NL tasks. The code is available on our project page.

[209] Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition with Multimodal Training

Mahdi Abavisani, Hamid Reza Vaezi Joze, Vishal M. Patel

Main category: cs.CV

TL;DR: Efficient approach for training unimodal 3D-CNNs using multimodal knowledge through collaboration between modality-specific networks with spatiotemporal semantic alignment loss and focal regularization.

Details

Motivation: To improve unimodal 3D-CNN performance for dynamic hand gesture recognition by embedding multimodal knowledge without explicit multimodal fusion, enabling better individual modality networks.

Method: Separate networks per modality with collaboration enforced through spatiotemporal semantic alignment (SSA) loss and focal regularization to prevent negative knowledge transfer.

Result: Improved test time recognition accuracy for unimodal networks and state-of-the-art performance on various dynamic hand gesture recognition datasets.

Conclusion: The proposed framework successfully embeds multimodal knowledge into unimodal networks through collaboration and alignment, achieving superior performance without explicit multimodal fusion at test time.

Abstract: We present an efficient approach for leveraging the knowledge from multiple modalities in training unimodal 3D convolutional neural networks (3D-CNNs) for the task of dynamic hand gesture recognition. Instead of explicitly combining multimodal information, which is commonplace in many state-of-the-art methods, we propose a different framework in which we embed the knowledge of multiple modalities in individual networks so that each unimodal network can achieve an improved performance. In particular, we dedicate separate networks per available modality and enforce them to collaborate and learn to develop networks with common semantics and better representations. We introduce a “spatiotemporal semantic alignment” loss (SSA) to align the content of the features from different networks. In addition, we regularize this loss with our proposed “focal regularization parameter” to avoid negative knowledge transfer. Experimental results show that our framework improves the test time recognition accuracy of unimodal networks, and provides the state-of-the-art performance on various dynamic hand gesture recognition datasets.

[210] Lesion-Aware Post-Training of Latent Diffusion Models for Synthesizing Diffusion MRI from CT Perfusion

Junhyeok Lee, Hyunwoong Kim, Hyungjin Chung, Heeseong Eom, Joon Jang, Chul-Ho Sohn, Kyu Sung Choi

Main category: cs.CV

TL;DR: A post-training framework for latent diffusion models that incorporates lesion-aware pixel space objectives to improve medical image-to-image translation, particularly for preserving clinically significant structures like lesions in brain CT-to-MRI translation for stroke diagnosis.

Details

Motivation: Latent diffusion models trade off pixel-level detail for efficiency, which is problematic for medical images where small lesions require high-fidelity reconstruction. This limitation affects diagnostic reliability and clinical decision-making.

Method: Proposed a novel post-training framework that incorporates lesion-aware medical pixel space objectives into pre-trained latent diffusion models, focusing on enhancing lesion delineation while maintaining overall image quality.

Result: The framework improved overall image quality and enhanced lesion delineation when synthesizing DWI and ADC images from CT perfusion scans, outperforming existing image-to-image translation models on a dataset of 817 acute ischemic stroke patients.

Conclusion: The post-training strategy is easily adaptable to pre-trained LDMs and shows substantial potential for broader applications across diverse medical image translation tasks, addressing critical clinical needs in stroke diagnosis where MRI accessibility is limited.

Abstract: Image-to-Image translation models can help mitigate various challenges inherent to medical image acquisition. Latent diffusion models (LDMs) leverage efficient learning in compressed latent space and constitute the core of state-of-the-art generative image models. However, this efficiency comes with a trade-off, potentially compromising crucial pixel-level detail essential for high-fidelity medical images. This limitation becomes particularly critical when generating clinically significant structures, such as lesions, which often occupy only a small portion of the image. Failure to accurately reconstruct these regions can severely impact diagnostic reliability and clinical decision-making. To overcome this limitation, we propose a novel post-training framework for LDMs in medical image-to-image translation by incorporating lesion-aware medical pixel space objectives. This approach is essential, as it not only enhances overall image quality but also improves the precision of lesion delineation. We evaluate our framework on brain CT-to-MRI translation in acute ischemic stroke patients, where early and accurate diagnosis is critical for optimal treatment selection and improved patient outcomes. While diffusion MRI is the gold standard for stroke diagnosis, its clinical utility is often constrained by high costs and low accessibility. Using a dataset of 817 patients, we demonstrate that our framework improves overall image quality and enhances lesion delineation when synthesizing DWI and ADC images from CT perfusion scans, outperforming existing image-to-image translation models. Furthermore, our post-training strategy is easily adaptable to pre-trained LDMs and exhibits substantial potential for broader applications across diverse medical image translation tasks.

[211] Deep Sparse Representation-based Classification

Mahdi Abavisani, Vishal M. Patel

Main category: cs.CV

TL;DR: A transductive deep learning approach for sparse representation-based classification using a convolutional autoencoder with a fully-connected layer for sparse coding.

Details

Motivation: To improve sparse representation-based classification (SRC) by leveraging deep learning to learn robust features and find better sparse representations.

Method: Proposed network combines a convolutional autoencoder for feature learning with a fully-connected layer between encoder and decoder for sparse representation estimation.

Result: Experiments on three datasets show the method produces sparse representations that yield better classification results than state-of-the-art SRC methods.

Conclusion: The deep learning-based formulation successfully enhances SRC performance through learned deep features and improved sparse coding.

Abstract: We present a transductive deep learning-based formulation for the sparse representation-based classification (SRC) method. The proposed network consists of a convolutional autoencoder along with a fully-connected layer. The role of the autoencoder network is to learn robust deep features for classification. On the other hand, the fully-connected layer, which is placed in between the encoder and the decoder networks, is responsible for finding the sparse representation. The estimated sparse codes are then used for classification. Various experiments on three different datasets show that the proposed network leads to sparse representations that give better classification results than state-of-the-art SRC methods. The source code is available at: github.com/mahdiabavisani/DSRC.

[212] Visual Anomaly Detection for Reliable Robotic Implantation of Flexible Microelectrode Array

Yitong Chen, Xinyao Xu, Ping Zhu, Xinyong Han, Fangbo Qin, Shan Yu

Main category: cs.CV

TL;DR: An image-based anomaly detection framework using vision transformers for monitoring flexible microelectrode implantation in brain cortex, with progressive granularity patch feature sampling and channel selection to address sensitivity-tolerance trade-offs.

Details

Motivation: Flexible microelectrode implantation is challenging due to the probe's deformable structure and interaction with critical brain tissue, requiring careful monitoring for reliability and safety.

Method: Uses microscopic cameras and a unified framework with four checkpoints. Extracts aligned ROIs from images, applies pretrained ViT with progressive granularity patch feature sampling, and selects high signal-to-noise ratio feature channels for specific scenes.

Result: The proposed methods were validated with image datasets from the implantation system, demonstrating effectiveness in anomaly detection.

Conclusion: The image-based framework successfully monitors FME implantation process at multiple checkpoints, addressing sensitivity-tolerance trade-offs through progressive feature sampling and channel selection.

Abstract: Flexible microelectrode (FME) implantation into brain cortex is challenging due to the deformable fiber-like structure of FME probe and the interaction with critical bio-tissue. To ensure reliability and safety, the implantation process should be monitored carefully. This paper develops an image-based anomaly detection framework based on the microscopic cameras of the robotic FME implantation system. The unified framework is utilized at four checkpoints to check the micro-needle, FME probe, hooking result, and implantation point, respectively. Exploiting the existing object localization results, the aligned regions of interest (ROIs) are extracted from raw image and input to a pretrained vision transformer (ViT). Considering the task specifications, we propose a progressive granularity patch feature sampling method to address the sensitivity-tolerance trade-off issue at different locations. Moreover, we select a part of feature channels with higher signal-to-noise ratios from the raw general ViT features, to provide better descriptors for each specific scene. The effectiveness of the proposed methods is validated with the image datasets collected from our implantation system.

[213] MambaH-Fit: Rethinking Hyper-surface Fitting-based Point Cloud Normal Estimation via State Space Modelling

Weijia Wang, Yuanzhi Su, Pei-Gen Ye, Yuan-Gen Wang, Xuequan Lu

Main category: cs.CV

TL;DR: MambaH-Fit is a state space modeling framework for point cloud normal estimation that uses attention-driven hierarchical feature fusion and patch-wise state space modeling to better capture fine-grained geometric structures.

Details

Motivation: Existing normal estimation methods struggle to model fine-grained geometric structures, and current Mamba-based approaches focus mainly on global shape understanding while neglecting local geometric details.

Method: Uses Attention-driven Hierarchical Feature Fusion (AHFF) to fuse multi-scale patch features, and Patch-wise State Space Model (PSSM) to model point cloud patches as implicit hyper-surfaces via state dynamics.

Result: Outperforms existing methods on benchmark datasets in accuracy, robustness, and flexibility. Ablation studies confirm the effectiveness of the proposed components.

Conclusion: The framework successfully addresses the limitations of existing methods by effectively modeling fine-grained geometric structures for improved point cloud normal estimation.

Abstract: We present MambaH-Fit, a state space modelling framework tailored for hyper-surface fitting-based point cloud normal estimation. Existing normal estimation methods often fall short in modelling fine-grained geometric structures, thereby limiting the accuracy of the predicted normals. Recently, state space models (SSMs), particularly Mamba, have demonstrated strong modelling capability by capturing long-range dependencies with linear complexity and inspired adaptations to point cloud processing. However, existing Mamba-based approaches primarily focus on understanding global shape structures, leaving the modelling of local, fine-grained geometric details largely under-explored. To address the issues above, we first introduce an Attention-driven Hierarchical Feature Fusion (AHFF) scheme to adaptively fuse multi-scale point cloud patch features, significantly enhancing geometric context learning in local point cloud neighbourhoods. Building upon this, we further propose Patch-wise State Space Model (PSSM) that models point cloud patches as implicit hyper-surfaces via state dynamics, enabling effective fine-grained geometric understanding for normal prediction. Extensive experiments on benchmark datasets show that our method outperforms existing ones in terms of accuracy, robustness, and flexibility. Ablation studies further validate the contribution of the proposed components.

[214] GL-DT: Multi-UAV Detection and Tracking with Global-Local Integration

Juanqin Liu, Leonardo Plotegher, Eloy Roura, Shaoming He

Main category: cs.CV

TL;DR: Proposes GL-DT framework with STFF module and JPTrack algorithm for UAV multi-object tracking, addressing small targets and occlusion challenges while maintaining real-time performance.

Details

Motivation: Urgent need for accurate MOT technologies in UAV applications due to challenges from complex backgrounds, small-scale targets, and frequent occlusions that affect detection accuracy and trajectory continuity.

Method: GL-DT framework with Spatio-Temporal Feature Fusion module for joint motion/appearance modeling, global-local collaborative detection strategy for small targets, and JPTrack tracking algorithm to reduce ID switches and fragmentation.

Result: Significantly improves MOT continuity and stability while maintaining real-time performance, effectively enhancing small-target detection capabilities.

Conclusion: Provides strong support for advancing UAV detection and tracking technologies by addressing key challenges in complex UAV environments.

Abstract: The extensive application of unmanned aerial vehicles (UAVs) in military reconnaissance, environmental monitoring, and related domains has created an urgent need for accurate and efficient multi-object tracking (MOT) technologies, which are also essential for UAV situational awareness. However, complex backgrounds, small-scale targets, and frequent occlusions and interactions continue to challenge existing methods in terms of detection accuracy and trajectory continuity. To address these issues, this paper proposes the Global-Local Detection and Tracking (GL-DT) framework. It employs a Spatio-Temporal Feature Fusion (STFF) module to jointly model motion and appearance features, combined with a global-local collaborative detection strategy, effectively enhancing small-target detection. Building upon this, the JPTrack tracking algorithm is introduced to mitigate common issues such as ID switches and trajectory fragmentation. Experimental results demonstrate that the proposed approach significantly improves the continuity and stability of MOT while maintaining real-time performance, providing strong support for the advancement of UAV detection and tracking technologies.

[215] Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation

Youwei Zheng, Yuxi Ren, Xin Xia, Xuefeng Xiao, Xiaohua Xie

Main category: cs.CV

TL;DR: Dense2MoE transforms dense Diffusion Transformers into Mixture of Experts structure for efficient text-to-image generation, reducing activated parameters by 60% while maintaining performance.

Details

Motivation: Diffusion Transformers have large parameter sizes causing substantial inference overhead, and existing compression methods like pruning often lead to severe performance degradation due to reduced model capacity.

Method: Replace FFNs in DiT Blocks with MoE layers (62.5% reduction in activated FFN parameters) and propose Mixture of Blocks for selective block activation. Use multi-step distillation pipeline with Taylor metric-based expert initialization, knowledge distillation with load balancing, and group feature loss for MoB optimization.

Result: Transformed large diffusion transformers (e.g., FLUX.1) into MoE structure, reducing activated parameters by 60% while maintaining original performance and surpassing pruning-based approaches.

Conclusion: Dense2MoE establishes a new paradigm for efficient text-to-image generation through structured sparsification while preserving model capacity.

Abstract: Diffusion Transformer (DiT) has demonstrated remarkable performance in text-to-image generation; however, its large parameter size results in substantial inference overhead. Existing parameter compression methods primarily focus on pruning, but aggressive pruning often leads to severe performance degradation due to reduced model capacity. To address this limitation, we pioneer the transformation of a dense DiT into a Mixture of Experts (MoE) for structured sparsification, reducing the number of activated parameters while preserving model capacity. Specifically, we replace the Feed-Forward Networks (FFNs) in DiT Blocks with MoE layers, reducing the number of activated parameters in the FFNs by 62.5%. Furthermore, we propose the Mixture of Blocks (MoB) to selectively activate DiT blocks, thereby further enhancing sparsity. To ensure an effective dense-to-MoE conversion, we design a multi-step distillation pipeline, incorporating Taylor metric-based expert initialization, knowledge distillation with load balancing, and group feature loss for MoB optimization. We transform large diffusion transformers (e.g., FLUX.1 [dev]) into an MoE structure, reducing activated parameters by 60% while maintaining original performance and surpassing pruning-based approaches in extensive experiments. Overall, Dense2MoE establishes a new paradigm for efficient text-to-image generation.

[216] A Novel Multi-branch ConvNeXt Architecture for Identifying Subtle Pathological Features in CT Scans

Irash Perera, Uthayasanker Thayasivam

Main category: cs.CV

TL;DR: A novel multi-branch ConvNeXt architecture for medical image analysis achieves state-of-the-art performance in COVID-19 diagnosis from CT scans, with ROC-AUC of 0.9937 and validation accuracy of 0.9757.

Details

Motivation: Intelligent medical imaging analysis is crucial for clinical diagnosis, especially for identifying subtle pathological features. The paper addresses the need for robust deep learning models for medical diagnostics.

Method: Multi-branch ConvNeXt architecture with three parallel branches (Global Average Pooling, Global Max Pooling, and Attention-weighted Pooling), using end-to-end pipeline with data preprocessing, augmentation, and two-phase training strategy with transfer learning.

Result: Achieved ROC-AUC of 0.9937, validation accuracy of 0.9757, and F1-score of 0.9825 for COVID-19 cases on a dataset of 2,609 CT slices, outperforming all previously reported models.

Conclusion: Modern multi-branch architecture with careful data handling can achieve performance comparable to or exceeding state-of-the-art models, proving the efficacy of advanced deep learning for robust medical diagnostics.

Abstract: Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis, especially for identifying subtle pathological features. This paper introduces a novel multi-branch ConvNeXt architecture designed specifically for the nuanced challenges of medical image analysis. While applied here to the specific problem of COVID-19 diagnosis, the methodology offers a generalizable framework for classifying a wide range of pathologies from CT scans. The proposed model incorporates a rigorous end-to-end pipeline, from meticulous data preprocessing and augmentation to a disciplined two-phase training strategy that leverages transfer learning effectively. The architecture uniquely integrates features extracted from three parallel branches: Global Average Pooling, Global Max Pooling, and a new Attention-weighted Pooling mechanism. The model was trained and validated on a combined dataset of 2,609 CT slices derived from two distinct datasets. Experimental results demonstrate a superior performance on the validation set, achieving a final ROC-AUC of 0.9937, a validation accuracy of 0.9757, and an F1-score of 0.9825 for COVID-19 cases, outperforming all previously reported models on this dataset. These findings indicate that a modern, multi-branch architecture, coupled with careful data handling, can achieve performance comparable to or exceeding contemporary state-of-the-art models, thereby proving the efficacy of advanced deep learning techniques for robust medical diagnostics.

[217] SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding

Weikai Huang, Jieyu Zhang, Taoyang Jia, Chenhao Zheng, Ziqi Gao, Jae Sung Park, Ranjay Krishna

Main category: cs.CV

TL;DR: SOS is a scalable data synthesis pipeline that generates high-quality synthetic object segments using object-centric composition, structured layout priors, and generative relighting, outperforming models trained on larger real-image datasets.

Details

Motivation: Large annotated datasets for visual grouping tasks are costly, biased, and hard to scale. Synthetic data offers promise but often lacks flexibility, accuracy, and compositional diversity.

Method: Object-centric composition strategy that pastes high-quality synthetic object segments into new images using structured layout priors and generative relighting to produce accurate masks, boxes, and referring expressions.

Result: Models trained on 100K SOS synthetic images outperform those trained on larger real datasets (GRIT with 20M images, V3Det with 200K images), achieving +10.9 AP on LVIS detection and +8.4 NAcc on gRefCOCO grounding. Also improves performance in low-data settings.

Conclusion: SOS enables controllable dataset construction and improves generalization across various data scales, supporting targeted data generation for challenging visual tasks.

Abstract: Visual grouping – operationalized via instance segmentation, visual grounding, and object detection – underpins applications from robotic perception to photo editing. Large annotated datasets are costly, biased in coverage, and hard to scale. Synthetic data are promising but often lack flexibility, accuracy, and compositional diversity. We present SOS, a simple and scalable data synthesis pipeline based on an object-centric composition strategy. It pastes high-quality synthetic object segments into new images using structured layout priors and generative relighting, producing accurate and diverse masks, boxes, and referring expressions. Models trained on 100000 synthetic images from SOS outperform those trained on larger real-image datasets such as GRIT (20M) and V3Det (200K) on detection and grounding tasks, achieving +10.9 AP on LVIS detection and +8.4 $N_{\text{Acc}}$ on gRefCOCO grounding. SOS enables controllable dataset construction and improves generalization in both low-data and closed-vocabulary settings. Augmenting LVIS and COCO with synthetic object segments yields strong performance across real-data scales and even larger gains under extremely limited real data (for example, +3.83 $AP_{\text{rare}}$ on LVIS instance segmentation and +6.59 AP with a 1 percent COCO setup). This controllability also supports targeted data generation for challenging intra-class referring in visual grounding.

[218] MSDM: Generating Task-Specific Pathology Images with a Multimodal Conditioned Diffusion Model for Cell and Nuclei Segmentation

Dominik Winter, Mai Bui, Monica Azqueta Gavaldon, Nicolas Triltsch, Marco Rosati, Nicolas Brieu

Main category: cs.CV

TL;DR: MSDM is a multimodal semantic diffusion model that generates realistic image-mask pairs for cell/nuclei segmentation, using morphological maps, RGB colors, and metadata to create synthetic data that improves segmentation accuracy.

Details

Motivation: Addressing the scarcity of annotated data for rare cell morphologies in computational pathology, where manual annotation is expensive and labor-intensive.

Method: Multimodal Semantic Diffusion Model (MSDM) conditioned on cellular morphologies (horizontal/vertical maps), RGB color characteristics, and BERT-encoded assay metadata, integrated via multi-head cross-attention.

Result: Synthetic images closely match real data with low Wasserstein distances, and incorporating synthetic samples significantly improves segmentation model accuracy on rare cell types like columnar cells.

Conclusion: Multimodal diffusion-based augmentation effectively enhances robustness and generalizability of cell/nuclei segmentation models, paving the way for broader generative model applications in computational pathology.

Abstract: Scarcity of annotated data, particularly for rare or atypical morphologies, present significant challenges for cell and nuclei segmentation in computational pathology. While manual annotation is labor-intensive and costly, synthetic data offers a cost-effective alternative. We introduce a Multimodal Semantic Diffusion Model (MSDM) for generating realistic pixel-precise image-mask pairs for cell and nuclei segmentation. By conditioning the generative process with cellular/nuclear morphologies (using horizontal and vertical maps), RGB color characteristics, and BERT-encoded assay/indication metadata, MSDM generates datasests with desired morphological properties. These heterogeneous modalities are integrated via multi-head cross-attention, enabling fine-grained control over the generated images. Quantitative analysis demonstrates that synthetic images closely match real data, with low Wasserstein distances between embeddings of generated and real images under matching biological conditions. The incorporation of these synthetic samples, exemplified by columnar cells, significantly improves segmentation model accuracy on columnar cells. This strategy systematically enriches data sets, directly targeting model deficiencies. We highlight the effectiveness of multimodal diffusion-based augmentation for advancing the robustness and generalizability of cell and nuclei segmentation models. Thereby, we pave the way for broader application of generative models in computational pathology.

[219] Polar Separable Transform for Efficient Orthogonal Rotation-Invariant Image Representation

Satya P. Singh, Rashmi Chaudhry, Anand Srivastava, Jagath C. Rajapakse

Main category: cs.CV

TL;DR: PSepT is a novel separable orthogonal transform that overcomes the non-separability barrier in polar coordinates, achieving exponential improvements in computational complexity, memory requirements, and numerical stability compared to classical orthogonal moment methods.

Details

Motivation: Classical orthogonal moment-based image representations (like Zernike and pseudo-Zernike moments) suffer from high computational complexity (O(n³N²) to O(n⁶N²)), numerical instability at large orders, and poor condition number scaling (O(N⁴)), which limits their practical application.

Method: PSepT uses tensor-product construction combining Discrete Cosine Transform (DCT) radial bases and Fourier harmonic angular bases, enabling complete kernel factorization and independent radial-angular processing in polar coordinates.

Result: PSepT achieves O(N² log N) computational complexity, O(N²) memory requirements, and O(√N) condition number scaling - exponential improvements over classical methods. It maintains orthogonality, completeness, energy conservation, and rotation-covariance while enabling exact reconstruction.

Conclusion: PSepT enables high-order moment analysis previously infeasible with classical methods, opening new possibilities for robust image analysis applications with superior numerical stability and computational efficiency.

Abstract: Orthogonal moment-based image representations are fundamental in computer vision, but classical methods suffer from high computational complexity and numerical instability at large orders. Zernike and pseudo-Zernike moments, for instance, require coupled radial-angular processing that precludes efficient factorization, resulting in $\mathcal{O}(n^3N^2)$ to $\mathcal{O}(n^6N^2)$ complexity and $\mathcal{O}(N^4)$ condition number scaling for the $n$th-order moments on an $N\times N$ image. We introduce \textbf{PSepT} (Polar Separable Transform), a separable orthogonal transform that overcomes the non-separability barrier in polar coordinates. PSepT achieves complete kernel factorization via tensor-product construction of Discrete Cosine Transform (DCT) radial bases and Fourier harmonic angular bases, enabling independent radial and angular processing. This separable design reduces computational complexity to $\mathcal{O}(N^2 \log N)$, memory requirements to $\mathcal{O}(N^2)$, and condition number scaling to $\mathcal{O}(\sqrt{N})$, representing exponential improvements over polynomial approaches. PSepT exhibits orthogonality, completeness, energy conservation, and rotation-covariance properties. Experimental results demonstrate better numerical stability, computational efficiency, and competitive classification performance on structured datasets, while preserving exact reconstruction. The separable framework enables high-order moment analysis previously infeasible with classical methods, opening new possibilities for robust image analysis applications.

[220] Training Feature Attribution for Vision Models

Aziz Bacha, Thomas George

Main category: cs.CV

TL;DR: This paper introduces training feature attribution, a method that links test predictions to specific regions in training images, providing joint analysis of input features and influential training examples for better model explainability.

Details

Motivation: Deep neural networks are opaque systems that need explainability methods. Existing approaches only attribute predictions to input features or training examples separately, but both perspectives should be studied jointly to improve trust and accountability.

Method: The paper proposes training feature attribution, which connects test-time predictions to specific regions within specific training images, allowing for fine-grained analysis of model behavior.

Result: Experiments on vision datasets show that training feature attribution provides test-specific explanations, identifies harmful training examples that cause misclassifications, and reveals spurious correlations like patch-based shortcuts that conventional methods miss.

Conclusion: Training feature attribution offers new insights into deep model inner workings by jointly analyzing input features and training examples, exposing previously hidden model behaviors and improving explainability.

Abstract: Deep neural networks are often considered opaque systems, prompting the need for explainability methods to improve trust and accountability. Existing approaches typically attribute test-time predictions either to input features (e.g., pixels in an image) or to influential training examples. We argue that both perspectives should be studied jointly. This work explores training feature attribution, which links test predictions to specific regions of specific training images and thereby provides new insights into the inner workings of deep models. Our experiments on vision datasets show that training feature attribution yields fine-grained, test-specific explanations: it identifies harmful examples that drive misclassifications and reveals spurious correlations, such as patch-based shortcuts, that conventional attribution methods fail to expose.

Clara Tomasini, Luis Riazuelo, Ana C. Murillo

Main category: cs.CV

TL;DR: A video bronchoscopy localization pipeline that provides topological navigation assistance without requiring patient CT scans, trained on phantom data with good generalization to real procedures.

Details

Motivation: Current bronchoscopy navigation methods require CT scans and additional sensors, which involve extra setup and training. Topological localization with a generic airway model can provide sufficient navigation assistance without these requirements.

Method: Image-based bronchoscopy topological localization pipeline trained only on phantom data to avoid costly real data labeling, using a generic airway model instead of patient-specific CT scans.

Result: The approach surpasses existing methods, particularly showing strong performance on real data test sequences despite being trained only on phantom data.

Conclusion: The proposed pipeline provides effective navigation assistance for bronchoscopy procedures without requiring patient CT scans, reducing setup complexity while maintaining good generalization from phantom to real data.

Abstract: Video bronchoscopy is a fundamental procedure in respiratory medicine, where medical experts navigate through the bronchial tree of a patient to diagnose or operate the patient. Surgeons need to determine the position of the scope as they go through the airway until they reach the area of interest. This task is very challenging for practitioners due to the complex bronchial tree structure and varying doctor experience and training. Navigation assistance to locate the bronchoscope during the procedure can improve its outcome. Currently used techniques for navigational guidance commonly rely on previous CT scans of the patient to obtain a 3D model of the airway, followed by tracking of the scope with additional sensors or image registration. These methods obtain accurate locations but imply additional setup, scans and training. Accurate metric localization is not always required, and a topological localization with regard to a generic airway model can often suffice to assist the surgeon with navigation. We present an image-based bronchoscopy topological localization pipeline to provide navigation assistance during the procedure, with no need of patient CT scan. Our approach is trained only on phantom data, eliminating the high cost of real data labeling, and presents good generalization capabilities. The results obtained surpass existing methods, particularly on real data test sequences.

[222] Instance-Level Generation for Representation Learning

Yankun Wu, Zakaria Laskar, Giorgos Kordopatis-Zilos, Noa Garcia, Giorgos Tolias

Main category: cs.CV

TL;DR: A novel approach for instance-level recognition that synthetically generates diverse object instances from multiple domains without real images, enabling efficient training of foundation models and improving retrieval performance across benchmarks.

Details

Motivation: Instance-level recognition requires fine-grained identification but faces challenges in creating large-scale annotated datasets due to its granular nature, limiting real-world applicability across domains.

Method: Synthetically generates diverse object instances from multiple domains under varied conditions and backgrounds without relying on any real images, forming a large-scale training set for fine-tuning foundation vision models.

Result: Fine-tuning foundation vision models on the generated data significantly improves retrieval performance across seven ILR benchmarks spanning multiple domains.

Conclusion: The approach offers an efficient and effective alternative to extensive data collection and curation, introducing a new ILR paradigm where only domain names are needed as input, unlocking wide real-world applications.

Abstract: Instance-level recognition (ILR) focuses on identifying individual objects rather than broad categories, offering the highest granularity in image classification. However, this fine-grained nature makes creating large-scale annotated datasets challenging, limiting ILR’s real-world applicability across domains. To overcome this, we introduce a novel approach that synthetically generates diverse object instances from multiple domains under varied conditions and backgrounds, forming a large-scale training set. Unlike prior work on automatic data synthesis, our method is the first to address ILR-specific challenges without relying on any real images. Fine-tuning foundation vision models on the generated data significantly improves retrieval performance across seven ILR benchmarks spanning multiple domains. Our approach offers a new, efficient, and effective alternative to extensive data collection and curation, introducing a new ILR paradigm where the only input is the names of the target domains, unlocking a wide range of real-world applications.

[223] TARO: Toward Semantically Rich Open-World Object Detection

Yuchen Zhang, Yao Lu, Johannes Betz

Main category: cs.CV

TL;DR: TARO is a novel detection framework that identifies unknown objects and classifies them into meaningful coarse parent categories using semantic hierarchy, improving decision-making in safety-critical applications.

Details

Motivation: Current object detectors are limited by closed-world assumptions and treat all unknown objects as a single 'Unknown' class, which is insufficient for safety-critical applications where different types of unknowns require different responses.

Method: TARO uses a sparsemax-based head for objectness modeling, hierarchy-guided relabeling for auxiliary supervision, and a classification module that learns hierarchical relationships to categorize unknown objects into coarse parent classes.

Result: TARO can categorize up to 29.9% of unknown objects into meaningful coarse classes, significantly reduces confusion between unknown and known classes, and achieves competitive performance in unknown recall and known mAP.

Conclusion: TARO effectively bridges the gap in open-set detection by providing more descriptive categorization of unknown objects, enhancing decision-making capabilities in real-world scenarios.

Abstract: Modern object detectors are largely confined to a “closed-world” assumption, limiting them to a predefined set of classes and posing risks when encountering novel objects in real-world scenarios. While open-set detection methods aim to address this by identifying such instances as ‘Unknown’, this is often insufficient. Rather than treating all unknowns as a single class, assigning them more descriptive subcategories can enhance decision-making in safety-critical contexts. For example, identifying an object as an ‘Unknown Animal’ (requiring an urgent stop) versus ‘Unknown Debris’ (requiring a safe lane change) is far more useful than just ‘Unknown’ in autonomous driving. To bridge this gap, we introduce TARO, a novel detection framework that not only identifies unknown objects but also classifies them into coarse parent categories within a semantic hierarchy. TARO employs a unique architecture with a sparsemax-based head for modeling objectness, a hierarchy-guided relabeling component that provides auxiliary supervision, and a classification module that learns hierarchical relationships. Experiments show TARO can categorize up to 29.9% of unknowns into meaningful coarse classes, significantly reduce confusion between unknown and known classes, and achieve competitive performance in both unknown recall and known mAP. Code will be made available.

[224] Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption

Johann-Friedrich Feiden, Tim Küchler, Denis Zavadski, Bogdan Savchynskyy, Carsten Rother

Main category: cs.CV

TL;DR: Online Video Depth Anything (oVDA) enables real-time monocular depth estimation for videos by adapting LLM techniques like feature caching and frame masking, achieving high performance on edge devices.

Details

Motivation: Video Depth Anything (VDA) shows strong depth estimation performance but requires batch processing, making it unsuitable for online/real-time applications. This work aims to overcome this limitation.

Method: Uses techniques from Large Language Models: caching latent features during inference and masking frames during training to enable online processing.

Result: Outperforms all competing online video depth estimation methods in accuracy and VRAM usage. Runs at 42 FPS on NVIDIA A100 and 20 FPS on NVIDIA Jetson edge device.

Conclusion: oVDA successfully enables real-time video depth estimation with low VRAM usage, making it suitable for deployment on edge devices. Code and compilation scripts will be released for easy deployment.

Abstract: Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device. We will release both, code and compilation scripts, making oVDA easy to deploy on low-power hardware.

[225] Modern Deep Learning Approaches for Cricket Shot Classification: A Comprehensive Baseline Study

Sungwoo Kang

Main category: cs.CV

TL;DR: This paper provides the first comprehensive baseline study comparing 7 deep learning approaches for cricket shot classification, revealing significant gaps between reported literature results and practical implementations, with modern EfficientNet-GRU achieving 92.25% accuracy.

Details

Motivation: Cricket shot classification from video sequences is challenging and requires effective spatial-temporal modeling. There's a need for standardized evaluation to address discrepancies between reported results and practical implementations in academic literature.

Method: Systematic evaluation of 7 deep learning approaches across 4 paradigms: CNN-LSTM, attention-based models, vision transformers, transfer learning, and EfficientNet-GRU combinations, implemented with PyTorch Lightning following MLOps practices.

Result: Re-implementations of previous methods showed much lower accuracies (46.0%, 55.6%, 57.7%) compared to literature claims (96%, 99.2%, 93%). The modern EfficientNet-GRU approach achieved 92.25% accuracy, demonstrating substantial improvements with modern architectures.

Conclusion: The study highlights critical importance of standardized evaluation protocols in sports video analysis research and shows that modern architectures with systematic optimization can achieve significant performance improvements.

Abstract: Cricket shot classification from video sequences remains a challenging problem in sports video analysis, requiring effective modeling of both spatial and temporal features. This paper presents the first comprehensive baseline study comparing seven different deep learning approaches across four distinct research paradigms for cricket shot classification. We implement and systematically evaluate traditional CNN-LSTM architectures, attention-based models, vision transformers, transfer learning approaches, and modern EfficientNet-GRU combinations on a unified benchmark. A critical finding of our study is the significant performance gap between claims in academic literature and practical implementation results. While previous papers reported accuracies of 96% (Balaji LRCN), 99.2% (IJERCSE), and 93% (Sensors), our standardized re-implementations achieve 46.0%, 55.6%, and 57.7% respectively. Our modern SOTA approach, combining EfficientNet-B0 with a GRU-based temporal model, achieves 92.25% accuracy, demonstrating that substantial improvements are possible with modern architectures and systematic optimization. All implementations follow modern MLOps practices with PyTorch Lightning, providing a reproducible research platform that exposes the critical importance of standardized evaluation protocols in sports video analysis research.

[226] Towards Safer and Understandable Driver Intention Prediction

Mukilan Karuppasamy, Shankar Gangisetty, Shyam Nandan Rai, Carlo Masone, C V Jawahar

Main category: cs.CV

TL;DR: The paper introduces DAAD-X dataset and VCBM framework for interpretable driver intent prediction in autonomous driving systems, showing transformers outperform CNNs in interpretability.

Details

Motivation: As autonomous driving systems interact more with humans, interpretability of decision-making becomes crucial for safety. Current deep learning systems lack understanding of underlying representations in driving tasks.

Method: Created DAAD-X dataset with multimodal ego-centric videos and hierarchical textual explanations. Proposed Video Concept Bottleneck Model (VCBM) that generates spatio-temporally coherent explanations without post-hoc techniques.

Result: Transformer-based models demonstrated greater interpretability than CNN-based models. Introduced multilabel t-SNE visualization to show disentanglement and causal correlation among explanations.

Conclusion: The VCBM framework and DAAD-X dataset enable interpretable driver intent prediction, which is critical for safe human-machine interaction in autonomous driving systems.

Abstract: Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, mainly due to recent advances in deep learning and AI. As interactions between autonomous systems and humans increase, the interpretability of decision-making processes in driving systems becomes increasingly crucial for ensuring safe driving operations. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the eXplainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver’s decisions. These explanations are derived from both the driver’s eye-gaze and the ego-vehicle’s perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatio-temporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on the DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability than conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations. Our data, code and models are available at: https://mukil07.github.io/VCBM.github.io/

[227] Cattle-CLIP: A Multimodal Framework for Cattle Behaviour Recognition

Huimin Liu, Jing Gao, Daria Baran, AxelX Montout, Neill W Campbell, Andrew W Dowsey

Main category: cs.CV

TL;DR: Cattle-CLIP is a multimodal deep learning framework adapted from CLIP for cattle behavior recognition, achieving 96.1% accuracy across six behaviors and showing strong few-shot learning capabilities.

Details

Motivation: Cattle behavior is crucial for monitoring animal health and well-being. Video-based monitoring with deep learning can provide accurate behavior recognition, but faces challenges with domain gaps between web data and real-world surveillance footage, and data scarcity for certain behaviors.

Method: Adapted CLIP model by adding temporal integration module, used tailored data augmentation strategies and specialized text prompts to address domain gap, evaluated under fully-supervised and few-shot learning scenarios using the CattleBehaviours6 dataset.

Result: Achieved 96.1% overall accuracy across six behaviors in supervised setting, with nearly 100% recall for feeding, drinking and standing-ruminating behaviors. Demonstrated robust generalization with limited data in few-shot scenarios.

Conclusion: Cattle-CLIP highlights the potential of multimodal learning in agricultural and animal behavior analysis, particularly for data-scarce behavior recognition tasks in livestock monitoring.

Abstract: Cattle behaviour is a crucial indicator of an individual animal health, productivity and overall well-being. Video-based monitoring, combined with deep learning techniques, has become a mainstream approach in animal biometrics, and it can offer high accuracy in some behaviour recognition tasks. We present Cattle-CLIP, a multimodal deep learning framework for cattle behaviour recognition, using semantic cues to improve the performance of video-based visual feature recognition. It is adapted from the large-scale image-language model CLIP by adding a temporal integration module. To address the domain gap between web data used for the pre-trained model and real-world cattle surveillance footage, we introduce tailored data augmentation strategies and specialised text prompts. Cattle-CLIP is evaluated under both fully-supervised and few-shot learning scenarios, with a particular focus on data-scarce behaviour recognition - an important yet under-explored goal in livestock monitoring. To evaluate the proposed method, we release the CattleBehaviours6 dataset, which comprises six types of indoor behaviours: feeding, drinking, standing-self-grooming, standing-ruminating, lying-self-grooming and lying-ruminating. The dataset consists of 1905 clips collected from our John Oldacre Centre dairy farm research platform housing 200 Holstein-Friesian cows. Experiments show that Cattle-CLIP achieves 96.1% overall accuracy across six behaviours in a supervised setting, with nearly 100% recall for feeding, drinking and standing-ruminating behaviours, and demonstrates robust generalisation with limited data in few-shot scenarios, highlighting the potential of multimodal learning in agricultural and animal behaviour analysis.

[228] Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, Alexandre Alahi

Main category: cs.CV

TL;DR: SVI enables infinite-length video generation with high temporal consistency and controllable storylines by using Error-Recycling Fine-Tuning, which trains the model to correct its own errors through closed-loop recycling.

Details

Motivation: Existing long-video methods suffer from accumulated errors and produce homogeneous scenes due to the discrepancy between training on clean data and testing on self-generated outputs.

Method: Error-Recycling Fine-Tuning that injects historical errors into clean inputs, approximates predictions with one-step bidirectional integration, calculates errors with residuals, and banks errors in replay memory for resampling.

Result: SVI scales videos from seconds to infinite durations without additional inference cost and achieves state-of-the-art performance on three benchmarks in consistent, creative, and conditional settings.

Conclusion: SVI successfully bridges the hypothesis gap between training and testing, enabling infinite video generation with high consistency and versatility across diverse conditions.

Abstract: We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)’s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.

[229] Tag-Enriched Multi-Attention with Large Language Models for Cross-Domain Sequential Recommendation

Wangyu Wu, Xuhang Chen, Zhenhong Chen, Jing-En Jiang, Kim-Fung Tsang, Xiaowei Huang, Fei Ma, Jimin Xiao

Main category: cs.CV

TL;DR: TEMA-LLM is a cross-domain sequential recommendation framework that uses LLMs for semantic tag generation and multi-attention mechanisms to capture domain-specific and cross-domain user preferences.

Details

Motivation: To address the challenge of accurately capturing both domain-specific and cross-domain behavioral patterns in modern e-commerce platforms where users interact with diverse services, enabling personalized and seamless consumer experiences.

Method: Uses LLMs to generate descriptive tags from item titles/descriptions with domain-aware prompts, fuses tag embeddings with item identifiers and textual/visual features, and employs a Tag-Enriched Multi-Attention mechanism to model user preferences within and across domains.

Result: Extensive experiments on four large-scale e-commerce datasets show TEMA-LLM consistently outperforms state-of-the-art baselines, demonstrating the benefits of LLM-based semantic tagging and multi-attention integration.

Conclusion: The approach highlights the potential of LLMs to advance intelligent, user-centric services in consumer electronics and e-commerce recommendation systems.

Abstract: Cross-Domain Sequential Recommendation (CDSR) plays a crucial role in modern consumer electronics and e-commerce platforms, where users interact with diverse services such as books, movies, and online retail products. These systems must accurately capture both domain-specific and cross-domain behavioral patterns to provide personalized and seamless consumer experiences. To address this challenge, we propose \textbf{TEMA-LLM} (\textit{Tag-Enriched Multi-Attention with Large Language Models}), a practical and effective framework that integrates \textit{Large Language Models (LLMs)} for semantic tag generation and enrichment. Specifically, TEMA-LLM employs LLMs to assign domain-aware prompts and generate descriptive tags from item titles and descriptions. The resulting tag embeddings are fused with item identifiers as well as textual and visual features to construct enhanced item representations. A \textit{Tag-Enriched Multi-Attention} mechanism is then introduced to jointly model user preferences within and across domains, enabling the system to capture complex and evolving consumer interests. Extensive experiments on four large-scale e-commerce datasets demonstrate that TEMA-LLM consistently outperforms state-of-the-art baselines, underscoring the benefits of LLM-based semantic tagging and multi-attention integration for consumer-facing recommendation systems. The proposed approach highlights the potential of LLMs to advance intelligent, user-centric services in the field of consumer electronics.

[230] Clear Roads, Clear Vision: Advancements in Multi-Weather Restoration for Smart Transportation

Vijay M. Galshetwar, Praful Hambarde, Prashant W. Patil, Akshay Dudhane, Sachin Chaudhary, Santosh Kumar Vipparathi, Subrahmanyam Murala

Main category: cs.CV

TL;DR: A comprehensive survey of image and video restoration techniques for adverse weather conditions in intelligent transportation systems, covering traditional and modern approaches including CNNs, transformers, diffusion models, and vision-language models.

Details

Motivation: Adverse weather conditions degrade image/video quality, posing serious challenges to intelligent transportation systems that rely on visual input for applications like autonomous driving and traffic monitoring.

Method: Categorizes restoration approaches into traditional prior-based methods and modern data-driven models (CNNs, transformers, diffusion models, VLMs), and classifies strategies by scope: single-task, multi-task/multi-weather, and all-in-one frameworks.

Result: Provides comprehensive review covering restoration techniques, day/night challenges, benchmark datasets, evaluation protocols, and maintains updated repository of latest research.

Conclusion: Identifies limitations in current research and outlines future directions including mixed-degradation restoration, real-time deployment, and agentic AI frameworks to advance weather-resilient vision systems in smart transportation.

Abstract: Adverse weather conditions such as haze, rain, and snow significantly degrade the quality of images and videos, posing serious challenges to intelligent transportation systems (ITS) that rely on visual input. These degradations affect critical applications including autonomous driving, traffic monitoring, and surveillance. This survey presents a comprehensive review of image and video restoration techniques developed to mitigate weather-induced visual impairments. We categorize existing approaches into traditional prior-based methods and modern data-driven models, including CNNs, transformers, diffusion models, and emerging vision-language models (VLMs). Restoration strategies are further classified based on their scope: single-task models, multi-task/multi-weather systems, and all-in-one frameworks capable of handling diverse degradations. In addition, we discuss day and night time restoration challenges, benchmark datasets, and evaluation protocols. The survey concludes with an in-depth discussion on limitations in current research and outlines future directions such as mixed/compound-degradation restoration, real-time deployment, and agentic AI frameworks. This work aims to serve as a valuable reference for advancing weather-resilient vision systems in smart transportation environments. Lastly, to stay current with rapid advancements in this field, we will maintain regular updates of the latest relevant papers and their open-source implementations at https://github.com/ChaudharyUPES/A-comprehensive-review-on-Multi-weather-restoration

[231] Diagnosing Shoulder Disorders Using Multimodal Large Language Models and Consumer-Grade Cameras

Jindong Hong, Wencheng Zhang, Shiqin Qiao, Jianhai Chen, Jianing Qiu, Chuanyang Zheng, Qian Xu, Yun Ji, Qianyue Wen, Weiwei Sun, Hao Li, Huizhen Li, Huichao Wang, Kai Wu, Meng Li, Yijun He, Lingjie Luo, Jiankai Sun

Main category: cs.CV

TL;DR: This paper proposes HMVDx, a Hybrid Motion Video Diagnosis framework using Multimodal Large Language Models for low-cost shoulder disorder diagnosis from consumer-grade videos, achieving 79.6% accuracy improvement.

Details

Motivation: Address the challenge of early and accurate diagnosis of shoulder disorders in resource-scarce regions by developing low-cost, scalable diagnostic solutions using consumer-grade videos.

Method: HMVDx framework divides action understanding and disease diagnosis tasks between two MLLMs, and introduces a Usability Index metric based on medical decision-making logic.

Result: HMVDx achieved 79.6% higher accuracy in diagnosing shoulder joint injuries compared to direct video diagnosis methods.

Conclusion: The framework demonstrates significant potential for low-cost MLLM applications in medical video understanding and provides a valuable approach for medical practitioners in resource-limited settings.

Abstract: Shoulder disorders, such as frozen shoulder (a.k.a., adhesive capsulitis), are common conditions affecting the health of people worldwide, and have a high incidence rate among the elderly and workers engaged in repetitive shoulder tasks. In regions with scarce medical resources, achieving early and accurate diagnosis poses significant challenges, and there is an urgent need for low-cost and easily scalable auxiliary diagnostic solutions. This research introduces videos captured by consumer-grade devices as the basis for diagnosis, reducing the cost for users. We focus on the innovative application of Multimodal Large Language Models (MLLMs) in the preliminary diagnosis of shoulder disorders and propose a Hybrid Motion Video Diagnosis framework (HMVDx). This framework divides the two tasks of action understanding and disease diagnosis, which are respectively completed by two MLLMs. In addition to traditional evaluation indicators, this work proposes a novel metric called Usability Index by the logical process of medical decision-making (action recognition, movement diagnosis, and final diagnosis). This index evaluates the effectiveness of MLLMs in the medical field from the perspective of the entire medical diagnostic pathway, revealing the potential value of low-cost MLLMs in medical applications for medical practitioners. In experimental comparisons, the accuracy of HMVDx in diagnosing shoulder joint injuries has increased by 79.6% compared with direct video diagnosis, a significant technical contribution to future research on the application of MLLMs for video understanding in the medical field.

[232] Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

Patrick Wienholt, Sophie Caselitz, Robert Siepmann, Philipp Bruners, Keno Bressem, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

Main category: cs.CV

TL;DR: Using discrete semantic entropy (DSE) to filter out questions likely to cause hallucinations significantly improves accuracy of vision-language models in radiologic VQA tasks.

Details

Motivation: To improve the accuracy of black-box vision-language models in radiologic image-based visual question answering by detecting and filtering out questions that are likely to generate hallucinations.

Method: Evaluated DSE using two datasets: VQA-Med 2019 benchmark and a diagnostic radiology dataset. GPT-4o and GPT-4.1 answered each question 15 times with temperature 1.0. Baseline accuracy used low-temperature answers (0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from semantic cluster frequencies. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3.

Result: Baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering high-entropy questions (DSE > 0.3), accuracy improved to 76.3% for GPT-4o (retained 334/706 questions) and 63.8% for GPT-4.1 (retained 499/706 questions), both statistically significant (p < .001).

Conclusion: DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency, significantly improves diagnostic answer accuracy, and offers a practical filtering strategy for clinical VLM applications.

Abstract: To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: (i) the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and (ii) a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.

[233] MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

Ming Dai, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang

Main category: cs.CV

TL;DR: A unified framework for Referring Video Object Segmentation that jointly optimizes temporal sentence grounding and segmentation, using a novel FIND token for key moment identification and moment-centric sampling strategy.

Details

Motivation: Existing RefVOS approaches either use handcrafted heuristics that miss temporal cues or rely on external keyframe models that increase system complexity.

Method: Proposes a unified framework with: 1) TSG using FIND token for key moment identification via temporal token similarity matching, 2) Moment-Centric Sampling strategy for dense sampling of informative moments, 3) Bidirectional Anchor-updated Propagation for stable tracking.

Result: The method avoids external timestamp encodings and external keyframe models while preserving motion details and global context with improved tracking stability.

Conclusion: The proposed framework provides an integrated solution for RefVOS that naturally incorporates key moment grounding capability without increasing system complexity.

Abstract: Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg

[234] Spotlight on Token Perception for Multimodal Reinforcement Learning

Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng

Main category: cs.CV

TL;DR: VPPO introduces token perception analysis in multimodal RLVR, revealing sparse visual dependency in CoT processes and proposing a policy optimization method that reweights advantages by visual dependency and focuses on perceptually pivotal tokens.

Details

Motivation: Existing multimodal RLVR methods neglect the role of visual perception in optimization, particularly at the token level, despite its importance for visually-grounded reasoning.

Method: VPPO uses token perception analysis to identify sparse visual dependency patterns, then applies dual mechanisms: trajectory advantage reweighting by overall visual dependency and focused policy updates on perceptually pivotal tokens.

Result: VPPO achieves substantial gains over leading open-source RL-tuned models across eight perception and reasoning benchmarks, with consistent effectiveness at 7B and 32B model scales.

Conclusion: The work establishes a token-level perceptual perspective for analyzing multimodal RLVR and presents an effective optimization strategy that significantly enhances LVLMs’ multimodal reasoning capabilities.

Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory’s advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.

[235] CapGeo: A Caption-Assisted Approach to Geometric Reasoning

Yuying Li, Siyi Qian, Hao Liang, Leqi Zheng, Ruichuan An, Yongzhen Guo, Wentao Zhang

Main category: cs.CV

TL;DR: CapGeo introduces a caption-assisted reasoning framework that converts geometric diagrams into textual descriptions to improve MLLMs’ geometric reasoning, achieving substantial performance gains from 8.6% to 59.0% and 44.8% to 73.0% on different models.

Details

Motivation: Geometric reasoning remains a core challenge for MLLMs, with advanced systems struggling despite strong textual reasoning abilities. The bottleneck appears to be understanding geometric diagrams rather than reasoning itself, suggesting that converting visual content to text could bridge this gap.

Method: Introduces CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities by converting geometric figures into concise textual descriptions. Also proposes CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs with a keypoint-based evaluation metric.

Result: Substantial improvements when models use captions: Qwen2.5-VL-72B improved from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rose from 44.8% to 73.0%. The keypoint-based metric correlates strongly with downstream CapGeo performance.

Conclusion: The framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs by bridging visual and textual modalities through geometric captioning.

Abstract: Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.

[236] RadioFlow: Efficient Radio Map Construction Framework with Flow Matching

Haozhe Jia, Wenshuo Chen, Xiucheng Wang, Nan Cheng, Hongbo Zhang, Kuimou Yu, Songning Lai, Nanjian Jia, Bowen Tian, Hongru Xiao, Yutao Yue

Main category: cs.CV

TL;DR: RadioFlow is a flow-matching-based generative framework that enables efficient single-step radio map generation with significantly reduced model size and faster inference compared to diffusion models.

Details

Motivation: Current diffusion-based approaches for radio map generation suffer from large model sizes, slow iterative denoising, and high inference latency, which hinder practical deployment in next-generation wireless systems.

Method: Proposes RadioFlow, a flow-matching-based generative framework that learns continuous transport trajectories between noise and data, enabling single-step efficient sampling instead of iterative denoising.

Result: RadioFlow achieves state-of-the-art performance with up to 8× fewer parameters and over 4× faster inference compared to the leading diffusion-based baseline (RadioDiff), while preserving reconstruction accuracy.

Conclusion: RadioFlow provides a promising pathway toward scalable, energy-efficient, and real-time electromagnetic digital twins for future 6G networks.

Abstract: Accurate and real-time radio map (RM) generation is crucial for next-generation wireless systems, yet diffusion-based approaches often suffer from large model sizes, slow iterative denoising, and high inference latency, which hinder practical deployment. To overcome these limitations, we propose \textbf{RadioFlow}, a novel flow-matching-based generative framework that achieves high-fidelity RM generation through single-step efficient sampling. Unlike conventional diffusion models, RadioFlow learns continuous transport trajectories between noise and data, enabling both training and inference to be significantly accelerated while preserving reconstruction accuracy. Comprehensive experiments demonstrate that RadioFlow achieves state-of-the-art performance with \textbf{up to 8$\times$ fewer parameters} and \textbf{over 4$\times$ faster inference} compared to the leading diffusion-based baseline (RadioDiff). This advancement provides a promising pathway toward scalable, energy-efficient, and real-time electromagnetic digital twins for future 6G networks. We release the code at \href{https://github.com/Hxxxz0/RadioFlow}{GitHub}.

[237] Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation

Wenyao Zhang, Hongsi Liu, Bohan Li, Jiawei He, Zekun Qi, Yunnan Wang, Shengyang Zhao, Xinqiang Yu, Wenjun Zeng, Xin Jin

Main category: cs.CV

TL;DR: Hybrid-depth is a novel self-supervised monocular depth estimation framework that integrates CLIP and DINO foundation models to extract semantic-spatial knowledge through contrastive language guidance, achieving state-of-the-art performance on KITTI benchmark.

Details

Motivation: Current self-supervised monocular depth estimation methods suffer from insufficient semantic-spatial knowledge extraction, limiting their performance.

Method: A coarse-to-fine progressive learning framework: 1) Aggregates multi-grained features from CLIP (global semantics) and DINO (local spatial details) under contrastive language guidance with proxy tasks for depth-aware feature alignment; 2) Integrates camera pose information and pixel-wise language alignment to refine depth predictions, working as plug-and-play encoder for existing MDE pipelines.

Result: Extensive experiments on KITTI benchmark demonstrate significant outperformance over state-of-the-art methods across all metrics, with benefits for downstream tasks like BEV perception.

Conclusion: The integration of CLIP’s semantic context and DINO’s spatial details through language guidance effectively addresses feature granularity mismatches and enhances continuous depth estimation.

Abstract: Current self-supervised monocular depth estimation (MDE) approaches encounter performance limitations due to insufficient semantic-spatial knowledge extraction. To address this challenge, we propose Hybrid-depth, a novel framework that systematically integrates foundation models (e.g., CLIP and DINO) to extract visual priors and acquire sufficient contextual information for MDE. Our approach introduces a coarse-to-fine progressive learning framework: 1) Firstly, we aggregate multi-grained features from CLIP (global semantics) and DINO (local spatial details) under contrastive language guidance. A proxy task comparing close-distant image patches is designed to enforce depth-aware feature alignment using text prompts; 2) Next, building on the coarse features, we integrate camera pose information and pixel-wise language alignment to refine depth predictions. This module seamlessly integrates with existing self-supervised MDE pipelines (e.g., Monodepth2, ManyDepth) as a plug-and-play depth encoder, enhancing continuous depth estimation. By aggregating CLIP’s semantic context and DINO’s spatial details through language guidance, our method effectively addresses feature granularity mismatches. Extensive experiments on the KITTI benchmark demonstrate that our method significantly outperforms SOTA methods across all metrics, which also indeed benefits downstream tasks like BEV perception. Code is available at https://github.com/Zhangwenyao1/Hybrid-depth.

[238] Instance-Aware Robust Consistency Regularization for Semi-Supervised Nuclei Instance Segmentation

Zenan Lin, Wei Li, Jintao Chen, Zihao Wu, Wenxiong Kang, Changxin Gao, Liansheng Wang, Jin-Gang Yu

Main category: cs.CV

TL;DR: IRCR-Net is a semi-supervised method for nuclei instance segmentation that uses instance-aware consistency regularization and morphological priors to handle dense and overlapping nuclei while reducing pseudo-label noise.

Details

Motivation: Fully supervised methods are limited by expensive annotation costs, while existing semi-supervised methods fail to regularize instance-level consistency, lack pathological structure priors, and introduce noisy pseudo-labels during training.

Method: Proposes Instance-Aware Robust Consistency Regularization Network (IRCR-Net) with Matching-Driven Instance-Aware Consistency (MIAC) and Prior-Driven Instance-Aware Consistency (PIAC) mechanisms that incorporate morphological priors to assess pseudo-label quality and discard low-quality predictions.

Result: Significantly enhances semi-supervised nuclei instance segmentation performance across multiple public datasets, even surpassing fully supervised methods in some scenarios.

Conclusion: The proposed IRCR-Net effectively addresses limitations of existing semi-supervised methods by leveraging instance-aware consistency regularization and morphological priors, achieving robust performance in nuclei instance segmentation.

Abstract: Nuclei instance segmentation in pathological images is crucial for downstream tasks such as tumor microenvironment analysis. However, the high cost and scarcity of annotated data limit the applicability of fully supervised methods, while existing semi-supervised methods fail to adequately regularize consistency at the instance level, lack leverage of the inherent prior knowledge of pathological structures, and are prone to introducing noisy pseudo-labels during training. In this paper, we propose an Instance-Aware Robust Consistency Regularization Network (IRCR-Net) for accurate instance-level nuclei segmentation. Specifically, we introduce the Matching-Driven Instance-Aware Consistency (MIAC) and Prior-Driven Instance-Aware Consistency (PIAC) mechanisms to refine the nuclei instance segmentation result of the teacher and student subnetwork, particularly for densely distributed and overlapping nuclei. We incorporate morphological prior knowledge of nuclei in pathological images and utilize these priors to assess the quality of pseudo-labels generated from unlabeled data. Low-quality pseudo-labels are discarded, while high-quality predictions are enhanced to reduce pseudo-label noise and benefit the network’s robust training. Experimental results demonstrate that the proposed method significantly enhances semi-supervised nuclei instance segmentation performance across multiple public datasets compared to existing approaches, even surpassing fully supervised methods in some scenarios.

[239] Enhancing Infrared Vision: Progressive Prompt Fusion Network and Benchmark

Jinyuan Liu, Zihang Chen, Zhu Liu, Zhiying Jiang, Long Ma, Xin Fan, Risheng Liu

Main category: cs.CV

TL;DR: A novel Progressive Prompt Fusion Network (PPFN) for thermal infrared image enhancement that handles coupled degradations through prompt-based feature modulation and selective progressive training.

Details

Motivation: Existing infrared enhancement methods focus on individual degradations and struggle with coupled degradations, while all-in-one RGB methods are ineffective due to different imaging models.

Method: PPFN establishes prompt pairs based on thermal imaging process, fuses them to modulate features for specific degradations, and uses Selective Progressive Training to refine handling of composite cases.

Result: Achieves 8.76% improvement on complex degradation scenes, removes camera noise while retaining structural details, and enhances overall contrast.

Conclusion: The proposed approach effectively handles both specific and complex degradations in thermal infrared images, with significant performance improvements validated through extensive experiments.

Abstract: We engage in the relatively underexplored task named thermal infrared image enhancement. Existing infrared image enhancement methods primarily focus on tackling individual degradations, such as noise, contrast, and blurring, making it difficult to handle coupled degradations. Meanwhile, all-in-one enhancement methods, commonly applied to RGB sensors, often demonstrate limited effectiveness due to the significant differences in imaging models. In sight of this, we first revisit the imaging mechanism and introduce a Progressive Prompt Fusion Network (PPFN). Specifically, the PPFN initially establishes prompt pairs based on the thermal imaging process. For each type of degradation, we fuse the corresponding prompt pairs to modulate the model’s features, providing adaptive guidance that enables the model to better address specific degradations under single or multiple conditions. In addition, a Selective Progressive Training (SPT) mechanism is introduced to gradually refine the model’s handling of composite cases to align the enhancement process, which not only allows the model to remove camera noise and retain key structural details, but also enhancing the overall contrast of the thermal image. Furthermore, we introduce the most high-quality, multi-scenarios infrared benchmark covering a wide range of scenarios. Extensive experiments substantiate that our approach not only delivers promising visual results under specific degradation but also significantly improves performance on complex degradation scenes, achieving a notable 8.76% improvement. Code is available at https://github.com/Zihang-Chen/HM-TIR.

Qihang Ma, Shengyu Li, Jie Tang, Dingkang Yang, Shaodong Chen, Yingyi Zhang, Chao Feng, Jiao Ran

Main category: cs.CV

TL;DR: The paper proposes using vision-language models for multi-modal keyphrase prediction, addressing limitations of traditional methods through zero-shot/SFT baselines, Fine-tune-CoT for reasoning enhancement, and dynamic CoT to prevent overthinking.

Details

Motivation: Traditional multi-modal approaches have limitations in handling absence and unseen scenarios, and existing benchmarks overestimate model capability due to training-test overlap.

Method: Three strategies: 1) zero-shot and supervised fine-tuning as baselines, 2) Fine-tune-CoT using teacher-generated reasoning data, 3) dynamic CoT that adaptively injects CoT data during training for flexible inference.

Result: Experimental results on various datasets demonstrate the effectiveness of the proposed approaches.

Conclusion: The proposed vision-language model strategies effectively address multi-modal keyphrase prediction challenges, with dynamic CoT particularly improving reasoning flexibility.

Abstract: Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the “overthinking” phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches. The code is available at https://github.com/bytedance/DynamicCoT.

[241] BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception

Junyan Ye, Dongzhi Jiang, Jun He, Baichuan Zhou, Zilong Huang, Zhiyuan Yan, Hongsheng Li, Conghui He, Weijia Li

Main category: cs.CV

TL;DR: BLINK-Twice is a vision-centric reasoning benchmark that challenges MLLMs to reason from visual content alone, moving beyond shallow perception to require fine-grained observation and analytical reasoning.

Details

Motivation: Existing reasoning benchmarks primarily assess language-based reasoning and treat visual input as replaceable context, creating a gap for evaluating true vision-centric reasoning capabilities.

Method: The benchmark integrates three components: seven types of visual challenges, natural adversarial image pairs that enforce visual reliance, and annotated reasoning chains for fine-grained evaluation of the reasoning process.

Result: BLINK-Twice poses significant challenges to 20 leading MLLMs. While language-based reasoning strategies can improve performance, they often result in unstable reasoning. Repeated image observation improves performance, and active visual interaction highlights the need for new vision reasoning paradigms.

Conclusion: Current MLLMs struggle with true vision-centric reasoning, and the benchmark reveals the need for new reasoning paradigms that go beyond language-based approaches to properly handle visual analytical reasoning.

Abstract: Recently, Multimodal Large Language Models (MLLMs) have made rapid progress, particularly in enhancing their reasoning capabilities. However, existing reasoning benchmarks still primarily assess language-based reasoning, often treating visual input as replaceable context. To address this gap, we introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks. Instead of relying on external knowledge, our tasks require models to reason from visual content alone, shifting the focus from language-based to image-grounded reasoning. Compared to prior perception benchmarks, it moves beyond shallow perception (“see”) and requires fine-grained observation and analytical reasoning (“observe”). BLINK-Twice integrates three core components: seven types of visual challenges for testing visual reasoning, natural adversarial image pairs that enforce reliance on visual content, and annotated reasoning chains for fine-grained evaluation of the reasoning process rather than final answers alone. We evaluate 20 leading MLLMs, including 12 foundation models and 8 reasoning-enhanced models. BLINK-Twice poses a significant challenge to current models. While existing reasoning strategies in the language space-such as chain-of-thought or self-criticism can improve performance, they often result in unstable and redundant reasoning. We observe that repeated image observation improves performance across models, and active visual interaction, as demonstrated by models like o3, highlights the need for a new paradigm for vision reasoning. The dataset is publicly available at https://github.com/PicoTrex/BLINK-Twice

[242] Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes

Yikang Zhang, Rui Fan

Main category: cs.CV

TL;DR: VAD-GS is a 3D Gaussian splatting framework that addresses geometry recovery in challenging urban scenes by identifying unreliable structures, selecting informative views, and reconstructing missing geometry through patch matching.

Details

Motivation: 3DGS depends heavily on initialized point cloud quality, which often fails in unbounded, dynamic urban environments due to incomplete coverage and overlapping observation frustums, leading to distortions and artifacts.

Method: The method uses voxel-based visibility reasoning to identify unreliable structures, diversity-aware view selection for informative supporting views, and patch matching-based multi-view stereo reconstruction to recover missing structures and generate new Gaussian primitives.

Result: Extensive experiments on Waymo and nuScenes datasets show VAD-GS outperforms state-of-the-art 3DGS approaches and significantly improves reconstructed geometry quality for both static and dynamic objects.

Conclusion: VAD-GS effectively addresses the limitations of 3DGS in urban scenes by enabling geometry recovery in regions lacking initial points through reliable geometric priors.

Abstract: 3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing missing structures. To address these limitations, we propose VAD-GS, a 3DGS framework tailored for geometry recovery in challenging urban scenes. Our method identifies unreliable geometry structures via voxel-based visibility reasoning, selects informative supporting views through diversity-aware view selection, and recovers missing structures via patch matching-based multi-view stereo reconstruction. This design enables the generation of new Gaussian primitives guided by reliable geometric priors, even in regions lacking initial points. Extensive experiments on the Waymo and nuScenes datasets demonstrate that VAD-GS outperforms state-of-the-art 3DGS approaches and significantly improves the quality of reconstructed geometry for both static and dynamic objects. Source code will be released upon publication.

[243] Minkowski-MambaNet: A Point Cloud Framework with Selective State Space Models for Forest Biomass Quantification

Jinxiang Tu, Dayong Ren, Fei Shi, Zhenhong Jia, Yahong Ren, Jiwei Qin, Fang He

Main category: cs.CV

TL;DR: Minkowski-MambaNet is a novel deep learning framework that directly estimates forest volume and biomass from raw LiDAR data by integrating Mamba’s Selective State Space Model with Minkowski networks, achieving superior performance without requiring Digital Terrain Models.

Details

Motivation: Accurate forest biomass quantification is crucial for carbon cycle monitoring, but directly estimating woody volume and AGB from LiDAR point clouds is challenging due to difficulties in modeling long-range dependencies needed to distinguish individual trees.

Method: Proposes Minkowski-MambaNet framework that integrates Mamba’s Selective State Space Model (SSM) into a Minkowski network to encode global context and long-range dependencies, with skip connections to enhance features and accelerate convergence.

Result: Evaluated on Danish National Forest Inventory LiDAR data, Minkowski-MambaNet significantly outperforms state-of-the-art methods, providing more accurate and robust estimates without requiring DTM and being robust to boundary artifacts.

Conclusion: This work offers a powerful tool for large-scale forest biomass analysis, advancing LiDAR-based forest inventories by enabling direct volume and AGB estimation from raw LiDAR data with improved accuracy and robustness.

Abstract: Accurate forest biomass quantification is vital for carbon cycle monitoring. While airborne LiDAR excels at capturing 3D forest structure, directly estimating woody volume and Aboveground Biomass (AGB) from point clouds is challenging due to difficulties in modeling long-range dependencies needed to distinguish trees.We propose Minkowski-MambaNet, a novel deep learning framework that directly estimates volume and AGB from raw LiDAR. Its key innovation is integrating the Mamba model’s Selective State Space Model (SSM) into a Minkowski network, enabling effective encoding of global context and long-range dependencies for improved tree differentiation. Skip connections are incorporated to enhance features and accelerate convergence.Evaluated on Danish National Forest Inventory LiDAR data, Minkowski-MambaNet significantly outperforms state-of-the-art methods, providing more accurate and robust estimates. Crucially, it requires no Digital Terrain Model (DTM) and is robust to boundary artifacts. This work offers a powerful tool for large-scale forest biomass analysis, advancing LiDAR-based forest inventories.

[244] Utilizing dynamic sparsity on pretrained DETR

Reza Sedghi, Anand Subramoney, David Kappel

Main category: cs.CV

TL;DR: The paper introduces two methods (SIBS and MGS) to exploit sparsity in DETR’s MLP layers for efficient inference without retraining, with MGS achieving 85-95% activation sparsity while maintaining performance.

Details

Motivation: Transformer-based models face efficiency challenges in vision tasks like object detection, and the inherent sparsity in DETR's MLP layers presents an opportunity for computational optimization.

Method: Two approaches: Static Indicator-Based Sparsification (SIBS) using fixed activation patterns, and Micro-Gated Sparsification (MGS) with a lightweight gating mechanism trained on pretrained DETR to predict dynamic sparsity.

Result: MGS achieves 85-95% activation sparsity on COCO dataset while maintaining or improving performance, significantly reducing computation compared to baseline DETR.

Conclusion: MGS provides a practical, input-adaptive sparsification approach that enables efficient deployment of pretrained vision transformers without requiring full model retraining.

Abstract: Efficient inference with transformer-based models remains a challenge, especially in vision tasks like object detection. We analyze the inherent sparsity in the MLP layers of DETR and introduce two methods to exploit it without retraining. First, we propose Static Indicator-Based Sparsification (SIBS), a heuristic method that predicts neuron inactivity based on fixed activation patterns. While simple, SIBS offers limited gains due to the input-dependent nature of sparsity. To address this, we introduce Micro-Gated Sparsification (MGS), a lightweight gating mechanism trained on top of a pretrained DETR. MGS predicts dynamic sparsity using a small linear layer and achieves up to 85 to 95% activation sparsity. Experiments on the COCO dataset show that MGS maintains or even improves performance while significantly reducing computation. Our method offers a practical, input-adaptive approach to sparsification, enabling efficient deployment of pretrained vision transformers without full model retraining.

[245] Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians

Jin-Chuan Shi, Chengye Su, Jiajun Wang, Ariel Shamir, Miao Wang

Main category: cs.CV

TL;DR: Mono4DEditor is a framework for text-driven editing of 4D scenes from monocular videos, using language-embedded 3D Gaussians and diffusion-based editing with temporal coherence.

Details

Motivation: Editing 4D scenes from monocular videos with text prompts is challenging but valuable for content creation and virtual environments, requiring precise semantic edits while preserving unedited content.

Method: Augments 3D Gaussians with quantized CLIP features for semantic querying, uses two-stage point-level localization (CLIP similarity + spatial refinement), and applies diffusion-based video editing with flow and scribble guidance.

Result: Enables high-quality text-driven edits across diverse scenes and object types while preserving appearance/geometry of unedited areas, surpassing prior approaches in flexibility and visual fidelity.

Conclusion: Mono4DEditor provides an effective solution for flexible and accurate text-driven 4D scene editing with improved localization and temporal coherence.

Abstract: Editing 4D scenes reconstructed from monocular videos based on text prompts is a valuable yet challenging task with broad applications in content creation and virtual environments. The key difficulty lies in achieving semantically precise edits in localized regions of complex, dynamic scenes, while preserving the integrity of unedited content. To address this, we introduce Mono4DEditor, a novel framework for flexible and accurate text-driven 4D scene editing. Our method augments 3D Gaussians with quantized CLIP features to form a language-embedded dynamic representation, enabling efficient semantic querying of arbitrary spatial regions. We further propose a two-stage point-level localization strategy that first selects candidate Gaussians via CLIP similarity and then refines their spatial extent to improve accuracy. Finally, targeted edits are performed on localized regions using a diffusion-based video editing model, with flow and scribble guidance ensuring spatial fidelity and temporal coherence. Extensive experiments demonstrate that Mono4DEditor enables high-quality, text-driven edits across diverse scenes and object types, while preserving the appearance and geometry of unedited areas and surpassing prior approaches in both flexibility and visual fidelity.

[246] Dynamic Weight-based Temporal Aggregation for Low-light Video Enhancement

Ruirui Lin, Guoxi Huang, Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: DWTA-Net is a two-stage framework for low-light video enhancement that jointly exploits short- and long-term temporal cues to effectively suppress noise and artifacts in real-world low-light videos.

Details

Motivation: Learning-based approaches for low-light video enhancement struggle with heavy noise in real scenes due to limitations in effectively leveraging temporal information.

Method: Two-stage framework: Stage I uses Visual State-Space blocks for multi-frame alignment to recover brightness, color, and structure. Stage II employs recurrent refinement with dynamic weight-based temporal aggregation guided by optical flow to balance static and dynamic regions, plus a texture-adaptive loss for detail preservation.

Result: Experiments on real-world low-light videos show DWTA-Net effectively suppresses noise and artifacts, delivering superior visual quality compared to state-of-the-art methods.

Conclusion: DWTA-Net successfully addresses noise and temporal information challenges in low-light video enhancement through its novel two-stage approach with dynamic temporal aggregation.

Abstract: Low-light video enhancement (LLVE) is challenging due to noise, low contrast, and color degradations. Learning-based approaches offer fast inference but still struggle with heavy noise in real low-light scenes, primarily due to limitations in effectively leveraging temporal information. In this paper, we address this issue with DWTA-Net, a novel two-stage framework that jointly exploits short- and long-term temporal cues. Stage I employs Visual State-Space blocks for multi-frame alignment, recovering brightness, color, and structure with local consistency. Stage II introduces a recurrent refinement module with dynamic weight-based temporal aggregation guided by optical flow, adaptively balancing static and dynamic regions. A texture-adaptive loss further preserves fine details while promoting smoothness in flat areas. Experiments on real-world low-light videos show that DWTA-Net effectively suppresses noise and artifacts, delivering superior visual quality compared with state-of-the-art methods.

[247] SilvaScenes: Tree Segmentation and Species Classification from Under-Canopy Images in Natural Forests

David-Alexandre Duclos, William Guimont-Martin, Gabriel Jeanson, Arthur Larochelle-Tremblay, Théo Defosse, Frédéric Moore, Philippe Nolet, François Pomerleau, Philippe Giguère

Main category: cs.CV

TL;DR: SilvaScenes is a new dataset for tree instance segmentation and species classification from under-canopy images, collected across five bioclimatic domains in Quebec with 1476 trees from 24 species.

Details

Motivation: Existing datasets are inadequate for developing perception systems in complex forest environments due to focus on urban settings or limited species. Advanced perception is needed for precision forestry, biodiversity monitoring, and forestry equipment automation.

Method: Created SilvaScenes dataset with expert annotations from under-canopy images across five bioclimatic domains in Quebec, Canada. Benchmarked modern deep learning approaches for instance segmentation.

Result: Tree segmentation achieved 67.65% mAP, while species classification remained challenging with only 35.69% mAP, highlighting the difficulty of fine-grained species identification.

Conclusion: SilvaScenes addresses the gap in forest perception datasets and demonstrates that while tree detection is feasible, accurate species classification in natural forest environments remains a significant challenge.

Abstract: Interest in robotics for forest management is growing, but perception in complex, natural environments remains a significant hurdle. Conditions such as heavy occlusion, variable lighting, and dense vegetation pose challenges to automated systems, which are essential for precision forestry, biodiversity monitoring, and the automation of forestry equipment. These tasks rely on advanced perceptual capabilities, such as detection and fine-grained species classification of individual trees. Yet, existing datasets are inadequate to develop such perception systems, as they often focus on urban settings or a limited number of species. To address this, we present SilvaScenes, a new dataset for instance segmentation of tree species from under-canopy images. Collected across five bioclimatic domains in Quebec, Canada, SilvaScenes features 1476 trees from 24 species with annotations from forestry experts. We demonstrate the relevance and challenging nature of our dataset by benchmarking modern deep learning approaches for instance segmentation. Our results show that, while tree segmentation is easy, with a top mean average precision (mAP) of 67.65%, species classification remains a significant challenge with an mAP of only 35.69%. Our dataset and source code will be available at https://github.com/norlab-ulaval/SilvaScenes.

[248] D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models

Jisu Han, Wonjun Hwang

Main category: cs.CV

TL;DR: The paper proposes dimensional entropy maximization to address modality gap issues in contrastive Vision-Language Models during test-time prompt tuning, improving calibration error and model reliability.

Details

Motivation: Vision-Language Models face modality gaps caused by dominant feature dimensions across modalities during test-time adaptation, which degrades calibration performance and reliability in real-world deployment.

Method: Dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate dependency on dominant dimensions in contrastive VLMs.

Result: The method alleviates calibration performance degradation in test-time prompt tuning and enhances the reliability of VLMs in deployment scenarios.

Conclusion: The proposed dimensional entropy maximization offers a simple yet effective solution to improve VLM reliability by addressing modality gap issues through regularization of dominant feature dimensions.

Abstract: Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.

[249] Few-shot multi-token DreamBooth with LoRa for style-consistent character generation

Ruben Pascual, Mikel Sesma-Sara, Aranzazu Jurio, Daniel Paternain, Mikel Galar

Main category: cs.CV

TL;DR: The paper proposes a method for generating unlimited novel characters that preserve the artistic style of a small set of reference characters using DreamBooth fine-tuning with multi-token strategy and LoRA-based parameter-efficient fine-tuning.

Details

Motivation: To address the need for producing virtually unlimited novel characters that preserve artistic style and shared visual traits from small sets of human-designed reference characters, broadening creative possibilities in animation and gaming.

Method: Builds upon DreamBooth fine-tuning for text-to-image diffusion models, using multi-token strategy with clustering to assign separate tokens to individual characters and collective style, combined with LoRA-based parameter-efficient fine-tuning. Removes class-specific regularization and introduces random tokens/embeddings during generation.

Result: Evaluated on five small specialized datasets, the approach produces high-quality, diverse characters while preserving distinctive aesthetic features of reference characters. Human evaluation confirms effectiveness and highlights method’s potential.

Conclusion: The proposed method successfully enables unlimited character creation while preserving learned artistic style, demonstrating practical value for creative industries like animation and gaming.

Abstract: The audiovisual industry is undergoing a profound transformation as it is integrating AI developments not only to automate routine tasks but also to inspire new forms of art. This paper addresses the problem of producing a virtually unlimited number of novel characters that preserve the artistic style and shared visual traits of a small set of human-designed reference characters, thus broadening creative possibilities in animation, gaming, and related domains. Our solution builds upon DreamBooth, a well-established fine-tuning technique for text-to-image diffusion models, and adapts it to tackle two core challenges: capturing intricate character details beyond textual prompts and the few-shot nature of the training data. To achieve this, we propose a multi-token strategy, using clustering to assign separate tokens to individual characters and their collective style, combined with LoRA-based parameter-efficient fine-tuning. By removing the class-specific regularization set and introducing random tokens and embeddings during generation, our approach allows for unlimited character creation while preserving the learned style. We evaluate our method on five small specialized datasets, comparing it to relevant baselines using both quantitative metrics and a human evaluation study. Our results demonstrate that our approach produces high-quality, diverse characters while preserving the distinctive aesthetic features of the reference characters, with human evaluation further reinforcing its effectiveness and highlighting the potential of our method.

[250] A methodology for clinically driven interactive segmentation evaluation

Parhom Esmaeili, Virginia Fernandez, Pedro Borges, Eli Gibson, Sebastien Ourselin, M. Jorge Cardoso

Main category: cs.CV

TL;DR: The paper proposes a standardized evaluation framework for interactive medical image segmentation, revealing key insights about model performance across different tasks and conditions.

Details

Motivation: Inconsistent and clinically unrealistic evaluation methods hinder fair comparison of interactive segmentation algorithms and misrepresent their real-world performance in medical imaging.

Method: Developed a clinically grounded methodology for defining evaluation tasks and metrics, and built a software framework for constructing standardized evaluation pipelines to assess state-of-the-art algorithms.

Result: Key findings include: minimizing information loss in user interactions is critical for robustness; adaptive-zooming boosts robustness and convergence; performance drops when validation differs from training; 2D methods work well with slab-like images but 3D context helps with large/irregular targets; non-medical models degrade with poor contrast and complex shapes.

Conclusion: Standardized evaluation is essential for fair comparison of interactive segmentation methods, and the proposed framework reveals important performance characteristics that inform algorithm selection and development for medical imaging applications.

Abstract: Interactive segmentation is a promising strategy for building robust, generalisable algorithms for volumetric medical image segmentation. However, inconsistent and clinically unrealistic evaluation hinders fair comparison and misrepresents real-world performance. We propose a clinically grounded methodology for defining evaluation tasks and metrics, and built a software framework for constructing standardised evaluation pipelines. We evaluate state-of-the-art algorithms across heterogeneous and complex tasks and observe that (i) minimising information loss when processing user interactions is critical for model robustness, (ii) adaptive-zooming mechanisms boost robustness and speed convergence, (iii) performance drops if validation prompting behaviour/budgets differ from training, (iv) 2D methods perform well with slab-like images and coarse targets, but 3D context helps with large or irregularly shaped targets, (v) performance of non-medical-domain models (e.g. SAM2) degrades with poor contrast and complex shapes.

[251] PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs

Zixin Zhang, Kanghao Chen, Xingwang Lin, Lutao Jiang, Xu Zheng, Yuanhuiyi Lyu, Litao Guo, Yinchuan Li, Ying-Cong Chen

Main category: cs.CV

TL;DR: PhysToolBench is the first benchmark to evaluate Multimodal Large Language Models’ understanding of physical tools through a VQA dataset with three difficulty levels: tool recognition, understanding, and creation.

Details

Motivation: To quantify MLLMs' true understanding of physical tools, which is crucial for general-purpose intelligent agents to achieve true versatility in interacting with the physical world.

Method: Created a Visual Question Answering dataset with over 1,000 image-text pairs assessing capabilities across three levels: tool recognition (function), tool understanding (operation principles), and tool creation (making tools from surrounding objects).

Result: Evaluation of 32 MLLMs (proprietary, open-source, specialized embodied, and VLA backbones) revealed significant deficiency in tool understanding.

Conclusion: Current MLLMs have substantial gaps in physical tool comprehension, highlighting the need for improved tool understanding capabilities in multimodal AI systems.

Abstract: The ability to use, understand, and create tools is a hallmark of human intelligence, enabling sophisticated interaction with the physical world. For any general-purpose intelligent agent to achieve true versatility, it must also master these fundamental skills. While modern Multimodal Large Language Models (MLLMs) leverage their extensive common knowledge for high-level planning in embodied AI and in downstream Vision-Language-Action (VLA) models, the extent of their true understanding of physical tools remains unquantified. To bridge this gap, we present PhysToolBench, the first benchmark dedicated to evaluating the comprehension of physical tools by MLLMs. Our benchmark is structured as a Visual Question Answering (VQA) dataset comprising over 1,000 image-text pairs. It assesses capabilities across three distinct difficulty levels: (1) Tool Recognition: Requiring the recognition of a tool’s primary function. (2) Tool Understanding: Testing the ability to grasp the underlying principles of a tool’s operation. (3) Tool Creation: Challenging the model to fashion a new tool from surrounding objects when conventional options are unavailable. Our comprehensive evaluation of 32 MLLMs-spanning proprietary, open-source, specialized embodied, and backbones in VLAs-reveals a significant deficiency in tool understanding. Furthermore, we provide an in-depth analysis and propose preliminary solutions. Code and dataset are publicly available.

[252] Diagonal Artifacts in Samsung Images: PRNU Challenges and Solutions

David Vázquez-Padín, Fernando Pérez-González, Alejandro Martín-Del-Río

Main category: cs.CV

TL;DR: Samsung smartphone cameras exhibit diagonal artifacts that cause PRNU fingerprint collisions, but raw images from PRO mode bypass this issue. The artifacts can be used forensically to reduce HDR misdetections and localize synthetic bokeh regions.

Details

Motivation: To investigate diagonal artifacts in Samsung smartphone images and their impact on PRNU-based camera source verification, as these artifacts cause fingerprint collisions between different devices.

Method: Analyzed images from various Samsung Galaxy S and A series models, compared PRNU verification with and without artifacts, and examined raw images from PRO mode to bypass artifact-inducing processing pipeline.

Result: Certain Galaxy S models share common patterns causing fingerprint collisions, similar issues found in Galaxy A models. PRNU verification works reliably with raw images from PRO mode, but not available for mid-range A series or forensic cases without raw access.

Conclusion: Diagonal artifacts pose challenges for PRNU verification but can be leveraged forensically for HDR misdetection reduction and synthetic bokeh localization in portrait-mode images.

Abstract: We investigate diagonal artifacts present in images captured by several Samsung smartphones and their impact on PRNU-based camera source verification. We first show that certain Galaxy S series models share a common pattern causing fingerprint collisions, with a similar issue also found in some Galaxy A models. Next, we demonstrate that reliable PRNU verification remains feasible for devices supporting PRO mode with raw capture, since raw images bypass the processing pipeline that introduces artifacts. This option, however, is not available for the mid-range A series models or in forensic cases without access to raw images. Finally, we outline potential forensic applications of the diagonal artifacts, such as reducing misdetections in HDR images and localizing regions affected by synthetic bokeh in portrait-mode images.

[253] PRNet: Original Information Is All You Have

PeiHuang Zheng, Yunlong Zhao, Zheng Cui, Yang Li

Main category: cs.CV

TL;DR: PRNet is a real-time detection framework that preserves primitive shallow spatial features to improve small object detection in aerial images, addressing information degradation issues through Progressive Refinement Neck and Enhanced SliceSamp modules.

Details

Motivation: Small object detection in aerial images suffers from severe information degradation during feature extraction, where shallow spatial details fail to align effectively with semantic information, leading to frequent misses and false positives. Existing FPN-based methods have limitations in reconstructing accurate details.

Method: PRNet uses two main modules: Progressive Refinement Neck (PRN) for spatial-semantic alignment through backbone reuse and iterative refinement, and Enhanced SliceSamp (ESSamp) for preserving shallow information during downsampling via optimized rearrangement and convolution.

Result: Extensive experiments on VisDrone, AI-TOD, and UAVDT datasets demonstrate that PRNet outperforms state-of-the-art methods under comparable computational constraints, achieving superior accuracy-efficiency trade-offs.

Conclusion: PRNet effectively addresses the limitations of existing methods by prioritizing preservation and efficient utilization of primitive shallow spatial features, providing a superior solution for small object detection in aerial images with real-time performance.

Abstract: Small object detection in aerial images suffers from severe information degradation during feature extraction due to limited pixel representations, where shallow spatial details fail to align effectively with semantic information, leading to frequent misses and false positives. Existing FPN-based methods attempt to mitigate these losses through post-processing enhancements, but the reconstructed details often deviate from the original image information, impeding their fusion with semantic content. To address this limitation, we propose PRNet, a real-time detection framework that prioritizes the preservation and efficient utilization of primitive shallow spatial features to enhance small object representations. PRNet achieves this via two modules:the Progressive Refinement Neck (PRN) for spatial-semantic alignment through backbone reuse and iterative refinement, and the Enhanced SliceSamp (ESSamp) for preserving shallow information during downsampling via optimized rearrangement and convolution. Extensive experiments on the VisDrone, AI-TOD, and UAVDT datasets demonstrate that PRNet outperforms state-of-the-art methods under comparable computational constraints, achieving superior accuracy-efficiency trade-offs.

[254] FLOWING: Implicit Neural Flows for Structure-Preserving Morphing

Arthur Bizzi, Matias Grynberg, Vitor Matias, Daniel Perazzo, João Paulo Lima, Luiz Velho, Nuno Gonçalves, João Pereira, Guilherme Schardong, Tiago Novello

Main category: cs.CV

TL;DR: FLOWING is a neural framework that treats morphing as differential vector flow construction, ensuring continuity and invertibility without costly regularizations, achieving state-of-the-art morphing quality for 2D/3D content.

Details

Motivation: Standard MLP-based implicit neural representations for morphing require expensive regularizations that lead to unstable training and poor feature alignment. The paper aims to overcome these limitations with a more principled approach.

Method: Proposes FLOWING framework that recasts warping as construction of differential vector flow, encoding structural flow properties directly into network architectures to ensure continuity, invertibility, and temporal coherence.

Result: Extensive experiments show FLOWING achieves state-of-the-art morphing quality with faster convergence across face/image morphing and Gaussian Splatting morphing applications.

Conclusion: The flow-centric approach provides principled and stable transformations for accurate, structure-preserving morphing of both 2D images and 3D shapes.

Abstract: Morphing is a long-standing problem in vision and computer graphics, requiring a time-dependent warping for feature alignment and a blending for smooth interpolation. Recently, multilayer perceptrons (MLPs) have been explored as implicit neural representations (INRs) for modeling such deformations, due to their meshlessness and differentiability; however, extracting coherent and accurate morphings from standard MLPs typically relies on costly regularizations, which often lead to unstable training and prevent effective feature alignment. To overcome these limitations, we propose FLOWING (FLOW morphING), a framework that recasts warping as the construction of a differential vector flow, naturally ensuring continuity, invertibility, and temporal coherence by encoding structural flow properties directly into the network architectures. This flow-centric approach yields principled and stable transformations, enabling accurate and structure-preserving morphing of both 2D images and 3D shapes. Extensive experiments across a range of applications - including face and image morphing, as well as Gaussian Splatting morphing - show that FLOWING achieves state-of-the-art morphing quality with faster convergence. Code and pretrained models are available at http://schardong.github.io/flowing.

[255] TC-LoRA: Temporally Modulated Conditional LoRA for Adaptive Diffusion Control

Minkyoung Cho, Ruben Ohana, Christian Jacobsen, Adityan Jothi, Min-Hung Chen, Z. Morley Mao, Ethem Can

Main category: cs.CV

TL;DR: TC-LoRA introduces dynamic, context-aware control for diffusion models by conditioning weights directly through time-aware LoRA adapters generated on-the-fly, improving over static activation-based conditioning methods.

Details

Motivation: Current diffusion models use static conditioning strategies that modify intermediate activations, which limits adaptation to the dynamic, multi-stage denoising process from coarse structure to fine detail.

Method: Uses a hypernetwork to generate LoRA adapters on-the-fly that tailor weight modifications for the frozen backbone at each diffusion step based on time and user conditions, enabling explicit adaptive conditioning strategies.

Result: Significantly enhances generative fidelity and adherence to spatial conditions compared to static, activation-based methods across various data domains.

Conclusion: TC-LoRA establishes a paradigm where conditioning strategy is modified through deeper functional weight adaptation, allowing control to align with dynamic task demands and generative stages.

Abstract: Current controllable diffusion models typically rely on fixed architectures that modify intermediate activations to inject guidance conditioned on a new modality. This approach uses a static conditioning strategy for a dynamic, multi-stage denoising process, limiting the model’s ability to adapt its response as the generation evolves from coarse structure to fine detail. We introduce TC-LoRA (Temporally Modulated Conditional LoRA), a new paradigm that enables dynamic, context-aware control by conditioning the model’s weights directly. Our framework uses a hypernetwork to generate LoRA adapters on-the-fly, tailoring weight modifications for the frozen backbone at each diffusion step based on time and the user’s condition. This mechanism enables the model to learn and execute an explicit, adaptive strategy for applying conditional guidance throughout the entire generation process. Through experiments on various data domains, we demonstrate that this dynamic, parametric control significantly enhances generative fidelity and adherence to spatial conditions compared to static, activation-based methods. TC-LoRA establishes an alternative approach in which the model’s conditioning strategy is modified through a deeper functional adaptation of its weights, allowing control to align with the dynamic demands of the task and generative stage.

[256] FSP-DETR: Few-Shot Prototypical Parasitic Ova Detection

Shubham Trehan, Udhav Ramachandran, Akash Rao, Ruth Scimeca, Sathyanarayanan N. Aakur

Main category: cs.CV

TL;DR: FSP-DETR is a unified detection framework for biomedical object detection that addresses data scarcity and novel category challenges through few-shot learning, open-set recognition, and cross-task generalization within a single model.

Details

Motivation: Biomedical object detection faces fundamental constraints from scarce labeled data and frequent emergence of novel or rare categories, requiring robust solutions that can handle few-shot scenarios and open-set recognition.

Method: Built on class-agnostic DETR backbone, constructs class prototypes from support images, learns embedding space using augmented views and lightweight transformer decoder, with joint optimization of prototype matching loss, alignment-based separation loss, and KL divergence regularization.

Result: Significantly outperforms prior few-shot and prototype-based detectors across ova, blood cell, and malaria detection tasks, especially in low-shot and open-set scenarios.

Conclusion: FSP-DETR enables inference-time flexibility for unseen class recognition, background rejection, and cross-task adaptation without retraining, establishing a new standard for unified biomedical object detection.

Abstract: Object detection in biomedical settings is fundamentally constrained by the scarcity of labeled data and the frequent emergence of novel or rare categories. We present FSP-DETR, a unified detection framework that enables robust few-shot detection, open-set recognition, and generalization to unseen biomedical tasks within a single model. Built upon a class-agnostic DETR backbone, our approach constructs class prototypes from original support images and learns an embedding space using augmented views and a lightweight transformer decoder. Training jointly optimizes a prototype matching loss, an alignment-based separation loss, and a KL divergence regularization to improve discriminative feature learning and calibration under scarce supervision. Unlike prior work that tackles these tasks in isolation, FSP-DETR enables inference-time flexibility to support unseen class recognition, background rejection, and cross-task adaptation without retraining. We also introduce a new ova species detection benchmark with 20 parasite classes and establish standardized evaluation protocols. Extensive experiments across ova, blood cell, and malaria detection tasks demonstrate that FSP-DETR significantly outperforms prior few-shot and prototype-based detectors, especially in low-shot and open-set scenarios.

[257] Vision Language Models: A Survey of 26K Papers

Fengming Lin

Main category: cs.CV

TL;DR: This paper presents a transparent measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS (2023-2025), identifying three major shifts: multimodal vision-language-LLM work rising, generative methods expanding, and resilient 3D/video activity evolving.

Details

Motivation: To provide a reproducible analysis of research trends in computer vision and machine learning conferences, enabling better understanding of the field's evolution and facilitating auditing and extension of the methodology.

Method: Titles and abstracts are normalized, phrase-protected, and matched against a hand-crafted lexicon to assign up to 35 topical labels and mine fine-grained cues about tasks, architectures, training regimes, objectives, datasets, and co-mentioned modalities.

Result: Identified three macro shifts: (1) sharp rise of multimodal vision-language-LLM work reframing perception as instruction following; (2) steady expansion of generative methods focusing on controllability, distillation, and speed; (3) resilient 3D/video activity with composition moving from NeRFs to Gaussian splatting and emphasis on human/agent-centric understanding.

Conclusion: The longitudinal signals are consistent across venues and years, with cross-venue comparisons showing CVPR has stronger 3D footprint and ICLR the highest VLM share. The lexicon and methodology are released to enable auditing and extension, though limitations include lexicon recall and abstract-only scope.

Abstract: We present a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023-2025. Titles and abstracts are normalized, phrase-protected, and matched against a hand-crafted lexicon to assign up to 35 topical labels and mine fine-grained cues about tasks, architectures, training regimes, objectives, datasets, and co-mentioned modalities. The analysis quantifies three macro shifts: (1) a sharp rise of multimodal vision-language-LLM work, which increasingly reframes classic perception as instruction following and multi-step reasoning; (2) steady expansion of generative methods, with diffusion research consolidating around controllability, distillation, and speed; and (3) resilient 3D and video activity, with composition moving from NeRFs to Gaussian splatting and a growing emphasis on human- and agent-centric understanding. Within VLMs, parameter-efficient adaptation like prompting/adapters/LoRA and lightweight vision-language bridges dominate; training practice shifts from building encoders from scratch to instruction tuning and finetuning strong backbones; contrastive objectives recede relative to cross-entropy/ranking and distillation. Cross-venue comparisons show CVPR has a stronger 3D footprint and ICLR the highest VLM share, while reliability themes such as efficiency or robustness diffuse across areas. We release the lexicon and methodology to enable auditing and extension. Limitations include lexicon recall and abstract-only scope, but the longitudinal signals are consistent across venues and years.

[258] SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue

Main category: cs.CV

TL;DR: This paper introduces SpaceVista-1M dataset and SpaceVista-7B model for all-scale spatial reasoning, addressing challenges in dataset curation and scene modeling across diverse applications like robotics and autonomous driving.

Details

Motivation: Current spatial reasoning research struggles with heavy reliance on indoor 3D scans and manual annotations, and lacks effective all-scale scene modeling that often leads to overfitting to individual scenes.

Method: Developed a holistic solution with structured spatial reasoning knowledge system, scale-aware modeling, and progressive training paradigm. Created SpaceVista-1M dataset with 1M spatial QA pairs from 38K video scenes across 5 scales using automated pipeline, and built SpaceVista-7B model that accepts dense inputs and uses scale as anchor for experts and progressive rewards.

Result: Extensive evaluations across 5 benchmarks demonstrate competitive performance with strong generalization across all scales and scenarios.

Conclusion: The proposed approach successfully advances all-scale spatial reasoning across diverse scenarios, with the dataset, model, and benchmark being publicly released.

Abstract: With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .

[259] VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

Shaoqi Dong, Chaoyou Fu, Haihan Gao, Yi-Fan Zhang, Chi Yan, Chu Wu, Xiaoyu Liu, Yunhang Shen, Jing Huo, Deqiang Jiang, Haoyu Cao, Yang Gao, Xing Sun, Ran He, Caifeng Shan

Main category: cs.CV

TL;DR: A distillation-based framework that equips vision-language models with action-execution capability by transferring knowledge from pretrained small action models, achieving high success rates while reducing training costs.

Details

Motivation: Training Vision-Language Action (VLA) models from scratch is costly, and existing methods need improved generalization and efficiency in robotic manipulation tasks.

Method: Two-stage training: lightweight alignment maps VLM hidden states to action space of small action model, then selective fine-tuning of language model, state encoder, and action modules. Architecture adds only an action token and state encoder to original VLM.

Result: Achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). Real-world experiments show 82.0% success rate (17% improvement over teacher model).

Conclusion: Action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs, outperforming state-of-the-art methods.

Abstract: Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.

[260] StreamingVLM: Real-Time Understanding for Infinite Video Streams

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han

Main category: cs.CV

TL;DR: StreamingVLM enables real-time understanding of infinite video streams by maintaining a compact KV cache with attention sinks, recent vision tokens, and recent text tokens, achieving stable performance at 8 FPS on H100 GPU.

Details

Motivation: Current VLMs struggle with infinite video streams due to quadratic computational costs from full attention and poor performance on long videos, while sliding window methods break coherence or have high latency.

Method: Unified framework aligning training with streaming inference using supervised fine-tuning on short, overlapped video chunks to mimic inference-time attention patterns without training on long contexts.

Result: Achieves 66.18% win rate against GPT-4O mini on Inf-Streams-Eval benchmark, maintains 8 FPS real-time performance, and improves general VQA abilities (+4.30 on LongVideoBench, +5.96 on OVOBench Realtime).

Conclusion: StreamingVLM provides an effective solution for real-time video understanding with stable performance and enhanced general capabilities through simple SFT strategy.

Abstract: Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

[261] Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion

Luigi Celona, Simone Bianco, Marco Donzella, Paolo Napoletano

Main category: cs.CV

TL;DR: A novel approach that combines captions from multiple SoTA models using BLIPScore ranking and LLM fusion to generate richer, more detailed image captions without additional training.

Details

Motivation: Current SoTA captioning models trained on MS-COCO produce short captions that fail to capture complex scenes and finer details, exhibiting bias towards average descriptions that overlook specific information.

Method: Leverages pre-trained SoTA models to generate initial captions, ranks them using BLIPScore metric, then fuses top two captions using a Large Language Model to produce final detailed descriptions.

Result: Experimental results on MS-COCO and Flickr30k show improved caption-image alignment and reduced hallucination according to ALOHa, CAPTURE, and Polos metrics, with subjective studies confirming better human judgment alignment.

Conclusion: The method enhances caption quality by combining diverse SoTA models, bridging the gap between automated systems and human-generated descriptions, enabling better training data for vision-language and captioning models.

Abstract: State-of-The-Art (SoTA) image captioning models are often trained on the MicroSoft Common Objects in Context (MS-COCO) dataset, which contains human-annotated captions with an average length of approximately ten tokens. Although effective for general scene understanding, these short captions often fail to capture complex scenes and convey detailed information. Moreover, captioning models tend to exhibit bias towards the ``average’’ caption, which captures only the more general aspects, thus overlooking finer details. In this paper, we present a novel approach to generate richer and more informative image captions by combining the captions generated from different SoTA captioning models. Our proposed method requires no additional model training: given an image, it leverages pre-trained models from the literature to generate the initial captions, and then ranks them using a newly introduced image-text-based metric, which we name BLIPScore. Subsequently, the top two captions are fused using a Large Language Model (LLM) to produce the final, more detailed description. Experimental results on the MS-COCO and Flickr30k test sets demonstrate the effectiveness of our approach in terms of caption-image alignment and hallucination reduction according to the ALOHa, CAPTURE, and Polos metrics. A subjective study lends additional support to these results, suggesting that the captions produced by our model are generally perceived as more consistent with human judgment. By combining the strengths of diverse SoTA models, our method enhances the quality and appeal of image captions, bridging the gap between automated systems and the rich and informative nature of human-generated descriptions. This advance enables the generation of more suitable captions for the training of both vision-language and captioning models.

[262] FFT-based Selection and Optimization of Statistics for Robust Recognition of Severely Corrupted Images

Elena Camuffo, Umberto Michieli, Jijoong Moon, Daehyun Kim, Mete Ozay

Main category: cs.CV

TL;DR: FROST improves model robustness on corrupted images using high-frequency features to detect corruption types and select layer-wise normalization statistics, achieving state-of-the-art results.

Details

Motivation: Improving model robustness for corrupted images is crucial for robust vision systems on smart devices like robotic agents, where robust test-time performance is imperative.

Method: FROST employs high-frequency features to detect input image corruption type and select layer-wise feature normalization statistics to improve any classification model’s robustness.

Result: FROST achieves state-of-the-art results, outperforming competitors on ImageNet-C by up to 37.1% relative gain, improving baseline of 40.9% mCE on severe corruptions.

Conclusion: The proposed FROST method effectively enhances model robustness on corrupted images through corruption type detection and adaptive normalization, demonstrating significant improvements over existing approaches.

Abstract: Improving model robustness in case of corrupted images is among the key challenges to enable robust vision systems on smart devices, such as robotic agents. Particularly, robust test-time performance is imperative for most of the applications. This paper presents a novel approach to improve robustness of any classification model, especially on severely corrupted images. Our method (FROST) employs high-frequency features to detect input image corruption type, and select layer-wise feature normalization statistics. FROST provides the state-of-the-art results for different models and datasets, outperforming competitors on ImageNet-C by up to 37.1% relative gain, improving baseline of 40.9% mCE on severe corruptions.

[263] Continual Adapter Tuning with Semantic Shift Compensation for Class-Incremental Learning

Qinhao Zhou, Yuwen Tan, Boqing Gong, Xiang Xiang

Main category: cs.CV

TL;DR: The paper proposes a parameter-efficient tuning method for class-incremental learning that uses adapter tuning without parameter expansion, feature sampling from prototypes, and semantic shift estimation to achieve state-of-the-art performance.

Details

Motivation: To enable models to continuously learn new classes while overcoming catastrophic forgetting, leveraging pre-trained models with parameter-efficient tuning methods in continual learning contexts.

Method: Uses adapter tuning without parameter expansion constraints, employs feature sampling from stored prototypes to retrain classifiers, and estimates semantic shift of old prototypes without accessing past samples.

Result: Achieves state-of-the-art performance on five CIL benchmarks, surpassing previous pre-trained model-based CIL methods without model expansion or retaining image samples.

Conclusion: The proposed method demonstrates remarkable continual learning capabilities and effectively addresses catastrophic forgetting in class-incremental learning scenarios.

Abstract: Class-incremental learning (CIL) aims to enable models to continuously learn new classes while overcoming catastrophic forgetting. The introduction of pre-trained models has brought new tuning paradigms to CIL. In this paper, we revisit different parameter-efficient tuning (PET) methods within the context of continual learning. We observe that adapter tuning demonstrates superiority over prompt-based methods, even without parameter expansion in each learning session. Motivated by this, we propose incrementally tuning the shared adapter without imposing parameter update constraints, enhancing the learning capacity of the backbone. Additionally, we employ feature sampling from stored prototypes to retrain a unified classifier, further improving its performance. We estimate the semantic shift of old prototypes without access to past samples and update stored prototypes session by session. Our proposed method eliminates model expansion and avoids retaining any image samples. It surpasses previous pre-trained model-based CIL methods and demonstrates remarkable continual learning capabilities. Experimental results on five CIL benchmarks validate the effectiveness of our approach, achieving state-of-the-art (SOTA) performance.

[264] UltraSeP: Sequence-aware Pre-training for Echocardiography Probe Movement Guidance

Haojun Jiang, Teng Wang, Zhenguo Sun, Yulin Wang, Yang Yue, Yu Sun, Ning Jia, Meng Li, Shaqi Luo, Shiji Song, Gao Huang

Main category: cs.CV

TL;DR: A novel sequence-aware self-supervised pre-training method for echocardiography probe guidance that learns personalized 3D cardiac structures to reduce operational complexity and improve standard plane image acquisition.

Details

Motivation: Address the shortage of trained echocardiography professionals by developing automated probe guidance systems, overcoming challenges of complex heart structure and individual variations that previous population-averaged approaches failed to handle.

Method: Sequence-aware self-supervised pre-training that learns personalized 3D cardiac structural features by predicting masked-out image features and probe movement actions in scanning sequences, mimicking how sonographers dynamically adjust based on prior scans.

Result: Extensive experiments on 1.67 million expert scanning samples show the method effectively reduces probe guidance errors compared to advanced baseline methods.

Conclusion: The proposed sequence-aware paradigm successfully captures personalized cardiac anatomy understanding, enabling more accurate probe guidance for echocardiography applications.

Abstract: Echocardiography is an essential medical technique for diagnosing cardiovascular diseases, but its high operational complexity has led to a shortage of trained professionals. To address this issue, we introduce a novel probe movement guidance algorithm that has the potential to be applied in guiding robotic systems or novices with probe pose adjustment for high-quality standard plane image acquisition.Cardiac ultrasound faces two major challenges: (1) the inherently complex structure of the heart, and (2) significant individual variations. Previous works have only learned the population-averaged structure of the heart rather than personalized cardiac structures, leading to a performance bottleneck. Clinically, we observe that sonographers dynamically adjust their interpretation of a patient’s cardiac anatomy based on prior scanning sequences, consequently refining their scanning strategies. Inspired by this, we propose a novel sequence-aware self-supervised pre-training method. Specifically, our approach learns personalized three-dimensional cardiac structural features by predicting the masked-out image features and probe movement actions in a scanning sequence. We hypothesize that if the model can predict the missing content it has acquired a good understanding of personalized cardiac structure. Extensive experiments on a large-scale expert scanning dataset with 1.67 million samples demonstrate that our proposed sequence-aware paradigm can effectively reduce probe guidance errors compared to other advanced baseline methods.

[265] Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer

Minh Bui, Kostas Alexis

Main category: cs.CV

TL;DR: A diffusion-based framework for RGB-D semantic segmentation that uses a Deformable Attention Transformer encoder to handle depth image noise, achieving state-of-the-art performance with faster training than discriminative methods.

Details

Motivation: Vision-based perception is essential for autonomous systems, but noisy RGB-D measurements in real-world applications require robust interpretation methods for reliable scene understanding.

Method: Diffusion-based framework with Deformable Attention Transformer encoder to extract features from depth images, specifically designed to capture characteristics of invalid depth regions.

Result: Achieves state-of-the-art performance on NYUv2 and SUN-RGBD datasets, particularly excelling in challenging scenarios, with significantly reduced training time compared to discriminative approaches.

Conclusion: The generative diffusion framework effectively models RGB-D image distributions and demonstrates superior robustness in handling noisy depth measurements for semantic segmentation tasks.

Abstract: Vision-based perception and reasoning is essential for scene understanding in any autonomous system. RGB and depth images are commonly used to capture both the semantic and geometric features of the environment. Developing methods to reliably interpret this data is critical for real-world applications, where noisy measurements are often unavoidable. In this work, we introduce a diffusion-based framework to address the RGB-D semantic segmentation problem. Additionally, we demonstrate that utilizing a Deformable Attention Transformer as the encoder to extract features from depth images effectively captures the characteristics of invalid regions in depth measurements. Our generative framework shows a greater capacity to model the underlying distribution of RGB-D images, achieving robust performance in challenging scenarios with significantly less training time compared to discriminative methods. Experimental results indicate that our approach achieves State-of-the-Art performance on both the NYUv2 and SUN-RGBD datasets in general and especially in the most challenging of their image data. Our project page will be available at https://diffusionmms.github.io/

[266] A Comprehensive Survey of Mamba Architectures for Medical Image Analysis: Classification, Segmentation, Restoration and Beyond

Shubhi Bansal, Sreeharish A, Madhava Prasath J, Manikandan S, Sreekanth Madisetty, Mohammad Zia Ur Rehman, Chandravardhan Singh Raghaw, Gaurav Duggal, Nagendra Kumar

Main category: cs.CV

TL;DR: Mamba, a State Space Model variant, offers linear time complexity and efficient long-range dependency handling for medical image analysis, overcoming transformer limitations like quadratic complexity and memory inefficiency.

Details

Motivation: Transformers have limitations in medical imaging including quadratic computational complexity and inefficient long-range dependency handling, which affects analysis of large, complex medical datasets with spatial and temporal relationships.

Method: The paper reviews Mamba architectures including pure Mamba, U-Net variants, and hybrid models with CNNs, transformers, and Graph Neural Networks, covering optimizations, techniques, scanning, datasets, and applications.

Result: Mamba demonstrates strong performance in merging multimodal data, improving diagnosis accuracy and patient outcomes, with faster inference and lower memory requirements compared to transformers.

Conclusion: Mamba has transformative potential in overcoming existing barriers in medical imaging and paves the way for innovative advancements, though challenges and future directions remain.

Abstract: Mamba, a special case of the State Space Model, is gaining popularity as an alternative to template-based deep learning approaches in medical image analysis. While transformers are powerful architectures, they have drawbacks, including quadratic computational complexity and an inability to address long-range dependencies efficiently. This limitation affects the analysis of large and complex datasets in medical imaging, where there are many spatial and temporal relationships. In contrast, Mamba offers benefits that make it well-suited for medical image analysis. It has linear time complexity, which is a significant improvement over transformers. Mamba processes longer sequences without attention mechanisms, enabling faster inference and requiring less memory. Mamba also demonstrates strong performance in merging multimodal data, improving diagnosis accuracy and patient outcomes. The organization of this paper allows readers to appreciate the capabilities of Mamba in medical imaging step by step. We begin by defining core concepts of SSMs and models, including S4, S5, and S6, followed by an exploration of Mamba architectures such as pure Mamba, U-Net variants, and hybrid models with convolutional neural networks, transformers, and Graph Neural Networks. We also cover Mamba optimizations, techniques and adaptations, scanning, datasets, applications, experimental results, and conclude with its challenges and future directions in medical imaging. This review aims to demonstrate the transformative potential of Mamba in overcoming existing barriers within medical imaging while paving the way for innovative advancements in the field. A comprehensive list of Mamba architectures applied in the medical field, reviewed in this work, is available at Github.

[267] Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors

Soumava Paul, Prakhar Kaushik, Alan Yuille

Main category: cs.CV

TL;DR: A generative approach for pose-free 360° scene reconstruction from sparse 2D images using image-to-image inpainting and Gaussian-SLAM-inspired progressive integration.

Details

Motivation: Existing sparse-view reconstruction methods require known camera poses, limiting their applicability when ground-truth poses are unavailable during evaluation.

Method: Uses an image-to-image generative model with FiLM modulation for context/geometry conditioning, novel confidence measure for 3D Gaussian splats, and progressive integration in a Gaussian-SLAM-inspired process.

Result: Outperforms existing pose-free techniques and performs competitively with state-of-the-art posed reconstruction methods on MipNeRF360 and DL3DV-10K benchmarks.

Conclusion: The method enables high-quality 360° scene reconstruction without requiring camera parameters, achieving multi-view consistency through generative inpainting and progressive integration.

Abstract: In this work, we introduce a generative approach for pose-free (without camera parameters) reconstruction of 360 scenes from a sparse set of 2D images. Pose-free scene reconstruction from incomplete, pose-free observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of large complex scenes (with high degree of foreground and background detail) with known camera poses using view-conditioned generative priors, these methods cannot be directly adapted for the pose-free setting when ground-truth poses are not available during evaluation. To address this, we propose an image-to-image generative model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We introduce context and geometry conditioning using Feature-wise Linear Modulation (FiLM) modulation layers as a lightweight alternative to cross-attention and also propose a novel confidence measure for 3D Gaussian splat representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent 3D representation. Evaluations on the MipNeRF360 and DL3DV-10K benchmark dataset demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed (precomputed camera parameters are given) reconstruction methods in complex 360 scenes. Our project page provides additional results, videos, and code.

[268] RAGDiffusion: Faithful Cloth Generation via External Knowledge Assimilation

Xianfeng Tan, Yuhan Li, Wenxiang Shang, Yubo Wu, Jian Wang, Xuanhong Chen, Yi Zhang, Ran Lin, Bingbing Ni

Main category: cs.CV

TL;DR: RAGDiffusion is a Retrieval-Augmented Generation framework that enhances clothing asset generation by improving structure determinacy and reducing hallucinations through knowledge assimilation from language models and external databases.

Details

Motivation: Standard clothing asset generation faces challenges with structural hallucinations and texture distortion due to limited spatial perception in existing models when extracting clothing information from complex real-world contexts.

Method: Uses a two-process RAG framework: (1) Retrieval-based structure aggregation with contrastive learning and Structure Locally Linear Embedding for global structure and spatial landmarks, and (2) Omni-level faithful garment generation with coarse-to-fine texture alignment.

Result: Extensive experiments show RAGDiffusion synthesizes structurally and texture-faithful clothing assets with significant performance improvements on challenging real-world datasets.

Conclusion: RAGDiffusion represents a pioneering effort in high-specification faithful generation using RAG to confront intrinsic hallucinations and enhance fidelity in clothing asset generation.

Abstract: Standard clothing asset generation involves restoring forward-facing flat-lay garment images displayed on a clear background by extracting clothing information from diverse real-world contexts, which presents significant challenges due to highly standardized structure sampling distributions and clothing semantic absence in complex scenarios. Existing models have limited spatial perception, often exhibiting structural hallucinations and texture distortion in this high-specification generative task. To address this issue, we propose a novel Retrieval-Augmented Generation (RAG) framework, termed RAGDiffusion, to enhance structure determinacy and mitigate hallucinations by assimilating knowledge from language models and external databases. RAGDiffusion consists of two processes: (1) Retrieval-based structure aggregation, which employs contrastive learning and a Structure Locally Linear Embedding (SLLE) to derive global structure and spatial landmarks, providing both soft and hard guidance to counteract structural ambiguities; and (2) Omni-level faithful garment generation, which introduces a coarse-to-fine texture alignment that ensures fidelity in pattern and detail components within the diffusing. Extensive experiments on challenging real-world datasets demonstrate that RAGDiffusion synthesizes structurally and texture-faithful clothing assets with significant performance improvements, representing a pioneering effort in high-specification faithful generation with RAG to confront intrinsic hallucinations and enhance fidelity.

[269] Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna

Main category: cs.CV

TL;DR: Generate Any Scene is a data engine that creates diverse scene graphs and uses them to generate training data for improving text-to-vision models through self-improvement, distillation, and reward modeling.

Details

Motivation: Current text-to-vision models struggle with compositional generalization and semantic alignment due to noisy datasets and lack of scalable high-quality annotations.

Method: Systematically enumerates scene graphs from object/attribute/relation taxonomies, translates them into captions for generation and visual Q&A for evaluation, then uses this data for self-improvement, distillation, and reward modeling.

Result: Stable Diffusion v1.5 improved by 4% over baselines, achieved 10% TIFA score increase with distillation, and reward models surpassed CLIP-based methods by +5% on DPG-Bench.

Conclusion: Generate Any Scene enables scalable creation of high-quality training data that significantly improves text-to-vision models’ compositional understanding and semantic alignment across multiple applications.

Abstract: Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models’ understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. Stable Diffusion v1.5 achieves an average 4% improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune Stable Diffusion v1.5 and achieve a 10% increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by +5% on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.

[270] Making Bias Amplification in Balanced Datasets Directional and Interpretable

Bhanu Tokas, Rahul Nair, Hannah Kerner

Main category: cs.CV

TL;DR: Proposes a new metric called Directional Predictability Amplification (DPA) to measure directional bias amplification in ML models, addressing limitations of existing metrics for balanced datasets.

Details

Motivation: Existing bias amplification metrics fail for balanced datasets and cannot identify the direction of bias amplification. Leakage amplification, while working for balanced datasets, cannot measure directional bias.

Method: Introduces DPA, a predictability-based metric that measures directional bias amplification by quantifying how much a model’s predictions amplify the predictability of protected attributes from task labels.

Result: Experiments on tabular and image datasets show DPA effectively measures directional bias amplification, is easier to interpret, and less sensitive to attacker models compared to leakage amplification.

Conclusion: DPA is a practical metric for measuring directional bias amplification in balanced datasets, overcoming limitations of existing co-occurrence-based and predictability-based metrics.

Abstract: Most of the ML datasets we use today are biased. When we train models on these biased datasets, they often not only learn dataset biases but can also amplify them – a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification between a protected attribute A (e.g., gender) and a task T (e.g., cooking). However, these metrics fail to measure biases when A is balanced with T. To measure bias amplification in balanced datasets, recent work proposed a predictability-based metric called leakage amplification. However, leakage amplification cannot identify the direction in which biases are amplified. In this work, we propose a new predictability-based metric called directional predictability amplification (DPA). DPA measures directional bias amplification, even for balanced datasets. Unlike leakage amplification, DPA is easier to interpret and less sensitive to attacker models (a hyperparameter in predictability-based metrics). Our experiments on tabular and image datasets show that DPA is an effective metric for measuring directional bias amplification. The code will be available soon.

[271] SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts

Robin Schön, Julian Lorenz, Daniel Kienzle, Rainer Lienhart

Main category: cs.CV

TL;DR: A novel interactive segmentation architecture for winter sports that uses click prompts to generate high-quality masks, outperforming SAM and HQ-SAM on winter sports equipment segmentation.

Details

Motivation: To develop an interactive segmentation system specifically optimized for winter sports contexts that can quickly respond to user clicks and accurately segment winter sports equipment.

Method: First presents a baseline architecture for fast response to click prompts, then introduces architectural modifications optimized for segmenting winter sports equipment on the WSESeg dataset.

Result: Outperforms SAM by 2.336 clicks and HQ-SAM by 7.946 clicks on average NoC@85 metric for WSESeg classes. Achieves state-of-the-art results on HQSeg-44k with NoC@90 of 6.00 and NoC@95 of 9.89. Also tested on novel skiing human dataset.

Conclusion: The proposed architecture successfully improves interactive segmentation performance in winter sports contexts, demonstrating significant improvements over existing methods and achieving state-of-the-art results on relevant datasets.

Abstract: In this paper, we present a novel architecture for interactive segmentation in winter sports contexts. The field of interactive segmentation deals with the prediction of high-quality segmentation masks by informing the network about the objects position with the help of user guidance. In our case the guidance consists of click prompts. For this task, we first present a baseline architecture which is specifically geared towards quickly responding after each click. Afterwards, we motivate and describe a number of architectural modifications which improve the performance when tasked with segmenting winter sports equipment on the WSESeg dataset. With regards to the average NoC@85 metric on the WSESeg classes, we outperform SAM and HQ-SAM by 2.336 and 7.946 clicks, respectively. When applied to the HQSeg-44k dataset, our system delivers state-of-the-art results with a NoC@90 of 6.00 and NoC@95 of 9.89. In addition to that, we test our model on a novel dataset containing masks for humans during skiing.

[272] RadVLM: A Multitask Conversational Vision-Language Model for Radiology

Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer

Main category: cs.CV

TL;DR: RadVLM is a compact, multitask conversational foundation model for chest X-ray interpretation that achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks.

Details

Motivation: Address the shortage of radiologists and lack of interactive diagnostic capabilities in existing vision-language models for automated CXR analysis and AI-assisted reporting.

Method: Curated a large-scale instruction dataset with over 1 million image-instruction pairs containing single-turn tasks (report generation, abnormality classification, visual grounding) and multi-turn conversational interactions, then fine-tuned RadVLM on this dataset.

Result: RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Joint training across multiple tasks is particularly beneficial for scenarios with limited annotated data.

Conclusion: RadVLM shows potential as a clinically relevant AI assistant that provides structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.

Abstract: The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks – such as report generation, abnormality classification, and visual grounding – and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.

[273] RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

Fanhu Zeng, Haiyang Guo, Fei Zhu, Li Shen, Hao Tang

Main category: cs.CV

TL;DR: RobustMerge is a training-free parameter-efficient merging method that addresses challenges in merging expert models from parameter-efficient tuning by maintaining direction robustness through parameter pruning and cross-task normalization.

Details

Motivation: With the rise of parameter-efficient tuning creating many task-specific expert models, there's a need for efficient merging methods that work with these models, as existing full fine-tuning merging methods fail under efficient tuning scenarios.

Method: The method analyzes low-rank decomposition and identifies direction robustness as crucial. It prunes parameters and scales coefficients from inter-parameter relations to maintain direction stability, and performs cross-task normalization for better generalization.

Result: Experiments on a diverse multimodal task benchmark show outstanding performance and generalizability. The method effectively maintains direction robustness and enhances unseen task generalization.

Conclusion: RobustMerge successfully addresses the challenge of merging parameter-efficient tuned models by focusing on direction robustness, providing an effective training-free solution for creating universal models from multiple expert models.

Abstract: Fine-tuning pre-trained models with custom data leads to numerous expert models on specific tasks. Merging models into one universal model to empower multi-task ability refraining from data leakage has gained popularity. With the expansion in data and model size, parameter-efficient tuning becomes the common practice for obtaining task-specific models efficiently. However, few methods are dedicated to efficient merging, and existing methods designed for full fine-tuning merging fail under efficient merging. To address the issue, we analyze from low-rank decomposition and reveal that direction robustness during merging is crucial for merging efficient modules. We furthermore uncover that compensating for the gap between stark singular values contributes to direction robustness. Therefore, we propose RobustMerge, a training-free parameter-efficient merging method with complementary parameter adaptation to maintain direction robustness. Specifically, we (1) prune parameters and scale coefficients from inter-parameter relation for singular values to maintain direction stability away from task interference, and (2) perform cross-task normalization to enhance unseen task generalization. We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certify the outstanding performance and generalizability of our method. Additional studies and extensive analyses further showcase the effectiveness. Code is available at https://github.com/AuroraZengfh/RobustMerge.

[274] Measuring directional bias amplification in image captions using predictability

Rahul Nair, Bhanu Tokas, Hannah Kerner

Main category: cs.CV

TL;DR: Proposes DPAC, a new metric to measure directional bias amplification in image captioning that improves upon existing co-occurrence and predictability-based metrics by capturing semantics, directionality, and being less sensitive to hyperparameters.

Details

Motivation: Existing bias amplification metrics like co-occurrence methods fail for complex tasks like image captioning as they can't capture caption semantics, while predictability-based metrics like LIC have limitations in identifying bias direction, poor dataset bias estimation, and high sensitivity to attacker models.

Method: Developed Directional Predictability Amplification in Captioning (DPAC) which measures directional bias amplification using an improved vocabulary substitution strategy and reduced sensitivity to attacker models.

Result: Experiments on COCO captioning dataset demonstrate that DPAC is the most reliable metric for measuring bias amplification in captions compared to existing approaches.

Conclusion: DPAC effectively addresses limitations of prior bias amplification metrics for image captioning by providing directional bias measurement, better dataset bias estimation, and improved robustness to model hyperparameters.

Abstract: When we train models on biased ML datasets, they not only learn these biases but can inflate them at test time - a phenomenon called bias amplification. To measure bias amplification in ML datasets, many co-occurrence-based metrics have been proposed. Co-occurrence-based metrics are effective in measuring bias amplification in simple problems like image classification. However, these metrics are ineffective for complex problems like image captioning as they cannot capture the semantics of a caption. To measure bias amplification in captions, prior work introduced a predictability-based metric called Leakage in Captioning (LIC). While LIC captures the semantics and context of captions, it has limitations. LIC cannot identify the direction in which bias is amplified, poorly estimates dataset bias due to a weak vocabulary substitution strategy, and is highly sensitive to attacker models (a hyperparameter in predictability-based metrics). To overcome these issues, we propose Directional Predictability Amplification in Captioning (DPAC). DPAC measures directional bias amplification in captions, provides a better estimate of dataset bias using an improved substitution strategy, and is less sensitive to attacker models. Our experiments on the COCO captioning dataset show how DPAC is the most reliable metric to measure bias amplification in captions.

[275] A Survey on Self-supervised Contrastive Learning for Multimodal Text-Image Analysis

Asifullah Khan, Laiba Asmatullah, Anza Malik, Shahzaib Khan, Hamna Asif

Main category: cs.CV

TL;DR: This paper provides a comprehensive survey of contrastive learning in text-image models, covering terminology, recent developments, categorization by model structures, technical advances, and state-of-the-art applications.

Details

Motivation: To systematically review and organize the growing field of self-supervised contrastive learning for text-image models, which has shown significant improvements in image understanding and text-image analysis without heavy reliance on labeled data.

Method: The authors conduct a comprehensive survey by: 1) providing an overview of contrastive learning approaches in text-image models, 2) categorizing approaches based on different model structures, 3) discussing technical advances including pretext tasks, architectural structures, and key trends, and 4) reviewing state-of-the-art applications.

Result: The paper presents a systematic organization and analysis of contrastive learning methodologies in text-image models, highlighting recent developments and categorizing various approaches for better understanding of the field.

Conclusion: Contrastive learning has emerged as a powerful self-supervised approach for text-image models, enabling significant advances in multimodal understanding without extensive manual labeling, with ongoing developments in model structures and applications.

Abstract: Self-supervised learning is a machine learning approach that generates implicit labels by learning underlined patterns and extracting discriminative features from unlabeled data without manual labelling. Contrastive learning introduces the concept of “positive” and “negative” samples, where positive pairs (e.g., variation of the same image/object) are brought together in the embedding space, and negative pairs (e.g., views from different images/objects) are pushed farther away. This methodology has shown significant improvements in image understanding and image text analysis without much reliance on labeled data. In this paper, we comprehensively discuss the terminologies, recent developments and applications of contrastive learning with respect to text-image models. Specifically, we provide an overview of the approaches of contrastive learning in text-image models in recent years. Secondly, we categorize the approaches based on different model structures. Thirdly, we further introduce and discuss the latest advances of the techniques used in the process such as pretext tasks for both images and text, architectural structures, and key trends. Lastly, we discuss the recent state-of-art applications of self-supervised contrastive learning Text-Image based models.

[276] CQ-DINO: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection

Zhichao Sun, Huazhang Hu, Yidong Ma, Gang Liu, Yibo Chen, Xu Tang, Yao Hu, Yongchao Xu

Main category: cs.CV

TL;DR: CQ-DINO is a category query-based object detection framework that reformulates classification as contrastive learning between object queries and category queries, addressing gradient dilution issues in large-vocabulary detection through image-guided query selection and hierarchical category modeling.

Details

Motivation: Traditional classification-based detectors struggle with vast vocabulary detection due to positive gradient dilution (insufficient learning signals for rare categories) and hard negative gradient dilution (discriminative gradients overwhelmed by easy negatives).

Method: Proposes CQ-DINO framework that reformulates classification as contrastive task between object queries and learnable category queries. Uses image-guided query selection to reduce negative space by adaptively retrieving top-K relevant categories per image via cross-attention. Integrates explicit hierarchical category relationships or learns implicit correlations via self-attention.

Result: Achieves superior performance on V3Det benchmark (surpassing previous methods by 2.1% AP) while maintaining competitiveness in COCO. Provides scalable solution for wide category coverage detection systems.

Conclusion: CQ-DINO effectively addresses gradient dilution problems in large-vocabulary object detection through contrastive query-based approach and adaptive category selection, demonstrating strong performance on challenging benchmarks.

Abstract: With the exponential growth of data, traditional object detection methods are increasingly struggling to handle vast vocabulary object detection tasks effectively. We analyze two key limitations of classification-based detectors: positive gradient dilution, where rare positive categories receive insufficient learning signals, and hard negative gradient dilution, where discriminative gradients are overwhelmed by numerous easy negatives. To address these challenges, we propose CQ-DINO, a category query-based object detection framework that reformulates classification as a contrastive task between object queries and learnable category queries. Our method introduces image-guided query selection, which reduces the negative space by adaptively retrieving top-K relevant categories per image via cross-attention, thereby rebalancing gradient distributions and facilitating implicit hard example mining. Furthermore, CQ-DINO flexibly integrates explicit hierarchical category relationships in structured datasets (e.g., V3Det) or learns implicit category correlations via self-attention in generic datasets (e.g., COCO). Experiments demonstrate that CQ-DINO achieves superior performance on the challenging V3Det benchmark (surpassing previous methods by 2.1% AP) while maintaining competitiveness in COCO. Our work provides a scalable solution for real-world detection systems requiring wide category coverage. The code is publicly at https://github.com/FireRedTeam/CQ-DINO.

[277] ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition

Sanjoy Kundu, Shanmukha Vellamcheti, Sathyanarayanan N. Aakur

Main category: cs.CV

TL;DR: ProbRes is a probabilistic residual search framework using jump-diffusion to efficiently recognize unseen activities in open-world egocentric scenarios by balancing exploration and exploitation.

Details

Motivation: Open-world egocentric activity recognition is challenging due to unconstrained environments and the need to infer unseen activities from vast search spaces.

Method: Integrates commonsense priors for semantic coherence, uses Vision-Language Models for adaptive prediction refinement, and employs stochastic search to find high-likelihood labels efficiently.

Result: Achieves state-of-the-art performance on GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego datasets across multiple openness levels (L0-L3).

Conclusion: Structured search strategies are crucial for scalable and efficient open-world activity recognition, with ProbRes demonstrating effective navigation of complex search spaces.

Abstract: Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jump-diffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0-L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding. Our results highlight the importance of structured search strategies, paving the way for scalable and efficient open-world activity recognition.

[278] Differentially Private 2D Human Pose Estimation

Kaushik Bhargav Sivangi, Paul Henderson, Fani Deligianni

Main category: cs.CV

TL;DR: This paper presents a comprehensive framework for differentially private 2D human pose estimation that combines Projected DP-SGD and Feature Differential Privacy to balance privacy and accuracy.

Details

Motivation: Human pose estimation is essential for many applications but raises privacy concerns when processing sensitive visual data. Traditional anonymization techniques offer limited protection and compromise data utility, while naive Differential Privacy degrades model performance.

Method: The framework uses Projected DP-SGD (PDP-SGD) to project noisy gradients to low-dimensional subspace, incorporates Feature Differential Privacy (FDP) to selectively privatize sensitive features while retaining public visual cues, and proposes a hybrid feature-projective DP framework combining both approaches.

Result: The combined feature-projective method outperforms vanilla DP-SGD and individual baselines, achieving up to 82.61% mean PCKh@0.5 at ε=0.8, substantially closing the gap to non-private performance on the MPII dataset.

Conclusion: This work lays the foundation for privacy-preserving human pose estimation in real-world, sensitive applications by effectively balancing privacy guarantees with model performance.

Abstract: Human pose estimation (HPE) has become essential in numerous applications including healthcare, activity recognition, and human-computer interaction. However, the privacy implications of processing sensitive visual data present significant deployment barriers in critical domains. While traditional anonymization techniques offer limited protection and often compromise data utility for broader motion analysis, Differential Privacy (DP) provides formal privacy guarantees but typically degrades model performance when applied naively. In this work, we present the first comprehensive framework for differentially private 2D human pose estimation (2D-HPE) by applying Differentially Private Stochastic Gradient Descent (DP-SGD) to this task. To effectively balance privacy with performance, we adopt Projected DP-SGD (PDP-SGD), which projects the noisy gradients to a low-dimensional subspace. Next, we incorporate Feature Differential Privacy(FDP) to selectively privatize only sensitive features while retaining public visual cues. Finally, we propose a hybrid feature-projective DP framework that combines both approaches to balance privacy and accuracy for HPE. We evaluate our approach on the MPII dataset across varying privacy budgets, training strategies, and clipping norms. Our combined feature-projective method consistently outperforms vanilla DP-SGD and individual baselines, achieving up to 82.61% mean PCKh@0.5 at $\epsilon = 0.8$, substantially closing the gap to the non-private performance. This work lays foundation for privacy-preserving human pose estimation in real-world, sensitive applications.

[279] Multimodal Language Models See Better When They Look Shallower

Haoran Chen, Junyan Lin, Xinghao Chen, Yue Fan, Jianfeng Dong, Xin Jin, Hui Su, Jinlan Fu, Xiaoyu Shen

Main category: cs.CV

TL;DR: This paper presents a comprehensive study showing that shallow and middle layers of Vision Transformers often outperform deep layers for fine-grained visual tasks in multimodal LLMs, and proposes a lightweight feature fusion method that achieves consistent improvements.

Details

Motivation: Current MLLMs typically use only the final layers of Vision Transformers for visual features, but this deep-layer bias lacks principled analysis despite evidence that different ViT layers capture different types of information.

Method: The authors conducted a comprehensive study analyzing representation similarity across ViT layers, established shallow/middle/deep layer groupings, and proposed a lightweight feature fusion method that strategically incorporates shallower layers.

Result: Extensive evaluation across 10 benchmarks (60+ tasks) showed that while deep layers excel at semantic-rich tasks like OCR, shallow and middle layers significantly outperform them on fine-grained visual tasks including counting, positioning, and object localization.

Conclusion: MLLMs can often see better when they look shallower, and the proposed feature fusion method achieves consistent improvements over single-layer and specialized fusion baselines, offering the first principled study of visual layer selection in MLLMs.

Abstract: Multimodal large language models (MLLMs) typically extract visual features from the final layers of a pretrained Vision Transformer (ViT). This widespread deep-layer bias, however, is largely driven by empirical convention rather than principled analysis. While prior studies suggest that different ViT layers capture different types of information, with shallower layers focusing on fine visual details and deeper layers aligning more closely with textual semantics, the impact of this variation on MLLM performance remains underexplored. We present the first comprehensive study of visual layer selection for MLLMs, analyzing representation similarity across ViT layers to establish shallow, middle, and deep layer groupings. Through extensive evaluation of MLLMs (1.4B-7B parameters) across 10 benchmarks encompassing 60+ tasks, we find that while deep layers excel in semantic-rich tasks like OCR, shallow and middle layers significantly outperform them on fine-grained visual tasks including counting, positioning, and object localization. Building on these insights, we propose a lightweight feature fusion method that strategically incorporates shallower layers, achieving consistent improvements over both single-layer and specialized fusion baselines. Our work offers the first principled study of visual layer selection in MLLMs, showing that MLLMs can often see better when they look shallower.

[280] Deep Learning for Sports Video Event Detection: Tasks, Datasets, Methods, and Challenges

Hao Xu, Arbind Agrahari Baniya, Sam Well, Mohamed Reda Bouadjenek, Richard Dazeley, Sunil Aryal

Main category: cs.CV

TL;DR: This survey paper clarifies distinctions between Temporal Action Localization (TAL), Action Spotting (AS), and Precise Event Spotting (PES) in sports video analysis, provides a taxonomy of state-of-the-art methods, and critically evaluates datasets and evaluation protocols.

Details

Motivation: To address confusion between closely related video event detection tasks (TAL, AS, PES) in sports analytics, fill gaps in existing surveys that overlook temporal granularity and domain-specific challenges, and bridge the focus gap between elite-level competitions and everyday practitioners.

Method: The survey delineates TAL, AS, and PES distinctions, introduces a structured taxonomy of approaches including temporal modeling strategies, multimodal frameworks, and data-efficient pipelines, and critically assesses benchmark datasets and evaluation protocols.

Result: Provides a comprehensive foundation for developing temporally precise, generalizable sports event detection systems by synthesizing current research and exposing limitations in existing approaches, datasets, and metrics.

Conclusion: This work establishes clear boundaries between related video event detection tasks and offers a structured framework to advance research and practical deployment of sports event detection systems for both research and industry communities.

Abstract: Video event detection has become a cornerstone of modern sports analytics, powering automated performance evaluation, content generation, and tactical decision-making. Recent advances in deep learning have driven progress in related tasks such as Temporal Action Localization (TAL), which detects extended action segments; Action Spotting (AS), which identifies a representative timestamp; and Precise Event Spotting (PES), which pinpoints the exact frame of an event. Although closely connected, their subtle differences often blur the boundaries between them, leading to confusion in both research and practical applications. Furthermore, prior surveys either address generic video event detection or broader sports video tasks, but largely overlook the unique temporal granularity and domain-specific challenges of event spotting. In addition, most existing sports video surveys focus on elite-level competitions while neglecting the wider community of everyday practitioners. This survey addresses these gaps by: (i) clearly delineating TAL, AS, and PES and their respective use cases; (ii) introducing a structured taxonomy of state of the art approaches including temporal modeling strategies, multimodal frameworks, and data-efficient pipelines tailored for AS and PES; and (iii) critically assessing benchmark datasets and evaluation protocols, highlighting limitations such as reliance on broadcast quality footage and metrics that over reward permissive multilabel predictions. By synthesizing current research and exposing open challenges, this work provides a comprehensive foundation for developing temporally precise, generalizable, and practically deployable sports event detection systems for both the research and industry communities.

[281] The Role of Video Generation in Enhancing Data-Limited Action Understanding

Wei Li, Dezhao Luo, Dongbao Yang, Zhenhang Li, Weiping Wang, Yu Zhou

Main category: cs.CV

TL;DR: Proposes using text-to-video diffusion transformers to generate annotated training data for video action understanding, addressing data scarcity with information enhancement and uncertainty-based label smoothing strategies.

Details

Motivation: Video action understanding tasks in real-world scenarios suffer from data limitations, creating a need for methods to bridge data scarcity without human intervention.

Method: Uses text-to-video diffusion transformer to generate annotated data, with information enhancement strategy to improve generated samples’ content and uncertainty-based label smoothing to reduce negative impact of low-quality samples.

Result: Achieves state-of-the-art performance for zero-shot action recognition on four datasets across five tasks.

Conclusion: The proposed method effectively addresses data scarcity in video action understanding through generated data and specialized training strategies, demonstrating strong performance in zero-shot scenarios.

Abstract: Video action understanding tasks in real-world scenarios always suffer data limitations. In this paper, we address the data-limited action understanding problem by bridging data scarcity. We propose a novel method that employs a text-to-video diffusion transformer to generate annotated data for model training. This paradigm enables the generation of realistic annotated data on an infinite scale without human intervention. We proposed the information enhancement strategy and the uncertainty-based label smoothing tailored to generate sample training. Through quantitative and qualitative analysis, we observed that real samples generally contain a richer level of information than generated samples. Based on this observation, the information enhancement strategy is proposed to enhance the informative content of the generated samples from two aspects: the environments and the characters. Furthermore, we observed that some low-quality generated samples might negatively affect model training. To address this, we devised the uncertainty-based label smoothing strategy to increase the smoothing of these samples, thus reducing their impact. We demonstrate the effectiveness of the proposed method on four datasets across five tasks and achieve state-of-the-art performance for zero-shot action recognition.

[282] HoliTom: Holistic Token Merging for Fast Video Large Language Models

Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang

Main category: cs.CV

TL;DR: HoliTom is a training-free holistic token merging framework that combines outer-LLM pruning (global temporal segmentation and spatial-temporal merging) with inner-LLM token similarity-based merging to reduce video LLM computational costs by over 90% while maintaining near-original performance.

Details

Motivation: Video LLMs face computational inefficiency due to redundant video tokens. Existing methods either have intrinsic overhead (inner-LLM pruning) or fail to capture global temporal dynamics (outer-LLM pruning), and the synergistic potential of combining these strategies remains unexplored.

Method: HoliTom employs: 1) Outer-LLM pruning through global redundancy-aware temporal segmentation followed by spatial-temporal merging, and 2) Inner-LLM token similarity-based merging designed for compatibility with outer-LLM pruning.

Result: Reduces computational costs to 6.9% of FLOPs while maintaining 99.1% of original performance on LLaVA-OneVision-7B. Achieves 2.28x reduction in Time-To-First-Token and 1.32x acceleration in decoding throughput.

Conclusion: The integrated pruning approach provides practical benefits for efficient video LLM inference by effectively leveraging both outer-LLM and inner-LLM token reduction strategies in a synergistic manner.

Abstract: Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (outer-LLM pruning) primarily address spatial redundancy within individual frames or limited temporal windows, neglecting the crucial global temporal dynamics and correlations across longer video sequences. This leads to sub-optimal spatio-temporal reduction and does not leverage video compressibility fully. Crucially, the synergistic potential and mutual influence of combining these strategies remain unexplored. To further reduce redundancy, we introduce HoliTom, a novel training-free holistic token merging framework. HoliTom employs outer-LLM pruning through global redundancy-aware temporal segmentation, followed by spatial-temporal merging to reduce visual tokens by over 90%, significantly alleviating the LLM’s computational burden. Complementing this, we introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning. Evaluations demonstrate our method’s promising efficiency-performance trade-off on LLaVA-OneVision-7B, reducing computational costs to 6.9% of FLOPs while maintaining 99.1% of the original performance. Furthermore, we achieve a 2.28x reduction in Time-To-First-Token (TTFT) and a 1.32x acceleration in decoding throughput, highlighting the practical benefits of our integrated pruning approach for efficient video LLMs inference.

[283] Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model

Yang Yang, Siming Zheng, Qirui Yang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, Peng-Tao Jiang

Main category: cs.CV

TL;DR: A one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering using multi-plane image representation and progressive training.

Details

Motivation: Existing image-based bokeh methods suffer from temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over focus plane and bokeh intensity, limiting controllable video bokeh applications.

Method: Uses a one-step diffusion framework with multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, enabling exploitation of 3D priors from pretrained backbones. Introduces progressive training strategy to enhance temporal stability, depth robustness, and detail preservation.

Result: Experiments on synthetic and real-world benchmarks demonstrate superior temporal coherence, spatial accuracy, and controllability, outperforming prior baselines.

Conclusion: This work represents the first dedicated diffusion framework for video bokeh generation, establishing a new baseline for temporally coherent and controllable depth-of-field effects.

Abstract: Diffusion models have recently emerged as powerful tools for camera simulation, enabling both geometric transformations and realistic optical effects. Among these, image-based bokeh rendering has shown promising results, but diffusion for video bokeh remains unexplored. Existing image-based methods are plagued by temporal flickering and inconsistent blur transitions, while current video editing methods lack explicit control over the focus plane and bokeh intensity. These issues limit their applicability for controllable video bokeh. In this work, we propose a one-step diffusion framework for generating temporally coherent, depth-aware video bokeh rendering. The framework employs a multi-plane image (MPI) representation adapted to the focal plane to condition the video diffusion model, thereby enabling it to exploit strong 3D priors from pretrained backbones. To further enhance temporal stability, depth robustness, and detail preservation, we introduce a progressive training strategy. Experiments on synthetic and real-world benchmarks demonstrate superior temporal coherence, spatial accuracy, and controllability, outperforming prior baselines. This work represents the first dedicated diffusion framework for video bokeh generation, establishing a new baseline for temporally coherent and controllable depth-of-field effects.

[284] SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images

Yu Sheng, Jiajun Deng, Xinran Zhang, Yu Zhang, Bei Hua, Yanyong Zhang, Jianmin Ji

Main category: cs.CV

TL;DR: SpatialSplat is a feedforward framework for semantic 3D reconstruction that uses dual-field semantic representation and selective Gaussian mechanism to reduce redundancy while maintaining semantic accuracy.

Details

Motivation: Existing methods for semantic 3D reconstruction suffer from compromised semantic expressiveness due to feature compression and memory overhead from redundant pixel-wise primitive predictions in overlapping areas.

Method: Proposes a dual-field semantic representation (coarse feature field for uncompressed semantics with minimal primitives, and fine-grained low-dimensional feature field for inter-instance relationships) and a selective Gaussian mechanism to eliminate redundant primitives.

Result: Achieves 60% reduction in scene representation parameters while achieving superior performance over state-of-the-art methods.

Conclusion: SpatialSplat learns accurate semantic information with more compact 3D Gaussians, making semantic 3D reconstruction more applicable through reduced memory requirements.

Abstract: A major breakthrough in 3D reconstruction is the feedforward paradigm to generate pixel-wise 3D points or Gaussian primitives from sparse, unposed images. To further incorporate semantics while avoiding the significant memory and storage costs of high-dimensional semantic features, existing methods extend this paradigm by associating each primitive with a compressed semantic feature vector. However, these methods have two major limitations: (a) the naively compressed feature compromises expressiveness, affecting the model’s ability to capture fine-grained semantics, and (b) the pixel-wise primitive prediction introduces redundancy in overlapping areas, causing unnecessary memory overhead. To this end, we introduce \textbf{SpatialSplat}, a feedforward framework that produces redundancy-aware Gaussians and capitalizes on a dual-field semantic representation. Particularly, with the insight that primitives within the same instance exhibit high semantic consistency, we decompose the semantic representation into a coarse feature field that encodes uncompressed semantics with minimal primitives, and a fine-grained yet low-dimensional feature field that captures detailed inter-instance relationships. Moreover, we propose a selective Gaussian mechanism, which retains only essential Gaussians in the scene, effectively eliminating redundant primitives. Our proposed Spatialsplat learns accurate semantic information and detailed instances prior with more compact 3D Gaussians, making semantic 3D reconstruction more applicable. We conduct extensive experiments to evaluate our method, demonstrating a remarkable 60% reduction in scene representation parameters while achieving superior performance over state-of-the-art methods. The code is available at https://github.com/shengyuuu/SpatialSplat.git

[285] DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, Aliaksandr Siarohin

Main category: cs.CV

TL;DR: DenseDPO improves Direct Preference Optimization for video generation by creating aligned video pairs from corrupted ground truth videos, enabling segment-level preference labeling to reduce motion bias and improve training efficiency.

Details

Motivation: Standard DPO for video generation has limitations: it uses independently generated videos which bias annotators toward low-motion clips (fewer artifacts) and prohibits fine-grained comparisons.

Method: Three key contributions: 1) Create video pairs by denoising corrupted copies of ground truth videos for aligned motion structures, 2) Use temporal alignment to label preferences on short segments rather than entire clips, 3) Enable automatic preference annotation using Vision Language Models.

Result: DenseDPO achieves superior motion generation with only one-third of labeled data compared to vanilla DPO, while matching text alignment, visual quality, and temporal consistency. Automatic VLM-based annotation performs close to human labels.

Conclusion: DenseDPO effectively addresses motion bias in video DPO, enables more efficient training with denser learning signals, and unlocks automatic preference annotation that approaches human-level performance.

Abstract: Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.

[286] Contour Errors: An Ego-Centric Metric for Reliable 3D Multi-Object Tracking

Sharang Kaul, Mario Berk, Thiemo Gerbich, Abhinav Valada

Main category: cs.CV

TL;DR: Contour Errors (CEs) is a new ego or object-centric metric that improves match reliability in 3D multi-object tracking by reducing functional failures by 80% at close ranges and 60% at far ranges compared to traditional IoU metrics.

Details

Motivation: Traditional metrics like IoU and Center Point Distances fail to find critical matches in complex 3D scenes, which is problematic for safety-critical applications like autonomous vehicles where reliable perception is essential.

Method: Proposed Contour Errors (CEs) - an ego or object-centric metric that compares bounding boxes in the ego vehicle’s frame to provide functionally relevant assessment of object matches in tracking scenarios.

Result: Extensive experiments on nuScenes dataset show CEs improve match reliability over state-of-the-art 2D IoU and CPD metrics. In 3D car tracking, CEs reduce functional failures (FPs/FNs) by 80% at close ranges and 60% at far ranges compared to IoU.

Conclusion: Contour Errors provide a more functionally relevant assessment for object matching in 3D tracking scenarios, significantly improving reliability over traditional 2D metrics in safety-critical applications.

Abstract: Finding reliable matches is essential in multi-object tracking to ensure the accuracy and reliability of perception systems in safety-critical applications such as autonomous vehicles. Effective matching mitigates perception errors, enhancing object identification and tracking for improved performance and safety. However, traditional metrics such as Intersection over Union (IoU) and Center Point Distances (CPDs), which are effective in 2D image planes, often fail to find critical matches in complex 3D scenes. To address this limitation, we introduce Contour Errors (CEs), an ego or object-centric metric for identifying matches of interest in tracking scenarios from a functional perspective. By comparing bounding boxes in the ego vehicle’s frame, contour errors provide a more functionally relevant assessment of object matches. Extensive experiments on the nuScenes dataset demonstrate that contour errors improve the reliability of matches over the state-of-the-art 2D IoU and CPD metrics in tracking-by-detection methods. In 3D car tracking, our results show that Contour Errors reduce functional failures (FPs/FNs) by 80% at close ranges and 60% at far ranges compared to IoU in the evaluation stage.

[287] Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, Zheng-Jun Zha

Main category: cs.CV

TL;DR: AliTok is a novel aligned tokenizer that resolves the misalignment between bidirectional image dependencies and unidirectional autoregressive models by using a bidirectional encoder with causal decoder, enabling high-fidelity image generation with faster sampling than diffusion methods.

Details

Motivation: To address the fundamental misalignment between bidirectional dependencies in conventional image tokenizations and the unidirectional nature of autoregressive models for image generation.

Method: AliTok employs a bidirectional encoder constrained by a causal decoder, uses prefix tokens, and a two-stage tokenizer training process to create token sequences with semantic richness and forward-dependency.

Result: A 177M parameter model achieves gFID of 1.44 and IS of 319.5 on ImageNet-256. Scaling to 662M parameters reaches gFID of 1.28, surpassing state-of-the-art diffusion methods with 10x faster sampling speed.

Conclusion: AliTok successfully bridges the gap between image tokenization and autoregressive modeling, enabling efficient and high-quality image generation that outperforms diffusion methods in both quality and speed.

Abstract: Autoregressive image generation aims to predict the next token based on previous ones. However, this process is challenged by the bidirectional dependencies inherent in conventional image tokenizations, which creates a fundamental misalignment with the unidirectional nature of autoregressive models. To resolve this, we introduce AliTok, a novel Aligned Tokenizer that alters the dependency structure of the token sequence. AliTok employs a bidirectional encoder constrained by a causal decoder, a design that compels the encoder to produce a token sequence with both semantic richness and forward-dependency. Furthermore, by incorporating prefix tokens and employing a two-stage tokenizer training process to enhance reconstruction performance, AliTok achieves high fidelity and predictability simultaneously. Building upon AliTok, a standard decoder-only autoregressive model with just 177M parameters achieves a gFID of 1.44 and an IS of 319.5 on the ImageNet-256 benchmark. Scaling up to 662M parameters, our model reaches a gFID of 1.28, surpassing the state-of-the-art diffusion method while achieving a 10x faster sampling speed. The code and weights are available at https://github.com/ali-vilab/alitok.

[288] AD-EE: Early Exiting for Fast and Reliable Vision-Language Models in Autonomous Driving

Lianming Huang, Haibo Hu, Yufei Cui, Jiacheng Zuo, Shangyu Wu, Nan Guan, Chun Jason Xue

Main category: cs.CV

TL;DR: AD-EE is an Early Exit framework for Vision-Language Models in autonomous driving that reduces latency by up to 57.58% and improves object detection accuracy by up to 44% through optimal layer exit strategies.

Details

Motivation: Real-time application of VLMs in autonomous driving is hindered by high latency and computational overhead, especially when models continue unnecessary processing after confident predictions.

Method: Proposed AD-EE framework incorporates autonomous driving domain characteristics and uses causal inference to identify optimal exit layers for early termination of inference.

Result: Extensive experiments on Waymo, CODA datasets and real Autoware vehicles show significant latency reduction (up to 57.58%) and accuracy improvements (up to 44%) across multiple VLMs.

Conclusion: The AD-EE framework effectively addresses VLM inefficiency in time-critical driving scenarios by enabling early exit strategies, making VLMs more practical for real-time autonomous driving applications.

Abstract: With the rapid advancement of autonomous driving, deploying Vision-Language Models (VLMs) to enhance perception and decision-making has become increasingly common. However, the real-time application of VLMs is hindered by high latency and computational overhead, limiting their effectiveness in time-critical driving scenarios. This challenge is particularly evident when VLMs exhibit over-inference, continuing to process unnecessary layers even after confident predictions have been reached. To address this inefficiency, we propose AD-EE, an Early Exit framework that incorporates domain characteristics of autonomous driving and leverages causal inference to identify optimal exit layers. We evaluate our method on large-scale real-world autonomous driving datasets, including Waymo and the corner-case-focused CODA, as well as on a real vehicle running the Autoware Universe platform. Extensive experiments across multiple VLMs show that our method significantly reduces latency, with maximum improvements reaching up to 57.58%, and enhances object detection accuracy, with maximum gains of up to 44%.

Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, Sungroh Yoon

Main category: cs.CV

TL;DR: Proposes RL-based post-training framework for MLLMs to improve personalized image captioning, especially for multi-concept scenarios where SFT methods struggle.

Details

Motivation: Existing MLLMs and SFT-based personalization methods fail to generate faithful personalized captions in complex real-world scenarios like multi-concept image captioning, and acquiring large-scale high-quality caption data is costly.

Method: Reinforcement learning-based post-training framework for MLLMs, first RL approach for personalized image captioning, addressing data-centric limitations of SFT.

Result: Significantly enhances both visual recognition and personalized generation capabilities, consistently outperforms SFT-based baselines, especially in challenging multi-concept image captioning.

Conclusion: RL-based post-training is an effective alternative to SFT for improving MLLM personalization, particularly for complex captioning tasks where data acquisition is difficult.

Abstract: Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task. Project page: https://github.com/oyt9306/RePIC

[290] Interpretable and Granular Video-Based Quantification of Motor Characteristics from the Finger Tapping Test in Parkinson Disease

Tahereh Zarrat Ehsan, Michael Tangermann, Yağmur Güçlütürk, Bastiaan R. Bloem, Luc J. W. Evers

Main category: cs.CV

TL;DR: A computer vision method for quantifying Parkinson’s disease motor characteristics from finger-tapping videos, achieving better MDS-UPDRS score prediction than existing approaches while providing interpretable feature analysis.

Details

Motivation: Current clinical finger-tapping assessments are subjective, prone to variability, and lack granular insights into individual motor characteristics in Parkinson's disease.

Method: Proposed four sets of clinically relevant features from video recordings to characterize hypokinesia, bradykinesia, sequence effect, and hesitation-halts. Used principal component analysis and trained machine learning classifiers for MDS-UPDRS score prediction.

Result: Video-based features corresponded to the four motor deficits and revealed granular distinctions within sequence effect and hesitation-halts. Achieved higher accuracy in MDS-UPDRS score prediction compared to state-of-the-art methods.

Conclusion: The framework provides an objective, interpretable solution for PD motor assessment applicable in clinical and remote settings, though future work is needed to assess responsiveness to treatment and disease progression.

Abstract: Accurately quantifying motor characteristics in Parkinson disease (PD) is crucial for monitoring disease progression and optimizing treatment strategies. The finger-tapping test is a standard motor assessment. Clinicians visually evaluate a patient’s tapping performance and assign an overall severity score based on tapping amplitude, speed, and irregularity. However, this subjective evaluation is prone to inter- and intra-rater variability, and does not offer insights into individual motor characteristics captured during this test. This paper introduces a granular computer vision-based method for quantifying PD motor characteristics from video recordings. Four sets of clinically relevant features are proposed to characterize hypokinesia, bradykinesia, sequence effect, and hesitation-halts. We evaluate our approach on video recordings and clinical evaluations of 74 PD patients from the Personalized Parkinson Project. Principal component analysis with varimax rotation shows that the video-based features corresponded to the four deficits. Additionally, video-based analysis has allowed us to identify further granular distinctions within sequence effect and hesitation-halts deficits. In the following, we have used these features to train machine learning classifiers to estimate the Movement Disorder Society Unified Parkinson Disease Rating Scale (MDS-UPDRS) finger-tapping score. Compared to state-of-the-art approaches, our method achieves a higher accuracy in MDS-UPDRS score prediction, while still providing an interpretable quantification of individual finger-tapping motor characteristics. In summary, the proposed framework provides a practical solution for the objective assessment of PD motor characteristics, that can potentially be applied in both clinical and remote settings. Future work is needed to assess its responsiveness to symptomatic treatment and disease progression.

Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, Yong Li

Main category: cs.CV

TL;DR: Mem4Nav is a hierarchical spatial-cognition memory system that enhances Vision-and-Language Navigation (VLN) by combining sparse octree for fine-grained voxel indexing with semantic topology graph for landmark connectivity, using trainable memory tokens with reversible Transformer for efficient long-short term memory management.

Details

Motivation: Current VLN approaches face limitations: modular pipelines lack unified memory, while end-to-end MLLM agents are constrained by fixed context windows and implicit spatial reasoning, especially in large-scale urban environments requiring long-term experience recall.

Method: Mem4Nav uses a hierarchical spatial-cognition system with sparse octree for voxel indexing and semantic topology graph for landmark connectivity. It employs long-term memory (LTM) to compress historical observations and short-term memory (STM) for recent multimodal entries. The system uses trainable memory tokens embedded via reversible Transformer, with STM retrieval for context pruning and LTM decoding for historical reconstruction.

Result: Mem4Nav achieves 7-13 percentage point gains in Task Completion, sufficient SPD reduction, and over 10 percentage point nDTW improvement across three different VLN backbones on Touchdown and Map2Seq datasets.

Conclusion: The hierarchical map and dual memory modules are essential for effective VLN in large-scale environments, with Mem4Nav demonstrating significant performance improvements across multiple backbones and datasets.

Abstract: Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce \textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short-term memory (STM) caches recent multimodal entries in relative coordinates for real-time obstacle avoidance and local planning. At each step, STM retrieval sharply prunes dynamic context, and, when deeper history is needed, LTM tokens are decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13 pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW improvement. Ablations confirm the indispensability of both the hierarchical map and dual memory modules. Our codes are open-sourced via https://github.com/tsinghua-fib-lab/Mem4Nav.

[292] Event-RGB Fusion for Spacecraft Pose Estimation Under Harsh Lighting

Mohsi Jawaid, Marcus Märtens, Tat-Jun Chin

Main category: cs.CV

TL;DR: This paper introduces a sensor fusion approach combining RGB and event cameras for spacecraft pose estimation, addressing limitations of individual sensors under harsh lighting conditions.

Details

Motivation: Vision-based pose estimation using RGB sensors faces challenges from harsh lighting conditions (glare, over-exposure, etc.), while event sensors have limited resolution and noise issues during low motion. Combining both sensors can leverage their complementary strengths.

Method: Used a beam-splitter prism for precise optical/temporal alignment, developed a RANSAC-based fusion technique to combine RGB and event data, and implemented dropout uncertainty estimation to detect extreme conditions affecting either sensor channel.

Result: Collected a comprehensive real dataset under challenging illumination conditions and demonstrated encouraging performance of the event-RGB fusion approach for spacecraft pose estimation.

Conclusion: The proposed sensor fusion method effectively leverages complementary strengths of RGB and event sensors, supporting the use of event sensors for spacecraft pose estimation. The dataset has been released publicly to support community research.

Abstract: Spacecraft pose estimation is crucial for autonomous in-space operations, such as rendezvous, docking and on-orbit servicing. Vision-based pose estimation methods, which typically employ RGB imaging sensors, is a compelling solution for spacecraft pose estimation, but are challenged by harsh lighting conditions, which produce imaging artifacts such as glare, over-exposure, blooming and lens flare. Due to their much higher dynamic range, neuromorphic or event sensors are more resilient to extreme lighting conditions. However, event sensors generally have lower spatial resolution and suffer from reduced signal-to-noise ratio during periods of low relative motion. This work addresses these individual sensor limitations by introducing a sensor fusion approach combining RGB and event sensors. A beam-splitter prism was employed to achieve precise optical and temporal alignment. Then, a RANSAC-based technique was developed to fuse the information from the RGB and event channels to achieve pose estimation that leveraged the strengths of the two modalities. The pipeline was complemented by dropout uncertainty estimation to detect extreme conditions that affect either channel. To benchmark the performance of the proposed event-RGB fusion method, we collected a comprehensive real dataset of RGB and event data for satellite pose estimation in a laboratory setting under a variety of challenging illumination conditions. Encouraging results on the dataset demonstrate the efficacy of our event-RGB fusion approach and further supports the usage of event sensors for spacecraft pose estimation. To support community research on this topic, our dataset has been released publicly.

[293] Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants

Sybelle Goedicke-Fritz, Michelle Bous, Annika Engel, Matthias Flotho, Pascal Hirsch, Hannah Wittig, Dino Milanovic, Dominik Mohr, Mathias Kaspar, Sogand Nemat, Dorothea Kerner, Arno Bücker, Andreas Keller, Sascha Meyer, Michael Zemlin, Philipp Flotho

Main category: cs.CV

TL;DR: Deep learning model using day-1 chest X-rays predicts moderate/severe BPD in extremely preterm infants with AUROC 0.78, outperforming traditional IRDS grades.

Details

Motivation: BPD affects 35% of extremely low birth weight infants and causes lifelong complications. Early prediction is crucial to avoid unnecessary toxic interventions in low-risk infants. Routine admission radiographs could serve as non-invasive prognostic tools.

Method: Fine-tuned ResNet-50 pretrained on adult chest radiographs using progressive layer freezing with discriminative learning rates. Evaluated CutMix augmentation and linear probing on chest X-rays from 163 extremely low-birth-weight infants obtained within 24 hours of birth.

Result: Best model achieved AUROC of 0.78 ± 0.10, balanced accuracy of 0.69 ± 0.10, and F1-score of 0.67 ± 0.11 for moderate/severe BPD prediction. Domain-specific pretraining significantly outperformed ImageNet initialization (p = 0.031). Traditional IRDS grades showed limited prognostic value (AUROC 0.57 ± 0.11).

Conclusion: Domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. The method with progressive freezing and linear probing is computationally feasible for clinical implementation and future federated learning deployments.

Abstract: Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants ($\leq$32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 $\pm$ 0.10, balanced accuracy of 0.69 $\pm$ 0.10, and an F1-score of 0.67 $\pm$ 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 $\pm$ 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.

[294] TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation

Zhekai Chen, Ruihang Chu, Yukang Chen, Shiwei Zhang, Yujie Wei, Yingya Zhang, Xihui Liu

Main category: cs.CV

TL;DR: TTS-VAR is a test-time scaling framework for visual auto-regressive models that treats generation as path searching, using adaptive batch scheduling, clustering-based diversity search at coarse scales, and resampling-based potential selection at fine scales to improve generation quality efficiently.

Details

Motivation: To enable efficient scaling of visual generation models without expensive retraining by leveraging test-time optimization, addressing the computational cost challenges of traditional model scaling approaches.

Method: Models generation as path searching with adaptive descending batch size schedule. Uses clustering-based diversity search at coarse scales to preserve structural variety, and resampling-based potential selection at fine scales using multi-scale reward functions.

Result: Achieved 8.7% improvement in GenEval score (from 0.69 to 0.75) on the Infinity VAR model, demonstrating effective quality enhancement through test-time optimization.

Conclusion: Test-time scaling is viable for visual auto-regressive models, with early-stage structural features significantly influencing final quality and resampling effectiveness varying across generation scales.

Abstract: Scaling visual generation models is essential for real-world content creation, yet requires substantial training and computational expenses. Alternatively, test-time scaling has garnered growing attention due to resource efficiency and promising performance. In this work, we present TTS-VAR, the first general test-time scaling framework for visual auto-regressive (VAR) models, modeling the generation process as a path searching problem. To dynamically balance computational efficiency with exploration capacity, we first introduce an adaptive descending batch size schedule throughout the causal generation process. Besides, inspired by VAR’s hierarchical coarse-to-fine multi-scale generation, our framework integrates two key components: (i) At coarse scales, we observe that generated tokens are hard for evaluation, possibly leading to erroneous acceptance of inferior samples or rejection of superior samples. Noticing that the coarse scales contain sufficient structural information, we propose clustering-based diversity search. It preserves structural variety through semantic feature clustering, enabling later selection on samples with higher potential. (ii) In fine scales, resampling-based potential selection prioritizes promising candidates using potential scores, which are defined as reward functions incorporating multi-scale generation history. Experiments on the powerful VAR model Infinity show a notable 8.7% GenEval score improvement (from 0.69 to 0.75). Key insights reveal that early-stage structural features effectively influence final quality, and resampling efficacy varies across generation scales. Code is available at https://github.com/ali-vilab/TTS-VAR.

[295] CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel

Main category: cs.CV

TL;DR: Proposes a dual-model framework for embodied reference understanding that uses both head-to-fingertip and wrist-to-fingertip directions with Gaussian ray heatmaps and CLIP-aware ensemble to improve object reference prediction.

Details

Motivation: Existing methods struggle with multimodal understanding of pointing gestures and language, and rely on simplistic single-line assumptions for reference alignment, leading to suboptimal performance.

Method: Dual-model framework with separate models for head-to-fingertip and wrist-to-fingertip directions, using Gaussian ray heatmaps and CLIP-Aware Pointing Ensemble module, plus object center prediction as auxiliary task.

Result: Achieves ~4 mAP improvement at 0.25 IoU threshold on YouRefIt dataset, with additional evaluation on CAESAR and ISL Pointing datasets.

Conclusion: The proposed dual-model approach with multimodal integration and ensemble strategy effectively improves embodied reference understanding performance.

Abstract: We address the problem of Embodied Reference Understanding, which involves predicting the object that a person in the scene is referring to through both pointing gesture and language. Accurately identifying the referent requires multimodal understanding: integrating textual instructions, visual pointing, and scene context. However, existing methods often struggle to effectively leverage visual clues for disambiguation. We also observe that, while the referent is often aligned with the head-to-fingertip line, it occasionally aligns more closely with the wrist-to-fingertip line. Therefore, relying on a single line assumption can be overly simplistic and may lead to suboptimal performance. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We further introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To combine the strengths of both models, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features. Additionally, we propose an object center prediction head as an auxiliary task to further enhance referent localization. We validate our approach through extensive experiments and analysis on the benchmark YouRefIt dataset, achieving an improvement of approximately 4 mAP at the 0.25 IoU threshold. We further evaluate our approach on the CAESAR and ISL Pointing datasets.

[296] UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, Liwei Wang

Main category: cs.CV

TL;DR: UniLIP is a unified framework that adapts CLIP for multimodal understanding, generation and editing through a two-stage training scheme with self-distillation, achieving state-of-the-art performance with smaller parameter sizes.

Details

Motivation: Previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. CLIP excels at understanding but lacks reconstruction abilities needed for unified visual encoding.

Method: Two-stage training scheme with self-distillation strategy to progressively endow CLIP with high-fidelity reconstruction while preserving comprehension. Dual-condition architecture built on MetaQuery framework using multimodal hidden states and learnable query embeddings to leverage MLLM reasoning abilities.

Result: Outperforms larger unified models (BAGEL 7B, Uniworld-V1 12B) with only 1B and 3B parameters. Achieves SOTA: 0.90 on GenEval, 0.63 on WISE, and 3.94 on ImgEdit. Demonstrates superior instruction following and edit fidelity.

Conclusion: UniLIP successfully expands CLIP’s application, establishing its continuous features as optimal for understanding tasks while achieving highly competitive performance in generation and editing tasks.

Abstract: In this paper, we propose UniLIP, a unified framework that adapts CLIP for multimodal understanding, generation and editing. Although CLIP excels at understanding, it lacks reconstruction abilities required to be a unified visual encoder. However, previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. In contrast, we introduce a novel two-stage training scheme with a self-distillation strategy that progressively endows CLIP with high-fidelity reconstruction abilities while preserving its original comprehension performance. For enhanced reasoning and consistency in generation and editing, we further develop a dual-condition architecture built upon the MetaQuery framework. Our architecture jointly utilizes multimodal hidden states for rich contextual details and learnable query embeddings to harness the powerful reasoning abilities of Multimodal Large Language Models (MLLMs). Leveraging advanced image representation and architectural design, UniLIP demonstrates superior instruction following and edit fidelity. With only 1B and 3B parameters, UniLIP can outperform larger unified models such as BAGEL (7B) and Uniworld-V1 (12B), achieving state-of-the-art performance of 0.90 on GenEval, 0.63 on WISE, and 3.94 on ImgEdit. These results demonstrate that UniLIP successfully expands the application of CLIP, establishing its continuous features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks. Code and models are available at https://github.com/nnnth/UniLIP.

[297] MedCAL-Bench: A Comprehensive Benchmark on Cold-Start Active Learning with Foundation Models for Medical Image Analysis

Ning Zhu, Xiaochuan Ma, Shaoting Zhang, Guotai Wang

Main category: cs.CV

TL;DR: MedCAL-Bench is the first systematic benchmark for evaluating Foundation Models in Cold-Start Active Learning for medical image analysis, covering 14 FMs, 7 CSAL strategies, and 7 datasets across classification and segmentation tasks.

Details

Motivation: Current CSAL methods rely on inefficient SSL for feature extraction and lack proper evaluation of Foundation Models, which have shown powerful feature extraction capabilities but haven't been systematically benchmarked for CSAL in medical imaging.

Method: Proposed MedCAL-Bench evaluates 14 Foundation Models and 7 CSAL strategies across 7 medical datasets under different annotation budgets, covering both classification and segmentation tasks from diverse modalities, with evaluation of both feature extraction and sample selection stages.

Result: 1) Most FMs are effective CSAL feature extractors, with DINO family performing best in segmentation; 2) FM performance differences are large in segmentation but small for classification; 3) Different selection strategies work best for different tasks - ALPS for segmentation and RepDiv for classification.

Conclusion: MedCAL-Bench provides the first comprehensive benchmark for FM-based CSAL in medical imaging, revealing that FMs are effective feature extractors and task-specific strategies should be considered for optimal performance.

Abstract: Cold-Start Active Learning (CSAL) aims to select informative samples for annotation without prior knowledge, which is important for improving annotation efficiency and model performance under a limited annotation budget in medical image analysis. Most existing CSAL methods rely on Self-Supervised Learning (SSL) on the target dataset for feature extraction, which is inefficient and limited by insufficient feature representation. Recently, pre-trained Foundation Models (FMs) have shown powerful feature extraction ability with a potential for better CSAL. However, this paradigm has been rarely investigated, with a lack of benchmarks for comparison of FMs in CSAL tasks. To this end, we propose MedCAL-Bench, the first systematic FM-based CSAL benchmark for medical image analysis. We evaluate 14 FMs and 7 CSAL strategies across 7 datasets under different annotation budgets, covering classification and segmentation tasks from diverse medical modalities. It is also the first CSAL benchmark that evaluates both the feature extraction and sample selection stages. Our experimental results reveal that: 1) Most FMs are effective feature extractors for CSAL, with DINO family performing the best in segmentation; 2) The performance differences of these FMs are large in segmentation tasks, while small for classification; 3) Different sample selection strategies should be considered in CSAL on different datasets, with Active Learning by Processing Surprisal (ALPS) performing the best in segmentation while RepDiv leading for classification. The code is available at https://github.com/HiLab-git/MedCAL-Bench.

Lefei Shen, Mouxiang Chen, Xu Liu, Han Fu, Xiaoxue Ren, Jianling Sun, Zhuo Li, Chenghao Liu

Main category: cs.CV

TL;DR: VisionTS++ bridges gaps between vision models and time series forecasting by using continual pre-training, vision-model-based filtering, colorized multivariate conversion, and multi-quantile forecasting, achieving state-of-the-art performance.

Details

Motivation: To address three key discrepancies in using vision models as time series foundation models: data-modality gap, multivariate-forecasting gap, and probabilistic-forecasting gap.

Method: Continual pre-training of vision models on large-scale time series with three innovations: vision-model-based filtering for quality sequences, colorized multivariate conversion to RGB images, and multi-quantile forecasting with parallel reconstruction heads.

Result: Achieves state-of-the-art performance, outperforming specialized time series foundation models by 6%-44% in MSE reduction and ranking first in GIFT-Eval benchmark across 23 datasets and 7 domains.

Conclusion: With appropriate adaptation, vision models can effectively generalize to time series forecasting, advancing the development of universal time series foundation models.

Abstract: Recent studies have indicated that vision models pre-trained on images can serve as time series foundation models (TSFMs) by reformulating time series forecasting (TSF) as image reconstruction. However, effective cross-modal transfer from vision to time series remains challenging due to three discrepancies: (1) the data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) the multivariate-forecasting gap between fixed RGB-three-channel vision models and time series with arbitrary numbers of variates; and (3) the probabilistic-forecasting gap between the deterministic outputs of vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisonTS++, a TSFM based on continual pre-training of a vision model on large-scale time series. Our approach introduces three key innovations: (1) vision-model-based filtering to identify high-quality sequences to stabilize pre-training and mitigate modality gap; (2) colorized multivariate conversion, encoding multivariate series as multi-subfigure RGB images to enhance cross-variate modeling; (3) multi-quantile forecasting, using parallel reconstruction heads to generate quantile forecasts without parametric assumptions. Experiments show that VisionTS++ achieves state-of-the-art performance in both in-distribution and out-of-distribution forecasting, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in GIFT-Eval benchmark which comprises 23 datasets across 7 domains. Our work demonstrates that with appropriate adaptation, vision models can effectively generalize to TSF, thus advancing the pursuit of universal TSFMs. Code is available at https://github.com/HALF111/VisionTSpp.

[299] ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection

Ke Ma, Jun Long, Hongxiao Fei, Liujie Hua, Zhen Dai, Yueyi Luo

Main category: cs.CV

TL;DR: The paper proposes an Architectural Co-Design framework with Conv-LoRA adapters and Dynamic Fusion Gateway to address VLMs’ limitations in Zero-Shot Anomaly Detection by injecting local inductive biases and enabling adaptive cross-modal fusion.

Details

Motivation: Pre-trained Vision-Language Models struggle with Zero-Shot Anomaly Detection due to lacking local inductive biases for dense prediction and using inflexible feature fusion paradigms.

Method: Proposes Architectural Co-Design framework with: 1) Convolutional Low-Rank Adaptation (Conv-LoRA) adapter for parameter-efficient local bias injection, and 2) Dynamic Fusion Gateway (DFG) that uses visual context to adaptively modulate text prompts for bidirectional fusion.

Result: Extensive experiments on industrial and medical benchmarks demonstrate superior accuracy and robustness compared to existing methods.

Conclusion: The synergistic co-design of feature representation and cross-modal fusion is critical for robustly adapting foundation models to dense perception tasks like Zero-Shot Anomaly Detection.

Abstract: Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks. The source code is available at https://github.com/cockmake/ACD-CLIP.

[300] PanoLAM: Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image

Peng Li, Yisheng He, Yingdong Hu, Yuan Dong, Weihao Yuan, Yuan Liu, Siyu Zhu, Gang Cheng, Zilong Dong, Yike Guo

Main category: cs.CV

TL;DR: A feed-forward framework for fast Gaussian full-head synthesis from a single unposed image, using synthetic data from 3D GANs and a coarse-to-fine pipeline with dual-branch feature aggregation.

Details

Motivation: To overcome the limitations of previous methods that rely on time-consuming GAN inversion and test-time optimization for 3D head reconstruction from single images.

Method: Proposes a coarse-to-fine Gaussian head generation pipeline using FLAME model sparse points with transformer blocks, and a dual-branch framework that aggregates spherical triplane features and point-based features from pretrained 3D GANs.

Result: Enables fast reconstruction and rendering during inference, achieving effective Gaussian full-head synthesis from single unposed images using only synthetic training data.

Conclusion: The framework demonstrates effectiveness compared to existing work, providing a practical solution for efficient 3D head reconstruction without requiring real 3D head assets.

Abstract: We present a feed-forward framework for Gaussian full-head synthesis from a single unposed image. Unlike previous work that relies on time-consuming GAN inversion and test-time optimization, our framework can reconstruct the Gaussian full-head model given a single unposed image in a single forward pass. This enables fast reconstruction and rendering during inference. To mitigate the lack of large-scale 3D head assets, we propose a large-scale synthetic dataset from trained 3D GANs and train our framework using only synthetic data. For efficient high-fidelity generation, we introduce a coarse-to-fine Gaussian head generation pipeline, where sparse points from the FLAME model interact with the image features by transformer blocks for feature extraction and coarse shape reconstruction, which are then densified for high-fidelity reconstruction. To fully leverage the prior knowledge residing in pretrained 3D GANs for effective reconstruction, we propose a dual-branch framework that effectively aggregates the structured spherical triplane feature and unstructured point-based features for more effective Gaussian head reconstruction. Experimental results show the effectiveness of our framework towards existing work. Project page at: https://panolam.github.io/.

[301] Visual Representation Alignment for Multimodal Large Language Models

Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim

Main category: cs.CV

TL;DR: VIRAL is a regularization strategy that aligns MLLMs’ visual representations with pre-trained vision foundation models to improve performance on vision-centric tasks by preserving fine-grained visual details.

Details

Motivation: Current MLLMs trained with text-only supervision perform poorly on vision-centric tasks like object counting and spatial reasoning because they discard fine-grained visual details during training.

Method: VIRAL explicitly aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models through regularization, enabling retention of critical visual details and complementing additional visual knowledge.

Result: Experiments show consistent improvements across all tasks on multimodal benchmarks, with comprehensive ablation studies validating the key design choices.

Conclusion: This simple alignment strategy opens up an important direction for effectively integrating visual information in training MLLMs.

Abstract: Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

[302] Unified Multimodal Model as Auto-Encoder

Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan

Main category: cs.CV

TL;DR: UAE proposes a unified multimodal model that treats understanding as encoding (image-to-text) and generation as decoding (text-to-image) under a single reconstruction objective, using reinforcement learning to enable bidirectional improvement between the two tasks.

Details

Motivation: Current multimodal models treat understanding and generation as separate tasks with disjoint objectives, missing their mutual benefits. True unification requires a foundational objective that intrinsically links both capabilities.

Method: UAE uses an Auto-Encoder paradigm: understanding as encoder (I2T) and generation as decoder (T2I). Pre-trains decoder with 700k long-context image-caption pairs, then applies Unified-GRPO reinforcement learning with two stages: Generation for Understanding (encoder generates informative captions) and Understanding for Generation (decoder reconstructs from captions).

Result: Bidirectional improvement: understanding enhances generation (verified on GenEval), while generation strengthens fine-grained visual perception like small object and color recognition (verified on MMT-Bench).

Conclusion: Under a unified reconstruction objective, generation and understanding can mutually benefit each other, moving closer to truly unified multimodal intelligence.

Abstract: The pursuit of unified multimodal models (UMMs) has long been hindered by a fundamental schism between multimodal understanding and generation. Current approaches typically disentangle the two and treat them as separate endeavors with disjoint objectives, missing the mutual benefits. We argue that true unification requires more than just merging two tasks. It requires a unified, foundational objective that intrinsically links them. In this paper, we introduce an insightful paradigm through the Auto-Encoder lens, i.e., regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. To implement this, we propose UAE, where we begin by pre-training the decoder with the proposed 700k long-context image-caption pairs to direct it to “understand” the fine-grained and complex semantics from the text. We then propose Unified-GRPO via reinforcement learning (RL) to unify the two, which covers two complementary stages: (1) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder’s reconstruction quality, enhancing its visual perception; (2) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception like small object and color recognition (verified on MMT-Bench). This bidirectional improvement reveals a deep synergy: under the unified reconstruction objective, generation and understanding can mutually benefit each other, moving closer to truly unified multimodal intelligence.

[303] LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA

Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Joey Tianyi Zhou, Changtao Miao, Huazhe Tan, Weibin Yao, Jianshu Li

Main category: cs.CV

TL;DR: LaV-CoT is a Language-aware Visual Chain-of-Thought framework that enhances multilingual visual question answering through multi-stage reasoning and multi-aspect reward optimization, achieving significant performance improvements over existing models.

Details

Motivation: Existing approaches primarily rely on textual CoT with limited support for multilingual multimodal reasoning, constraining real-world deployment. There's a need for better multilingual visual reasoning capabilities.

Method: Uses interpretable multi-stage reasoning pipeline (Text Summary with BBox, Language Identification, Spatial Object-level Captioning, Step-by-step Logical Reasoning) with automated data curation and two-stage training combining SFT with Language-aware GRPO guided by multi-aspect rewards.

Result: Achieves up to 9.5% accuracy improvements over open-source baselines, surpasses models with 2x larger scales by 2.6%, and outperforms proprietary models like GPT-4o-0513 and Gemini-2.5-flash. Validated through online A/B testing.

Conclusion: LaV-CoT effectively addresses multilingual visual reasoning limitations and demonstrates strong performance for real-world industrial deployment.

Abstract: As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce LaV-CoT, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to ~9.5% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by ~2.6%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: https://github.com/HJNVR/LaV-CoT

Ayhan Can Erdur, Christian Beischl, Daniel Scholz, Jiazhen Pan, Benedikt Wiestler, Daniel Rueckert, Jan C Peeken

Main category: cs.CV

TL;DR: A masked autoencoder approach for handling missing MRI sequences in 3D medical imaging, using multi-modal transformer architecture to infer missing data and improve downstream task performance.

Details

Motivation: Missing input sequences are common in medical imaging data, posing challenges for deep learning models that require complete input data.

Method: Developed a masked autoencoder paradigm treating each MRI sequence as separate modality, using late-fusion transformer encoder and individual decoder streams for multi-task reconstruction.

Result: Achieved absolute improvement of 10.1 overall Dice score and 0.46 MCC over baselines with missing input sequences in downstream segmentation and classification tasks.

Conclusion: The method provides a flexible and generalizable encoder for brain MRIs that can handle missing sequences and adapt to various downstream applications.

Abstract: Missing input sequences are common in medical imaging data, posing a challenge for deep learning models reliant on complete input data. In this work, inspired by MultiMAE [2], we develop a masked autoencoder (MAE) paradigm for multi-modal, multi-task learning in 3D medical imaging with brain MRIs. Our method treats each MRI sequence as a separate input modality, leveraging a late-fusion-style transformer encoder to integrate multi-sequence information (multi-modal) and individual decoder streams for each modality for multi-task reconstruction. This pretraining strategy guides the model to learn rich representations per modality while also equipping it to handle missing inputs through cross-sequence reasoning. The result is a flexible and generalizable encoder for brain MRIs that infers missing sequences from available inputs and can be adapted to various downstream applications. We demonstrate the performance and robustness of our method against an MAE-ViT baseline in downstream segmentation and classification tasks, showing absolute improvement of $10.1$ overall Dice score and $0.46$ MCC over the baselines with missing input sequences. Our experiments demonstrate the strength of this pretraining strategy. The implementation is made available.

Elena Camuffo, Francesco Barbato, Mete Ozay, Simone Milani, Umberto Michieli

Main category: cs.CV

TL;DR: MOCHA is a knowledge distillation method that transfers region-level multimodal semantics from large vision-language teachers to lightweight vision-only object detectors using object-level alignment and global relational consistency.

Details

Motivation: To enable efficient transfer of multimodal semantics to lightweight vision-only detectors without modifying the teacher model or requiring textual input at inference time.

Method: Uses a translation module to map student features into joint space, trained with dual-objective loss enforcing both local alignment and global relational consistency at object level.

Result: Achieves consistent gains over baselines with +10.1 average score improvement across four personalized detection benchmarks under few-shot regimes, reaching performance comparable to larger multimodal models.

Conclusion: MOCHA proves suitable for real-world deployment by enabling lightweight detectors to achieve multimodal-level performance without the computational overhead of full multimodal models.

Abstract: We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a knowledge distillation approach that transfers region-level multimodal semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight vision-only object detector student (e.g., YOLO). A translation module maps student features into a joint space, where the training of the student and translator is guided by a dual-objective loss that enforces both local alignment and global relational consistency. Unlike prior approaches focused on dense or global alignment, MOCHA operates at the object level, enabling efficient transfer of semantics without modifying the teacher or requiring textual input at inference. We validate our method across four personalized detection benchmarks under few-shot regimes. Results show consistent gains over baselines, with a +10.1 average score improvement. Despite its compact architecture, MOCHA reaches performance on par with larger multimodal models, proving its suitability for real-world deployment.

[306] Enhancing Feature Fusion of U-like Networks with Dynamic Skip Connections

Yue Cao, Quansong He, Kaishen Wang, Jianlong Xiong, Tao He

Main category: cs.CV

TL;DR: Proposes Dynamic Skip Connection (DSC) block to overcome limitations in U-like networks for medical image segmentation, addressing inter-feature and intra-feature constraints through adaptive mechanisms.

Details

Motivation: Conventional skip connections in U-like networks have two key limitations: inter-feature constraints (static feature fusion regardless of content) and intra-feature constraints (insufficient multi-scale feature interactions hindering global context aggregation).

Method: DSC block with two components: (1) Test-Time Training (TTT) module for dynamic adaptation during inference to address inter-feature constraint, and (2) Dynamic Multi-Scale Kernel (DMSK) module that adaptively selects kernel sizes based on global context to address intra-feature constraint.

Result: The DSC block demonstrates plug-and-play effectiveness across various U-like network architectures including CNN-based, Transformer-based, hybrid CNN-Transformer, and Mamba-based networks.

Conclusion: The proposed DSC block fundamentally enhances cross-layer connectivity through adaptive mechanisms and can be seamlessly incorporated into existing U-like network structures.

Abstract: U-like networks have become fundamental frameworks in medical image segmentation through skip connections that bridge high-level semantics and low-level spatial details. Despite their success, conventional skip connections exhibit two key limitations: inter-feature constraints and intra-feature constraints. The inter-feature constraint refers to the static nature of feature fusion in traditional skip connections, where information is transmitted along fixed pathways regardless of feature content. The intra-feature constraint arises from the insufficient modeling of multi-scale feature interactions, thereby hindering the effective aggregation of global contextual information. To overcome these limitations, we propose a novel Dynamic Skip Connection (DSC) block that fundamentally enhances cross-layer connectivity through adaptive mechanisms. The DSC block integrates two complementary components. (1) Test-Time Training (TTT) module. This module addresses the inter-feature constraint by enabling dynamic adaptation of hidden representations during inference, facilitating content-aware feature refinement. (2) Dynamic Multi-Scale Kernel (DMSK) module. To mitigate the intra-feature constraint, this module adaptively selects kernel sizes based on global contextual cues, enhancing the network capacity for multi-scale feature integration. The DSC block is architecture-agnostic and can be seamlessly incorporated into existing U-like network structures. Extensive experiments demonstrate the plug-and-play effectiveness of the proposed DSC block across CNN-based, Transformer-based, hybrid CNN-Transformer, and Mamba-based U-like networks.

[307] RangeSAM: Leveraging Visual Foundation Models for Range-View repesented LiDAR segmentation

Paul Julius Kühn, Duc Anh Nguyen, Arjan Kuijper, Holger Graf, Dieter Fellner, Saptarshi Neil Sinha

Main category: cs.CV

TL;DR: This paper presents the first range-view framework that adapts SAM2 (Visual Foundation Model) for LiDAR point cloud segmentation, achieving competitive performance on SemanticKITTI with improved speed and deployment simplicity.

Details

Motivation: To leverage mature 2D semantic segmentation techniques and Visual Foundation Models (VFMs) for fast and accurate LiDAR point cloud segmentation, addressing the computational cost and efficiency limitations of voxel- and point-based methods.

Method: Adapts SAM2 to 3D segmentation by coupling efficient 2D feature extraction with projection/back-projection operations. Implements architectural modifications including: (1) novel module for horizontal spatial dependencies in LiDAR range images, (2) customized configuration for spherical projections, and (3) adapted mechanism for range-view spatial patterns and discontinuities.

Result: Achieves competitive performance on SemanticKITTI benchmark while benefiting from the speed, scalability, and deployment simplicity of 2D-centric pipelines.

Conclusion: Range-view segmentation methods using VFMs lead to promising results, highlighting the viability of VFMs as general-purpose backbones for 3D perception and opening a path toward unified, foundation-model-driven LiDAR segmentation.

Abstract: Point cloud segmentation is central to autonomous driving and 3D scene understanding. While voxel- and point-based methods dominate recent research due to their compatibility with deep architectures and ability to capture fine-grained geometry, they often incur high computational cost, irregular memory access, and limited real-time efficiency. In contrast, range-view methods, though relatively underexplored - can leverage mature 2D semantic segmentation techniques for fast and accurate predictions. Motivated by the rapid progress in Visual Foundation Models (VFMs) for captioning, zero-shot recognition, and multimodal tasks, we investigate whether SAM2, the current state-of-the-art VFM for segmentation tasks, can serve as a strong backbone for LiDAR point cloud segmentation in the range view. We present , to our knowledge, the first range-view framework that adapts SAM2 to 3D segmentation, coupling efficient 2D feature extraction with standard projection/back-projection to operate on point clouds. To optimize SAM2 for range-view representations, we implement several architectural modifications to the encoder: (1) a novel module that emphasizes horizontal spatial dependencies inherent in LiDAR range images, (2) a customized configuration of tailored to the geometric properties of spherical projections, and (3) an adapted mechanism in the encoder backbone specifically designed to capture the unique spatial patterns and discontinuities present in range-view pseudo-images. Our approach achieves competitive performance on SemanticKITTI while benefiting from the speed, scalability, and deployment simplicity of 2D-centric pipelines. This work highlights the viability of VFMs as general-purpose backbones for 3D perception and opens a path toward unified, foundation-model-driven LiDAR segmentation. Results lets us conclude that range-view segmentation methods using VFMs leads to promising results.

[308] The Impact of 2D Segmentation Backbones on Point Cloud Predictions Using 4D Radar

William Muckelroy III, Mohammed Alsakabi, John Dolan, Ozan Tonguz

Main category: cs.CV

TL;DR: Investigating the impact of segmentation backbone capacity on LiDAR-like 3D point cloud generation from 4D Radars, finding optimal capacity provides 23.7% improvement over SOTA.

Details

Motivation: LiDAR's high cost limits autonomous driving adoption, so research aims to generate LiDAR-like point clouds using cheaper 4D Radars instead.

Method: Analyzed different capacity segmentation backbones in neural networks that generate 3D point clouds from 4D Radar data, building on prior work with modular 2D CNN and temporal coherence networks.

Result: Optimal segmentation backbone capacity provides 23.7% improvement over state-of-the-art, while very high-capacity models can actually hurt performance.

Conclusion: Careful selection of segmentation backbone capacity is crucial for optimal LiDAR-like point cloud generation from 4D Radars, with significant performance gains possible over current methods.

Abstract: LiDAR’s dense, sharp point cloud (PC) representations of the surrounding environment enable accurate perception and significantly improve road safety by offering greater scene awareness and understanding. However, LiDAR’s high cost continues to restrict the broad adoption of high-level Autonomous Driving (AD) systems in commercially available vehicles. Prior research has shown progress towards circumventing the need for LiDAR by training a neural network, using LiDAR point clouds as ground truth (GT), to produce LiDAR-like 3D point clouds using only 4D Radars. One of the best examples is a neural network created to train a more efficient radar target detector with a modular 2D convolutional neural network (CNN) backbone and a temporal coherence network at its core that uses the RaDelft dataset for training (see arXiv:2406.04723). In this work, we investigate the impact of higher-capacity segmentation backbones on the quality of the produced point clouds. Our results show that while very high-capacity models may actually hurt performance, an optimal segmentation backbone can provide a 23.7% improvement over the state-of-the-art (SOTA).

[309] DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images

Dwip Dalal, Gautam Vashishtha, Anku Rani, Aishwarya Reganti, Parth Patwa, Mohd Sarique, Chandan Gupta, Keshav Nath, Viswanatha Reddy, Vinija Jain, Aman Chadha, Amitava Das, Amit Sheth, Asif Ekbal

Main category: cs.CV

TL;DR: A multimodal dataset and vision-language model called DeHater are introduced for detecting and removing hateful content from digital images using watermarked stable diffusion techniques and attention maps.

Details

Motivation: The rise in harmful online content distorts public discourse and poses challenges to maintaining a healthy digital environment, necessitating better tools for hate content detection and removal.

Method: Innovative application of watermarked, stability-enhanced stable diffusion techniques combined with Digital Attention Analysis Module (DAAM) to generate hate attention maps that identify and blur hateful regions in images.

Result: Development of a multimodal dataset for hate identification and DeHater vision-language model for multimodal dehatification tasks, setting new standards in AI-driven image hate detection.

Conclusion: The approach contributes to more ethical AI applications in social media by providing effective tools for detecting and removing hateful content from digital images.

Abstract: The rise in harmful online content not only distorts public discourse but also poses significant challenges to maintaining a healthy digital environment. In response to this, we introduce a multimodal dataset uniquely crafted for identifying hate in digital content. Central to our methodology is the innovative application of watermarked, stability-enhanced, stable diffusion techniques combined with the Digital Attention Analysis Module (DAAM). This combination is instrumental in pinpointing the hateful elements within images, thereby generating detailed hate attention maps, which are used to blur these regions from the image, thereby removing the hateful sections of the image. We release this data set as a part of the dehate shared task. This paper also describes the details of the shared task. Furthermore, we present DeHater, a vision-language model designed for multimodal dehatification tasks. Our approach sets a new standard in AI-driven image hate detection given textual prompts, contributing to the development of more ethical AI applications in social media.

[310] Benchmarking and Mitigate Sycophancy in Medical Vision-Language Models

Zikun Guo, Xinyue Xu, Pei Xiang, Shu Yang, Xin Han, Di Wang, Lijie Hu

Main category: cs.CV

TL;DR: This paper evaluates sycophantic behavior in clinical vision-language models and proposes VIPER, a lightweight mitigation framework that filters non-evidentiary content to generate evidence-first responses, reducing sycophancy while maintaining interpretability.

Details

Motivation: Vision language models in clinical workflows often exhibit sycophantic behavior by prioritizing alignment with user phrasing and social cues over evidence-based reasoning, which poses risks in medical decision-making.

Method: Created a medical sycophancy benchmark from PathVQA, SLAKE, and VQA-RAD datasets using psychologically motivated pressure templates. Proposed VIPER framework that filters non-evidentiary content and generates constrained evidence-first answers.

Result: VLMs were generally vulnerable to sycophancy with significant variations in adversarial responses, showing weak correlation to model accuracy or size. Imitation and expert corrections were most effective triggers, indicating bias mechanisms independent of visual evidence. VIPER reduced sycophancy by an average amount while outperforming baselines.

Conclusion: The benchmark and mitigation framework provide groundwork for robust deployment of medical VLMs in real-world clinician interactions, emphasizing the need for evidence-anchored defenses against sycophantic behavior.

Abstract: Vision language models(VLMs) are increasingly integrated into clinical workflows, but they often exhibit sycophantic behavior prioritizing alignment with user phrasing social cues or perceived authority over evidence based reasoning. This study evaluate clinical sycophancy in medical visual question answering through a novel clinically grounded benchmark. We propose a medical sycophancy dataset construct from PathVQA, SLAKE, and VQA-RAD stratified by different type organ system and modality. Using psychologically motivated pressure templates including various sycophancy. In our adversarial experiments on various VLMs, we found that these models are generally vulnerable, exhibiting significant variations in the occurrence of adversarial responses, with weak correlations to the model accuracy or size. Imitation and expert provided corrections were found to be the most effective triggers, suggesting that the models possess a bias mechanism independent of visual evidence. To address this, we propose Visual Information Purification for Evidence based Response (VIPER) a lightweight mitigation strategy that filters non evidentiary content for example social pressures and then generates constrained evidence first answers. This framework reduces sycophancy by an average amount outperforming baselines while maintaining interpretability. Our benchmark analysis and mitigation framework lay the groundwork for robust deployment of medical VLMs in real world clinician interactions emphasizing the need for evidence anchored defenses.

[311] Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation

Mingyu Kang, Yong Suk Choi

Main category: cs.CV

TL;DR: ENM Inversion is a novel noise map inversion method for text-guided image editing that optimizes noise maps to balance content preservation and editability, outperforming existing approaches.

Details

Motivation: Previous inversion methods for text-guided image editing struggle to adhere closely to target text prompts because inverted noise maps that faithfully reconstruct source images limit edit flexibility.

Method: Proposes Editable Noise Map Inversion (ENM Inversion) that searches for optimal noise maps by analyzing noise map properties for editability and introducing editable noise refinement to align with desired edits.

Result: Extensive experiments show ENM Inversion outperforms existing approaches across various image editing tasks in both preservation and edit fidelity. The method also applies successfully to video editing with temporal consistency.

Conclusion: ENM Inversion effectively addresses the trade-off between content preservation and editability in text-guided image editing through optimized noise map inversion, demonstrating superior performance over previous methods.

Abstract: Text-to-image diffusion models have achieved remarkable success in generating high-quality and diverse images. Building on these advancements, diffusion models have also demonstrated exceptional performance in text-guided image editing. A key strategy for effective image editing involves inverting the source image into editable noise maps associated with the target image. However, previous inversion methods face challenges in adhering closely to the target text prompt. The limitation arises because inverted noise maps, while enabling faithful reconstruction of the source image, restrict the flexibility needed for desired edits. To overcome this issue, we propose Editable Noise Map Inversion (ENM Inversion), a novel inversion technique that searches for optimal noise maps to ensure both content preservation and editability. We analyze the properties of noise maps for enhanced editability. Based on this analysis, our method introduces an editable noise refinement that aligns with the desired edits by minimizing the difference between the reconstructed and edited noise maps. Extensive experiments demonstrate that ENM Inversion outperforms existing approaches across a wide range of image editing tasks in both preservation and edit fidelity with target prompts. Our approach can also be easily applied to video editing, enabling temporal consistency and content manipulation across frames.

[312] TimeScope: Towards Task-Oriented Temporal Grounding In Long Videos

Xiangrui Liu, Minghao Qin, Yan Shu, Zhengyang Liang, Yang Tian, Chen Jason Zhang, Bo Zhao, Zheng Liu

Main category: cs.CV

TL;DR: The paper introduces Task-oriented Temporal Grounding (ToTG) to localize key time intervals in long videos based on task descriptions, presents the ToTG Bench benchmark, and proposes TimeScope framework with progressive reasoning that outperforms existing methods.

Details

Motivation: Traditional approaches struggle with identifying key moments in long videos due to limited generalizability and difficulty handling long video content, necessitating a new approach for task-oriented temporal grounding.

Method: Proposed TimeScope framework uses progressive reasoning: first identifies coarse-grained temporal scope in long videos, then refines through fine-grained moment partitioning. Also created ToTG Pile dataset to enhance progressive temporal grounding.

Result: Extensive experiments show TimeScope consistently outperforms existing temporal grounding methods and popular MLLMs across various settings.

Conclusion: TimeScope effectively addresses the challenging ToTG problem through its progressive reasoning approach and demonstrates superior performance compared to existing methods.

Abstract: Identifying key moments in long videos is essential for downstream understanding and reasoning tasks. In this paper, we introduce a new problem, Taskoriented Temporal Grounding ToTG, which aims to localize time intervals containing the necessary information based on a task’s natural description. Along with the definition, we also present ToTG Bench, a comprehensive benchmark for evaluating the performance on ToTG. ToTG is particularly challenging for traditional approaches due to their limited generalizability and difficulty in handling long videos. To address these challenges, we propose TimeScope, a novel framework built upon progressive reasoning. TimeScope first identifies a coarse-grained temporal scope in the long video that likely contains the key moments, and then refines this scope through finegrained moment partitioning. Additionally, we curate a highquality dataset, namely ToTG Pile, to enhance TimeScope’s ability to perform progressive temporal grounding effectively. Extensive experiments demonstrate that TimeScope consistently outperforms both existing temporalgrounding methods and popular MLLMs across various settings, highlighting its effectiveness in addressing this new challenging problem.

[313] Don’t Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention

Xin Zou, Di Lu, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Xu Zheng, Linfeng Zhang, Xuming Hu

Main category: cs.CV

TL;DR: HoloV is a novel visual token pruning framework for MLLMs that uses holistic spatial budget distribution instead of attention-based pruning to maintain performance under high pruning ratios.

Details

Motivation: Current attention-based token pruning methods preserve semantically similar tokens, causing performance drops at high pruning ratios due to representational collapse.

Method: HoloV adaptively distributes pruning budget across different spatial crops to retain global visual context rather than isolated salient features, minimizing representational collapse.

Result: LLaVA1.5 with HoloV preserves 95.8% of original performance after pruning 88.9% of visual tokens, achieving superior efficiency-accuracy trade-offs across various tasks and architectures.

Conclusion: HoloV provides an effective plug-and-play framework for visual token pruning that maintains task-relevant information under aggressive pruning through holistic spatial budget distribution.

Abstract: Despite their powerful capabilities, Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens. Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention or [\texttt{CLS}] attention to assess and discard redundant visual tokens. In this work, we identify a critical limitation of such attention-first pruning approaches, i.e., they tend to preserve semantically similar tokens, resulting in pronounced performance drops under high pruning ratios. To this end, we propose {HoloV}, a simple yet effective, plug-and-play visual token pruning framework for efficient inference. Distinct from previous attention-first schemes, HoloV rethinks token retention from a holistic perspective. By adaptively distributing the pruning budget across different spatial crops, HoloV ensures that the retained tokens capture the global visual context rather than isolated salient features. This strategy minimizes representational collapse and maintains task-relevant information even under aggressive pruning. Experimental results demonstrate that our HoloV achieves superior performance across various tasks, MLLM architectures, and pruning ratios compared to SOTA methods. For instance, LLaVA1.5 equipped with HoloV preserves 95.8% of the original performance after pruning 88.9% of visual tokens, achieving superior efficiency-accuracy trade-offs.

[314] The 1st Solution for CARE Liver Task Challenge 2025: Contrast-Aware Semi-Supervised Segmentation with Domain Generalization and Test-Time Adaptation

Jincan Lou, Jingkun Chen, Haoquan Li, Hang Li, Wenjian Huang, Weihua Chen, Fan Wang, Jianguo Zhang

Main category: cs.CV

TL;DR: CoSSeg-TTA is a compact liver segmentation framework for MRI that combines nnU-Netv2 with semi-supervised learning and domain adaptation to address challenges of limited annotated data and cross-center variability.

Details

Motivation: Liver segmentation from contrast-enhanced MRI faces challenges due to limited annotated data, heterogeneous enhancement protocols, and domain shifts across scanners/institutions. Traditional image translation methods like Pix2Pix and cycle-GAN have limitations including requirement for image registration, structural distortions, and unstable training.

Method: Built on nnU-Netv2 with semi-supervised mean teacher scheme to exploit unlabeled data. Includes domain adaptation module with randomized histogram-based style appearance transfer and trainable contrast-aware network. Uses continual test-time adaptation strategy for improved inference robustness.

Result: Framework consistently outperforms nnU-Netv2 baseline, achieving superior Dice score and Hausdorff Distance. Exhibits strong generalization to unseen domains under low-annotation conditions.

Conclusion: CoSSeg-TTA provides an effective solution for liver segmentation in MRI that addresses domain shift challenges and limited annotation through semi-supervised learning and domain adaptation techniques.

Abstract: Accurate liver segmentation from contrast-enhanced MRI is essential for diagnosis, treatment planning, and disease monitoring. However, it remains challenging due to limited annotated data, heterogeneous enhancement protocols, and significant domain shifts across scanners and institutions. Traditional image-to-image translation frameworks have made great progress in domain generalization, but their application is not straightforward. For example, Pix2Pix requires image registration, and cycle-GAN cannot be integrated seamlessly into segmentation pipelines. Meanwhile, these methods are originally used to deal with cross-modality scenarios, and often introduce structural distortions and suffer from unstable training, which may pose drawbacks in our single-modality scenario. To address these challenges, we propose CoSSeg-TTA, a compact segmentation framework for the GED4 (Gd-EOB-DTPA enhanced hepatobiliary phase MRI) modality built upon nnU-Netv2 and enhanced with a semi-supervised mean teacher scheme to exploit large amounts of unlabeled volumes. A domain adaptation module, incorporating a randomized histogram-based style appearance transfer function and a trainable contrast-aware network, enriches domain diversity and mitigates cross-center variability. Furthermore, a continual test-time adaptation strategy is employed to improve robustness during inference. Extensive experiments demonstrate that our framework consistently outperforms the nnU-Netv2 baseline, achieving superior Dice score and Hausdorff Distance while exhibiting strong generalization to unseen domains under low-annotation conditions.

[315] TBStar-Edit: From Image Editing Pattern Shifting to Consistency Enhancement

Hao Fang, Zechao Zhan, Weixin Feng, Ziwei Huang, Xubin Li, Tiezheng Ge

Main category: cs.CV

TL;DR: TBStar-Edit is a specialized image editing model for e-commerce that addresses consistency limitations of general models through hierarchical architecture and two-stage training, achieving superior performance in e-commerce scenarios.

Details

Motivation: General image generation and editing models perform well in general domains but face consistency limitations when applied to e-commerce scenarios, where maintaining product appearance and layout integrity is crucial.

Method: Three-pronged approach: 1) Comprehensive data engineering pipeline for high-quality instruction-following data; 2) Hierarchical model framework with base model, pattern shifting modules, and consistency enhancement modules; 3) Two-stage training strategy (pattern shifting then consistency enhancement) with separate datasets.

Result: TBStar-Edit outperforms existing general-domain editing models on e-commerce benchmark, achieving better scores in both objective metrics (VIE Score) and subjective user preference.

Conclusion: The proposed TBStar-Edit model successfully addresses e-commerce-specific editing challenges through specialized data engineering, hierarchical architecture, and targeted training strategy, demonstrating superior performance over general models in maintaining product consistency.

Abstract: Recent advances in image generation and editing technologies have enabled state-of-the-art models to achieve impressive results in general domains. However, when applied to e-commerce scenarios, these general models often encounter consistency limitations. To address this challenge, we introduce TBStar-Edit, an new image editing model tailored for the e-commerce domain. Through rigorous data engineering, model architecture design and training strategy, TBStar-Edit achieves precise and high-fidelity image editing while maintaining the integrity of product appearance and layout. Specifically, for data engineering, we establish a comprehensive data construction pipeline, encompassing data collection, construction, filtering, and augmentation, to acquire high-quality, instruction-following, and strongly consistent editing data to support model training. For model architecture design, we design a hierarchical model framework consisting of a base model, pattern shifting modules, and consistency enhancement modules. For model training, we adopt a two-stage training strategy to enhance the consistency preservation: first stage for editing pattern shifting, and second stage for consistency enhancement. Each stage involves training different modules with separate datasets. Finally, we conduct extensive evaluations of TBStar-Edit on a self-proposed e-commerce benchmark, and the results demonstrate that TBStar-Edit outperforms existing general-domain editing models in both objective metrics (VIE Score) and subjective user preference.

[316] ALISE: Annotation-Free LiDAR Instance Segmentation for Autonomous Driving

Yongxuan Lyu, Guangfeng Jiang, Hongsi Liu, Jun Liu

Main category: cs.CV

TL;DR: ALISE is a novel framework for unsupervised LiDAR instance segmentation that eliminates the need for manual annotations by generating pseudo-labels using Vision Foundation Models and refining them through spatio-temporal voting and semantic supervision.

Details

Motivation: Manual annotation of outdoor LiDAR point clouds for instance segmentation is extremely costly and time-consuming, creating a need for methods that can eliminate dependency on human labeling entirely.

Method: Uses Vision Foundation Models guided by text and images to generate initial pseudo-labels, refines them through spatio-temporal voting module combining 2D and 3D semantics, and applies 2D prior-based losses and prototype-based contrastive loss for feature learning.

Result: Achieves state-of-the-art performance for unsupervised 3D instance segmentation, outperforming MWSIS (which uses ground-truth 2D bounding box supervision) by 2.53% in mAP (50.95% vs. 48.42%).

Conclusion: ALISE demonstrates that high-quality unsupervised LiDAR instance segmentation is achievable through comprehensive pseudo-label generation and refinement strategies, eliminating the need for costly manual annotations.

Abstract: The manual annotation of outdoor LiDAR point clouds for instance segmentation is extremely costly and time-consuming. Current methods attempt to reduce this burden but still rely on some form of human labeling. To completely eliminate this dependency, we introduce ALISE, a novel framework that performs LiDAR instance segmentation without any annotations. The central challenge is to generate high-quality pseudo-labels in a fully unsupervised manner. Our approach starts by employing Vision Foundation Models (VFMs), guided by text and images, to produce initial pseudo-labels. We then refine these labels through a dedicated spatio-temporal voting module, which combines 2D and 3D semantics for both offline and online optimization. To achieve superior feature learning, we further introduce two forms of semantic supervision: a set of 2D prior-based losses that inject visual knowledge into the 3D network, and a novel prototype-based contrastive loss that builds a discriminative feature space by exploiting 3D semantic consistency. This comprehensive design results in significant performance gains, establishing a new state-of-the-art for unsupervised 3D instance segmentation. Remarkably, our approach even outperforms MWSIS, a method that operates with supervision from ground-truth (GT) 2D bounding boxes by a margin of 2.53% in mAP (50.95% vs. 48.42%).

[317] From Captions to Keyframes: KeyScore for Multimodal Frame Scoring and Video-Language Understanding

Shih-Yao Lin, Sibendu Paul, Caren Chen

Main category: cs.CV

TL;DR: KeyScore is a caption-aware frame scoring method that combines semantic similarity, temporal representativeness, and contextual impact to select informative keyframes, enabling up to 99% frame reduction while improving video understanding performance.

Details

Motivation: Existing keyframe selection methods rely on heuristics, ignore semantics, or produce redundant frames, limiting efficient video understanding.

Method: Proposes KeyScore for frame importance scoring using semantic similarity to captions, temporal representativeness, and contextual drop impact, plus STACFP for spatio-temporal adaptive clustering to generate diverse frame proposals.

Result: Achieves up to 99% frame reduction compared to full-frame processing while outperforming uniform 8-frame encoders on MSRVTT, MSVD, and DiDeMo benchmarks in video-text retrieval, keyframe extraction, and action recognition.

Conclusion: Focusing on semantically relevant frames enhances both efficiency and performance, enabling scalable and caption-grounded video understanding.

Abstract: Selecting informative keyframes is critical for efficient video understanding, yet existing approaches often rely on heuristics, ignore semantics, or produce redundant frames. We propose KeyScore, a caption-aware frame scoring method that combines three complementary signals: semantic similarity to captions, temporal representativeness, and contextual drop impact. Applied to large-scale video-caption datasets, KeyScore generates frame-level importance scores that enable training keyframe extractors or guiding video-language models. To support this, we also propose STACFP, a Spatio-Temporal Adaptive Clustering method that generates diverse and compact frame proposals across long videos. Together, KeyScore and STACFP reduce uninformative frames while preserving critical content, resulting in faster and more accurate inference. Our experiments on three standard video-language benchmarks (MSRVTT, MSVD, DiDeMo) show that combining STACFP and KeyScore enables up to 99% frame reduction compared to full-frame processing, while outperforming uniform 8-frame encoders in video-text retrieval, keyframe extraction, and action recognition tasks. By focusing on semantically relevant frames, our method enhances both efficiency and performance, enabling scalable and caption-grounded video understanding.

[318] Through the Perspective of LiDAR: A Feature-Enriched and Uncertainty-Aware Annotation Pipeline for Terrestrial Point Cloud Segmentation

Fei Zhang, Rob Chancia, Josie Clapp, Amirhossein Hassanzadeh, Dimah Dera, Richard MacKenzie, Jan van Aardt

Main category: cs.CV

TL;DR: A semi-automated pipeline for semantic segmentation of TLS point clouds that reduces manual annotation effort through uncertainty-aware ensemble learning and targeted annotation, with validation on mangrove forests.

Details

Motivation: Manual annotation of terrestrial laser scanning (TLS) point clouds is costly and time-consuming, limiting large-scale semantic segmentation applications in ecological monitoring.

Method: Projects 3D points to 2D spherical grid, enriches with multi-source features, uses ensemble segmentation networks to generate pseudo-labels and uncertainty maps, then back-projects to 3D with visualization tools for reviewer guidance.

Result: Performance saturates after ~12 annotated scans, geometric features contribute most, and compact nine-channel feature stacks achieve mIoU of ~0.76. Cross-dataset validation confirms generalization.

Conclusion: The proposed pipeline enables scalable, high-quality TLS point cloud segmentation with reduced annotation effort, supported by empirical guidance on data efficiency and feature importance.

Abstract: Accurate semantic segmentation of terrestrial laser scanning (TLS) point clouds is limited by costly manual annotation. We propose a semi-automated, uncertainty-aware pipeline that integrates spherical projection, feature enrichment, ensemble learning, and targeted annotation to reduce labeling effort, while sustaining high accuracy. Our approach projects 3D points to a 2D spherical grid, enriches pixels with multi-source features, and trains an ensemble of segmentation networks to produce pseudo-labels and uncertainty maps, the latter guiding annotation of ambiguous regions. The 2D outputs are back-projected to 3D, yielding densely annotated point clouds supported by a three-tier visualization suite (2D feature maps, 3D colorized point clouds, and compact virtual spheres) for rapid triage and reviewer guidance. Using this pipeline, we build Mangrove3D, a semantic segmentation TLS dataset for mangrove forests. We further evaluate data efficiency and feature importance to address two key questions: (1) how much annotated data are needed and (2) which features matter most. Results show that performance saturates after ~12 annotated scans, geometric features contribute the most, and compact nine-channel stacks capture nearly all discriminative power, with the mean Intersection over Union (mIoU) plateauing at around 0.76. Finally, we confirm the generalization of our feature-enrichment strategy through cross-dataset tests on ForestSemantic and Semantic3D. Our contributions include: (i) a robust, uncertainty-aware TLS annotation pipeline with visualization tools; (ii) the Mangrove3D dataset; and (iii) empirical guidance on data efficiency and feature importance, thus enabling scalable, high-quality segmentation of TLS point clouds for ecological monitoring and beyond. The dataset and processing scripts are publicly available at https://fz-rit.github.io/through-the-lidars-eye/.

[319] GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models

Qinghongbing Xie, Zhaoyuan Xia, Feng Zhu, Lijun Gong, Ziyue Li, Rui Zhao, Long Zeng

Main category: cs.CV

TL;DR: GTR-Bench is a new benchmark for evaluating geographic temporal reasoning in VLMs, requiring reasoning across maps and multiple videos with non-overlapping views. Current VLMs perform poorly (34.9% vs human 78.61%) due to imbalanced context usage, weak temporal forecasting, and poor map-video alignment.

Details

Motivation: Existing spatial-temporal benchmarks focus on either egocentric perspective reasoning with images/video or geographic perspective reasoning with maps, but fail to assess VLMs' geographic spatial-temporal intelligence with both images/video and graphics context, which is important for applications like traffic management and emergency response.

Method: Introduces GTR-Bench, a benchmark for geographic temporal reasoning of moving targets in large-scale camera networks. It requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over unobserved spatial-temporal regions.

Result: Evaluations of over 10 popular VLMs show that even the best proprietary model (Gemini-2.5-Pro) achieves only 34.9% accuracy, significantly lagging behind human performance of 78.61%. The benchmark reveals three primary deficiencies: imbalanced spatial-temporal context utilization, weak temporal forecasting, and poor map-video comprehension/alignment.

Conclusion: GTR-Bench provides valuable insights and opens new opportunities for spatial-temporal intelligence research, highlighting significant gaps in current VLMs’ geographic temporal reasoning capabilities that need to be addressed for real-world applications.

Abstract: Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for Autonomous Driving, Embodied AI and General Artificial Intelligence. Existing spatial-temporal benchmarks mainly focus on egocentric perspective reasoning with images/video context, or geographic perspective reasoning with graphics context (eg. a map), thus fail to assess VLMs’ geographic spatial-temporal intelligence with both images/video and graphics context, which is important for areas like traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench demonstrate that even the best proprietary model, Gemini-2.5-Pro (34.9%), significantly lags behind human performance (78.61%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three primary deficiencies of current models for geo-temporal reasoning. (1) VLMs’ reasoning is impaired by an imbalanced utilization of spatial-temporal context. (2) VLMs are weak in temporal forecasting, which leads to worse performance on temporal-emphasized tasks than on spatial-emphasized tasks. (3) VLMs lack the proficiency to comprehend or align the map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at https://github.com/X-Luffy/GTR-Bench.

[320] RetouchLLM: Training-free Code-based Image Retouching with Vision Language Models

Moon Ye-Bin, Roy Miles, Tae-Hyun Oh, Ismail Elezi, Jiankang Deng

Main category: cs.CV

TL;DR: RetouchLLM is a training-free white-box image retouching system that performs interpretable, code-based retouching without requiring paired training data, enabling controllable adjustments through natural language interaction.

Details

Motivation: Existing learning-based approaches require large-scale paired data and operate as black boxes, making retouching opaque and limiting adaptability to diverse user- or image-specific adjustments.

Method: A training-free framework with two modules: a visual critic that identifies differences between input and reference images, and a code generator that produces executable codes. It progressively enhances images in multi-step retouching similar to human workflow.

Result: The approach generalizes well across diverse retouching styles and enables interpretable, controllable adjustments tailored to user intent through natural language interaction.

Conclusion: RetouchLLM provides a white-box, interpretable solution for image retouching that doesn’t require training data and offers flexible control over the retouching process.

Abstract: Image retouching not only enhances visual quality but also serves as a means of expressing personal preferences and emotions. However, existing learning-based approaches require large-scale paired data and operate as black boxes, making the retouching process opaque and limiting their adaptability to handle diverse, user- or image-specific adjustments. In this work, we propose RetouchLLM, a training-free white-box image retouching system, which requires no training data and performs interpretable, code-based retouching directly on high-resolution images. Our framework progressively enhances the image in a manner similar to how humans perform multi-step retouching, allowing exploration of diverse adjustment paths. It comprises of two main modules: a visual critic that identifies differences between the input and reference images, and a code generator that produces executable codes. Experiments demonstrate that our approach generalizes well across diverse retouching styles, while natural language-based user interaction enables interpretable and controllable adjustments tailored to user intent.

[321] One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting

Haipeng Liu, Yang Wang, Meng Wang

Main category: cs.CV

TL;DR: NTN-Diff is a frequency-aware diffusion model for text-guided image inpainting that addresses semantics consistency between masked/unmasked regions by disentangling mid-and-low frequency bands during denoising.

Details

Motivation: Previous methods failed to simultaneously preserve unmasked regions and achieve semantics consistency between masked and unmasked regions, due to entanglement of hybrid frequency bands with different robustness to text prompts.

Method: Proposes null-text-null frequency-aware diffusion that divides denoising into early (high-level noise) and late (low-level noise) stages, using mid-frequency band as guidance for null-text denoising of low-frequency band, followed by text-guided denoising.

Result: Extensive experiments show NTN-Diff outperforms state-of-the-art diffusion models for text-guided image inpainting.

Conclusion: The proposed frequency-aware approach effectively addresses both preservation of unmasked regions and semantics consistency between regions in text-guided image inpainting.

Abstract: Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.

[322] A Multimodal Depth-Aware Method For Embodied Reference Understanding

Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel

Main category: cs.CV

TL;DR: A novel ERU framework that combines LLM-based data augmentation, depth-map modality, and depth-aware decision module to improve embodied reference understanding in ambiguous visual scenes.

Details

Motivation: Prior open-vocabulary object detection methods often fail in ambiguous scenarios where multiple candidate objects exist, requiring better integration of linguistic and embodied cues for disambiguation.

Method: Proposes an ERU framework with three key components: LLM-based data augmentation, depth-map modality integration, and a depth-aware decision module for robust cue integration.

Result: Significantly outperforms existing baselines on two datasets, achieving more accurate and reliable referent detection in complex or cluttered environments.

Conclusion: The proposed framework effectively addresses ambiguity in embodied reference understanding through joint utilization of multimodal cues and data augmentation.

Abstract: Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.

[323] Learning Neural Exposure Fields for View Synthesis

Michael Niemeyer, Fabian Manhardt, Marie-Julie Rakotosaona, Michael Oechsle, Christina Tsalicoglou, Keisuke Tateno, Jonathan T. Barron, Federico Tombari

Main category: cs.CV

TL;DR: Neural Exposure Fields (NExF) is a novel technique that learns a neural field predicting optimal exposure per 3D point, enabling robust 3D scene reconstruction and view synthesis in challenging high dynamic range scenarios.

Details

Motivation: Current neural scene representations degrade when handling images with strong exposure variations (e.g., indoor/outdoor scenes, rooms with windows), which are common in real-world captures.

Method: Proposes learning a neural field that predicts optimal exposure values per 3D point, enabling joint optimization of scene representation and exposure field via a novel neural conditioning mechanism.

Result: Achieves state-of-the-art performance with over 55% improvement over best-performing baselines, trains faster than prior works, and enables accurate view synthesis in high dynamic range scenarios without post-processing or multi-exposure captures.

Conclusion: NExF provides a robust solution for 3D scene reconstruction with consistent appearance across varying exposure conditions, generalizing the concept of exposure optimization from 2D images to 3D space.

Abstract: Recent advances in neural scene representations have led to unprecedented quality in 3D reconstruction and view synthesis. Despite achieving high-quality results for common benchmarks with curated data, outputs often degrade for data that contain per image variations such as strong exposure changes, present, e.g., in most scenes with indoor and outdoor areas or rooms with windows. In this paper, we introduce Neural Exposure Fields (NExF), a novel technique for robustly reconstructing 3D scenes with high quality and 3D-consistent appearance from challenging real-world captures. In the core, we propose to learn a neural field predicting an optimal exposure value per 3D point, enabling us to optimize exposure along with the neural scene representation. While capture devices such as cameras select optimal exposure per image/pixel, we generalize this concept and perform optimization in 3D instead. This enables accurate view synthesis in high dynamic range scenarios, bypassing the need of post-processing steps or multi-exposure captures. Our contributions include a novel neural representation for exposure prediction, a system for joint optimization of the scene representation and the exposure field via a novel neural conditioning mechanism, and demonstrated superior performance on challenging real-world data. We find that our approach trains faster than prior works and produces state-of-the-art results on several benchmarks improving by over 55% over best-performing baselines.

cs.AI

[324] AniME: Adaptive Multi-Agent Planning for Long Animation Generation

Lisai Zhang, Baohan Xu, Siqian Yang, Mingyu Yin, Jing Liu, Chao Xu, Siqi Wang, Yidi Wu, Yuxin Hong, Zihao Zhang, Yanzhang Liang, Yudong Jiang

Main category: cs.AI

TL;DR: AniME is a director-oriented multi-agent system for automated long-form anime production that covers the entire workflow from story to final video, using a director agent with global memory to coordinate specialized agents.

Details

Motivation: To create a scalable solution for AI-driven anime creation that can produce cinematic animation with consistent characters and synchronized audio-visual elements throughout the full production workflow.

Method: Uses a director agent with global memory to coordinate specialized downstream agents, integrating customized Model Context Protocol (MCP) with downstream model instruction for adaptive control condition selection in diverse sub-tasks.

Result: The system successfully produces cinematic animation with consistent characters and synchronized audio-visual elements, demonstrating a complete automated workflow from story to final video.

Conclusion: AniME offers a scalable solution for automated long-form anime production through its director-oriented multi-agent architecture that maintains consistency and synchronization across the entire creative pipeline.

Abstract: We present AniME, a director-oriented multi-agent system for automated long-form anime production, covering the full workflow from a story to the final video. The director agent keeps a global memory for the whole workflow, and coordinates several downstream specialized agents. By integrating customized Model Context Protocol (MCP) with downstream model instruction, the specialized agent adaptively selects control conditions for diverse sub-tasks. AniME produces cinematic animation with consistent characters and synchronized audio visual elements, offering a scalable solution for AI-driven anime creation.

[325] Hypothesis Hunting with Evolving Networks of Autonomous Scientific Agents

Tennison Liu, Silas Ruhrberg Estévez, David L. Bentley, Mihaela van der Schaar

Main category: cs.AI

TL;DR: AScience framework enables large-scale exploratory hypothesis hunting using distributed LLM-based research agents that self-organize into networks to produce and peer-review findings, demonstrating effective discovery across diversity-quality-novelty dimensions.

Details

Motivation: Large-scale scientific datasets create opportunities for exploratory discovery beyond specific research questions, but current methods lack systematic support for sustained exploration across vast hypothesis spaces.

Method: Introduces AScience framework modeling discovery as agent-network-evaluation interaction, implemented as ASCollab - a distributed system of LLM-based research agents with heterogeneous behaviors that self-organize into evolving networks and continually produce/peer-review findings under shared evaluation standards.

Result: Experiments show social dynamics enable accumulation of expert-rated results along diversity-quality-novelty frontier, including rediscoveries of established biomarkers, extensions of known pathways, and proposals of new therapeutic targets in cancer cohorts.

Conclusion: Socially structured, agentic networks can sustain exploratory hypothesis hunting at scale, though wet-lab validation remains indispensable for final confirmation.

Abstract: Large-scale scientific datasets – spanning health biobanks, cell atlases, Earth reanalyses, and more – create opportunities for exploratory discovery unconstrained by specific research questions. We term this process hypothesis hunting: the cumulative search for insight through sustained exploration across vast and complex hypothesis spaces. To support it, we introduce AScience, a framework modeling discovery as the interaction of agents, networks, and evaluation norms, and implement it as ASCollab, a distributed system of LLM-based research agents with heterogeneous behaviors. These agents self-organize into evolving networks, continually producing and peer-reviewing findings under shared standards of evaluation. Experiments show that such social dynamics enable the accumulation of expert-rated results along the diversity-quality-novelty frontier, including rediscoveries of established biomarkers, extensions of known pathways, and proposals of new therapeutic targets. While wet-lab validation remains indispensable, our experiments on cancer cohorts demonstrate that socially structured, agentic networks can sustain exploratory hypothesis hunting at scale.

[326] Optimizing delivery for quick commerce factoring qualitative assessment of generated routes

Milon Bhattacharya, Milan Kumar

Main category: cs.AI

TL;DR: LLM-based framework critiques VRP-generated delivery routes using policy criteria, achieving 79-86% accuracy in identifying routing issues, providing scalable evaluation beyond conventional metrics.

Details

Motivation: VRP solvers have limited real-world effectiveness in India due to unstructured addresses, incomplete maps, and computational constraints in distance estimation, with last-mile delivery accounting for nearly half of operational expenses.

Method: Proposed framework uses large language models to critique VRP-generated routes against policy-based criteria, generating and evaluating 400 cases with both open-source and proprietary reasoning models.

Result: Open-source LLMs identified routing issues with 79% accuracy, while proprietary reasoning models achieved up to 86% accuracy in evaluating delivery plans.

Conclusion: LLM-based evaluation of VRP-generated routes provides an effective and scalable layer of assessment that goes beyond conventional distance and time metrics, with implications for improving cost efficiency, delivery reliability, and sustainability in last-mile logistics, especially in developing countries.

Abstract: Indias e-commerce market is projected to grow rapidly, with last-mile delivery accounting for nearly half of operational expenses. Although vehicle routing problem (VRP) based solvers are widely used for delivery planning, their effectiveness in real-world scenarios is limited due to unstructured addresses, incomplete maps, and computational constraints in distance estimation. This study proposes a framework that employs large language models (LLMs) to critique VRP-generated routes against policy-based criteria, allowing logistics operators to evaluate and prioritise more efficient delivery plans. As a illustration of our approach we generate, annotate and evaluated 400 cases using large language models. Our study found that open-source LLMs identified routing issues with 79% accuracy, while proprietary reasoning models achieved reach upto 86%. The results demonstrate that LLM-based evaluation of VRP-generated routes can be an effective and scalable layer of evaluation which goes beyond beyond conventional distance and time based metrics. This has implications for improving cost efficiency, delivery reliability, and sustainability in last-mile logistics, especially for developing countries like India.

[327] Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation

Yifei Dong, Fengyi Wu, Guangyu Chen, Zhi-Qi Cheng, Qiyu Hu, Yuxuan Zhou, Jingdong Sun, Jun-Yan He, Qi Dai, Alexander G Hauptmann

Main category: cs.AI

TL;DR: UniWM is a unified world model that integrates visual foresight and planning in a single multimodal autoregressive backbone, addressing state-action misalignment in embodied navigation through memory-augmented imagination.

Details

Motivation: Current modular architectures separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability in novel or dynamic scenarios.

Method: UniWM uses a unified, memory-augmented world model with a hierarchical memory mechanism that integrates short-term perceptual cues with long-term trajectory context, enabling visual imagination and planning within a single multimodal autoregressive backbone.

Result: UniWM improves navigation success rates by up to 30%, significantly reduces trajectory errors across four benchmarks (Go Stanford, ReCon, SCAND, HuRoN), and shows impressive zero-shot generalization on TartanDrive dataset.

Conclusion: UniWM represents a principled step toward unified, imagination-driven embodied navigation by explicitly grounding action decisions in visually imagined outcomes.

Abstract: Enabling embodied agents to effectively imagine future states is critical for robust and generalizable visual navigation. Current state-of-the-art approaches, however, adopt modular architectures that separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability in novel or dynamic scenarios. To overcome this fundamental limitation, we propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone. Unlike modular frameworks, UniWM explicitly grounds action decisions in visually imagined outcomes, ensuring tight alignment between prediction and control. A hierarchical memory mechanism further integrates detailed short-term perceptual cues with longer-term trajectory context, enabling stable, coherent reasoning over extended horizons. Extensive experiments across four challenging benchmarks (Go Stanford, ReCon, SCAND, HuRoN) demonstrate that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset. These results highlight UniWM as a principled step toward unified, imagination-driven embodied navigation.

[328] Robust Heuristic Algorithm Design with LLMs

Pantea Karimi, Dany Rouhana, Pooria Namyar, Siva Kesava Reddy Kakarla, Venkat Arun, Behnaz Arzani

Main category: cs.AI

TL;DR: The paper proposes enhancing LLM-based heuristic design by incorporating tools that explain underperformance and suggest fixes, resulting in more robust algorithms with significantly improved worst-case performance.

Details

Motivation: To generate more robust and performant heuristics by augmenting LLM approaches with explanations of underperformance and improvement suggestions.

Method: Three key ideas: (1) expose LLM to instances where heuristics underperform, (2) explain why underperformance occurs, and (3) specialize design to regions in the input space.

Result: Produced heuristics with ~28x better worst-case performance compared to FunSearch, improved average performance, while maintaining runtime.

Conclusion: Simple augmentation techniques that provide explanations and specialization can significantly enhance the robustness and performance of LLM-generated heuristics.

Abstract: We posit that we can generate more robust and performant heuristics if we augment approaches using LLMs for heuristic design with tools that explain why heuristics underperform and suggestions about how to fix them. We find even simple ideas that (1) expose the LLM to instances where the heuristic underperforms; (2) explain why they occur; and (3) specialize design to regions in the input space, can produce more robust algorithms compared to existing techniques~ – ~the heuristics we produce have a $\sim28\times$ better worst-case performance compared to FunSearch, improve average performance, and maintain the runtime.

[329] What Is Your Agent’s GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment

Allison Sihan Jia, Daniel Huang, Nikhil Vytla, Nirvika Choudhury, John C Mitchell, Anupam Datta

Main category: cs.AI

TL;DR: The Agent GPA framework evaluates AI agents through Goal-Plan-Action metrics including Goal Fulfillment, Logical Consistency, Execution Efficiency, Plan Quality, and Plan Adherence.

Details

Motivation: To create a systematic evaluation paradigm that comprehensively covers agent failures and enables targeted performance improvements.

Method: Developed the Agent GPA framework with five evaluation metrics and tested it on TRAIL/GAIA benchmark dataset and an internal production dataset using LLM-judges.

Result: The framework covers all agent errors on TRAIL/GAIA, achieves 80-95% agreement with human annotation, and localizes errors with 86% agreement for targeted improvements.

Conclusion: The Agent GPA framework provides a comprehensive and effective evaluation system for AI agents that supports systematic error analysis and performance enhancement.

Abstract: We introduce the Agent GPA (Goal-Plan-Action) framework: an evaluation paradigm based on an agent’s operational loop of setting goals, devising plans, and executing actions. The framework includes five evaluation metrics: Goal Fulfillment, Logical Consistency, Execution Efficiency, Plan Quality, and Plan Adherence. Logical Consistency checks that an agent’s actions are consistent with its prior actions. Execution Efficiency checks whether the agent executes in the most efficient way to achieve its goal. Plan Quality checks whether an agent’s plans are aligned with its goals; Plan Adherence checks if an agent’s actions are aligned with its plan; and Goal Fulfillment checks that agent’s final outcomes match the stated goals. Our experimental results on two benchmark datasets - the public TRAIL/GAIA dataset and an internal dataset for a production-grade data agent - show that this framework (a) provides a systematic way to cover a broad range of agent failures, including all agent errors on the TRAIL/GAIA benchmark dataset; (b) supports LLM-judges that exhibit strong agreement with human annotation, covering 80% to over 95% errors; and (c) localizes errors with 86% agreement to enable targeted improvement of agent performance.

[330] COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context

Guangya Wan, Mingyang Ling, Xiaoqi Ren, Rujun Han, Sheng Li, Zizhao Zhang

Main category: cs.AI

TL;DR: COMPASS is a hierarchical multi-agent framework that addresses context management issues in long-horizon tasks by separating tactical execution, strategic oversight, and context organization into three specialized components, achieving up to 20% accuracy improvement over baselines.

Details

Motivation: Long-horizon tasks requiring sustained reasoning and multiple tool interactions are challenging for LLM agents due to small errors compounding across steps, hallucinations, loss of coherence, and context management issues where extended histories cause agents to overlook critical evidence or become distracted by irrelevant information.

Method: COMPASS uses a lightweight hierarchical framework with three specialized components: (1) a Main Agent for reasoning and tool use, (2) a Meta-Thinker for monitoring progress and issuing strategic interventions, and (3) a Context Manager for maintaining concise, relevant progress briefs for different reasoning stages.

Result: Across three challenging benchmarks (GAIA, BrowseComp, and Humanity’s Last Exam), COMPASS improves accuracy by up to 20% relative to both single- and multi-agent baselines. The framework also includes test-time scaling that matches DeepResearch agents’ performance and a post-training pipeline for efficient context management using smaller models.

Conclusion: COMPASS effectively addresses context management bottlenecks in long-horizon reasoning tasks through specialized component separation, demonstrating significant performance improvements and offering scalable, efficient solutions for complex multi-step problems.

Abstract: Long-horizon tasks that require sustained reasoning and multiple tool interactions remain challenging for LLM agents: small errors compound across steps, and even state-of-the-art models often hallucinate or lose coherence. We identify context management as the central bottleneck – extended histories cause agents to overlook critical evidence or become distracted by irrelevant information, thus failing to replan or reflect from previous mistakes. To address this, we propose COMPASS (Context-Organized Multi-Agent Planning and Strategy System), a lightweight hierarchical framework that separates tactical execution, strategic oversight, and context organization into three specialized components: (1) a Main Agent that performs reasoning and tool use, (2) a Meta-Thinker that monitors progress and issues strategic interventions, and (3) a Context Manager that maintains concise, relevant progress briefs for different reasoning stages. Across three challenging benchmarks – GAIA, BrowseComp, and Humanity’s Last Exam – COMPASS improves accuracy by up to 20% relative to both single- and multi-agent baselines. We further introduce a test-time scaling extension that elevates performance to match established DeepResearch agents, and a post-training pipeline that delegates context management to smaller models for enhanced efficiency.

[331] GTAlign: Game-Theoretic Alignment of LLM Assistants for Mutual Welfare

Siqi Zhu, David Zhang, Pedro Cisneros-Velarde, Jiaxuan You

Main category: cs.AI

TL;DR: GTAlign is a game-theoretic alignment framework that treats LLM-user interaction as a strategic game, using payoff matrices during reasoning and mutual welfare rewards during training to achieve mutually beneficial outcomes.

Details

Motivation: Conventional alignment assumes maximizing model reward maximizes user welfare, but this fails when models produce overly verbose or suboptimal responses that don't match user preferences, creating a prisoner's dilemma-like situation.

Method: Integrates game-theoretic decision making into reasoning (constructing payoff matrices to estimate welfare for both LLM and user) and training (mutual welfare reward that reinforces cooperative responses). Also includes inference technique for dynamic adaptation to pricing policy changes.

Result: GTAlign substantially improves reasoning efficiency, answer quality, and mutual welfare compared to baselines across diverse tasks.

Conclusion: Game-theoretic alignment provides a principled mechanism for mutually beneficial LLM-user interactions, addressing the limitations of conventional alignment approaches.

Abstract: Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner’s dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a mutual welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM’s response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and mutual welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .

[332] Everyone prefers human writers, including AI

Wouter Haverals, Meredith Martin

Main category: cs.AI

TL;DR: AI systems show stronger pro-human attribution bias than humans when evaluating literary style, systematically devaluing content labeled as AI-generated regardless of actual authorship.

Details

Motivation: To understand how humans and AI evaluate literary style and measure attribution bias in aesthetic judgments as AI writing tools become widespread.

Method: Two controlled experiments using Queneau’s Exercises in Style: Study 1 compared human participants and AI models evaluating literary passages under blind, accurate, and counterfactual labeling conditions; Study 2 tested bias generalization across a 14×14 matrix of AI evaluators and creators.

Result: Both humans and AI showed systematic pro-human attribution bias, but AI models exhibited 2.5-fold stronger bias (+34.3pp vs +13.7pp). Study 2 confirmed this bias operates across AI architectures, with evaluators inverting assessment criteria based solely on perceived authorship.

Conclusion: AI models have absorbed human cultural biases against artificial creativity during training and not only replicate but amplify this human tendency in aesthetic judgment.

Abstract: As AI writing tools become widespread, we need to understand how both humans and machines evaluate literary style, a domain where objective standards are elusive and judgments are inherently subjective. We conducted controlled experiments using Raymond Queneau’s Exercises in Style (1947) to measure attribution bias across evaluators. Study 1 compared human participants (N=556) and AI models (N=13) evaluating literary passages from Queneau versus GPT-4-generated versions under three conditions: blind, accurately labeled, and counterfactually labeled. Study 2 tested bias generalization across a 14$\times$14 matrix of AI evaluators and creators. Both studies revealed systematic pro-human attribution bias. Humans showed +13.7 percentage point (pp) bias (Cohen’s h = 0.28, 95% CI: 0.21-0.34), while AI models showed +34.3 percentage point bias (h = 0.70, 95% CI: 0.65-0.76), a 2.5-fold stronger effect (P$<$0.001). Study 2 confirmed this bias operates across AI architectures (+25.8pp, 95% CI: 24.1-27.6%), demonstrating that AI systems systematically devalue creative content when labeled as “AI-generated” regardless of which AI created it. We also find that attribution labels cause evaluators to invert assessment criteria, with identical features receiving opposing evaluations based solely on perceived authorship. This suggests AI models have absorbed human cultural biases against artificial creativity during training. Our study represents the first controlled comparison of attribution bias between human and artificial evaluators in aesthetic judgment, revealing that AI systems not only replicate but amplify this human tendency.

[333] ReviewerToo: Should AI Join The Program Committee? A Look At The Future of Peer Review

Gaurav Sahu, Hugo Larochelle, Laurent Charlin, Christopher Pal

Main category: cs.AI

TL;DR: ReviewerToo is an AI-assisted peer review framework that achieves 81.8% accuracy in accept/reject decisions, close to human reviewers (83.9%), with AI-generated reviews rated higher quality than average human reviews but still below expert levels.

Details

Motivation: To address inconsistencies, subjectivity, and scalability challenges in traditional peer review by complementing human judgment with systematic AI assessments.

Method: Developed ReviewerToo framework with specialized reviewer personas and structured evaluation criteria, validated on 1,963 ICLR 2025 submissions using gpt-oss-120b model.

Result: AI achieved 81.8% accuracy vs 83.9% for average human reviewer; AI reviews rated higher quality than human average but below expert levels; AI excels at fact-checking and literature coverage but struggles with assessing novelty and theoretical contributions.

Conclusion: AI can enhance consistency, coverage, and fairness in peer review while human experts remain essential for complex evaluative judgments, providing foundation for hybrid peer-review systems that scale with scientific publishing growth.

Abstract: Peer review is the cornerstone of scientific publishing, yet it suffers from inconsistencies, reviewer subjectivity, and scalability challenges. We introduce ReviewerToo, a modular framework for studying and deploying AI-assisted peer review to complement human judgment with systematic and consistent assessments. ReviewerToo supports systematic experiments with specialized reviewer personas and structured evaluation criteria, and can be partially or fully integrated into real conference workflows. We validate ReviewerToo on a carefully curated dataset of 1,963 paper submissions from ICLR 2025, where our experiments with the gpt-oss-120b model achieves 81.8% accuracy for the task of categorizing a paper as accept/reject compared to 83.9% for the average human reviewer. Additionally, ReviewerToo-generated reviews are rated as higher quality than the human average by an LLM judge, though still trailing the strongest expert contributions. Our analysis highlights domains where AI reviewers excel (e.g., fact-checking, literature coverage) and where they struggle (e.g., assessing methodological novelty and theoretical contributions), underscoring the continued need for human expertise. Based on these findings, we propose guidelines for integrating AI into peer-review pipelines, showing how AI can enhance consistency, coverage, and fairness while leaving complex evaluative judgments to domain experts. Our work provides a foundation for systematic, hybrid peer-review systems that scale with the growth of scientific publishing.

[334] LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition

Yushuo Zheng, Zicheng Zhang, Xiongkuo Min, Huiyu Duan, Guangtao Zhai

Main category: cs.AI

TL;DR: LM Fight Arena is a novel benchmark that evaluates large multimodal models by having them compete against each other in Mortal Kombat II, testing real-time strategic reasoning in dynamic environments.

Details

Motivation: Existing benchmarks fail to assess LMM performance in real-time, adversarial environments, creating a gap in evaluating strategic reasoning capabilities.

Method: Six leading open- and closed-source models compete in controlled Mortal Kombat II tournaments, interpreting game frames and state data to select actions while controlling identical characters for fair comparison.

Result: The framework provides fully automated, reproducible, and objective assessment of LMMs’ strategic reasoning in dynamic settings.

Conclusion: LM Fight Arena introduces a challenging benchmark that bridges AI evaluation with interactive entertainment, offering a more realistic test of LMM capabilities.

Abstract: Existing benchmarks for large multimodal models (LMMs) often fail to capture their performance in real-time, adversarial environments. We introduce LM Fight Arena (Large Model Fight Arena), a novel framework that evaluates LMMs by pitting them against each other in the classic fighting game Mortal Kombat II, a task requiring rapid visual understanding and tactical, sequential decision-making. In a controlled tournament, we test six leading open- and closed-source models, where each agent operates controlling the same character to ensure a fair comparison. The models are prompted to interpret game frames and state data to select their next actions. Unlike static evaluations, LM Fight Arena provides a fully automated, reproducible, and objective assessment of an LMM’s strategic reasoning capabilities in a dynamic setting. This work introduces a challenging and engaging benchmark that bridges the gap between AI evaluation and interactive entertainment.

[335] RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation

Ashish Kattamuri, Harshwardhan Fartale, Arpita Vats, Rahul Raja, Ishita Prasad

Main category: cs.AI

TL;DR: RADAR is a framework using mechanistic interpretability to detect data contamination in LLMs by distinguishing recall-based from reasoning-based responses through activation analysis.

Details

Motivation: Data contamination in LLMs leads to unreliable evaluation where models appear high-performing due to memorization rather than genuine reasoning capabilities.

Method: Extracts 37 features from surface-level confidence trajectories and deep mechanistic properties (attention specialization, circuit dynamics, activation flow patterns), using ensemble classifiers trained on these features.

Result: Achieves 93% accuracy on diverse evaluation set, with perfect performance on clear cases and 76.7% accuracy on challenging ambiguous examples.

Conclusion: Demonstrates mechanistic interpretability’s potential for advancing LLM evaluation beyond traditional surface-level metrics.

Abstract: Data contamination poses a significant challenge to reliable LLM evaluation, where models may achieve high performance by memorizing training data rather than demonstrating genuine reasoning capabilities. We introduce RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework that leverages mechanistic interpretability to detect contamination by distinguishing recall-based from reasoning-based model responses. RADAR extracts 37 features spanning surface-level confidence trajectories and deep mechanistic properties including attention specialization, circuit dynamics, and activation flow patterns. Using an ensemble of classifiers trained on these features, RADAR achieves 93% accuracy on a diverse evaluation set, with perfect performance on clear cases and 76.7% accuracy on challenging ambiguous examples. This work demonstrates the potential of mechanistic interpretability for advancing LLM evaluation beyond traditional surface-level metrics.

[336] FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

Samuel Hildebrand, Curtis Taylor, Sean Oesch, James M Ghawaly Jr, Amir Sadovnik, Ryan Shivers, Brandon Schreiber, Kevin Kurian

Main category: cs.AI

TL;DR: A new benchmark for evaluating multimodal RAG pipelines that tests ingestion, retrieval, and reasoning across text, tables, and images, with closed-source models outperforming open-source ones.

Details

Motivation: Existing benchmarks focus on specific aspects like retrieval, but there's a need to evaluate RAG pipelines holistically across multiple modalities and document types.

Method: Created a 93-question dataset with multimodal content, developed phrase-level recall and hallucination detection metrics, and evaluated 6 pipelines (2 open-source, 4 closed-source).

Result: Closed-source pipelines significantly outperformed open-source ones, especially on multimodal and cross-document questions. Human evaluation showed strong agreement with automated metrics (4.62/5 for correctness, 4.53/5 for hallucination detection).

Conclusion: The benchmark effectively evaluates multimodal RAG pipelines, revealing performance gaps between closed-source and open-source systems, with validated metrics that align well with human judgment.

Abstract: Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline’s ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline’s ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating “strongly agree”).

[337] EcphoryRAG: Re-Imagining Knowledge-Graph RAG via Human Associative Memory

Zirui Liao

Main category: cs.AI

TL;DR: EcphoryRAG is an entity-centric knowledge graph RAG framework that reduces token consumption by 94% through storing only core entities, enables multi-hop associative search, and dynamically infers implicit relations for complex question answering.

Details

Motivation: Inspired by human cognitive neuroscience where cues activate entity-centered memory traces for multi-hop recollection, aiming to create a more efficient and effective retrieval system for complex reasoning.

Method: Extracts and stores only core entities with metadata during indexing, extracts cue entities from queries, performs scalable multi-hop associative search across knowledge graph, and dynamically infers implicit relations between entities.

Result: Sets new state-of-the-art on 2WikiMultiHop, HotpotQA, and MuSiQue benchmarks, improving average Exact Match score from 0.392 to 0.474 over strong KG-RAG methods like HippoRAG.

Conclusion: Validates the efficacy of the entity-cue-multi-hop retrieval paradigm for complex question answering, demonstrating significant improvements in performance while reducing computational overhead.

Abstract: Cognitive neuroscience research indicates that humans leverage cues to activate entity-centered memory traces (engrams) for complex, multi-hop recollection. Inspired by this mechanism, we introduce EcphoryRAG, an entity-centric knowledge graph RAG framework. During indexing, EcphoryRAG extracts and stores only core entities with corresponding metadata, a lightweight approach that reduces token consumption by up to 94% compared to other structured RAG systems. For retrieval, the system first extracts cue entities from queries, then performs a scalable multi-hop associative search across the knowledge graph. Crucially, EcphoryRAG dynamically infers implicit relations between entities to populate context, enabling deep reasoning without exhaustive pre-enumeration of relationships. Extensive evaluations on the 2WikiMultiHop, HotpotQA, and MuSiQue benchmarks demonstrate that EcphoryRAG sets a new state-of-the-art, improving the average Exact Match (EM) score from 0.392 to 0.474 over strong KG-RAG methods like HippoRAG. These results validate the efficacy of the entity-cue-multi-hop retrieval paradigm for complex question answering.

[338] Reimagining Agent-based Modeling with Large Language Model Agents via Shachi

So Kuroki, Yingtao Tian, Kou Misaki, Takashi Ikegami, Takuya Akiba, Yujin Tang

Main category: cs.AI

TL;DR: Shachi is a formal methodology and framework that decomposes LLM agent policies into Configuration, Memory, and Tools components for systematic analysis of emergent behaviors in multi-agent systems.

Details

Motivation: To address the lack of principled methodologies for controlled experimentation in studying emergent behaviors in LLM-driven multi-agent systems.

Method: Decomposes agent policy into core cognitive components: Configuration (intrinsic traits), Memory (contextual persistence), and Tools (expanded capabilities), orchestrated by an LLM reasoning engine.

Result: Validated on 10-task benchmark; demonstrated external validity by modeling real-world U.S. tariff shock - agent behaviors aligned with observed market reactions only when properly configured with memory and tools.

Conclusion: Provides rigorous, open-source foundation for building and evaluating LLM agents to foster more cumulative and scientifically grounded research.

Abstract: The study of emergent behaviors in large language model (LLM)-driven multi-agent systems is a critical research challenge, yet progress is limited by a lack of principled methodologies for controlled experimentation. To address this, we introduce Shachi, a formal methodology and modular framework that decomposes an agent’s policy into core cognitive components: Configuration for intrinsic traits, Memory for contextual persistence, and Tools for expanded capabilities, all orchestrated by an LLM reasoning engine. This principled architecture moves beyond brittle, ad-hoc agent designs and enables the systematic analysis of how specific architectural choices influence collective behavior. We validate our methodology on a comprehensive 10-task benchmark and demonstrate its power through novel scientific inquiries. Critically, we establish the external validity of our approach by modeling a real-world U.S. tariff shock, showing that agent behaviors align with observed market reactions only when their cognitive architecture is appropriately configured with memory and tools. Our work provides a rigorous, open-source foundation for building and evaluating LLM agents, aimed at fostering more cumulative and scientifically grounded research.

[339] DualResearch: Entropy-Gated Dual-Graph Retrieval for Answer Reconstruction

Jinxin Shi, Zongsheng Cao, Runmin Ma, Yusong Hu, Jie Zhou, Xin Li, Lei Bai, Liang He, Bo Zhang

Main category: cs.AI

TL;DR: DualResearch is a retrieval and fusion framework that addresses limitations in deep-research systems by modeling two complementary graphs (breadth semantic and depth causal) and fusing their evidence to improve scientific reasoning.

Details

Motivation: Deep-research frameworks suffer from context pollution, weak evidentiary support, and brittle execution paths when orchestrating external tools for complex scientific reasoning.

Method: Proposes DualResearch with two joint graphs: breadth semantic graph for stable background knowledge and depth causal graph for execution provenance. Uses semantic diffusion and causal-semantic path matching, then fuses evidence via entropy-gated rule with global calibration.

Result: Achieves competitive performance on scientific reasoning benchmarks HLE and GPQA. Using InternAgent log files, accuracy improves by 7.7% on HLE and 6.06% on GPQA.

Conclusion: DualResearch effectively compresses multi-tool execution logs into concise reasoning graphs and reconstructs answers stably, serving as a complementary enhancement to deep-research systems.

Abstract: The deep-research framework orchestrates external tools to perform complex, multi-step scientific reasoning that exceeds the native limits of a single large language model. However, it still suffers from context pollution, weak evidentiary support, and brittle execution paths. To address these issues, we propose DualResearch, a retrieval and fusion framework that matches the epistemic structure of tool-intensive reasoning by jointly modeling two complementary graphs: a breadth semantic graph that encodes stable background knowledge, and a depth causal graph that captures execution provenance. Each graph has a layer-native relevance function, seed-anchored semantic diffusion for breadth, and causal-semantic path matching with reliability weighting for depth. To reconcile their heterogeneity and query-dependent uncertainty, DualResearch converts per-layer path evidence into answer distributions and fuses them in log space via an entropy-gated rule with global calibration. The fusion up-weights the more certain channel and amplifies agreement. As a complement to deep-research systems, DualResearch compresses lengthy multi-tool execution logs into a concise reasoning graph, and we show that it can reconstruct answers stably and effectively. On the scientific reasoning benchmarks HLE and GPQA, DualResearch achieves competitive performance. Using log files from the open-source system InternAgent, its accuracy improves by 7.7% on HLE and 6.06% on GPQA.

[340] Semantic-Condition Tuning: Fusing Graph Context with Large Language Models for Knowledge Graph Completion

Ruitong Liu, Yan Wen, Te Sun, Yunjia Wu, Pingyang Huang, Zihang Yu, Siyuan Li

Main category: cs.AI

TL;DR: SCT is a new knowledge injection paradigm that fuses knowledge graphs with LLMs through semantic graph conditions and adaptive fusion, outperforming prefix-tuning methods.

Details

Motivation: Current prefix-tuning methods simply concatenate knowledge embeddings with text inputs, overlooking rich relational semantics in KGs and imposing heavy reasoning burden on LLMs.

Method: Two modules: Semantic Graph Module uses GNN to extract context-aware semantic conditions from local graph neighborhood, and Condition-Adaptive Fusion Module adaptively modulates textual embedding via parameterized projectors for deep feature-wise interaction.

Result: Extensive experiments show SCT significantly outperforms prefix-tuning and other baselines on knowledge graph benchmarks.

Conclusion: SCT provides more direct and potent signals by modulating input representation with semantic graph context before LLM inference, enabling more accurate and robust knowledge reasoning.

Abstract: Fusing Knowledge Graphs with Large Language Models is crucial for knowledge-intensive tasks like knowledge graph completion. The prevailing paradigm, prefix-tuning, simply concatenates knowledge embeddings with text inputs. However, this shallow fusion overlooks the rich relational semantics within KGs and imposes a significant implicit reasoning burden on the LLM to correlate the prefix with the text. To address these, we propose Semantic-condition Tuning (SCT), a new knowledge injection paradigm comprising two key modules. First, a Semantic Graph Module employs a Graph Neural Network to extract a context-aware semantic condition from the local graph neighborhood, guided by knowledge-enhanced relations. Subsequently, this condition is passed to a Condition-Adaptive Fusion Module, which, in turn, adaptively modulates the textual embedding via two parameterized projectors, enabling a deep, feature-wise, and knowledge-aware interaction. The resulting pre-fused embedding is then fed into the LLM for fine-tuning. Extensive experiments on knowledge graph benchmarks demonstrate that SCT significantly outperforms prefix-tuning and other strong baselines. Our analysis confirms that by modulating the input representation with semantic graph context before LLM inference, SCT provides a more direct and potent signal, enabling more accurate and robust knowledge reasoning.

[341] Tiny-R1V: Lightweight Multimodal Unified Reasoning Model via Model Merging

Qixiang Yin, Huanjin Yao, Jianghao Chen, Jiaxing Huang, Zhicheng Zhao, Fei Su

Main category: cs.AI

TL;DR: Tiny-R1V is a 3B lightweight multimodal model that achieves faster inference and higher accuracy through two-stage optimization: LIPO for training reasoning models to generate concise responses, and AMM for merging specialist models into a unified architecture.

Details

Motivation: Address challenges in lightweight MLLMs including large model size, overthinking, and compromised accuracy, as research on lightweight MLLM reasoning capabilities is lacking.

Method: Two-stage optimization: 1) LIPO - reinforcement learning method that dynamically adjusts response advantages to prioritize concise, high-quality responses; 2) AMM - training-free model merging that adaptively adjusts task vector weights and optimizes merged vectors via gradient projection regularization.

Result: Superior performance on ten reasoning benchmarks covering mathematics, structured data, OCR, and general capabilities, enabling lightweight models to excel in diverse multimodal reasoning tasks.

Conclusion: Tiny-R1V demonstrates that lightweight models can achieve excellent multimodal reasoning performance through optimized training and model merging techniques, unifying reasoning across multiple tasks with fewer tokens.

Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across diverse tasks, they encounter numerous challenges in terms of reasoning efficiency, such as large model size, overthinking, and compromised accuracy in lightweight scenarios. However, research on the reasoning capabilities of lightweight MLLMs is quite lacking. To this end, we propose Tiny-R1V, a novel lightweight 3B model that achieves faster inference and higher accuracy via a two-stage optimization, while unifying multimodal reasoning across multiple tasks and using fewer tokens. In the first stage, Tiny-R1V introduces Length-Informed Relative Policy Optimization (LIPO), a novel reinforcement learning method, to train each reasoning model. The LIPO is designed to dynamically adjusts advantages of responses within groups, that is, by prioritizing concise yet high-quality responses to encourage the generation of shorter and more accurate response. In the second stage, we propose Adaptive Model Merging (AMM), a training-free model merging method that merges multiple specialist models into a unified architecture. Specifically, AMM adaptively adjusts the weights of task vectors and robustly optimizes the merged vectors via a novel gradient projection regularization loss function, thus mitigating redundant conflicts between them. Extensive evaluations on ten widely-used reasoning benchmarks covering mathematics, structured data (charts, tables, documents), OCR, and general capabilities showcase the superior performance of Tiny-R1V, enabling lightweight models to excel in diverse multimodal reasoning tasks.

[342] TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation

Yincen Qu, Huan Xiao, Feng Li, Hui Zhou, Xiangying Dai

Main category: cs.AI

TL;DR: A comprehensive benchmark for evaluating travel planning capabilities of LLMs, focusing on feasibility, reliability, and engagement, with a unified reward system and large-scale dataset.

Details

Motivation: Existing benchmarks for evaluating LLMs' travel planning capabilities fall short in assessing key aspects like feasibility, reliability, and engagement of travel plans.

Method: Developed a benchmark with fine-grained criteria unified into a single reward, created a dataset of 4,870 queries including 219 real-world requests, and tested various methods including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and RL via GRPO.

Result: The evaluator achieved 60.75% agreement with travel-expert annotations and outperformed LLM-as-judge baselines. RL generally improved itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.

Conclusion: The proposed benchmark enables better evaluation of LLMs’ travel planning capabilities, and reinforcement learning shows promise for improving the quality of generated travel itineraries.

Abstract: Travel planning is a valuable yet complex task that poses significant challenges even for advanced large language models (LLMs). While recent benchmarks have advanced in evaluating LLMs’ planning capabilities, they often fall short in evaluating feasibility, reliability, and engagement of travel plans. We introduce a comprehensive benchmark for travel planning that unifies fine-grained criteria into a single reward, enabling direct comparison of plan quality and seamless integration with reinforcement learning (RL). Our evaluator achieves moderate agreement with travel-expert annotations (60.75%) and outperforms multiple LLM-as-judge baselines. We further release a large-scale dataset of 4,870 queries including 219 real-world, free-form requests for generalization to authentic user intent. Using this benchmark, we conduct extensive experiments across diverse methods and LLMs, including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and RL via GRPO. Across base models, RL generally improves itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.

[343] RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows

Hamed Mahdavi, Pouria Mahdavinia, Samira Malek, Pegah Mohammadipour, Alireza Hashemi, Majid Daliri, Alireza Farhadi, Amir Khasahmadi, Niloofar Mireshghallah, Vasant Honavar

Main category: cs.AI

TL;DR: LLMs can now detect errors in mathematical proofs but struggle with partial credit grading. The paper introduces agentic workflows that extract reference solutions and create problem-specific rubrics to improve grading consistency and human agreement.

Details

Motivation: While SOTA LLMs can solve most IMO problems, their ability to grade proofs - detecting errors, judging severity, and assigning fair scores beyond binary correctness - remains understudied.

Method: Used corpus of 90 Gemini 2.5 Pro-generated solutions graded on 1-4 scale with error annotations, and MathArena solution sets for IMO/USAMO 2025 scored on 0-7 scale. Introduced agentic workflows that extract reference solutions and automatically derive problem-specific rubrics for multi-step grading.

Result: Models can reliably flag incorrect solutions but exhibit calibration gaps in partial credit assignment. Proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics.

Conclusion: Agentic workflows with problem-specific rubrics significantly improve LLM-based proof grading, addressing calibration gaps and achieving better alignment with human evaluation standards.

Abstract: State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.

[344] Repairing Regex Vulnerabilities via Localization-Guided Instructions

Sicheol Sung, Joonghyuk Hahn, Yo-Sub Han

Main category: cs.AI

TL;DR: Hybrid framework combining symbolic analysis and LLMs for automated regex repair, achieving 15.4% improvement over state-of-the-art by localizing vulnerabilities and generating precise fixes.

Details

Motivation: Current regex repair methods face a trade-off: symbolic systems are precise but can't handle complex patterns, while LLMs are generalizable but unreliable for syntax/semantic correctness. Need a solution that combines both strengths.

Method: LRR (Localized Regex Repair) framework that decouples problem identification from repair. First, symbolic module localizes vulnerable subpatterns, then LLM generates semantically equivalent fixes for the isolated segment.

Result: Successfully resolves complex repair cases intractable for rule-based approaches while avoiding semantic errors of LLM-only methods. Achieves 15.4% improvement in repair rate over state-of-the-art.

Conclusion: Provides validated methodology for automated repair problems, demonstrating that combining symbolic precision with LLM generalization can effectively solve regex repair challenges.

Abstract: Regular expressions (regexes) are foundational to modern computing for critical tasks like input validation and data parsing, yet their ubiquity exposes systems to regular expression denial of service (ReDoS), a vulnerability requiring automated repair methods. Current approaches, however, are hampered by a trade-off. Symbolic, rule-based system are precise but fails to repair unseen or complex vulnerability patterns. Conversely, large language models (LLMs) possess the necessary generalizability but are unreliable for tasks demanding strict syntactic and semantic correctness. We resolve this impasse by introducing a hybrid framework, localized regex repair (LRR), designed to harness LLM generalization while enforcing reliability. Our core insight is to decouple problem identification from the repair process. First, a deterministic, symbolic module localizes the precise vulnerable subpattern, creating a constrained and tractable problem space. Then, the LLM invoked to generate a semantically equivalent fix for this isolated segment. This combined architecture successfully resolves complex repair cases intractable for rule-based repair while avoiding the semantic errors of LLM-only approaches. Our work provides a validated methodology for solving such problems in automated repair, improving the repair rate by 15.4%p over the state-of-the-art. Our code is available at https://github.com/cdltlehf/LRR.

[345] Auto-scaling Continuous Memory for GUI Agent

Wenyi Wu, Kun Zhou, Ruoxin Yuan, Vivian Yu, Stephen Wang, Zhiting Hu, Biwei Huang

Main category: cs.AI

TL;DR: Proposes continuous memory for GUI agents using VLM embeddings to reduce context length while preserving visual details, with auto-scaling data collection pipeline.

Details

Motivation: Prior GUI agents compress trajectories into text tokens, which increases context length and misses important visual cues like widget size and position.

Method: Encodes GUI trajectories into fixed-length continuous embeddings using VLM as encoder, plugs into backbone input layer. Uses auto-scaling data flywheel for low-cost memory growth via search, task synthesis, rollout, and verification.

Result: Performance improves monotonically with memory size and retrieval depth. Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models like GPT-4o and Claude-4.

Conclusion: Continuous memory effectively enhances GUI agents for long-horizon tasks and distribution shifts, providing scalable memory solution with preserved visual information.

Abstract: We study how to endow GUI agents with scalable memory that help generalize across unfamiliar interfaces and long-horizon tasks. Prior GUI agents compress past trajectories into text tokens, which balloons context length and misses decisive visual cues (e.g., exact widget size and position). We propose a continuous memory that encodes each GUI trajectory into a fixed-length sequence of continuous embeddings using the VLM itself as an encoder; these embeddings are plugged directly into the backbone’s input layer, sharply reducing context cost while preserving fine-grained visual information. As memory size and retrieval depth increase, performance improves monotonically, unlike text memories that degrade with long prompts. To grow memory at low cost, we introduce an auto-scaling data flywheel that (i) discovers new environments via search, (ii) synthesizes tasks with an open-source VLM, (iii) rolls out trajectories with the agent, and (iv) verifies success with the same VLM. Using this pipeline, we collect 100k+ trajectories for about $4000 and fine-tune only the memory encoder (LoRA on a Q-Former, 1.2% parameters) with 1,500 samples. On real-world GUI benchmarks, our memory-augmented agent consistently improves success rates under long horizons and distribution shifts. Notably, Qwen-2.5-VL-7B + continuous memory achieves performance comparable to state-of-the-art closed-source models (e.g., GPT-4o, Claude-4).

[346] Humanoid Artificial Consciousness Designed with Large Language Model Based on Psychoanalysis and Personality Theory

Sang Hun Kim, Jongmin Lee, Dongkyu Park, So Young Lee, Yosep Chong

Main category: cs.AI

TL;DR: This study proposes integrating psychoanalysis and MBTI personality types to create artificial consciousness modules, developing three consciousness types and 16 personality characters to simulate human-like cognition in decision-making scenarios.

Details

Motivation: Current LLMs struggle with human consciousness simulation due to hallucination issues, and there's a need to create more intuitive AI systems with humanoid consciousness capabilities.

Method: Developed three artificial consciousnesses (self-awareness, unconsciousness, preconsciousness) based on psychoanalysis, and 16 MBTI personality characters with attributes like needs, status, and memories. Tested decision-making in 10 situations using survey evaluation, ChatGPT classification, and qualitative review.

Result: Both quantitative and qualitative analyses showed high likelihood of well-simulated consciousness, though differences between characters and consciousness types were not very significant.

Conclusion: The models incorporating psychoanalysis and personality theory can lead to more intuitive and adaptable AI systems with humanoid consciousness, opening new avenues for improving AI interactions in complex cognitive contexts.

Abstract: Human consciousness is still a concept hard to define with current scientific understanding. Although Large Language Models (LLMs) have recently demonstrated significant advancements across various domains including translation and summarization, human consciousness is not something to imitate with current upfront technology owing to so-called hallucination. This study, therefore, proposes a novel approach to address these challenges by integrating psychoanalysis and the Myers-Briggs Type Indicator (MBTI) into constructing consciousness and personality modules. We developed three artificial consciousnesses (self-awareness, unconsciousness, and preconsciousness) based on the principles of psychoanalysis. Additionally, we designed 16 characters with different personalities representing the sixteen MBTI types, with several attributes such as needs, status, and memories. To determine if our model’s artificial consciousness exhibits human-like cognition, we created ten distinct situations considering seven attributes such as emotional understanding and logical thinking. The decision-making process of artificial consciousness and the final action were evaluated in three ways: survey evaluation, three-tier classification via ChatGPT, and qualitative review. Both quantitative and qualitative analyses indicated a high likelihood of well-simulated consciousness, although the difference in response between different characters and consciousnesses was not very significant. This implies that the developed models incorporating elements of psychoanalysis and personality theory can lead to building a more intuitive and adaptable AI system with humanoid consciousness. Therefore, this study contributes to opening up new avenues for improving AI interactions in complex cognitive contexts.

[347] MEC$^3$O: Multi-Expert Consensus for Code Time Complexity Prediction

Joonghyuk Hahn, Soohan Lim, Yo-Sub Han

Main category: cs.AI

TL;DR: MEC$^3$O is a multi-expert consensus system that improves code time complexity prediction by assigning LLMs to specific complexity classes and using structured debates with weighted consensus, outperforming baselines by at least 10% accuracy.

Details

Motivation: LLMs without fine-tuning struggle with certain complexity classes, and no single LLM excels at every class. Different models show advantages in specific complexity classes, suggesting the need for a specialized approach.

Method: MEC$^3$O assigns LLMs to complexity classes based on performance, provides class-specialized instructions to turn them into experts, and uses structured debates with a weighted consensus mechanism to integrate predictions.

Result: MEC$^3$O outperforms open-source baselines by at least 10% higher accuracy and macro-F1 scores, surpasses GPT-4o-mini in macro-F1 scores on average, and achieves competitive F1 scores to GPT-4o and GPT-o4-mini.

Conclusion: The multi-expert debates and weight consensus strategy effectively generate final predictions, demonstrating the value of specialized expertise assignments and structured consensus mechanisms for code complexity prediction.

Abstract: Predicting the complexity of source code is essential for software development and algorithm analysis. Recently, Baik et al. (2025) introduced CodeComplex for code time complexity prediction. The paper shows that LLMs without fine-tuning struggle with certain complexity classes. This suggests that no single LLM excels at every class, but rather each model shows advantages in certain classes. We propose MEC$^3$O, a multi-expert consensus system, which extends the multi-agent debate frameworks. MEC$^3$O assigns LLMs to complexity classes based on their performance and provides them with class-specialized instructions, turning them into experts. These experts engage in structured debates, and their predictions are integrated through a weighted consensus mechanism. Our expertise assignments to LLMs effectively handle Degeneration-of-Thought, reducing reliance on a separate judge model, and preventing convergence to incorrect majority opinions. Experiments on CodeComplex show that MEC$^3$O outperforms the open-source baselines, achieving at least 10% higher accuracy and macro-F1 scores. It also surpasses GPT-4o-mini in macro-F1 scores on average and demonstrates competitive on-par F1 scores to GPT-4o and GPT-o4-mini on average. This demonstrates the effectiveness of multi-expert debates and weight consensus strategy to generate the final predictions. Our code and data is available at https://github.com/suhanmen/MECO.

[348] OSCAR: Orthogonal Stochastic Control for Alignment-Respecting Diversity in Flow Matching

Jingxuan Wu, Zhenglin Wan, Xingrui Yu, Yuzhe Yang, Bo An, Ivor Tsang

Main category: cs.AI

TL;DR: A training-free method that enhances diversity in flow-based text-to-image models by adding orthogonal stochastic perturbations during inference, improving diversity metrics without degrading quality.

Details

Motivation: Flow-based text-to-image models follow deterministic trajectories, requiring repeated sampling for diversity discovery, which is inefficient and costly.

Method: Uses feature-space objective for lateral trajectory spread and time-scheduled stochastic perturbation orthogonal to generation flow, requiring no retraining or base sampler modification.

Result: Consistently improves diversity metrics (Vendi Score, Brisque) across multiple text-to-image settings while maintaining image quality and alignment under fixed sampling budgets.

Conclusion: The method provides principled diversity enhancement through geometric constraints that preserve marginal distribution, maintaining generation quality while boosting variation.

Abstract: Flow-based text-to-image models follow deterministic trajectories, forcing users to repeatedly sample to discover diverse modes, which is a costly and inefficient process. We present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our method simultaneously encourages lateral spread among trajectories via a feature-space objective and reintroduces uncertainty through a time-scheduled stochastic perturbation. Crucially, this perturbation is projected to be orthogonal to the generation flow, a geometric constraint that allows it to boost variation without degrading image details or prompt fidelity. Our procedure requires no retraining or modification to the base sampler and is compatible with common flow-matching solvers. Theoretically, our method is shown to monotonically increase a volume surrogate while, due to its geometric constraints, approximately preserving the marginal distribution. This provides a principled explanation for why generation quality is robustly maintained. Empirically, across multiple text-to-image settings under fixed sampling budgets, our method consistently improves diversity metrics such as the Vendi Score and Brisque over strong baselines, while upholding image quality and alignment.

[349] Physics-Informed High-order Graph Dynamics Identification Learning for Predicting Complex Networks Long-term Dynamics

Bicheng Wang, Jinping Wang, Yibo Sue

Main category: cs.AI

TL;DR: A higher-order network dynamics identification method using dynamic hypergraph learning and dual-driven prediction with Koopman operator theory and physical information neural differential equations for accurate and interpretable long-term prediction of complex network dynamics.

Details

Motivation: Existing methods for predicting complex network dynamics have limitations: they use simple graphs that only capture pairwise relationships, missing higher-order non-pairwise structures, and existing models either lack accuracy (theoretical models) or interpretability (data-driven models).

Method: 1) Dynamic hypergraph learning to capture higher-order non-pairwise relationships in complex networks; 2) Dual-driven dynamic prediction module combining Koopman operator theory to linearize nonlinear dynamics and physical information neural differential equations to ensure physical consistency.

Result: Experimental validation on public datasets and self-built industrial chain networks shows the method achieves good prediction accuracy and long-term prediction performance.

Conclusion: The proposed approach effectively addresses the limitations of existing methods by capturing higher-order relationships and ensuring both accuracy and interpretability through the dual-driven framework.

Abstract: Learning complex network dynamics is fundamental to understanding, modelling and controlling real-world complex systems. There are two main problems in the task of predicting the dynamic evolution of complex networks: on the one hand, existing methods usually use simple graphs to describe the relationships in complex networks; however, this approach can only capture pairwise relationships, while there may be rich non-pairwise structured relationships in the network. First-order GNNs have difficulty in capturing dynamic non-pairwise relationships. On the other hand, theoretical prediction models lack accuracy and data-driven prediction models lack interpretability. To address the above problems, this paper proposes a higher-order network dynamics identification method for long-term dynamic prediction of complex networks. Firstly, to address the problem that traditional graph machine learning can only deal with pairwise relations, dynamic hypergraph learning is introduced to capture the higher-order non-pairwise relations among complex networks and improve the accuracy of complex network modelling. Then, a dual-driven dynamic prediction module for physical data is proposed. The Koopman operator theory is introduced to transform the nonlinear dynamical differential equations for the dynamic evolution of complex networks into linear systems for solving. Meanwhile, the physical information neural differential equation method is utilised to ensure that the dynamic evolution conforms to the physical laws. The dual-drive dynamic prediction module ensures both accuracy and interpretability of the prediction. Validated on public datasets and self-built industrial chain network datasets, the experimental results show that the method in this paper has good prediction accuracy and long-term prediction performance.

Zhang Zheng, Deheng Ye, Peilin Zhao, Hao Wang

Main category: cs.AI

TL;DR: This paper proposes a reinforcement learning framework that trains LLM agents to optimize persuasive communication in social deduction games, treating dialogue as a Stackelberg competition to strategically influence other players’ responses.

Details

Motivation: Existing LLM approaches in social deduction games focus mainly on information processing and strategy selection, but overlook the crucial role of persuasive communication in influencing other players' beliefs and responses, which is essential for success in these games.

Method: The authors formalize turn-based dialogue in SDGs as a Stackelberg competition (leader-follower dynamic) and propose a reinforcement learning framework that trains agents to optimize utterances for persuasive impact.

Result: Comprehensive experiments across three diverse social deduction games demonstrate that the proposed agents significantly outperform baseline approaches.

Conclusion: This work represents a significant advancement in developing AI agents capable of strategic social influence, with broader implications for scenarios requiring persuasive communication beyond just social deduction games.

Abstract: Large language model (LLM) agents have shown remarkable progress in social deduction games (SDGs). However, existing approaches primarily focus on information processing and strategy selection, overlooking the significance of persuasive communication in influencing other players’ beliefs and responses. In SDGs, success depends not only on making correct deductions but on convincing others to response in alignment with one’s intent. To address this limitation, we formalize turn-based dialogue in SDGs as a Stackelberg competition, where the current player acts as the leader who strategically influences the follower’s response. Building on this theoretical foundation, we propose a reinforcement learning framework that trains agents to optimize utterances for persuasive impact. Through comprehensive experiments across three diverse SDGs, we demonstrate that our agents significantly outperform baselines. This work represents a significant step toward developing AI agents capable of strategic social influence, with implications extending to scenarios requiring persuasive communication.

[351] PAC Reasoning: Controlling the Performance Loss for Efficient Reasoning

Hao Zeng, Jianguo Huang, Bingyi Jing, Hongxin Wei, Bo An

Main category: cs.AI

TL;DR: The paper proposes PAC reasoning, a method to dynamically switch between thinking and non-thinking modes in large reasoning models while controlling performance loss with statistical guarantees.

Details

Motivation: Large reasoning models have high computational costs during deployment, and existing dynamic switching approaches lack statistical guarantees for performance loss, which is critical for high-stakes applications.

Method: Constructs an upper confidence bound on performance loss as a monotone function of uncertainty score, then determines a threshold for switching to non-thinking model to ensure bounded performance loss.

Result: Comprehensive experiments on reasoning benchmarks show the method can save computational budgets while controlling user-specified performance loss.

Conclusion: PAC reasoning provides a distribution-free approach to ensure bounded performance loss when dynamically switching between thinking and non-thinking modes in large reasoning models.

Abstract: Large reasoning models (LRMs) have achieved remarkable progress in complex problem-solving tasks. Despite this success, LRMs typically suffer from high computational costs during deployment, highlighting a need for efficient inference. A popular direction of efficiency improvement is to switch the LRM between thinking and nonthinking modes dynamically. However, such approaches often introduce additional reasoning errors and lack statistical guarantees for the performance loss, which are critical for high-stakes applications. In this work, we propose Probably Approximately Correct (PAC) reasoning that controls the performance loss under the user-specified performance loss tolerance. In particular, we construct an upper confidence bound on the performance loss, formulated as a monotone function of the uncertainty score, and subsequently determine a threshold for switching to the nonthinking model. Theoretically, using the threshold to switch between the thinking and nonthinking modes ensures bounded performance loss in a distribution-free manner. Our comprehensive experiments on reasoning benchmarks show that the proposed method can save computational budgets and control the user-specified performance loss.

Emma Kondrup, Anne Imouza

Main category: cs.AI

TL;DR: LLMs show systematic bias in medical advice generation, with Indigenous and intersex patients receiving less readable and more complex responses, especially in intersectional groups.

Details

Motivation: To investigate how bias in LLMs may translate into medical advice and impact different social groups, particularly in healthcare disparities.

Method: Exploratory analysis of LLM answers to medical questions across clinical domains, simulating patient profiles varying in sex, age, and ethnicity, comparing natural language features of responses.

Result: LLMs generate responses that systematically differ between social groups - Indigenous and intersex patients receive less readable and more complex advice, with amplified trends in intersectional groups.

Conclusion: Urgent need for AI literacy, investigation, and mitigation by developers to ensure systemic differences are diminished and don’t translate to unjust patient support.

Abstract: With the rapid progress of Large Language Models (LLMs), the general public now has easy and affordable access to applications capable of answering most health-related questions in a personalized manner. These LLMs are increasingly proving to be competitive, and now even surpass professionals in some medical capabilities. They hold particular promise in low-resource settings, considering they provide the possibility of widely accessible, quasi-free healthcare support. However, evaluations that fuel these motivations highly lack insights into the social nature of healthcare, oblivious to health disparities between social groups and to how bias may translate into LLM-generated medical advice and impact users. We provide an exploratory analysis of LLM answers to a series of medical questions spanning key clinical domains, where we simulate these questions being asked by several patient profiles that vary in sex, age range, and ethnicity. By comparing natural language features of the generated responses, we show that, when LLMs are used for medical advice generation, they generate responses that systematically differ between social groups. In particular, Indigenous and intersex patients receive advice that is less readable and more complex. We observe these trends amplify when intersectional groups are considered. Considering the increasing trust individuals place in these models, we argue for higher AI literacy and for the urgent need for investigation and mitigation by AI developers to ensure these systemic differences are diminished and do not translate to unjust patient support. Our code is publicly available on GitHub.

[353] Comparing Knowledge Source Integration Methods for Optimizing Healthcare Knowledge Fusion in Rescue Operation

Mubaris Nadeem, Madjid Fathi

Main category: cs.AI

TL;DR: This paper presents conceptual models for knowledge fusion in medicine using knowledge graphs to integrate multiple healthcare knowledge sources for improved patient decision-making.

Details

Motivation: Medical decision-making requires combining complex and varied medical knowledge with patient health information, creating a need for unified approaches to gather, analyze, and utilize medical knowledge for accurate patient-driven decisions.

Method: The paper develops multiple conceptual models based on knowledge graph structures to enable knowledge fusion and integrate various knowledge sources for healthcare applications.

Result: The approach provides health professionals with the ability to select from multiple contextual aligned knowledge sources, supporting critical decisions in medical operations and rescue scenarios.

Conclusion: Knowledge fusion through knowledge graphs offers a viable approach to unite diverse medical knowledge sources, enabling more accurate and contextually relevant decision-making in healthcare settings.

Abstract: In the field of medicine and healthcare, the utilization of medical expertise, based on medical knowledge combined with patients’ health information is a life-critical challenge for patients and health professionals. The within-laying complexity and variety form the need for a united approach to gather, analyze, and utilize existing knowledge of medical treatments, and medical operations to provide the ability to present knowledge for the means of accurate patient-driven decision-making. One way to achieve this is the fusion of multiple knowledge sources in healthcare. It provides health professionals the opportunity to select from multiple contextual aligned knowledge sources which enables the support for critical decisions. This paper presents multiple conceptual models for knowledge fusion in the field of medicine, based on a knowledge graph structure. It will evaluate, how knowledge fusion can be enabled and presents how to integrate various knowledge sources into the knowledge graph for rescue operations.

[354] RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems

Hyundong Jin, Joonghyuk Hahn, Yo-Sub Han

Main category: cs.AI

TL;DR: This paper introduces a novel benchmark using PSPACE-complete regex problems to evaluate the spatial computational limitations of LLMs and LRMs, revealing common failure patterns like verbosity and repetition.

Details

Motivation: To understand the computational limits of large language models and reasoning models, particularly spatial complexity constrained by finite context windows, by pushing beyond NP problems to more rigorous PSPACE-complete problems.

Method: Created a benchmark using two PSPACE-complete regex problems (equivalence decision and minimization) with over a million regex instances through double-exponential space exploration and sound filtering. Evaluated 6 LLMs and 5 LRMs of varying scales.

Result: Extensive evaluations revealed common failure patterns including verbosity and repetition. The benchmark provides the first empirical investigation into spatial computational limitations of these models.

Conclusion: The work presents a new framework for evaluating advanced reasoning capabilities of LLMs and LRMs using PSPACE-complete problems, offering quantitative metrics to assess their computational boundaries.

Abstract: Large language models (LLMs) show strong performance across natural language processing (NLP), mathematical reasoning, and programming, and recent large reasoning models (LRMs) further emphasize explicit reasoning. Yet their computational limits, particularly spatial complexity constrained by finite context windows, remain poorly understood. While recent works often focus on problems within the NP complexity class, we push the boundary by introducing a novel benchmark grounded in two PSPACE-complete regular expression (regex) problems: equivalence decision (RegexEQ) and minimization (RegexMin). PSPACE-complete problems serve as a more rigorous standard for assessing computational capacity, as their solutions require massive search space exploration. We perform a double-exponential space exploration to construct a labeled dataset of over a million regex instances with a sound filtering process to build the benchmark. We conduct extensive evaluations on 6 LLMs and 5 LRMs of varying scales, revealing common failure patterns such as verbosity and repetition. With its well-defined structure and quantitative evaluation metrics, this work presents the first empirical investigation into the spatial computational limitations of LLMs and LRMs, offering a new framework for evaluating their advanced reasoning capabilities. Our code is available at https://github.com/hyundong98/RegexPSPACE .

[355] Fundamentals of Building Autonomous LLM Agents

Victor de Lamo Castrillo, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

Main category: cs.AI

TL;DR: This paper reviews LLM-based agent architectures with perception, reasoning, memory, and execution systems to create autonomous agents that bridge performance gaps with human capabilities.

Details

Motivation: Address limitations of traditional LLMs in real-world tasks and develop 'agentic' LLMs that can automate complex tasks and close the performance gap with human capabilities.

Method: Proposes an integrated architecture with four key systems: perception (converts environmental inputs), reasoning (plans and adapts using techniques like Chain-of-Thought), memory (short-term and long-term knowledge retention), and execution (translates decisions into actions).

Result: The integration of these systems enables the creation of more capable and generalized software bots that can mimic human cognitive processes for autonomous and intelligent behavior.

Conclusion: A comprehensive architecture combining perception, reasoning, memory, and execution systems can successfully develop LLM-powered agents that demonstrate autonomous and intelligent behavior comparable to human cognitive processes.

Abstract: This paper reviews the architecture and implementation methods of agents powered by large language models (LLMs). Motivated by the limitations of traditional LLMs in real-world tasks, the research aims to explore patterns to develop “agentic” LLMs that can automate complex tasks and bridge the performance gap with human capabilities. Key components include a perception system that converts environmental percepts into meaningful representations; a reasoning system that formulates plans, adapts to feedback, and evaluates actions through different techniques like Chain-of-Thought and Tree-of-Thought; a memory system that retains knowledge through both short-term and long-term mechanisms; and an execution system that translates internal decisions into concrete actions. This paper shows how integrating these systems leads to more capable and generalized software bots that mimic human cognitive processes for autonomous and intelligent behavior.

[356] Localist LLMs – A Mathematical Framework for Dynamic Locality Control

Joachim Diederich

Main category: cs.AI

TL;DR: A framework for training LLMs with adjustable representations from interpretable localist to efficient distributed encodings via a tunable locality dial parameter.

Details

Motivation: To enable continuous interpolation between interpretable and high-performance modes in LLMs, supporting applications in regulated domains that require both transparency and capability.

Method: Uses group sparsity penalties on attention mechanisms, information-theoretic anchor design, and dynamic rule injection to create a tunable locality dial parameter that controls representation localization without retraining.

Result: Mathematical proofs establish threshold conditions where attention concentrates on semantically relevant blocks with exponential bounds on attention entropy and pointer fidelity, achieving low entropy and high fidelity with negligible error.

Conclusion: The framework successfully enables practitioners to dynamically adjust model interpretability versus performance trade-offs during both training and inference, supporting applications requiring both transparency and capability.

Abstract: We present a novel framework for training large language models with continuously adjustable internal representations that span the full spectrum from localist (interpretable, rule-based) to distributed (generalizable, efficient) encodings. The key innovation is a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining. This is achieved through group sparsity penalties on attention mechanisms, information-theoretic anchor design, and dynamic rule injection. We provide rigorous mathematical proofs establishing explicit threshold conditions under which attention provably concentrates on semantically relevant blocks, with exponential bounds on attention entropy and pointer fidelity. Specifically, we prove that when group sparsity penalties exceed certain threshold values, the model’s attention mechanisms concentrate on semantically relevant blocks, achieving low entropy and high fidelity with negligible error. This framework enables practitioners to continuously interpolate between interpretable and high-performance modes, supporting applications in regulated domains requiring both transparency and capability.

[357] Toward Mechanistic Explanation of Deductive Reasoning in Language Models

Davide Maltoni, Matteo Ferrara

Main category: cs.AI

TL;DR: Small language model solves deductive reasoning by learning underlying rules, not statistical patterns, with induction heads playing key role in rule completion and chaining.

Details

Motivation: To understand how language models perform logical reasoning internally, as current large models show reasoning capabilities but their mechanisms remain unexplored.

Method: Analyzed a small language model trained on deductive reasoning tasks, examining its internal representations and computational circuits.

Result: Found that the model learns underlying rules rather than statistical patterns, with induction heads implementing rule completion and rule chaining for logical inference.

Conclusion: Induction heads are central to implementing logical reasoning in language models, providing mechanistic understanding of how models perform rule-based deduction.

Abstract: Recent large language models have demonstrated relevant capabilities in solving problems that require logical reasoning; however, the corresponding internal mechanisms remain largely unexplored. In this paper, we show that a small language model can solve a deductive reasoning task by learning the underlying rules (rather than operating as a statistical learner). A low-level explanation of its internal representations and computational circuits is then provided. Our findings reveal that induction heads play a central role in the implementation of the rule completion and rule chaining steps involved in the logical inference required by the task.

[358] Sequence Variables: A Constraint Programming Computational Domain for Routing and Sequencing

Augustin Delecluse, Pierre Schaus, Pascal Van Hentenryck

Main category: cs.AI

TL;DR: The paper introduces sequence variables in Constraint Programming to overcome limitations of classical successor models for Vehicle Routing Problems, enabling optional visits and insertion heuristics.

Details

Motivation: Classical CP models using successor variables struggle with optional visits and insertion-based heuristics in Vehicle Routing Problems, limiting their practical applicability.

Method: Formalizes sequence variables with defined domains, update operations, and consistency levels; implements data structures for CP solvers; introduces global constraints for vehicle routing.

Result: Sequence variables simplify problem modeling and achieve competitive computational performance on the Dial-a-Ride Problem.

Conclusion: Sequence variables provide an effective alternative to successor models in CP for Vehicle Routing Problems, supporting optional visits and insertion heuristics while maintaining competitive performance.

Abstract: Constraint Programming (CP) offers an intuitive, declarative framework for modeling Vehicle Routing Problems (VRP), yet classical CP models based on successor variables cannot always deal with optional visits or insertion based heuristics. To address these limitations, this paper formalizes sequence variables within CP. Unlike the classical successor models, this computational domain handle optional visits and support insertion heuristics, including insertion-based Large Neighborhood Search. We provide a clear definition of their domain, update operations, and introduce consistency levels for constraints on this domain. An implementation is described with the underlying data structures required for integrating sequence variables into existing trail-based CP solvers. Furthermore, global constraints specifically designed for sequence variables and vehicle routing are introduced. Finally, the effectiveness of sequence variables is demonstrated by simplifying problem modeling and achieving competitive computational performance on the Dial-a-Ride Problem.

[359] Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges

Christian Bluethgen, Dave Van Veen, Daniel Truhn, Jakob Nikolas Kather, Michael Moor, Malgorzata Polacin, Akshay Chaudhari, Thomas Frauenfelder, Curtis P. Langlotz, Michael Krauthammer, Farhad Nooralahzadeh

Main category: cs.AI

TL;DR: This review explores LLM-driven agentic systems in radiology, examining their design, applications, evaluation methods, and challenges in automating complex workflows.

Details

Motivation: Radiology's multimodal data streams and orchestrated workflows make it ideal for AI agents that can adapt to context and automate repetitive yet complex tasks, moving beyond isolated LLM applications to support multi-step workflows.

Method: The paper reviews the design of LLM-driven agentic systems that incorporate external tools and feedback mechanisms, enabling systems with varying degrees of autonomy from semi-automated workflows to adaptive agents managing complex processes.

Result: LLMs and their multimodal variants have shown promising performance for individual radiology tasks like information extraction and report summarization, but their full potential is realized when equipped with tools for complex, multi-step workflows.

Conclusion: LLM-driven agentic systems represent a significant advancement for radiology automation, though challenges remain including error cascades, tool-use efficiency, and health IT integration that need to be addressed for practical implementation.

Abstract: Building agents, systems that perceive and act upon their environment with a degree of autonomy, has long been a focus of AI research. This pursuit has recently become vastly more practical with the emergence of large language models (LLMs) capable of using natural language to integrate information, follow instructions, and perform forms of “reasoning” and planning across a wide range of tasks. With its multimodal data streams and orchestrated workflows spanning multiple systems, radiology is uniquely suited to benefit from agents that can adapt to context and automate repetitive yet complex tasks. In radiology, LLMs and their multimodal variants have already demonstrated promising performance for individual tasks such as information extraction and report summarization. However, using LLMs in isolation underutilizes their potential to support complex, multi-step workflows where decisions depend on evolving context from multiple information sources. Equipping LLMs with external tools and feedback mechanisms enables them to drive systems that exhibit a spectrum of autonomy, ranging from semi-automated workflows to more adaptive agents capable of managing complex processes. This review examines the design of such LLM-driven agentic systems, highlights key applications, discusses evaluation methods for planning and tool use, and outlines challenges such as error cascades, tool-use efficiency, and health IT integration.

[360] Safe, Untrusted, “Proof-Carrying” AI Agents: toward the agentic lakehouse

Jacopo Tagliabue, Ciro Greco

Main category: cs.AI

TL;DR: API-first, programmable lakehouses enable safe AI agent workflows through data branching, declarative environments, and proof-carrying code principles, allowing untrusted agents to operate securely on production data.

Details

Motivation: Data lakehouses handle sensitive workloads but AI automation raises trust, correctness, and governance concerns. There's a need for safe-by-design approaches for agentic workflows in these environments.

Method: Using Bauplan as a case study, the paper implements data branching and declarative environments for agents, with a proof-of-concept where agents repair data pipelines using correctness checks inspired by proof-carrying code.

Result: The prototype demonstrates that untrusted AI agents can operate safely on production data while maintaining reproducibility and observability, and reducing the attack surface.

Conclusion: API-first, programmable lakehouses provide the right abstractions for safe agentic workflows, outlining a path toward fully agentic lakehouses with built-in safety mechanisms.

Abstract: Data lakehouses run sensitive workloads, where AI-driven automation raises concerns about trust, correctness, and governance. We argue that API-first, programmable lakehouses provide the right abstractions for safe-by-design, agentic workflows. Using Bauplan as a case study, we show how data branching and declarative environments extend naturally to agents, enabling reproducibility and observability while reducing the attack surface. We present a proof-of-concept in which agents repair data pipelines using correctness checks inspired by proof-carrying code. Our prototype demonstrates that untrusted AI agents can operate safely on production data and outlines a path toward a fully agentic lakehouse.

[361] GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data

Margarita Belova, Jiaxin Xiao, Shikhar Tuli, Niraj K. Jha

Main category: cs.AI

TL;DR: GraphMERT is a tiny graphical encoder-only model that creates high-quality knowledge graphs from text, forming a neurosymbolic stack that achieves state-of-the-art accuracy with superior symbolic representations compared to large language models.

Details

Motivation: Neurosymbolic AI has not scaled well due to implicit representations and limited interpretability. Knowledge graphs can provide explicit symbolic knowledge but automatic derivation from text remains challenging. Current LLMs generate unreliable KGs with hallucinations and poor domain expertise.

Method: GraphMERT is an 80M-parameter graphical encoder-only model that distills high-quality KGs from unstructured text and its internal representations. It forms a modular neurosymbolic stack combining neural learning with symbolic KG reasoning.

Result: On PubMed diabetes papers, GraphMERT achieved 69.8% FActScore and 68.8% ValidityScore, significantly outperforming a 32B-parameter LLM baseline (40.2% FActScore and 43.0% ValidityScore).

Conclusion: GraphMERT + KG is the first efficient and scalable neurosymbolic model that achieves state-of-the-art benchmark accuracy while providing superior symbolic representations, addressing key limitations of both traditional neurosymbolic approaches and LLM-based KG generation.

Abstract: Researchers have pursued neurosymbolic artificial intelligence (AI) applications for nearly three decades because symbolic components provide abstraction while neural components provide generalization. Thus, a marriage of the two components can lead to rapid advancements in AI. Yet, the field has not realized this promise since most neurosymbolic AI frameworks fail to scale. In addition, the implicit representations and approximate reasoning of neural approaches limit interpretability and trust. Knowledge graphs (KGs), a gold-standard representation of explicit semantic knowledge, can address the symbolic side. However, automatically deriving reliable KGs from text corpora has remained an open problem. We address these challenges by introducing GraphMERT, a tiny graphical encoder-only model that distills high-quality KGs from unstructured text corpora and its own internal representations. GraphMERT and its equivalent KG form a modular neurosymbolic stack: neural learning of abstractions; symbolic KGs for verifiable reasoning. GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines. Concretely, we target reliable domain-specific KGs that are both (1) factual (with provenance) and (2) valid (ontology-consistent relations with domain-appropriate semantics). When a large language model (LLM), e.g., Qwen3-32B, generates domain-specific KGs, it falls short on reliability due to prompt sensitivity, shallow domain expertise, and hallucinated relations. On text obtained from PubMed papers on diabetes, our 80M-parameter GraphMERT yields a KG with a 69.8% FActScore; a 32B-parameter baseline LLM yields a KG that achieves only 40.2% FActScore. The GraphMERT KG also attains a higher ValidityScore of 68.8%, versus 43.0% for the LLM baseline.

[362] LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?

Kaijian Zou, Aaron Xiong, Yunxiang Zhang, Frederick Zhang, Yueqi Ren, Jirong Yang, Ayoung Lee, Shitanshu Bhushan, Lu Wang

Main category: cs.AI

TL;DR: LiveOIBench is a comprehensive benchmark with 403 expert-curated Olympiad-level competitive programming problems featuring extensive test cases, designed to address limitations in current coding benchmarks and evaluate LLM coding capabilities.

Details

Motivation: Current coding benchmarks have limitations including lack of challenging problems, insufficient test coverage, and reliance on online APIs that limit accessibility. The authors aim to create a more comprehensive and accessible benchmark for evaluating LLM coding capabilities.

Method: Created LiveOIBench with 403 problems sourced from 72 official Informatics Olympiads (2023-2025), each with ~60 expert-designed test cases. Features include curated tasks with subtask rubrics, integration of elite contestant performance data, planned contamination-free updates, and self-contained offline evaluation system.

Result: Benchmarked 32 popular LLMs: GPT-5 achieved 81.76th percentile (strong but below top human contestants at >90th), while open-weight reasoning models like GPT-OSS-120B only reached 60th percentile, showing significant capability gaps from frontier closed models.

Conclusion: Robust reasoning models should prioritize precise problem analysis over excessive exploration. Future models should emphasize structured analysis and minimize unnecessary exploration. The benchmark will be publicly available to facilitate ongoing evaluation.

Abstract: Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad-level competitive programming problems, each with an average of 60 expert-designed test cases. The problems are sourced directly from 72 official Informatics Olympiads in different regions conducted between 2023 and 2025. LiveOIBench distinguishes itself through four key features: (1) meticulously curated high-quality tasks with detailed subtask rubrics and extensive private test cases; (2) direct integration of elite contestant performance data to enable informative comparison against top-performing humans; (3) planned continuous, contamination-free updates from newly released Olympiad problems; and (4) a self-contained evaluation system facilitating offline and easy-to-reproduce assessments. Benchmarking 32 popular general-purpose and reasoning LLMs, we find that GPT-5 achieves a notable 81.76th percentile, a strong result that nonetheless falls short of top human contestant performance, who usually place above 90th. In contrast, among open-weight reasoning models, GPT-OSS-120B achieves only a 60th percentile, underscoring significant capability disparities from frontier closed models. Detailed analyses indicate that robust reasoning models prioritize precise problem analysis over excessive exploration, suggesting future models should emphasize structured analysis and minimize unnecessary exploration. All data, code, and leaderboard results will be made publicly available on our website.

[363] A Method to Improve the Performance of Reinforcement Learning Based on the Y Operator for a Class of Stochastic Differential Equation-Based Child-Mother Systems

Cheng Yin, Yi Chen

Main category: cs.AI

TL;DR: Introduces Y operator to improve Actor-Critic RL for stochastic systems, integrating system stochasticity into Critic loss and reformulating PDE solving into parallel drift/diffusion problems.

Details

Motivation: To enhance control performance in Actor-Critic reinforcement learning for systems governed by stochastic differential equations by better incorporating system stochasticity.

Method: Developed Y operator that integrates system stochasticity into Critic network’s loss function and reformulates state-value PDE solving into parallel problems for drift and diffusion functions in SDEs.

Result: YORL framework demonstrates superior performance over existing methods in both linear and nonlinear numerical examples after convergence, effectively handling optimal control in model-based and data-driven systems.

Conclusion: The Y operator provides a mathematically proven approach to significantly improve RL control performance for stochastic systems by elegantly handling system stochasticity and PDE reformulation.

Abstract: This paper introduces a novel operator, termed the Y operator, to elevate control performance in Actor-Critic(AC) based reinforcement learning for systems governed by stochastic differential equations(SDEs). The Y operator ingeniously integrates the stochasticity of a class of child-mother system into the Critic network’s loss function, yielding substantial advancements in the control performance of RL algorithms.Additionally, the Y operator elegantly reformulates the challenge of solving partial differential equations for the state-value function into a parallel problem for the drift and diffusion functions within the system’s SDEs.A rigorous mathematical proof confirms the operator’s validity.This transformation enables the Y Operator-based Reinforcement Learning(YORL) framework to efficiently tackle optimal control problems in both model-based and data-driven systems.The superiority of YORL is demonstrated through linear and nonlinear numerical examples showing its enhanced performance over existing methods post convergence.

[364] Training-Free Safe Denoisers for Safe Use of Diffusion Models

Mingyu Kim, Dongjun Kim, Amman Yusuf, Stefano Ermon, Mijung Park

Main category: cs.AI

TL;DR: Training-free safe denoiser that modifies sampling trajectory using negation sets to avoid unsafe/copyrighted content without retraining diffusion models.

Details

Motivation: Address safety concerns of powerful diffusion models being misused for NSFW content, copyrighted material, and privacy violations, without relying on text-based negative prompts or extensive retraining.

Method: Directly modifies sampling trajectory using negation sets (unsafe images, copyrighted data) to avoid specific data distribution regions. Formally derives relationship between safe and unsafe denoised samples to create safe denoiser.

Result: Successfully produces high-quality samples while avoiding negation areas in text-conditional, class-conditional, and unconditional image generation scenarios.

Conclusion: Training-free safe denoiser shows great potential for using diffusion models more safely by avoiding problematic content without model retraining.

Abstract: There is growing concern over the safety of powerful diffusion models (DMs), as they are often misused to produce inappropriate, not-safe-for-work (NSFW) content or generate copyrighted material or data of individuals who wish to be forgotten. Many existing methods tackle these issues by heavily relying on text-based negative prompts or extensively retraining DMs to eliminate certain features or samples. In this paper, we take a radically different approach, directly modifying the sampling trajectory by leveraging a negation set (e.g., unsafe images, copyrighted data, or datapoints needed to be excluded) to avoid specific regions of data distribution, without needing to retrain or fine-tune DMs. We formally derive the relationship between the expected denoised samples that are safe and those that are not safe, leading to our $\textit{safe}$ denoiser which ensures its final samples are away from the area to be negated. Inspired by the derivation, we develop a practical algorithm that successfully produces high-quality samples while avoiding negation areas of the data distribution in text-conditional, class-conditional, and unconditional image generation scenarios. These results hint at the great potential of our training-free safe denoiser for using DMs more safely.

[365] Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research

Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang

Main category: cs.AI

TL;DR: A comprehensive knowledge-enhanced system for perovskite solar cells that integrates a domain-specific knowledge graph, two complementary datasets, and two specialized LLMs to significantly improve knowledge retrieval and scientific reasoning in PSC research.

Details

Motivation: The exponential growth in perovskite solar cell research publications creates an urgent need for efficient knowledge management and reasoning systems in this domain.

Method: Developed Perovskite-KG knowledge graph from 1,517 papers, created Perovskite-Chat (55,101 QA pairs) and Perovskite-Reasoning (2,217 problems) datasets, and introduced two specialized LLMs for knowledge assistance and scientific reasoning.

Result: The system significantly outperforms existing models in both domain-specific knowledge retrieval and scientific reasoning tasks.

Conclusion: Provides researchers with effective tools for literature review, experimental design, and complex problem-solving in PSC research.

Abstract: The rapid advancement of perovskite solar cells (PSCs) has led to an exponential growth in research publications, creating an urgent need for efficient knowledge management and reasoning systems in this domain. We present a comprehensive knowledge-enhanced system for PSCs that integrates three key components. First, we develop Perovskite-KG, a domain-specific knowledge graph constructed from 1,517 research papers, containing 23,789 entities and 22,272 relationships. Second, we create two complementary datasets: Perovskite-Chat, comprising 55,101 high-quality question-answer pairs generated through a novel multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully curated materials science problems. Third, we introduce two specialized large language models: Perovskite-Chat-LLM for domain-specific knowledge assistance and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental results demonstrate that our system significantly outperforms existing models in both domain-specific knowledge retrieval and scientific reasoning tasks, providing researchers with effective tools for literature review, experimental design, and complex problem-solving in PSC research.

Yifei Dong, Fengyi Wu, Qi He, Zhi-Qi Cheng, Heng Li, Minghan Li, Zebang Cheng, Yuxuan Zhou, Jingdong Sun, Qi Dai, Alexander G Hauptmann

Main category: cs.AI

TL;DR: HA-VLN 2.0 is a unified benchmark for Vision-and-Language Navigation that introduces social-awareness constraints in dynamic, crowded environments, addressing gaps in existing VLN research.

Details

Motivation: Current VLN research focuses mainly on discrete or continuous settings with little attention to dynamic, crowded environments, lacking explicit social-awareness constraints for safe navigation.

Method: Developed a standardized task with metrics for goal accuracy and personal-space adherence, created HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment.

Result: Benchmarks on 16,844 socially grounded instructions revealed sharp performance drops of leading agents under human dynamics and partial observability. Real-world robot experiments validated sim-to-real transfer.

Conclusion: Explicit social modeling improves navigation robustness and reduces collisions, demonstrating the necessity of human-centric approaches for safe, socially responsible navigation research.

Abstract: Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous settings, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment; (iii) benchmarks on 16,844 socially grounded instructions, revealing sharp performance drops of leading agents under human dynamics and partial observability; and (iv) real-world robot experiments validating sim-to-real transfer, with an open leaderboard enabling transparent comparison. Results show that explicit social modeling improves navigation robustness and reduces collisions, underscoring the necessity of human-centric approaches. By releasing datasets, simulators, baselines, and protocols, HA-VLN 2.0 provides a strong foundation for safe, socially responsible navigation research.

[367] A Knowledge-Informed Deep Learning Paradigm for Generalizable and Stability-Optimized Car-Following Models

Chengming Wang, Dongyao Jia, Wei Wang, Dong Ngoduy, Bei Peng, Jianping Wang

Main category: cs.AI

TL;DR: A Knowledge-Informed Deep Learning (KIDL) approach that uses LLMs to extract general car-following knowledge and transfers it to a lightweight neural model with built-in stability constraints, outperforming existing models in generalization and stability.

Details

Motivation: Existing car-following models lack generalization across diverse scenarios and don't explicitly optimize for local and string stability, which are crucial for safe autonomous vehicle deployment.

Method: Uses LLMs to extract fundamental car-following knowledge beyond dataset-specific patterns, then transfers this knowledge to a lightweight neural architecture through knowledge distillation while incorporating stability constraints into the training objective.

Result: KIDL demonstrates superior behavioral generalization and traffic flow stability on NGSIM and HighD datasets compared to physics-based, data-driven, and hybrid CFMs.

Conclusion: KIDL offers a robust and scalable solution for next-generation traffic systems by combining LLM generalization capabilities with stability-aware neural modeling.

Abstract: Car-following models (CFMs) are fundamental to traffic flow analysis and autonomous driving. Although calibrated physics-based and trained data-driven CFMs can replicate human driving behavior, their reliance on specific datasets limits generalization across diverse scenarios and reduces reliability in real-world deployment. Moreover, these models typically focus on behavioral fidelity and do not support the explicit optimization of local and string stability, which are increasingly important for the safe and efficient operation of autonomous vehicles (AVs). To address these limitations, we propose a Knowledge-Informed Deep Learning (KIDL) paradigm that distills the generalization capabilities of pre-trained Large Language Models (LLMs) into a lightweight and stability-aware neural architecture. LLMs are used to extract fundamental car-following knowledge beyond dataset-specific patterns, and this knowledge is transferred to a reliable, tractable, and computationally efficient model through knowledge distillation. KIDL also incorporates stability constraints directly into its training objective, ensuring that the resulting model not only emulates human-like behavior but also satisfies the local and string stability requirements essential for real-world AV deployment. We evaluate KIDL on the real-world NGSIM and HighD datasets, comparing its performance with representative physics-based, data-driven, and hybrid CFMs. Both empirical and theoretical results consistently demonstrate KIDL’s superior behavioral generalization and traffic flow stability, offering a robust and scalable solution for next-generation traffic systems.

[368] GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning

Longxi Gao, Li Zhang, Pengzhi Gao, Wei Liu, Jian Luan, Mengwei Xu

Main category: cs.AI

TL;DR: K-step GUI Transition is a self-supervised method that trains VLMs to predict actions between GUI states, eliminating the need for annotated datasets. GUI-Shift combines this with RL and data filtering to improve performance on GUI automation and grounding tasks.

Details

Motivation: Large-scale annotated datasets for GUI agents are labor-intensive and error-prone to collect. The paper aims to develop a scalable alternative using unlabeled GUI trajectories.

Method: Proposes K-step GUI Transition (self-supervised inverse dynamics task) and GUI-Shift framework (combining RL with rule-based optimization and data filtering) to train VLMs without natural language instructions.

Result: Experiments across four benchmarks show up to 11.2% increase in GUI automation accuracy. The method generalizes well to both GUI automation and grounding tasks.

Conclusion: Self-supervised RL can effectively leverage unlabeled GUI trajectories, providing a scalable alternative to annotated dataset training for GUI agents.

Abstract: Training effective Vision-Language Models (VLMs) for GUI agents typically depends on large-scale annotated datasets, whose collection is both labor-intensive and error-prone. We introduce K-step GUI Transition, a self-supervised inverse dynamics task in which VLMs learn GUI dynamics by predicting the initial action that causes a transition between two GUI states. This approach eliminates the need for natural language instructions and enables scalable dataset construction from existing GUI trajectories or automated exploration. Building on this task, we propose GUI-Shift, a reinforcement learning (RL) framework that combines rule-based optimization with data filtering to improve VLM performance. We conduct extensive experiments using multiple VLM backbones across four benchmarks, spanning GUI task automation (AndroidControl, GUI Odyssey) and GUI grounding (ScreenSpot-v2, ScreenSpot-Pro). Our results show that training on GUI-Shift generalizes well to both GUI automation and grounding tasks, yielding up to an 11.2% increase in GUI automation accuracy. This study underscores the potential of self-supervised RL to leverage unlabeled GUI trajectories and offers a scalable alternative to training with annotated samples.

[369] AdaReasoner: Adaptive Reasoning Enables More Flexible Thinking in Large Language Models

Xiangqi Wang, Yue Huang, Yanbo Wang, Xiaonan Luo, Kehan Guo, Yujun Zhou, Xiangliang Zhang

Main category: cs.AI

TL;DR: AdaReasoner is an LLM-agnostic plugin that automates adaptive reasoning configurations for different tasks using reinforcement learning, outperforming standard baselines across various LLMs and reasoning tasks.

Details

Motivation: Existing prompting approaches use general-purpose, fixed configurations that work 'well enough' but rarely achieve task-specific optimality for sophisticated reasoning tasks.

Method: AdaReasoner uses a reinforcement learning framework with factorized action space, targeted exploration strategy, and pretrained reward model to optimize reasoning configurations with few-shot guidance.

Result: AdaReasoner consistently outperforms standard baselines across six different LLMs and various reasoning tasks, maintains out-of-distribution robustness, and improves knowledge-intensive tasks through tailored prompts.

Conclusion: AdaReasoner provides an effective automated approach for adaptive reasoning configurations that achieves better performance than standard methods while being LLM-agnostic and theoretically guaranteed.

Abstract: LLMs often need effective configurations, like temperature and reasoning steps, to handle tasks requiring sophisticated reasoning and problem-solving, ranging from joke generation to mathematical reasoning. Existing prompting approaches usually adopt general-purpose, fixed configurations that work ‘well enough’ across tasks but seldom achieve task-specific optimality. To address this gap, we introduce AdaReasoner, an LLM-agnostic plugin designed for any LLM to automate adaptive reasoning configurations for tasks requiring different types of thinking. AdaReasoner is trained using a reinforcement learning (RL) framework, combining a factorized action space with a targeted exploration strategy, along with a pretrained reward model to optimize the policy model for reasoning configurations with only a few-shot guide. AdaReasoner is backed by theoretical guarantees and experiments of fast convergence and a sublinear policy gap. Across six different LLMs and a variety of reasoning tasks, it consistently outperforms standard baselines, preserves out-of-distribution robustness, and yield gains on knowledge-intensive tasks through tailored prompts.

[370] Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning

Xiancheng Gao, Yufeng Shi, Wengang Zhou, Houqiang Li

Main category: cs.AI

TL;DR: SPW unifies expert demonstrations and preference feedback in offline RL by using similarity search to derive stepwise importance weights, enabling more accurate credit assignment and outperforming prior methods on robot manipulation tasks.

Details

Motivation: Offline RL typically requires well-defined reward functions that are expensive to design. Human feedback alternatives (demonstrations and preferences) have complementary limitations: demonstrations are costly and limited in behavior modes, while preferences suffer from credit assignment issues.

Method: Search-Based Preference Weighting (SPW) searches for the most similar state-action pairs from expert demonstrations for each transition in preference-labeled trajectories, then derives stepwise importance weights based on similarity scores to guide preference learning.

Result: SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.

Conclusion: SPW successfully addresses the credit assignment problem in preference learning by leveraging demonstration similarity, providing a unified framework that combines the strengths of both demonstration and preference feedback sources.

Abstract: Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited expert behavior modes. In contrast, preferences are easier to collect, but it is unclear which parts of a behavior contribute most to a trajectory segment, leaving credit assignment unresolved. In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to unify these two feedback sources. For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations and directly derives stepwise importance weights based on their similarity scores. These weights are then used to guide standard preference learning, enabling more accurate credit assignment that traditional approaches struggle to achieve. We demonstrate that SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.

[371] Agentic Exploration of Physics Models

Maximilian Nägele, Florian Marquardt

Main category: cs.AI

TL;DR: SciExplorer is an AI agent that uses large language models to autonomously explore and discover physical laws from unknown systems through experiments and analysis, without requiring domain-specific knowledge or task-specific instructions.

Details

Motivation: To fully automate the scientific discovery process by enabling AI agents to autonomously explore unknown physical systems and discover their underlying laws through iterative experimentation and hypothesis generation, without human intervention or domain-specific tailoring.

Method: Uses large language model tool-use capabilities with minimal tools (primarily code execution) to explore physical systems, test hypotheses, and analyze observations across mechanical dynamical systems, wave evolution, and quantum many-body physics.

Result: Demonstrated impressive performance in recovering equations of motion from observed dynamics and inferring Hamiltonians from expectation values across diverse physical systems, showing the agent can effectively discover physical laws from initially unknown systems.

Conclusion: The approach enables automated scientific exploration without fine-tuning or task-specific instructions, opening doors for similar autonomous discovery in other scientific domains.

Abstract: The process of scientific discovery relies on an interplay of observations, analysis, and hypothesis generation. Machine learning is increasingly being adopted to address individual aspects of this process. However, it remains an open challenge to fully automate the heuristic, iterative loop required to discover the laws of an unknown system by exploring it through experiments and analysis, without tailoring the approach to the specifics of a given task. Here, we introduce SciExplorer, an agent that leverages large language model tool-use capabilities to enable exploration of systems without any domain-specific blueprints, and apply it to physical systems that are initially unknown to the agent. We test SciExplorer on a broad set of models spanning mechanical dynamical systems, wave evolution, and quantum many-body physics. Despite using a minimal set of tools, primarily based on code execution, we observe impressive performance on tasks such as recovering equations of motion from observed dynamics and inferring Hamiltonians from expectation values. The demonstrated effectiveness of this setup opens the door towards similar scientific exploration in other domains, without the need for finetuning or task-specific instructions.

[372] ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models

Dongqi Zheng

Main category: cs.AI

TL;DR: ARS is a training-free method that dynamically suppresses redundant reasoning steps in large reasoning models through adaptive certainty monitoring, achieving significant efficiency gains while maintaining accuracy.

Details

Motivation: Large reasoning models suffer from computational inefficiencies due to overthinking, and existing methods struggle to balance reasoning quality with inference cost reduction.

Method: Adaptive Reasoning Suppression (ARS) uses multi-checkpoint certainty estimation with progressive suppression thresholds to dynamically suppress redundant reasoning steps without requiring training.

Result: ARS achieves up to 53% token reduction, 46.1% latency reduction, and 57.9% energy reduction across mathematical reasoning benchmarks while maintaining or improving accuracy.

Conclusion: ARS provides an effective training-free solution for improving computational efficiency in large reasoning models through adaptive suppression of redundant reasoning steps.

Abstract: Large Reasoning Language Models (LRLMs or LRMs) demonstrate remarkable capabilities in complex reasoning tasks, but suffer from significant computational inefficiencies due to overthinking phenomena. Existing efficient reasoning methods face the challenge of balancing reasoning quality with inference cost reduction. We propose \textbf{Adaptive Reasoning Suppression (ARS)}, a novel training-free approach that dynamically suppresses redundant reasoning steps while preserving accuracy through adaptive certainty monitoring. ARS introduces a multi-checkpoint certainty estimation mechanism with progressive suppression thresholds, achieving superior efficiency compared to static suppression methods. Our extensive evaluation across mathematical reasoning benchmarks using multiple model architectures demonstrates that ARS achieves up to 53%, 46.1%, and 57.9% in token, latency and energy reduction, while maintaining or improving accuracy.

[373] Improving Multimodal Brain Encoding Model with Dynamic Subject-awareness Routing

Xuanhua Yin, Runkai Zhao, Weidong Cai

Main category: cs.AI

TL;DR: AFIRE is an agnostic framework that standardizes multimodal fMRI encoding with MIND decoder using subject-aware mixture-of-experts for improved whole-brain prediction and cross-subject generalization.

Details

Motivation: Address challenges in naturalistic fMRI encoding including multimodal inputs, varying fusion styles, and significant inter-subject variability.

Method: AFIRE standardizes time-aligned post-fusion tokens from different encoders, while MIND uses mixture-of-experts with subject-aware dynamic gating and Top-K sparse routing for personalized decoding.

Result: Consistent improvements over strong baselines across multiple multimodal backbones and subjects, enhanced cross-subject generalization, and interpretable expert patterns correlating with content types.

Conclusion: The framework provides a simple attachment point for new encoders and datasets, enabling robust plug-and-improve performance for naturalistic neuroimaging studies.

Abstract: Naturalistic fMRI encoding must handle multimodal inputs, shifting fusion styles, and pronounced inter-subject variability. We introduce AFIRE (Agnostic Framework for Multimodal fMRI Response Encoding), an agnostic interface that standardizes time-aligned post-fusion tokens from varied encoders, and MIND, a plug-and-play Mixture-of-Experts decoder with a subject-aware dynamic gating. Trained end-to-end for whole-brain prediction, AFIRE decouples the decoder from upstream fusion, while MIND combines token-dependent Top-K sparse routing with a subject prior to personalize expert usage without sacrificing generality. Experiments across multiple multimodal backbones and subjects show consistent improvements over strong baselines, enhanced cross-subject generalization, and interpretable expert patterns that correlate with content type. The framework offers a simple attachment point for new encoders and datasets, enabling robust, plug-and-improve performance for naturalistic neuroimaging studies.

[374] Barbarians at the Gate: How AI is Upending Systems Research

Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, Ion Stoica

Main category: cs.AI

TL;DR: AI-Driven Research for Systems (ADRS) automates algorithm discovery through iterative generation, evaluation, and refinement of solutions using reliable verifiers in systems research.

Details

Motivation: Systems research is well-suited for AI-driven discovery because performance problems naturally admit reliable verifiers through real systems or simulators, enabling accurate solution verification.

Method: ADRS iteratively generates diverse solutions, evaluates them using predefined workloads and performance measurements in real systems or simulators, and refines the solutions based on evaluation results.

Result: ADRS discovered algorithms that outperform state-of-the-art human designs, achieving up to 5.0x runtime improvements and 50% cost reductions across domains like load balancing, Mixture-of-Experts inference, LLM-based SQL queries, and transaction scheduling.

Conclusion: AI will increasingly handle algorithm design while human researchers focus on problem formulation and strategic guidance, highlighting the need for systems research practices to adapt to AI-driven approaches.

Abstract: Artificial Intelligence (AI) is starting to transform the research process as we know it by automating the discovery of new solutions. Given a task, the typical AI-driven approach is (i) to generate a set of diverse solutions, and then (ii) to verify these solutions and select one that solves the problem. Crucially, this approach assumes the existence of a reliable verifier, i.e., one that can accurately determine whether a solution solves the given problem. We argue that systems research, long focused on designing and evaluating new performance-oriented algorithms, is particularly well-suited for AI-driven solution discovery. This is because system performance problems naturally admit reliable verifiers: solutions are typically implemented in real systems or simulators, and verification reduces to running these software artifacts against predefined workloads and measuring performance. We term this approach as AI-Driven Research for Systems (ADRS), which iteratively generates, evaluates, and refines solutions. Using penEvolve, an existing open-source ADRS instance, we present case studies across diverse domains, including load balancing for multi-region cloud scheduling, Mixture-of-Experts inference, LLM-based SQL queries, and transaction scheduling. In multiple instances, ADRS discovers algorithms that outperform state-of-the-art human designs (e.g., achieving up to 5.0x runtime improvements or 50% cost reductions). We distill best practices for guiding algorithm evolution, from prompt design to evaluator construction, for existing frameworks. We then discuss the broader implications for the systems community: as AI assumes a central role in algorithm design, we argue that human researchers will increasingly focus on problem formulation and strategic guidance. Our results highlight both the disruptive potential and the urgent need to adapt systems research practices in the age of AI.

[375] SurveyG: A Multi-Agent LLM Framework with Hierarchical Citation Graph for Automated Survey Generation

Minh-Anh Nguye, Minh-Duc Nguyen, Ha Lan N. T., Kieu Hai Dang, Nguyen Tien Dong, Dung D. Le

Main category: cs.AI

TL;DR: SurveyG is an LLM-based agent framework that uses hierarchical citation graphs to generate more structured and comprehensive survey papers by capturing research evolution and relationships between papers.

Details

Motivation: Existing LLM-based survey generation methods overlook structural relationships among papers, resulting in surveys that lack coherent taxonomy and deeper contextual understanding of research progress.

Method: Proposes SurveyG framework with hierarchical citation graph organized into Foundation, Development, and Frontier layers. Uses horizontal search within layers and vertical traversal across layers to produce multi-level summaries, followed by multi-agent validation for consistency and accuracy.

Result: Experiments show SurveyG outperforms state-of-the-art frameworks, producing surveys that are more comprehensive and better structured according to human expert evaluations and LLM-as-a-judge assessments.

Conclusion: SurveyG successfully addresses limitations of existing methods by integrating structural and contextual knowledge through hierarchical citation graphs, enabling generation of high-quality, well-structured survey papers.

Abstract: Large language models (LLMs) are increasingly adopted for automating survey paper generation \cite{wang2406autosurvey, liang2025surveyx, yan2025surveyforge,su2025benchmarking,wen2025interactivesurvey}. Existing approaches typically extract content from a large collection of related papers and prompt LLMs to summarize them directly. However, such methods often overlook the structural relationships among papers, resulting in generated surveys that lack a coherent taxonomy and a deeper contextual understanding of research progress. To address these shortcomings, we propose \textbf{SurveyG}, an LLM-based agent framework that integrates \textit{hierarchical citation graph}, where nodes denote research papers and edges capture both citation dependencies and semantic relatedness between their contents, thereby embedding structural and contextual knowledge into the survey generation process. The graph is organized into three layers: \textbf{Foundation}, \textbf{Development}, and \textbf{Frontier}, to capture the evolution of research from seminal works to incremental advances and emerging directions. By combining horizontal search within layers and vertical depth traversal across layers, the agent produces multi-level summaries, which are consolidated into a structured survey outline. A multi-agent validation stage then ensures consistency, coverage, and factual accuracy in generating the final survey. Experiments, including evaluations by human experts and LLM-as-a-judge, demonstrate that SurveyG outperforms state-of-the-art frameworks, producing surveys that are more comprehensive and better structured to the underlying knowledge taxonomy of a field.

[376] PEAR: Phase Entropy Aware Reward for Efficient Reasoning

Chen Huang, Wei Lu, Wenxuan Zhang

Main category: cs.AI

TL;DR: PEAR introduces a phase entropy aware reward mechanism that reduces redundant reasoning steps in Large Reasoning Models by controlling entropy during different reasoning phases, achieving shorter responses while maintaining accuracy.

Details

Motivation: Large Reasoning Models generate excessively long chain-of-thought explanations with redundant reasoning steps, which increases inference costs and reduces usability. There's a need to control response length without sacrificing accuracy.

Method: PEAR incorporates phase-dependent entropy into reward design, penalizing excessive entropy during the thinking phase while allowing moderate exploration at the final answer phase. This enables adaptive length control without explicit length targets or rigid truncation rules.

Result: Extensive experiments across four benchmarks show PEAR consistently reduces response length while sustaining competitive accuracy across model scales. It also demonstrates strong out-of-distribution robustness beyond training distribution.

Conclusion: Phase-dependent entropy serves as an effective control mechanism for balancing conciseness and performance in reasoning models, enabling more efficient and usable reasoning systems.

Abstract: Large Reasoning Models (LRMs) have achieved impressive performance on complex reasoning tasks by generating detailed chain-of-thought (CoT) explanations. However, these responses are often excessively long, containing redundant reasoning steps that inflate inference cost and reduce usability. Controlling the length of generated reasoning without sacrificing accuracy remains an open challenge. Through a systematic empirical analysis, we reveal a consistent positive correlation between model entropy and response length at different reasoning stages across diverse LRMs: the thinking phase exhibits higher entropy, reflecting exploratory behavior of longer responses, while the final answer phase shows lower entropy, indicating a more deterministic solution. This observation suggests that entropy at different reasoning stages can serve as a control knob for balancing conciseness and performance. Based on this insight, this paper introduces Phase Entropy Aware Reward (PEAR), a reward mechanism that incorporating phase-dependent entropy into the reward design. Instead of treating all tokens uniformly, PEAR penalize excessive entropy during the thinking phase and allowing moderate exploration at the final answer phase, which encourages models to generate concise reasoning traces that retain sufficient flexibility to solve the task correctly. This enables adaptive control of response length without relying on explicit length targets or rigid truncation rules. Extensive experiments across four benchmarks demonstrate that PEAR consistently reduces response length while sustaining competitive accuracy across model scales. In addition, PEAR demonstrates strong out-of-distribution (OOD) robustness beyond the training distribution. Our code is available at: https://github.com/iNLP-Lab/PEAR.

cs.SD

[377] LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

Benjamin Shiue-Hal Chou, Purvish Jajal, Nick John Eliopoulos, James C. Davis, George K. Thiruvathukal, Kristen Yeon-Ji Yun, Yung-Hsiang Lu

Main category: cs.SD

TL;DR: LadderSym is a Transformer-based method for music error detection that improves on existing approaches by using a two-stream encoder with inter-stream alignment and multimodal integration of audio and symbolic scores, achieving significant F1 score improvements.

Details

Motivation: Existing music error detection methods have limitations: late fusion restricts inter-stream alignment and cross-modality comparison, and reliance on score audio creates frequency spectrum ambiguity, especially in music with concurrent notes.

Method: LadderSym introduces a two-stream encoder with inter-stream alignment modules for better audio comparison, and a multimodal strategy that uses symbolic representations as decoder prompts to reduce ambiguity.

Result: On MAESTRO-E dataset, LadderSym more than doubles F1 for missed notes (26.8% → 56.3%) and improves extra note detection by 14.4 points (72.0% → 86.4%). Similar gains observed on CocoChorales-E dataset.

Conclusion: LadderSym provides significant improvements in music error detection and introduces general insights about comparison models that could benefit sequence evaluation tasks in reinforcement learning, human skill assessment, and model evaluation.

Abstract: Music learners can greatly benefit from tools that accurately detect errors in their practice. Existing approaches typically compare audio recordings to music scores using heuristics or learnable models. This paper introduces \textit{LadderSym}, a novel Transformer-based method for music error detection. \textit{LadderSym} is guided by two key observations about the state-of-the-art approaches: (1) late fusion limits inter-stream alignment and cross-modality comparison capability; and (2) reliance on score audio introduces ambiguity in the frequency spectrum, degrading performance in music with concurrent notes. To address these limitations, \textit{LadderSym} introduces (1) a two-stream encoder with inter-stream alignment modules to improve audio comparison capabilities and error detection F1 scores, and (2) a multimodal strategy that leverages both audio and symbolic scores by incorporating symbolic representations as decoder prompts, reducing ambiguity and improving F1 scores. We evaluate our method on the \textit{MAESTRO-E} and \textit{CocoChorales-E} datasets by measuring the F1 score for each note category. Compared to the previous state of the art, \textit{LadderSym} more than doubles F1 for missed notes on \textit{MAESTRO-E} (26.8% $\rightarrow$ 56.3%) and improves extra note detection by 14.4 points (72.0% $\rightarrow$ 86.4%). Similar gains are observed on \textit{CocoChorales-E}. This work introduces general insights about comparison models that could inform sequence evaluation tasks for reinforcement Learning, human skill assessment, and model evaluation.

[378] Evaluating Hallucinations in Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions

Hansol Park, Hoseong Ahn, Junwon Moon, Yejin Lee, Kyuhong Shim

Main category: cs.SD

TL;DR: Spoken queries increase hallucinations in vision-language models, with error rates rising by 3% under clean speech and up to 20% with noise, revealing a critical challenge for voice interfaces.

Details

Motivation: To investigate how spoken input influences hallucinations in multimodal large language models, as voice-driven interfaces grow but spoken queries' effect on hallucinations remains unexplored.

Method: Created RePOPE-Spk benchmark with audio-augmented queries under diverse acoustic conditions, systematically evaluating both proprietary and open-source models on spoken vs written queries.

Result: Hallucinations escalate significantly with spoken queries: 3% increase under clean speech and up to 20% increase with environmental noise. Input order and query length affect robustness.

Conclusion: Spoken queries present a critical underexplored challenge for multimodal reliability, opening new directions for building robust voice interface systems as current mitigation strategies are insufficient.

Abstract: Hallucinations in vision-language models have been extensively studied using benchmarks that probe reliability in image-text settings. In contrast, the effect of spoken queries on multimodal hallucinations remains largely unexplored, despite the growing role of voice-driven interfaces. In this work, we investigate how spoken input influences hallucinations in multimodal large language models. We present RePOPE-Spk, an audio-augmented extension of the RePOPE benchmark, where queries are provided as speech under diverse acoustic conditions. Using RePOPE-Spk, we systematically evaluate both proprietary and open-source models. Experimental results show that hallucinations escalate when queries are spoken rather than written: error rates increase by 3% under clean speech and by up to 20% with environmental noise. Input order and query length further affect robustness, while strategies such as many-shot prompting and chain-of-thought reasoning offer partial but insufficient mitigation. These findings highlight a critical and underexplored challenge, opening new directions for building reliable voice interface systems.

[379] EGSTalker: Real-Time Audio-Driven Talking Head Generation with Efficient Gaussian Deformation

Tianheng Zhu, Yinfeng Yu, Liejun Wang, Fuchun Sun, Wendong Zheng

Main category: cs.SD

TL;DR: EGSTalker is a real-time audio-driven talking head generation framework using 3D Gaussian Splatting that achieves high-quality facial animations with fast training (3-5 minutes) and superior inference speed.

Details

Motivation: To enhance both speed and visual fidelity in real-time talking head generation, addressing the need for efficient multimedia applications with minimal training data requirements.

Method: Two-stage framework: 1) Static Gaussian initialization using multi-resolution hash triplane and KAN for spatial feature extraction, 2) Audio-driven deformation with Efficient Spatial-Audio Attention module to fuse audio-spatial cues and KAN for predicting Gaussian deformations.

Result: Achieves rendering quality and lip-sync accuracy comparable to state-of-the-art methods while significantly outperforming them in inference speed.

Conclusion: EGSTalker demonstrates strong potential for real-time multimedia applications with its efficient training and fast inference capabilities.

Abstract: This paper presents EGSTalker, a real-time audio-driven talking head generation framework based on 3D Gaussian Splatting (3DGS). Designed to enhance both speed and visual fidelity, EGSTalker requires only 3-5 minutes of training video to synthesize high-quality facial animations. The framework comprises two key stages: static Gaussian initialization and audio-driven deformation. In the first stage, a multi-resolution hash triplane and a Kolmogorov-Arnold Network (KAN) are used to extract spatial features and construct a compact 3D Gaussian representation. In the second stage, we propose an Efficient Spatial-Audio Attention (ESAA) module to fuse audio and spatial cues, while KAN predicts the corresponding Gaussian deformations. Extensive experiments demonstrate that EGSTalker achieves rendering quality and lip-sync accuracy comparable to state-of-the-art methods, while significantly outperforming them in inference speed. These results highlight EGSTalker’s potential for real-time multimedia applications.

[380] Audible Networks: Deconstructing and Manipulating Sounds with Deep Non-Negative Autoencoders

Juan José Burred, Carmine-Emanuele Cella

Main category: cs.SD

TL;DR: Non-Negative Autoencoders (NAEs) enable interpretable audio decomposition and user-guided sound manipulation through hierarchical representations with adjustable granularity.

Details

Motivation: To extend traditional NMF-based approaches for more versatile and scalable sound deconstruction and creative manipulation, providing interpretable decompositions where components can be directly understood and manipulated.

Method: Use Non-Negative Autoencoders with projected gradient descent to enforce non-negativity constraints, enabling multi-layer deep architectures for hierarchical sound representations at multiple abstraction levels.

Result: NAEs successfully decompose sounds into interpretable spectral shapes and temporal envelopes, allowing novel manipulation operations including cross-component/layer synthesis, hierarchical deconstructions, and randomization strategies for timbre and event density control.

Conclusion: NAEs serve as flexible and interpretable tools for object-based sound editing, enabling expressive, controllable sound transformations through hierarchical audio representations.

Abstract: We propose the use of Non-Negative Autoencoders (NAEs) for sound deconstruction and user-guided manipulation of sounds for creative purposes. NAEs offer a versatile and scalable extension of traditional Non-Negative Matrix Factorization (NMF)-based approaches for interpretable audio decomposition. By enforcing non-negativity constraints through projected gradient descent, we obtain decompositions where internal weights and activations can be directly interpreted as spectral shapes and temporal envelopes, and where components can themselves be listened to as individual sound events. In particular, multi-layer Deep NAE architectures enable hierarchical representations with an adjustable level of granularity, allowing sounds to be deconstructed at multiple levels of abstraction: from high-level note envelopes down to fine-grained spectral details. This framework enables a wide new range of expressive, controllable, and randomized sound transformations. We introduce novel manipulation operations including cross-component and cross-layer synthesis, hierarchical deconstructions, and several randomization strategies that control timbre and event density. Through visualizations and resynthesis of practical examples, we demonstrate how NAEs can serve as flexible and interpretable tools for object-based sound editing.

[381] ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Yuxuan Jiang, Zehua Chen, Zeqian Ju, Yusheng Dai, Weibei Dou, Jun Zhu

Main category: cs.SD

TL;DR: ControlAudio is a progressive diffusion modeling approach for controllable text-to-audio generation that achieves state-of-the-art performance in temporal accuracy and speech clarity through multi-task learning.

Details

Motivation: Existing controllable TTA generation methods suffer from compromised performance at scale due to data scarcity, despite exploring fine-grained control signals like timing and phoneme features.

Method: Progressive diffusion modeling with three stages: 1) Data construction spanning annotation and simulation to augment condition information, 2) Pretraining DiT on text-audio pairs then incrementally integrating timing and phoneme features, 3) Progressively guided generation at inference that sequentially emphasizes fine-grained information.

Result: ControlAudio achieves state-of-the-art performance in temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations.

Conclusion: The progressive diffusion modeling approach effectively addresses data scarcity in controllable TTA generation and enables scalable fine-grained control through multi-task learning.

Abstract: Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations. Demo samples are available at: https://control-audio.github.io/Control-Audio.

[382] VM-UNSSOR: Unsupervised Neural Speech Separation Enhanced by Higher-SNR Virtual Microphone Arrays

Shulin He, Zhong-Qiu Wang

Main category: cs.SD

TL;DR: VM-UNSSOR improves unsupervised speech separation by augmenting limited-microphone training mixtures with virtual-microphone signals to strengthen mixture consistency constraints and address frequency permutation problems.

Details

Motivation: UNSSOR's mixture consistency loss weakens when training with fewer microphones, causing dramatic performance drops in determined microphone setups.

Method: Augment observed training mixtures with higher-SNR virtual-microphone signals obtained by applying linear spatial demixers (IVA, spatial clustering) to compute extra mixture consistency losses.

Result: In 6-mic setup: VM-UNSSOR achieves 17.1 dB SI-SDR vs UNSSOR’s 14.7 dB. In 2-mic setup: VM-UNSSOR achieves 10.7 dB vs UNSSOR’s -2.7 dB collapse.

Conclusion: Virtual-microphone augmentation effectively strengthens mixture consistency constraints in limited-microphone scenarios, significantly improving unsupervised speech separation performance.

Abstract: Blind speech separation (BSS) aims to recover multiple speech sources from multi-channel, multi-speaker mixtures under unknown array geometry and room impulse responses. In unsupervised setup where clean target speech is not available for model training, UNSSOR proposes a mixture consistency (MC) loss for training deep neural networks (DNN) on over-determined training mixtures to realize unsupervised speech separation. However, when the number of microphones of the training mixtures decreases, the MC constraint weakens and the separation performance falls dramatically. To address this, we propose VM-UNSSOR, augmenting the observed training mixture signals recorded by a limited number of microphones with several higher-SNR virtual-microphone (VM) signals, which are obtained by applying linear spatial demixers (such as IVA and spatial clustering) to the observed training mixtures. As linear projections of the observed mixtures, the virtual-microphone signals can typically increase the SNR of each source and can be leveraged to compute extra MC losses to improve UNSSOR and address the frequency permutation problem in UNSSOR. On the SMS-WSJ dataset, in the over-determined six-microphone, two-speaker separation setup, VM-UNSSOR reaches 17.1 dB SI-SDR, while UNSSOR only obtains 14.7 dB; and in the determined two-microphone, two-speaker case, UNSSOR collapses to -2.7 dB SI-SDR, while VM-UNSSOR achieves 10.7 dB.

[383] DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment

Zongcai Du, Guilin Deng, Xiaofeng Guo, Xin Gao, Linke Li, Kaichang Cheng, Fubo Han, Siyu Yang, Peng Liu, Pan Zhong, Qiang Fu

Main category: cs.SD

TL;DR: DiTSinger: A scalable diffusion transformer model for singing voice synthesis that uses a two-stage pipeline to generate training data and incorporates implicit alignment without phoneme-level duration labels.

Details

Motivation: Address limitations of data scarcity and model scalability in diffusion-based singing voice synthesis while improving robustness under noisy alignments.

Method: Two-stage pipeline: 1) Create seed set with fixed melodies and LLM-generated lyrics, 2) Train melody-specific models to synthesize training data. Propose DiTSinger diffusion transformer with RoPE and qk-norm, scaled in depth/width/resolution. Implicit alignment mechanism constrains phoneme-to-acoustic attention within character-level spans.

Result: Synthesized over 500 hours of high-quality Chinese singing data. Extensive experiments validate scalable, alignment-free, and high-fidelity SVS performance.

Conclusion: The approach enables scalable, alignment-free, and high-fidelity singing voice synthesis through data generation pipeline and improved diffusion transformer architecture with implicit alignment.

Abstract: Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.

[384] Déréverbération non-supervisée de la parole par modèle hybride

Louis Bahrman, Mathieu Fontaine, Gaël Richard

Main category: cs.SD

TL;DR: Unsupervised speech dereverberation using only reverberant speech and limited acoustic information like RT60, achieving consistent performance across metrics.

Details

Motivation: Existing dereverberation methods require paired dry/reverberant data which is difficult to obtain, creating a need for unsupervised approaches.

Method: Uses limited acoustic information (reverberation time RT60) to train dereverberation system in unsupervised manner with only reverberant speech.

Result: Achieves more consistent performance across various objective metrics than state-of-the-art methods.

Conclusion: Proposed unsupervised approach using limited acoustic information is effective for speech dereverberation without requiring paired training data.

Abstract: This paper introduces a new training strategy to improve speech dereverberation systems in an unsupervised manner using only reverberant speech. Most existing algorithms rely on paired dry/reverberant data, which is difficult to obtain. Our approach uses limited acoustic information, like the reverberation time (RT60), to train a dereverberation system. Experimental results demonstrate that our method achieves more consistent performance across various objective metrics than the state-of-the-art.

[385] O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion

Huu Tuong Tu, Huan Vu, cuong tien nguyen, Dien Hy Ngo, Nguyen Thi Thu Trang

Main category: cs.SD

TL;DR: A new voice conversion approach using synthetic speech data from pretrained TTS models to create source-target pairs, enabling direct mapping between voices while preserving linguistic content and improving zero-shot performance.

Details

Motivation: Traditional voice conversion methods struggle with effectively disentangling speaker identity and linguistic information, often leading to information loss during training.

Method: Uses synthetic speech data generated by pretrained multispeaker TTS models to create input-output pairs with same linguistic content but different speaker identities, training the model to learn direct mapping between source and target voices.

Result: Achieves 16.35% relative reduction in word error rate and 5.91% improvement in speaker cosine similarity, outperforming state-of-the-art methods.

Conclusion: The proposed method effectively captures speaker-specific characteristics while preserving linguistic content and demonstrates strong generalization to unseen speakers and languages in zero-shot scenarios.

Abstract: Traditional voice conversion (VC) methods typically attempt to separate speaker identity and linguistic information into distinct representations, which are then combined to reconstruct the audio. However, effectively disentangling these factors remains challenging, often leading to information loss during training. In this paper, we propose a new approach that leverages synthetic speech data generated by a high-quality, pretrained multispeaker text-to-speech (TTS) model. Specifically, synthetic data pairs that share the same linguistic content but differ in speaker identity are used as input-output pairs to train the voice conversion model. This enables the model to learn a direct mapping between source and target voices, effectively capturing speaker-specific characteristics while preserving linguistic content. Additionally, we introduce a flexible training strategy for any-to-any voice conversion that generalizes well to unseen speakers and new languages, enhancing adaptability and performance in zero-shot scenarios. Our experiments show that our proposed method achieves a 16.35% relative reduction in word error rate and a 5.91% improvement in speaker cosine similarity, outperforming several state-of-the-art methods. Voice conversion samples can be accessed at: https://oovc-emnlp-2025.github.io/

[386] MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji

Main category: cs.SD

TL;DR: MMAudioSep is a generative model for video/text-queried sound separation built on a pretrained video-to-audio model, achieving superior performance over baseline models while retaining original generation capabilities.

Details

Motivation: To leverage knowledge from pretrained audio generative models for efficient training of sound separation models, avoiding the need for training from scratch.

Method: Fine-tunes a pretrained video-to-audio generative model for sound separation tasks using video/text queries.

Result: MMAudioSep outperforms existing separation models (both deterministic and generative approaches) and maintains video-to-audio generation capability after fine-tuning.

Conclusion: Foundational sound generation models have strong potential for adaptation to sound-related downstream tasks like separation.

Abstract: We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.

[387] Emotion-Disentangled Embedding Alignment for Noise-Robust and Cross-Corpus Speech Emotion Recognition

Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu

Main category: cs.SD

TL;DR: A two-step EDRL-MEA approach enhances speech emotion recognition robustness by learning emotion-disentangled representations and aligning them in a joint subspace, improving performance on noisy and cross-corpus data.

Details

Motivation: Speech emotion recognition effectiveness is limited by noisy environments and dataset variability, requiring more robust and generalizable models.

Method: Two-step approach: EDRL extracts class-specific discriminative features while preserving shared similarities, then MEA projects representations into a joint discriminative latent subspace maximizing covariance with original speech input.

Result: Improved performance on unseen noisy and cross-corpus speech samples, demonstrating enhanced robustness and generalization.

Conclusion: The EDRL-MEA method effectively improves speech emotion recognition under challenging real-world conditions through better representation learning.

Abstract: Effectiveness of speech emotion recognition in real-world scenarios is often hindered by noisy environments and variability across datasets. This paper introduces a two-step approach to enhance the robustness and generalization of speech emotion recognition models through improved representation learning. First, our model employs EDRL (Emotion-Disentangled Representation Learning) to extract class-specific discriminative features while preserving shared similarities across emotion categories. Next, MEA (Multiblock Embedding Alignment) refines these representations by projecting them into a joint discriminative latent subspace that maximizes covariance with the original speech input. The learned EDRL-MEA embeddings are subsequently used to train an emotion classifier using clean samples from publicly available datasets, and are evaluated on unseen noisy and cross-corpus speech samples. Improved performance under these challenging conditions demonstrates the effectiveness of the proposed method.

[388] SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion

Zhao Guo, Ziqian Ning, Guobin Ma, Lei Xie

Main category: cs.SD

TL;DR: SynthVC is a streaming end-to-end voice conversion framework that learns speaker timbre transformation from synthetic parallel data, enabling low-latency real-time conversion without explicit content-speaker separation modules.

Details

Motivation: Current VC models struggle with real-time streaming due to high latency, ASR dependency, and complex speaker disentanglement causing timbre leakage and degraded naturalness.

Method: Uses synthetic parallel data from pre-trained zero-shot VC model to directly learn speaker timbre transformation, built on neural audio codec architecture for streaming inference.

Result: Outperforms baseline streaming VC systems in naturalness and speaker similarity with 77.1 ms end-to-end latency.

Conclusion: SynthVC provides an effective streaming VC solution that eliminates complex separation modules while maintaining high fidelity and low latency.

Abstract: Voice Conversion (VC) aims to modify a speaker’s timbre while preserving linguistic content. While recent VC models achieve strong performance, most struggle in real-time streaming scenarios due to high latency, dependence on ASR modules, or complex speaker disentanglement, which often results in timbre leakage or degraded naturalness. We present SynthVC, a streaming end-to-end VC framework that directly learns speaker timbre transformation from synthetic parallel data generated by a pre-trained zero-shot VC model. This design eliminates the need for explicit content-speaker separation or recognition modules. Built upon a neural audio codec architecture, SynthVC supports low-latency streaming inference with high output fidelity. Experimental results show that SynthVC outperforms baseline streaming VC systems in both naturalness and speaker similarity, achieving an end-to-end latency of just 77.1 ms.

[389] WildElder: A Chinese Elderly Speech Dataset from the Wild with Fine-Grained Manual Annotations

Hui Wang, Jiaming Zhou, Jiabei He, Haoqin Sun, Yong Qin

Main category: cs.SD

TL;DR: WildElder is a Mandarin elderly speech corpus collected from online videos with fine-grained annotations to address the lack of diverse real-world elderly speech data.

Details

Motivation: Existing Chinese datasets are mostly recorded in controlled environments, limiting diversity and real-world applicability for elderly speech processing, which has unique challenges due to age-related changes.

Method: Collected elderly speech from online videos and enriched with fine-grained manual annotations including transcription, speaker age, gender, and accent strength.

Result: Experimental results reveal both the difficulties of elderly speech recognition and the potential of WildElder as a challenging new benchmark.

Conclusion: WildElder enables robust research on automatic speech recognition and speaker profiling by combining the realism of in-the-wild data with expert curation.

Abstract: Elderly speech poses unique challenges for automatic processing due to age-related changes such as slower articulation and vocal tremors. Existing Chinese datasets are mostly recorded in controlled environments, limiting their diversity and real-world applicability. To address this gap, we present WildElder, a Mandarin elderly speech corpus collected from online videos and enriched with fine-grained manual annotations, including transcription, speaker age, gender, and accent strength. Combining the realism of in-the-wild data with expert curation, WildElder enables robust research on automatic speech recognition and speaker profiling. Experimental results reveal both the difficulties of elderly speech recognition and the potential of WildElder as a challenging new benchmark. The dataset and code are available at https://github.com/NKU-HLT/WildElder.

[390] TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis

Tri Ton, Ji Woo Hong, Chang D. Yoo

Main category: cs.SD

TL;DR: TARO is a novel video-to-audio synthesis framework that uses flow-based transformers with timestep-adaptive representation alignment and onset-aware conditioning to achieve high-fidelity and temporally coherent audio generation.

Details

Motivation: To address the challenges of achieving high-fidelity and temporally coherent video-to-audio synthesis, particularly in maintaining synchronization with dynamic visual events and ensuring smooth audio evolution.

Method: Built on flow-based transformers for stable training, TARO introduces two key components: Timestep-Adaptive Representation Alignment (TRA) that dynamically adjusts alignment strength based on noise schedule, and Onset-Aware Conditioning (OAC) that integrates onset cues as event-driven markers for better synchronization.

Result: TARO outperforms prior methods on VGGSound and Landscape datasets, achieving 53% lower Frechet Distance, 29% lower Frechet Audio Distance, and 97.19% Alignment Accuracy.

Conclusion: The proposed TARO framework demonstrates superior audio quality and synchronization precision for video-to-audio synthesis through its innovative representation alignment and onset-aware conditioning mechanisms.

Abstract: This paper introduces Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally coherent video-to-audio synthesis. Built upon flow-based transformers, which offer stable training and continuous transformations for enhanced synchronization and audio quality, TARO introduces two key innovations: (1) Timestep-Adaptive Representation Alignment (TRA), which dynamically aligns latent representations by adjusting alignment strength based on the noise schedule, ensuring smooth evolution and improved fidelity, and (2) Onset-Aware Conditioning (OAC), which integrates onset cues that serve as sharp event-driven markers of audio-relevant visual moments to enhance synchronization with dynamic visual events. Extensive experiments on the VGGSound and Landscape datasets demonstrate that TARO outperforms prior methods, achieving relatively 53% lower Frechet Distance (FD), 29% lower Frechet Audio Distance (FAD), and a 97.19% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.

[391] Direction Estimation of Sound Sources Using Microphone Arrays and Signal Strength

Mahdi Ali Pour, Zahra Habibzadeh

Main category: cs.SD

TL;DR: A sound-tracking method using three electret microphones that estimates sound direction by comparing average power levels, achieving less than 6 degrees localization error and 98% precision with simple, cost-effective hardware.

Details

Motivation: Sound-tracking is essential for applications like security systems, acoustic monitoring, and speaker tracking, but existing systems face challenges with directional accuracy, precision, and complex hardware/signal processing requirements.

Method: Uses three strategically placed electret microphones and analyzes signals by comparing the average power of received signals to infer the most probable direction of the sound source.

Result: The system achieves localization error of less than 6 degrees and 98% precision, with straightforward hardware design ensuring simplicity and affordability.

Conclusion: This technique provides a robust and reliable solution for sound-tracking and localization that is versatile, adaptable, and suitable for diverse applications including security systems, smart homes, and acoustic monitoring.

Abstract: Sound-tracking refers to the process of determining the direction from which a sound originates, making it a fundamental component of sound source localization. This capability is essential in a variety of applications, including security systems, acoustic monitoring, and speaker tracking, where accurately identifying the direction of a sound source enables real-time responses, efficient resource allocation, and improved situational awareness. While sound-tracking is closely related to localization, it specifically focuses on identifying the direction of the sound source rather than estimating its exact position in space. Despite its utility, sound-tracking systems face several challenges, such as maintaining directional accuracy and precision, along with the need for sophisticated hardware configurations and complex signal processing algorithms. This paper presents a sound-tracking method using three electret microphones. We estimate the direction of a sound source using a lightweight method that analyzes signals from three strategically placed microphones. By comparing the average power of the received signals, the system infers the most probable direction of the sound. The results indicate that the power level from each microphone effectively determines the sound source direction. Our system employs a straightforward and cost-effective hardware design, ensuring simplicity and affordability in implementation. It achieves a localization error of less than 6 degrees and a precision of 98%. Additionally, its effortless integration with various systems makes it versatile and adaptable. Consequently, this technique presents a robust and reliable solution for sound-tracking and localization, with potential applications spanning diverse domains such as security systems, smart homes, and acoustic monitoring.

[392] From Coarse to Fine: Recursive Audio-Visual Semantic Enhancement for Speech Separation

Ke Xue, Rongfei Fan, Lixin, Dawei Zhao, Chao Zhu, Han Hu

Main category: cs.SD

TL;DR: CSFNet is a coarse-to-fine audio-visual speech separation network that uses recursive semantic enhancement through a two-stage process: coarse separation followed by fine separation with audio-visual speech recognition feedback.

Details

Motivation: Existing audio-visual speech separation methods underexploit visual cues by relying on static visual representations, failing to fully leverage the complementary semantic guidance that visual information provides.

Method: Two-stage approach: (1) Coarse separation reconstructs initial audio from mixture and visual input; (2) Fine separation feeds coarse audio back into AVSR model with visual stream for recursive semantic enhancement. Includes speaker-aware perceptual fusion and multi-range spectro-temporal separation network.

Result: Achieves state-of-the-art performance on three benchmark datasets and two noisy datasets, with substantial coarse-to-fine improvements.

Conclusion: The recursive semantic enhancement framework is necessary and effective for audio-visual speech separation, demonstrating significant performance gains over existing methods.

Abstract: Audio-visual speech separation aims to isolate each speaker’s clean voice from mixtures by leveraging visual cues such as lip movements and facial features. While visual information provides complementary semantic guidance, existing methods often underexploit its potential by relying on static visual representations. In this paper, we propose CSFNet, a Coarse-to-Separate-Fine Network that introduces a recursive semantic enhancement paradigm for more effective separation. CSFNet operates in two stages: (1) Coarse Separation, where a first-pass estimation reconstructs a coarse audio waveform from the mixture and visual input; and (2) Fine Separation, where the coarse audio is fed back into an audio-visual speech recognition (AVSR) model together with the visual stream. This recursive process produces more discriminative semantic representations, which are then used to extract refined audio. To further exploit these semantics, we design a speaker-aware perceptual fusion block to encode speaker identity across modalities, and a multi-range spectro-temporal separation network to capture both local and global time-frequency patterns. Extensive experiments on three benchmark datasets and two noisy datasets show that CSFNet achieves state-of-the-art (SOTA) performance, with substantial coarse-to-fine improvements, validating the necessity and effectiveness of our recursive semantic enhancement framework.

cs.LG

[393] Reinforcement Learning-Driven Edge Management for Reliable Multi-view 3D Reconstruction

Motahare Mounesan, Sourya Saha, Houchao Gan, Md. Nurul Absur, Saptarshi Debroy

Main category: cs.LG

TL;DR: RL-based edge resource management framework for reliable 3D reconstruction in dynamic edge environments using cooperative Q-learning agents for camera and server selection.

Details

Motivation: Real-time multi-view 3D reconstruction is critical for edge applications like fire rescue, but dynamic edge resource availability causes disruptions that challenge reconstruction reliability.

Method: Two cooperative Q-learning agents (camera selection and server selection) operating online through interactions with edge environment, implemented on distributed testbed with lab devices and FABRIC infrastructure.

Result: Framework improves application reliability by effectively balancing end-to-end latency and reconstruction quality in dynamic environments.

Conclusion: The RL-based approach successfully enables high-quality 3D reconstruction within reasonable time despite resource-constrained and disruption-prone edge environments.

Abstract: Real-time multi-view 3D reconstruction is a mission-critical application for key edge-native use cases, such as fire rescue, where timely and accurate 3D scene modeling enables situational awareness and informed decision-making. However, the dynamic and unpredictable nature of edge resource availability introduces disruptions, such as degraded image quality, unstable network links, and fluctuating server loads, which challenge the reliability of the reconstruction pipeline. In this work, we present a reinforcement learning (RL)-based edge resource management framework for reliable 3D reconstruction to ensure high quality reconstruction within a reasonable amount of time, despite the system operating under a resource-constrained and disruption-prone environment. In particular, the framework adopts two cooperative Q-learning agents, one for camera selection and one for server selection, both of which operate entirely online, learning policies through interactions with the edge environment. To support learning under realistic constraints and evaluate system performance, we implement a distributed testbed comprising lab-hosted end devices and FABRIC infrastructure-hosted edge servers to emulate smart city edge infrastructure under realistic disruption scenarios. Results show that the proposed framework improves application reliability by effectively balancing end-to-end latency and reconstruction quality in dynamic environments.

[394] Energy-Driven Steering: Reducing False Refusals in Large Language Models

Eric Hanchen Jiang, Weixuan Ou, Run Liu, Shengyuan Pang, Guancheng Wan, Ranjie Duan, Wei Dong, Kai-Wei Chang, XiaoFeng Wang, Ying Nian Wu, Xinfeng Li

Main category: cs.LG

TL;DR: Energy-Driven Steering (EDS) is a fine-tuning free framework that uses an external Energy-Based Model to dynamically steer LLM activations during inference, reducing false refusals while maintaining safety.

Details

Motivation: Current safety alignment techniques make LLMs over-cautious, causing them to refuse benign prompts. The goal is to enhance safety while reducing false refusals.

Method: Train a lightweight Energy-Based Model to assign high energy to undesirable states and low energy to desirable ones. During inference, use energy gradients to dynamically steer LLM hidden states without modifying weights.

Result: Substantially lowers false refusal rates - raises compliance on ORB-H benchmark from 57.3% to 82.6% while maintaining baseline safety performance.

Conclusion: EDS presents an effective paradigm for building LLMs that achieve both low false refusal rates and high safety through inference-time intervention.

Abstract: Safety alignment of large language models (LLMs) faces a key challenge: current alignment techniques often only focus on improving safety against harmful prompts, causing LLMs to become over-cautious and refuse to respond to benign prompts. Therefore, a key objective of safe alignment is to enhance safety while simultaneously reducing false refusals. In this paper, we introduce Energy-Driven Steering (EDS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. During inference, EBM maps the LLM’s internal activations to an “energy landscape”. We use the gradient of the energy function to dynamically steer the LLM’s hidden states to low energy regions, correcting the model to generate a desirable response in real-time without modifying its weights. This method decouples behavioral control from the model’s core knowledge, offering a flexible solution with minimal computational overhead. Extensive experiments across a wide range of models show our method successfully achieves this objective: it substantially lowers false refusal rates. For example, raising compliance on the ORB-H benchmark from 57.3% to 82.6% while maintaining the baseline safety performance. Our work presents an effective paradigm for building LLMs that achieve both low false refusal rates and high safety.

[395] Inverse-Free Wilson Loops for Transformers: A Practical Diagnostic for Invariance and Order Sensitivity

Edward Y. Chang, Ethan Y. Chang

Main category: cs.LG

TL;DR: WILSON is a diagnostic tool that detects when LLM outputs change under harmless edits like reordering, helping prevent reliability issues without model retraining.

Details

Motivation: LLMs exhibit sensitivity to harmless edits (passage reordering, fine-tuning, prompt variations) that violate intended invariances, break CI/CD, and force safety-speed tradeoffs.

Method: WILSON uses inverse-free curvature maps with JVPs and Hutchinson probes, plus activation-level commutators to flag reorder risks in Transformer representations.

Result: Provides cheap, model-agnostic signals for orchestrators to guard against order effects, catch fine-tuning regressions, and stabilize debate pathways and long contexts.

Conclusion: WILSON enables reliability and throughput improvements without architectural changes by anticipating failures and approving safe optimizations.

Abstract: Large language models can change answers under harmless edits that matter in practice: RAG outputs flip when passages are reordered, fine-tuning erodes invariances learned at pretraining, debate or chain-of-thought prompts take path-dependent routes, and compiler fusion or reordering perturbs logits near decision boundaries. These failures violate intended invariances, break continuous integration, and force teams to trade safety for speed. The effects are small yet distributed across layers and positions, sensitive to context length and evaluation order, and costly to repair with retraining or formal verification. We present WILSON, a minimal post-hoc diagnostic suite that converts simple loop and reordering checks on internal representations into system signals. WILSON combines an inverse-free curvature map over positions and layers, computed with JVPs and Hutchinson probes, with activation-level commutators that flag reorder risk. Signals are cheap to compute, model-agnostic for standard Transformers, and exported as thresholds and CSV artifacts for orchestrators. This enables concrete actions: guard RAG against order effects, catch fine-tuning regressions, stabilize debate pathways and long multi-turn contexts, and gate fusions or reorders in deployment. In short, WILSON helps anticipate failures and approve safe optimizations so reliability and throughput can improve together without changing model architecture or training.

[396] Knowledge Graph Sparsification for GNN-based Rare Disease Diagnosis

Premt Cara, Kamilia Zaripova, David Bani-Harouni, Nassir Navab, Azade Farshad

Main category: cs.LG

TL;DR: RareNet is a subgraph-based Graph Neural Network that uses only patient phenotypes to identify causal genes for rare diseases, serving as a standalone tool or enhancer for other methods.

Details

Motivation: Address challenges in rare genetic disease diagnosis including insufficient patient data, inaccessible full genome sequencing, and the vast number of possible causative genes, which lead to prolonged diagnostic journeys and disproportionately affect resource-limited settings.

Method: Subgraph-based Graph Neural Network that requires only patient phenotypes to identify causal genes and retrieve focused patient subgraphs for targeted clinical investigation.

Result: Demonstrates competitive and robust causal gene prediction on two biomedical datasets, with significant performance gains when integrated with other frameworks.

Conclusion: RareNet democratizes access to sophisticated genetic analysis by requiring only phenotypic data, offering particular value for underserved populations lacking advanced genomic infrastructure.

Abstract: Rare genetic disease diagnosis faces critical challenges: insufficient patient data, inaccessible full genome sequencing, and the immense number of possible causative genes. These limitations cause prolonged diagnostic journeys, inappropriate treatments, and critical delays, disproportionately affecting patients in resource-limited settings where diagnostic tools are scarce. We propose RareNet, a subgraph-based Graph Neural Network that requires only patient phenotypes to identify the most likely causal gene and retrieve focused patient subgraphs for targeted clinical investigation. RareNet can function as a standalone method or serve as a pre-processing or post-processing filter for other candidate gene prioritization methods, consistently enhancing their performance while potentially enabling explainable insights. Through comprehensive evaluation on two biomedical datasets, we demonstrate competitive and robust causal gene prediction and significant performance gains when integrated with other frameworks. By requiring only phenotypic data, which is readily available in any clinical setting, RareNet democratizes access to sophisticated genetic analysis, offering particular value for underserved populations lacking advanced genomic infrastructure.

[397] Inner-Instance Normalization for Time Series Forecasting

Zipo Jibao, Yingyi Fu, Xinyang Chen, Guoting Chen

Main category: cs.LG

TL;DR: Proposes two point-level methods (LD and LCD) to address inner-instance distribution shifts in time series forecasting by handling distribution changes within individual instances.

Details

Motivation: Real-world time series exhibit complex non-stationary characteristics that cause distribution shifts, negatively impacting model performance. Existing instance normalization techniques fail to address shifts within individual instances.

Method: Two novel point-level methods: Learning Distribution (LD) eliminates internal discrepancies by fitting input and output distributions with different parameters at different time steps; Learning Conditional Distribution (LCD) uses neural networks to predict scaling coefficients of the output.

Result: Evaluated with various backbone models across public benchmarks, demonstrating effectiveness of the point-level paradigm through comparative experiments.

Conclusion: The proposed point-level methods effectively address inner-instance distribution shifts in time series forecasting, improving model performance.

Abstract: Real-world time series are influenced by numerous factors and exhibit complex non-stationary characteristics. Non-stationarity can lead to distribution shifts, where the statistical properties of time series change over time, negatively impacting model performance. Several instance normalization techniques have been proposed to address distribution shifts in time series forecasting. However, existing methods fail to account for shifts within individual instances, leading to suboptimal performance. To tackle inner-instance distribution shifts, we propose two novel point-level methods: Learning Distribution (LD) and Learning Conditional Distribution (LCD). LD eliminates internal discrepancies by fitting the internal distribution of input and output with different parameters at different time steps, while LCD utilizes neural networks to predict scaling coefficients of the output. We evaluate the performance of the two methods with various backbone models across public benchmarks and demonstrate the effectiveness of the point-level paradigm through comparative experiments.

[398] Provably Robust Adaptation for Language-Empowered Foundation Models

Yuni Lai, Xiaoyu Xue, Linghui Shen, Yulun Wu, Gaolei Li, Song Guo, Kai Zhou, Bin Xiao

Main category: cs.LG

TL;DR: LeFCert is the first provably robust few-shot classifier for language-empowered foundation models that defends against poisoning attacks through certified robustness with adaptive blending of textual and feature embeddings.

Details

Motivation: Language-empowered foundation models like CLIP are vulnerable to poisoning attacks on their support datasets, and existing defenses lack formal guarantees and remain vulnerable to adaptive attacks.

Method: Integrates textual and feature embeddings with adaptive blending, uses twofold trimmed mean prototype, derives provable bounds for classification scores, and extends with LeFCert-L (randomized smoothing with dual budget constraints) and LeFCert-C (collective certification for shared poisoning budget).

Result: Achieves state-of-the-art performance with significant improvements in both clean and certified accuracy compared to existing baselines, while maintaining computational efficiency.

Conclusion: LeFCert provides the first provably robust defense against poisoning attacks for few-shot classifiers based on language-empowered foundation models, offering practical and computationally efficient solutions for real-world applications.

Abstract: Language-empowered foundation models (LeFMs), such as CLIP and GraphCLIP, have transformed multimodal learning by aligning visual (or graph) features with textual representations, enabling powerful downstream capabilities like few-shot learning. However, the reliance on small, task-specific support datasets collected in open environments exposes these models to poisoning attacks, where adversaries manipulate the support samples to degrade performance. Existing defenses rely on empirical strategies, which lack formal guarantees and remain vulnerable to unseen and adaptive attacks. Certified robustness offers provable guarantees but has been largely unexplored for few-shot classifiers based on LeFMs. This study seeks to fill these critical gaps by proposing the first provably robust few-shot classifier that is tailored for LeFMs. We term our model Language-empowered Few-shot Certification (\textbf{LeFCert}). It integrates both textual and feature embeddings with an adaptive blending mechanism. To achieve provable robustness, we propose a twofold trimmed mean prototype and derive provable upper and lower bounds for classification scores, enabling certification under worst-case poisoning scenarios. To further enhance the performance, we extend LeFCert with two variants by considering a more realistic and tighter attack budget: LeFCert-L incorporates randomized smoothing to provide Lipschitz continuity and derive robustness under dual budget constraints, and LeFCert-C provides collective certification for scenarios where attackers distribute a shared poisoning budget across multiple samples. Experiments demonstrate that LeFCert achieves state-of-the-art performance, significantly improving both clean and certified accuracy compared to existing baselines. Despite its advanced robustness mechanisms, LeFCert is computationally efficient, making it practical for real-world applications.

[399] How Scale Breaks “Normalized Stress” and KL Divergence: Rethinking Quality Metrics

Kiran Smelser, Kaviru Gunaratne, Jacob Miller, Stephen Kobourov

Main category: cs.LG

TL;DR: The paper analyzes how common visualization quality metrics (normalized stress and KL divergence) are sensitive to uniform scaling of 2D projections, which doesn’t meaningfully change the visualization. The authors propose a simple technique to make these metrics scale-invariant.

Details

Motivation: High-dimensional data visualization using 2D scatter plots is common, but existing quality metrics are sensitive to uniform scaling, which doesn't affect the actual visualization quality. This scaling sensitivity can mislead evaluations of dimension reduction techniques.

Method: The authors investigate scaling effects analytically and empirically, then introduce a simple technique to make both normalized stress and KL divergence metrics scale-invariant.

Result: The study shows significant changes in metric values due to scaling, which affects dimension reduction evaluations. The proposed scale-invariant modification accurately captures expected behavior on benchmark tests.

Conclusion: Current visualization quality metrics are unnecessarily sensitive to scaling. The proposed scale-invariant versions provide more meaningful evaluations of dimension reduction techniques by focusing on actual visualization quality rather than arbitrary scaling factors.

Abstract: Complex, high-dimensional data is ubiquitous across many scientific disciplines, including machine learning, biology, and the social sciences. One of the primary methods of visualizing these datasets is with two-dimensional scatter plots that visually capture some properties of the data. Because visually determining the accuracy of these plots is challenging, researchers often use quality metrics to measure the projection’s accuracy and faithfulness to the original data. One of the most commonly employed metrics, normalized stress, is sensitive to uniform scaling (stretching, shrinking) of the projection, despite this act not meaningfully changing anything about the projection. Another quality metric, the Kullback–Leibler (KL) divergence used in the popular t-Distributed Stochastic Neighbor Embedding (t-SNE) technique, is also susceptible to this scale sensitivity. We investigate the effect of scaling on stress and KL divergence analytically and empirically by showing just how much the values change and how this affects dimension reduction technique evaluations. We introduce a simple technique to make both metrics scale-invariant and show that it accurately captures expected behavior on a small benchmark.

[400] CATS-Linear: Classification Auxiliary Linear Model for Time Series Forecasting

Zipo Jibao, Yingyi Fu, Xinyang Chen, Guoting Chen

Main category: cs.LG

TL;DR: CATS-Linear is a novel linear model that uses classification-based channel routing and trend-seasonal decomposition to achieve state-of-the-art forecasting performance with fixed hyperparameters.

Details

Motivation: Linear models show competitive forecasting performance but enhancement methodologies are underexplored. The hypothesis is that different time series instances may follow heterogeneous linear mappings, suggesting the need for adaptive routing.

Method: Proposes Classification Auxiliary Channel-Independence (CACI) that dynamically routes instances to dedicated predictors via classification. Also redesigns trend-seasonal decomposition with decoupling-linear mapping-recoupling framework for trends and complex-domain linear projections for seasonal components.

Result: Extensive experiments show CATS-Linear with fixed hyperparameters achieves state-of-the-art accuracy comparable to hyperparameter-tuned baselines and delivers SOTA accuracy against fixed-hyperparameter counterparts.

Conclusion: CATS-Linear demonstrates that enhanced linear models with proper channel routing and decomposition techniques can achieve competitive forecasting performance while maintaining simplicity and fixed hyperparameter settings.

Abstract: Recent research demonstrates that linear models achieve forecasting performance competitive with complex architectures, yet methodologies for enhancing linear models remain underexplored. Motivated by the hypothesis that distinct time series instances may follow heterogeneous linear mappings, we propose the Classification Auxiliary Trend-Seasonal Decoupling Linear Model CATS-Linear, employing Classification Auxiliary Channel-Independence (CACI). CACI dynamically routes instances to dedicated predictors via classification, enabling supervised channel design. We further analyze the theoretical expected risks of different channel settings. Additionally, we redesign the trend-seasonal decomposition architecture by adding a decoupling – linear mapping – recoupling framework for trend components and complex-domain linear projections for seasonal components. Extensive experiments validate that CATS-Linear with fixed hyperparameters achieves state-of-the-art accuracy comparable to hyperparameter-tuned baselines while delivering SOTA accuracy against fixed-hyperparameter counterparts.

[401] DPCformer: An Interpretable Deep Learning Model for Genomic Prediction in Crops

Pengcheng Deng, Kening Liu, Mengxi Zhou, Mingxi Li, Rui Yang, Chuzhe Cao, Maojun Wang, Zeyu Zhang

Main category: cs.LG

TL;DR: DPCformer is a deep learning model combining CNNs and self-attention that significantly improves genomic selection accuracy across multiple crops compared to traditional methods.

Details

Motivation: Traditional Genomic Selection methods have limited accuracy for complex traits and large datasets, requiring more advanced approaches to better model genotype-phenotype relationships.

Method: DPCformer integrates convolutional neural networks with self-attention mechanism, uses 8-dimensional one-hot encoding for SNP data ordered by chromosome, and employs PMF algorithm for feature selection.

Result: DPCformer outperformed existing methods across 13 traits in 5 crops: maize (up to 2.92% improvement), cotton (up to 8.37% gain), tomato (up to 57.35% Pearson correlation increase), and chickpea (16.62% yield correlation boost).

Conclusion: DPCformer demonstrates superior accuracy, robustness in small-sample scenarios, and enhanced interpretability, providing a powerful tool for precision breeding and addressing global food security challenges.

Abstract: Genomic Selection (GS) uses whole-genome information to predict crop phenotypes and accelerate breeding. Traditional GS methods, however, struggle with prediction accuracy for complex traits and large datasets. We propose DPCformer, a deep learning model integrating convolutional neural networks with a self-attention mechanism to model complex genotype-phenotype relationships. We applied DPCformer to 13 traits across five crops (maize, cotton, tomato, rice, chickpea). Our approach uses an 8-dimensional one-hot encoding for SNP data, ordered by chromosome, and employs the PMF algorithm for feature selection. Evaluations show DPCformer outperforms existing methods. In maize datasets, accuracy for traits like days to tasseling and plant height improved by up to 2.92%. For cotton, accuracy gains for fiber traits reached 8.37%. On small-sample tomato data, the Pearson Correlation Coefficient for a key trait increased by up to 57.35%. In chickpea, the yield correlation was boosted by 16.62%. DPCformer demonstrates superior accuracy, robustness in small-sample scenarios, and enhanced interpretability, providing a powerful tool for precision breeding and addressing global food security challenges.

[402] FreqCa: Accelerating Diffusion Models via Frequency-Aware Caching

Jiacheng Liu, Peiliang Cai, Qinming Zhou, Yuqi Lin, Deyang Kong, Benhao Huang, Yupei Pan, Haowen Xu, Chang Zou, Junshu Tang, Shikang Zheng, Linfeng Zhang

Main category: cs.LG

TL;DR: Frequency-aware Caching (FreqCa) improves diffusion transformer efficiency by reusing low-frequency features and predicting high-frequency ones with interpolation, reducing memory usage by 99% while maintaining performance.

Details

Motivation: Existing feature caching methods assume feature continuity across timesteps, but this doesn't always hold. Analysis reveals different frequency bands exhibit different dynamics: low-frequency components have high similarity but poor continuity, while high-frequency components show good continuity but poor similarity.

Method: Proposed Frequency-aware Caching (FreqCa) that reuses low-frequency features directly based on similarity and uses second-order Hermite interpolation to predict volatile high-frequency features based on continuity. Also introduced caching Cumulative Residual Features (CRF) instead of all layer features.

Result: Extensive experiments on FLUX.1-dev, FLUX.1-Kontext-dev, Qwen-Image, and Qwen-Image-Edit demonstrate effectiveness in both generation and editing tasks.

Conclusion: FreqCa successfully addresses diffusion transformer inference costs by leveraging frequency domain insights, achieving significant memory reduction (99%) while maintaining performance across various models and tasks.

Abstract: The application of diffusion transformers is suffering from their significant inference costs. Recently, feature caching has been proposed to solve this problem by reusing features from previous timesteps, thereby skipping computation in future timesteps. However, previous feature caching assumes that features in adjacent timesteps are similar or continuous, which does not always hold in all settings. To investigate this, this paper begins with an analysis from the frequency domain, which reveal that different frequency bands in the features of diffusion models exhibit different dynamics across timesteps. Concretely, low-frequency components, which decide the structure of images, exhibit higher similarity but poor continuity. In contrast, the high-frequency bands, which decode the details of images, show significant continuity but poor similarity. These interesting observations motivate us to propose Frequency-aware Caching (FreqCa) which directly reuses features of low-frequency components based on their similarity, while using a second-order Hermite interpolator to predict the volatile high-frequency ones based on its continuity. Besides, we further propose to cache Cumulative Residual Feature (CRF) instead of the features in all the layers, which reduces the memory footprint of feature caching by 99%. Extensive experiments on FLUX.1-dev, FLUX.1-Kontext-dev, Qwen-Image, and Qwen-Image-Edit demonstrate its effectiveness in both generation and editing. Codes are available in the supplementary materials and will be released on GitHub.

[403] Don’t Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting

Yunzhen Feng, Parag Jain, Anthony Hartshorn, Yaqi Duan, Julia Kempe

Main category: cs.LG

TL;DR: LENS improves RLVR by leveraging negative groups (where no response is correct) through confidence-weighted penalties on incorrect responses, making previously wasted samples informative for gradient updates.

Details

Motivation: GRPO wastes compute on negative groups where no sampled response is correct, yielding zero advantage and no gradient. The paper aims to leverage these negative groups without extra supervision.

Method: Starting from maximum-likelihood objective in reward modeling, LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations by adding a confidence-weighted penalty on incorrect responses.

Result: On MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, LENS consistently outperforms GRPO baseline, with significant gains on harder items.

Conclusion: LENS provides a principled and practical way to rescue negative groups, improving efficiency and performance in RLVR by converting previously wasted samples into useful gradient updates.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as \textbf{L}ikelihood \textbf{E}stimation with \textbf{N}egative \textbf{S}amples (\textbf{LENS}). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to “rescue” negative groups, improving efficiency and performance in RLVR.

[404] In-Context Learning for Non-Stationary MIMO Equalization

Jiachen Jiang, Zhen Qin, Zhihui Zhu

Main category: cs.LG

TL;DR: The paper investigates in-context learning for non-stationary channel equalization, developing attention mechanisms inspired by classical adaptive algorithms to improve adaptability in time-varying wireless channels.

Details

Motivation: Standard supervised learning requires costly retraining for new channels, while existing ICL methods only handle static channels. There's a need for ICL approaches that can adapt to non-stationary, time-varying channels at inference time.

Method: Developed principled framework for designing efficient attention mechanisms using insights from adaptive signal processing algorithms like Least Mean Square (LMS), Least Root Mean Square (LRMS), and multi-step gradient updates for better tracking.

Result: Experimental results show ICL has strong potential for non-stationary MIMO equalization, and attention mechanisms inspired by classical adaptive algorithms significantly enhance adaptability and performance in dynamic environments.

Conclusion: ICL holds promise for non-stationary channel equalization, and attention mechanisms derived from adaptive algorithms can provide critical insights for developing next-generation wireless foundation models with improved adaptability and robustness.

Abstract: Channel equalization is fundamental for mitigating distortions such as frequency-selective fading and inter-symbol interference. Unlike standard supervised learning approaches that require costly retraining or fine-tuning for each new task, in-context learning (ICL) adapts to new channels at inference time with only a few examples. However, existing ICL-based equalizers are primarily developed for and evaluated on static channels within the context window. Indeed, to our knowledge, prior principled analyses and theoretical studies of ICL focus exclusively on the stationary setting, where the function remains fixed within the context. In this paper, we investigate the ability of ICL to address non-stationary problems through the lens of time-varying channel equalization. We employ a principled framework for designing efficient attention mechanisms with improved adaptivity in non-stationary tasks, leveraging algorithms from adaptive signal processing to guide better designs. For example, new attention variants can be derived from the Least Mean Square (LMS) adaptive algorithm, a Least Root Mean Square (LRMS) formulation for enhanced robustness, or multi-step gradient updates for improved long-term tracking. Experimental results demonstrate that ICL holds strong promise for non-stationary MIMO equalization, and that attention mechanisms inspired by classical adaptive algorithms can substantially enhance adaptability and performance in dynamic environments. Our findings may provide critical insights for developing next-generation wireless foundation models with stronger adaptability and robustness.

[405] FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms

Atul Shree, Harshith Jupuru

Main category: cs.LG

TL;DR: FLToP CTC is a novel decoding algorithm that uses frame-level token pruning with relative threshold probability to reduce computational and memory bottlenecks in CTC-based ASR systems, achieving 10.5x speedup and 2.78x memory reduction with minimal WER degradation.

Details

Motivation: CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments, with traditional CTC decoders requiring up to 90% of processing time due to exhaustive token-level operations.

Method: FLToP CTC employs frame-level token pruning guided by a relative threshold probability, dynamically eliminating low-probability tokens per frame to reduce compute and memory demands.

Result: On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders while maintaining negligible WER degradation.

Conclusion: FLToP CTC effectively addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.

Abstract: CTC-based ASR systems face computational and memory bottlenecks in resource-limited environments. Traditional CTC decoders, requiring up to 90% of processing time in systems (e.g., wav2vec2-large on L4 GPUs), face inefficiencies due to exhaustive token-level operations. This paper introduces Frame Level Token Pruning for Connectionist Temporal Classification (FLToP CTC), a novel decoding algorithm that employs frame-level token pruning guided by a relative threshold probability. By dynamically eliminating low-probability tokens per frame, FLToP CTC reduces compute and memory demands while maintaining negligible WER degradation. On LibriSpeech, FLToP CTC achieves a 10.5x runtime speedup and 2.78x memory reduction versus standard CTC decoders. Its simplicity enables seamless integration into CTC decoders across platforms (CPUs, GPUs, etc.). FLToP CTC addresses CTC bottlenecks, offering scalability for resource-limited environments and realtime applications, enhancing speech recognition accessibility and efficiency.

[406] Enhancing Self-Supervised Learning with Semantic Pairs A New Dataset and Empirical Study

Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong

Main category: cs.LG

TL;DR: Instance discrimination treats each dataset instance as a unique class for self-supervised learning, using stochastic transformations to create different views that help models learn invariant representations of the underlying object.

Details

Motivation: To develop self-supervised learning methods that can learn meaningful representations without manual labels by treating each data instance as its own class and learning invariance to transformations.

Method: Generate two different views of each instance using stochastic transformations, then train the model to recognize that both views belong to the same instance class.

Result: The approach enables models to learn representations that are invariant to common transformations while capturing the essential characteristics of the underlying objects.

Conclusion: Instance discrimination provides an effective self-supervised learning framework that leverages instance-level discrimination to learn robust representations without requiring labeled data.

Abstract: Instance discrimination is a self-supervised representation learning paradigm wherein individual instances within a dataset are treated as distinct classes. This is typically achieved by generating two disparate views of each instance by applying stochastic transformations, which encourages the model to learn representations that are invariant to the common underlying object across these views.

[407] Counterfactually Fair Conformal Prediction

Ozgur Guldogan, Neeraj Sarna, Yuanyuan Li, Michael Berger

Main category: cs.LG

TL;DR: CF-CP extends counterfactual fairness to prediction sets using conformal prediction, ensuring fair uncertainty quantification while maintaining coverage guarantees.

Details

Motivation: Counterfactual fairness for prediction sets is underexplored, while conformal prediction provides valid prediction sets but doesn't ensure counterfactual fairness. This gap needs to be closed for fair decision-making under uncertainty.

Method: Develop Counterfactually Fair Conformal Prediction (CF-CP) through symmetrization of conformity scores across protected-attribute interventions, maintaining marginal coverage while ensuring counterfactual fairness.

Result: CF-CP achieves counterfactual fairness and target coverage rates on both synthetic and real datasets across regression and classification tasks, with minimal increase in prediction set size.

Conclusion: CF-CP provides a simple, training-free approach to counterfactually fair uncertainty quantification, bridging the gap between conformal prediction and counterfactual fairness.

Abstract: While counterfactual fairness of point predictors is well studied, its extension to prediction sets–central to fair decision-making under uncertainty–remains underexplored. On the other hand, conformal prediction (CP) provides efficient, distribution-free, finite-sample valid prediction sets, yet does not ensure counterfactual fairness. We close this gap by developing Counterfactually Fair Conformal Prediction (CF-CP) that produces counterfactually fair prediction sets. Through symmetrization of conformity scores across protected-attribute interventions, we prove that CF-CP results in counterfactually fair prediction sets while maintaining the marginal coverage property. Furthermore, we empirically demonstrate that on both synthetic and real datasets, across regression and classification tasks, CF-CP achieves the desired counterfactual fairness and meets the target coverage rate with minimal increase in prediction set size. CF-CP offers a simple, training-free route to counterfactually fair uncertainty quantification.

[408] Transmuting prompts into weights

Hanna Mazzawi, Benoit Dherin, Michael Munn, Michael Wunder, Javier Gonzalvo

Main category: cs.LG

TL;DR: This paper provides a theoretical foundation for controlling LLM behavior through internal state modifications, showing how prompt information is represented as weight vectors/matrices and can be condensed into reusable thought vectors/matrices.

Details

Motivation: Existing techniques for controlling LLMs through internal state modifications are guided by empirical heuristics without theoretical foundation, creating a need for principled understanding of how these interventions work.

Method: Generalize the theory of implicit weight updates to deep transformers, showing how prompt information is represented internally, and derive principled methods for condensing this into token-independent thought vectors and matrices.

Result: Developed a theoretical explanation for existing vector- and matrix-based model editing techniques, providing a computationally-grounded method for converting textual input into reusable weight updates.

Conclusion: The work establishes a theoretical foundation for LLM control interventions, explaining how they emerge from transformer computations and offering principled methods for creating reusable steering mechanisms from textual inputs.

Abstract: A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt’s influence can be mathematically mapped to implicit weight updates (Dherin et al., 2025), we generalize this theory to deep, multi-block transformers. We show how the information contained in any chunk of a user prompt is represented and composed internally through weight vectors and weight matrices. We then derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector- and matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.

[409] SHAP-Based Supervised Clustering for Sample Classification and the Generalized Waterfall Plot

Justin Lin, Julia Fukuyama

Main category: cs.LG

TL;DR: This paper proposes clustering SHAP values to group samples that received the same prediction for similar reasons, providing insights into different pathways leading to the same prediction.

Details

Motivation: Large black-box models lack explainability, making them untrustworthy for high-stakes applications. SHAP analysis can explain predictions but needs methods to group similar prediction pathways.

Method: Clustering SHAP values to identify samples with similar prediction reasoning, plus a novel generalization of waterfall plots for multi-classification.

Result: Demonstrated methodology through simulated experiments and Alzheimer’s disease case study using ADNI data.

Conclusion: Clustering SHAP values provides valuable insights into data by revealing distinct pathways through which samples arrive at the same predictions.

Abstract: In this growing age of data and technology, large black-box models are becoming the norm due to their ability to handle vast amounts of data and learn incredibly complex input-output relationships. The deficiency of these methods, however, is their inability to explain the prediction process, making them untrustworthy and their use precarious in high-stakes situations. SHapley Additive exPlanations (SHAP) analysis is an explainable AI method growing in popularity for its ability to explain model predictions in terms of the original features. For each sample and feature in the data set, we associate a SHAP value that quantifies the contribution of that feature to the prediction of that sample. Clustering these SHAP values can provide insight into the data by grouping samples that not only received the same prediction, but received the same prediction for similar reasons. In doing so, we map the various pathways through which distinct samples arrive at the same prediction. To showcase this methodology, we present a simulated experiment in addition to a case study in Alzheimer’s disease using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. We also present a novel generalization of the waterfall plot for multi-classification.

[410] Faithful and Interpretable Explanations for Complex Ensemble Time Series Forecasts using Surrogate Models and Forecastability Analysis

Yikai Zhao, Jiekai Ma

Main category: cs.LG

TL;DR: A dual-approach framework that enhances interpretability of complex time series ensemble models through surrogate-based explanations and spectral predictability analysis, validated on M5 dataset.

Details

Motivation: Address the transparency and interpretability challenges of complex AutoML ensemble models like AutoGluon in time series forecasting, which deliver superior accuracy but lack explainability.

Method: 1) Surrogate-based explanation using LightGBM to mimic AutoGluon forecasts for stable SHAP feature attributions; 2) Spectral predictability analysis to quantify series forecastability by comparing against noise benchmarks; 3) Per-item normalization for meaningful SHAP explanations across heterogeneous series.

Result: Higher spectral predictability strongly correlates with improved forecast accuracy and higher surrogate fidelity. Feature injection experiments showed high faithfulness between SHAP values and ground truth effects. Forecastability metrics serve as effective filtering mechanisms and confidence scores.

Conclusion: The framework provides interpretable instance-level explanations for ensemble forecasts while equipping users with forecastability metrics as reliability indicators for both predictions and their explanations.

Abstract: Modern time series forecasting increasingly relies on complex ensemble models generated by AutoML systems like AutoGluon, delivering superior accuracy but with significant costs to transparency and interpretability. This paper introduces a comprehensive, dual-approach framework that addresses both the explainability and forecastability challenges in complex time series ensembles. First, we develop a surrogate-based explanation methodology that bridges the accuracy-interpretability gap by training a LightGBM model to faithfully mimic AutoGluon’s time series forecasts, enabling stable SHAP-based feature attributions. We rigorously validated this approach through feature injection experiments, demonstrating remarkably high faithfulness between extracted SHAP values and known ground truth effects. Second, we integrated spectral predictability analysis to quantify each series’ inherent forecastability. By comparing each time series’ spectral predictability to its pure noise benchmarks, we established an objective mechanism to gauge confidence in forecasts and their explanations. Our empirical evaluation on the M5 dataset found that higher spectral predictability strongly correlates not only with improved forecast accuracy but also with higher fidelity between the surrogate and the original forecasting model. These forecastability metrics serve as effective filtering mechanisms and confidence scores, enabling users to calibrate their trust in both the forecasts and their explanations. We further demonstrated that per-item normalization is essential for generating meaningful SHAP explanations across heterogeneous time series with varying scales. The resulting framework delivers interpretable, instance-level explanations for state-of-the-art ensemble forecasts, while equipping users with forecastability metrics that serve as reliability indicators for both predictions and their explanations.

[411] Graph Diffusion Transformers are In-Context Molecular Designers

Gang Liu, Jie Chen, Yihan Zhu, Michael Sun, Tengfei Luo, Nitesh V Chawla, Meng Jiang

Main category: cs.LG

TL;DR: DemoDiff introduces demonstration-conditioned diffusion models for molecular design using few-shot examples instead of text, achieving state-of-the-art performance across 33 tasks while being 100-1000x smaller than comparable language models.

Details

Motivation: To overcome the limitations of in-context learning in molecular design where labeled data for each property is scarce despite large databases like ChEMBL containing millions of biological assays.

Method: Uses demonstration-conditioned diffusion models with a novel molecular tokenizer (Node Pair Encoding) that represents molecules at motif level, requiring 5.5x fewer nodes. Pretrained on curated dataset with millions of context tasks using 0.7-billion-parameter model.

Result: Across 33 design tasks in six categories, DemoDiff matches or surpasses language models 100-1000x larger and achieves average rank of 3.63 compared to 5.25-10.20 for domain-specific approaches.

Conclusion: DemoDiff positions itself as a molecular foundation model for in-context molecular design, demonstrating superior performance with significantly smaller model size compared to existing approaches.

Abstract: In-context learning allows large models to adapt to new tasks from a few demonstrations, but it has shown limited success in molecular design. Existing databases such as ChEMBL contain molecular properties spanning millions of biological assays, yet labeled data for each property remain scarce. To address this limitation, we introduce demonstration-conditioned diffusion models (DemoDiff), which define task contexts using a small set of molecule-score examples instead of text descriptions. These demonstrations guide a denoising Transformer to generate molecules aligned with target properties. For scalable pretraining, we develop a new molecular tokenizer with Node Pair Encoding that represents molecules at the motif level, requiring 5.5$\times$ fewer nodes. We curate a dataset containing millions of context tasks from multiple sources covering both drugs and materials, and pretrain a 0.7-billion-parameter model on it. Across 33 design tasks in six categories, DemoDiff matches or surpasses language models 100-1000$\times$ larger and achieves an average rank of 3.63 compared to 5.25-10.20 for domain-specific approaches. These results position DemoDiff as a molecular foundation model for in-context molecular design. Our code is available at https://github.com/liugangcode/DemoDiff.

[412] RFOD: Random Forest-based Outlier Detection for Tabular Data

Yihao Ang, Peicheng Yao, Yifan Bao, Yushuo Feng, Qiang Huang, Anthony K. H. Tung, Zhiyong Huang

Main category: cs.LG

TL;DR: RFOD is a novel Random Forest-based outlier detection framework for tabular data that handles mixed data types through feature-wise conditional reconstruction and provides interpretable cell-level anomaly scores.

Details

Motivation: Existing outlier detection methods struggle with mixed-type tabular data, lose semantic information through encoding, and lack interpretability, making it difficult to identify which specific values cause anomalies in high-stakes domains like cybersecurity and healthcare.

Method: RFOD reframes anomaly detection as feature-wise conditional reconstruction, training dedicated random forests for each feature conditioned on others. It uses Adjusted Gower’s Distance for cell-level scoring and Uncertainty-Weighted Averaging to aggregate scores into row-level anomaly scores.

Result: Extensive experiments on 15 real-world datasets show RFOD consistently outperforms state-of-the-art baselines in detection accuracy while offering superior robustness, scalability, and interpretability.

Conclusion: RFOD provides an effective solution for outlier detection in mixed-type tabular data that preserves semantic integrity, handles heterogeneous data types, and offers interpretable anomaly explanations.

Abstract: Outlier detection in tabular data is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare, where anomalies can cause serious operational and economic impacts. Despite advances in both data mining and deep learning, many existing methods struggle with mixed-type tabular data, often relying on encoding schemes that lose important semantic information. Moreover, they frequently lack interpretability, offering little insight into which specific values cause anomalies. To overcome these challenges, we introduce \textsf{\textbf{RFOD}}, a novel \textsf{\textbf{R}}andom \textsf{\textbf{F}}orest-based \textsf{\textbf{O}}utlier \textsf{\textbf{D}}etection framework tailored for tabular data. Rather than modeling a global joint distribution, \textsf{RFOD} reframes anomaly detection as a feature-wise conditional reconstruction problem, training dedicated random forests for each feature conditioned on the others. This design robustly handles heterogeneous data types while preserving the semantic integrity of categorical features. To further enable precise and interpretable detection, \textsf{RFOD} combines Adjusted Gower’s Distance (AGD) for cell-level scoring, which adapts to skewed numerical data and accounts for categorical confidence, with Uncertainty-Weighted Averaging (UWA) to aggregate cell-level scores into robust row-level anomaly scores. Extensive experiments on 15 real-world datasets demonstrate that \textsf{RFOD} consistently outperforms state-of-the-art baselines in detection accuracy while offering superior robustness, scalability, and interpretability for mixed-type tabular data.

[413] Conformal Risk Training: End-to-End Optimization of Conformal Risk Control

Christopher Yeh, Nicolas Christianson, Adam Wierman, Yisong Yue

Main category: cs.LG

TL;DR: A method for controlling tail risks in deep learning models using conformal risk training, which provides provable guarantees for risk measures beyond expected loss and improves average-case performance over post-hoc approaches.

Details

Motivation: Deep learning models lack provable risk guarantees, which is critical for high-stakes applications. Existing conformal risk control methods only handle expected loss and can degrade performance due to lack of feedback during training.

Method: Developed conformal risk training - an end-to-end approach that differentiates through conformal Optimized Certainty-Equivalent (OCE) risk control during model training or fine-tuning, allowing control of tail risks like CVaR.

Result: Achieves provable risk guarantees while significantly improving average-case performance over post-hoc approaches, demonstrated on applications including classifier false negative rate control and financial risk in battery storage.

Conclusion: The proposed conformal risk training method provides a practical solution for controlling tail risks in deep learning with provable guarantees and improved performance compared to standard post-hoc conformal risk control.

Abstract: While deep learning models often achieve high predictive accuracy, their predictions typically do not come with any provable guarantees on risk or reliability, which are critical for deployment in high-stakes applications. The framework of conformal risk control (CRC) provides a distribution-free, finite-sample method for controlling the expected value of any bounded monotone loss function and can be conveniently applied post-hoc to any pre-trained deep learning model. However, many real-world applications are sensitive to tail risks, as opposed to just expected loss. In this work, we develop a method for controlling the general class of Optimized Certainty-Equivalent (OCE) risks, a broad class of risk measures which includes as special cases the expected loss (generalizing the original CRC method) and common tail risks like the conditional value-at-risk (CVaR). Furthermore, standard post-hoc CRC can degrade average-case performance due to its lack of feedback to the model. To address this, we introduce “conformal risk training,” an end-to-end approach that differentiates through conformal OCE risk control during model training or fine-tuning. Our method achieves provable risk guarantees while demonstrating significantly improved average-case performance over post-hoc approaches on applications to controlling classifiers’ false negative rate and controlling financial risk in battery storage operation.

[414] Exploring Cross-Client Memorization of Training Data in Large Language Models for Federated Learning

Tinnakit Udsa, Can Udomcharoenchaikit, Patomporn Payoungkhamdee, Sarana Nutanong, Norrathep Rattanavipanon

Main category: cs.LG

TL;DR: This paper proposes a framework to quantify fine-grained cross-sample memorization in federated learning, addressing limitations of existing methods that only detect single-sample memorization.

Details

Motivation: Existing FL memorization detection techniques focus on one sample at a time and underestimate cross-sample memorization risks, while centralized learning methods cannot be directly applied to FL due to data access constraints.

Method: The authors develop a framework that quantifies both intra- and inter-client memorization using fine-grained cross-sample measurement across all clients in FL settings.

Result: FL models do memorize client data, with intra-client data being memorized more than inter-client data. Memorization is influenced by decoding strategies, prefix length, and FL algorithms.

Conclusion: The proposed framework successfully bridges the gap between centralized and federated learning memorization analysis, revealing subtle memorization patterns and key influencing factors in FL settings.

Abstract: Federated learning (FL) enables collaborative training without raw data sharing, but still risks training data memorization. Existing FL memorization detection techniques focus on one sample at a time, underestimating more subtle risks of cross-sample memorization. In contrast, recent work on centralized learning (CL) has introduced fine-grained methods to assess memorization across all samples in training data, but these assume centralized access to data and cannot be applied directly to FL. We bridge this gap by proposing a framework that quantifies both intra- and inter-client memorization in FL using fine-grained cross-sample memorization measurement across all clients. Based on this framework, we conduct two studies: (1) measuring subtle memorization across clients and (2) examining key factors that influence memorization, including decoding strategies, prefix length, and FL algorithms. Our findings reveal that FL models do memorize client data, particularly intra-client data, more than inter-client data, with memorization influenced by training and inferencing factors.

[415] LOTION: Smoothing the Optimization Landscape for Quantized Training

Mujin Kwun, Depen Morwani, Chloe Huangyuan Su, Stephanie Gil, Nikhil Anand, Sham Kakade

Main category: cs.LG

TL;DR: LOTION is a principled smoothing framework that optimizes neural networks for quantized objectives by approximating the quantized loss surface with its expectation under unbiased randomized-rounding noise, enabling guaranteed convergence to local minima while preserving global minima.

Details

Motivation: Quantized optimization is challenging due to piece-wise constant quantizers with zero gradients everywhere except at thresholds where derivatives are undefined. Existing methods like STE lack convergence guarantees.

Method: Introduces LOTION framework that replaces raw quantized loss with expectation under unbiased randomized-rounding noise, inspired by Nesterov smoothing. Uses stochastic rounding noise to preserve global minima.

Result: Empirically outperforms standard QAT on synthetic testbeds and 150M- and 300M-parameter language models. Provides guaranteed convergence to local minima.

Conclusion: LOTION offers a principled smoothing approach for quantized optimization with convergence guarantees and preservation of global minima, demonstrating superior performance over standard methods.

Abstract: Optimizing neural networks for quantized objectives is fundamentally challenging because the quantizer is piece-wise constant, yielding zero gradients everywhere except at quantization thresholds where the derivative is undefined. Most existing methods deal with this issue by relaxing gradient computations with techniques like Straight Through Estimators (STE) and do not provide any guarantees of convergence. In this work, taking inspiration from Nesterov smoothing, we approximate the quantized loss surface with a continuous loss surface. In particular, we introduce LOTION, \textbf{L}ow-precision \textbf{O}ptimization via s\textbf{T}ochastic-no\textbf{I}se sm\textbf{O}othi\textbf{N}g, a principled smoothing framework that replaces the raw quantized loss with its expectation under unbiased randomized-rounding noise. In this framework, standard optimizers are guaranteed to converge to a local minimum of the loss surface. Moreover, when using noise derived from stochastic rounding, we show that the global minima of the original quantized loss are preserved. We empirically demonstrate that this method outperforms standard QAT on synthetic testbeds and on 150M- and 300M- parameter language models.

[416] Spatial Deconfounder: Interference-Aware Deconfounding for Spatial Causal Inference

Ayush Khot, Miruna Oprescu, Maresa Schröder, Ai Kagawa, Xihaier Luo

Main category: cs.LG

TL;DR: Spatial Deconfounder: A method that addresses both unmeasured spatial confounding and treatment interference by reconstructing latent confounders from local treatment vectors using CVAE with spatial prior, enabling nonparametric identification of direct and spillover effects.

Details

Motivation: Spatial causal inference faces two intertwined challenges: unmeasured spatial confounders and treatment interference. Existing methods typically address one while ignoring the other, but these issues are deeply connected as interference reveals structure in latent confounders.

Method: Two-stage approach: (1) reconstruct substitute confounder from local treatment vectors using conditional variational autoencoder (CVAE) with spatial prior, (2) estimate causal effects via flexible outcome model. Extends SpaCE benchmark suite to include treatment interference.

Result: Consistently improves effect estimation across real-world datasets in environmental health and social science. Enables nonparametric identification of both direct and spillover effects under weak assumptions without requiring multiple treatment types or known model of latent field.

Conclusion: By turning interference into a multi-cause signal, the framework bridges spatial and deconfounding literatures to advance robust causal inference in structured data, addressing both spatial confounding and interference simultaneously.

Abstract: Causal inference in spatial domains faces two intertwined challenges: (1) unmeasured spatial factors, such as weather, air pollution, or mobility, that confound treatment and outcome, and (2) interference from nearby treatments that violate standard no-interference assumptions. While existing methods typically address one by assuming away the other, we show they are deeply connected: interference reveals structure in the latent confounder. Leveraging this insight, we propose the Spatial Deconfounder, a two-stage method that reconstructs a substitute confounder from local treatment vectors using a conditional variational autoencoder (CVAE) with a spatial prior, then estimates causal effects via a flexible outcome model. We show that this approach enables nonparametric identification of both direct and spillover effects under weak assumptions–without requiring multiple treatment types or a known model of the latent field. Empirically, we extend SpaCE, a benchmark suite for spatial confounding, to include treatment interference, and show that the Spatial Deconfounder consistently improves effect estimation across real-world datasets in environmental health and social science. By turning interference into a multi-cause signal, our framework bridges spatial and deconfounding literatures to advance robust causal inference in structured data.

[417] Reinforcement Learning-Based Optimization of CT Acquisition and Reconstruction Parameters Through Virtual Imaging Trials

David Fenwick, Navid NaderiAlizadeh, Vahid Tarokh, Nicholas Felice, Darin Clark, Jayasai Rajagopal, Anuj Kapadia, Benjamin Wildman-Tobriner, Ehsan Samei, Ehsan Abadi

Main category: cs.LG

TL;DR: Reinforcement learning combined with virtual imaging tools can optimize CT protocols more efficiently than exhaustive search, achieving the same image quality with 79.7% fewer computational steps.

Details

Motivation: Traditional CT protocol optimization relies on exhaustive testing of parameter combinations, which is impractical due to complex interdependencies among acquisition and reconstruction parameters.

Method: Used virtual imaging tools with human models containing liver lesions, combined with Proximal Policy Optimization (PPO) reinforcement learning to optimize parameters including tube voltage, tube current, reconstruction kernel, slice thickness, and pixel size.

Result: The reinforcement learning approach achieved the global maximum detectability index (d’) across test cases while requiring 79.7% fewer steps than exhaustive search on a supercomputer.

Conclusion: The framework demonstrates both accuracy and computational efficiency, highlighting the potential of integrating virtual imaging tools with reinforcement learning for CT protocol management.

Abstract: Protocol optimization is critical in Computed Tomography (CT) to achieve high diagnostic image quality while minimizing radiation dose. However, due to the complex interdependencies among CT acquisition and reconstruction parameters, traditional optimization methods rely on exhaustive testing of combinations of these parameters, which is often impractical. This study introduces a novel methodology that combines virtual imaging tools with reinforcement learning to optimize CT protocols more efficiently. Human models with liver lesions were imaged using a validated CT simulator and reconstructed with a novel CT reconstruction toolkit. The optimization parameter space included tube voltage, tube current, reconstruction kernel, slice thickness, and pixel size. The optimization process was performed using a Proximal Policy Optimization (PPO) agent, which was trained to maximize an image quality objective, specifically the detectability index (d’) of liver lesions in the reconstructed images. Optimization performance was compared against an exhaustive search performed on a supercomputer. The proposed reinforcement learning approach achieved the global maximum d’ across test cases while requiring 79.7% fewer steps than the exhaustive search, demonstrating both accuracy and computational efficiency. The proposed framework is flexible and can accommodate various image quality objectives. The findings highlight the potential of integrating virtual imaging tools with reinforcement learning for CT protocol management.

[418] Zero-Shot Policy Transfer in Reinforcement Learning using Buckingham’s Pi Theorem

Francisco Pascoa, Ian Lalonde, Alexandre Girard

Main category: cs.LG

TL;DR: A zero-shot transfer method using Buckingham’s Pi Theorem enables RL policies to generalize to new robots, tasks, or environments by scaling inputs and outputs through dimensionless space without retraining.

Details

Motivation: RL policies often fail to generalize to new robots, tasks, or environments with different physical parameters, limiting real-world applicability.

Method: Based on Buckingham’s Pi Theorem, the method adapts pre-trained policies by scaling observations and actions through dimensionless space, requiring no retraining.

Result: The scaled transfer shows no performance loss on dynamically similar contexts and consistently outperforms naive transfer on non-similar contexts, significantly expanding effective policy contexts.

Conclusion: Dimensional analysis provides a powerful and practical tool to enhance RL policy robustness and generalization.

Abstract: Reinforcement learning (RL) policies often fail to generalize to new robots, tasks, or environments with different physical parameters, a challenge that limits their real-world applicability. This paper presents a simple, zero-shot transfer method based on Buckingham’s Pi Theorem to address this limitation. The method adapts a pre-trained policy to new system contexts by scaling its inputs (observations) and outputs (actions) through a dimensionless space, requiring no retraining. The approach is evaluated against a naive transfer baseline across three environments of increasing complexity: a simulated pendulum, a physical pendulum for sim-to-real validation, and the high-dimensional HalfCheetah. Results demonstrate that the scaled transfer exhibits no loss of performance on dynamically similar contexts. Furthermore, on non-similar contexts, the scaled policy consistently outperforms the naive transfer, significantly expanding the volume of contexts where the original policy remains effective. These findings demonstrate that dimensional analysis provides a powerful and practical tool to enhance the robustness and generalization of RL policies.

[419] Struc-EMB: The Potential of Structure-Aware Encoding in Language Embeddings

Shikun Liu, Haoyu Wang, Mufei Li, Pan Li

Main category: cs.LG

TL;DR: This paper introduces structure-aware text embeddings that integrate structural relations (like hyperlinks/citations) directly into LLM encoding, outperforming text-only and post-hoc methods across multiple tasks.

Details

Motivation: Current LLM text embeddings overlook rich structural information (hyperlinks, citations) that provides crucial context in real-world datasets, limiting their effectiveness.

Method: Two in-process methods: sequential concatenation (for noisy, moderate-length contexts) and parallel caching (for long, high-signal contexts), plus Context Distillation and Semantic Balancing techniques for handling noisy structural data.

Result: Structure-aware approaches consistently outperform text-only and post-hoc baselines in zero-shot experiments across retrieval, clustering, classification, and recommendation tasks.

Conclusion: This work provides the first comprehensive analysis of in-process structure-aware encoding, offering a blueprint for building more powerful and contextually aware embedding models with identified trade-offs between different methods.

Abstract: Text embeddings from Large Language Models (LLMs) have become foundational for numerous applications. However, these models typically operate on raw text, overlooking the rich structural information, such as hyperlinks or citations, that provides crucial context in many real-world datasets. This paper introduces and systematically evaluates a new paradigm for generating structure-aware text embeddings by integrating these structural relations directly into the LLM’s internal encoding process, rather than relying on traditional post-hoc aggregation. We investigate two primary in-process methods: sequential concatenation and parallel caching. Through extensive zero-shot experiments across retrieval, clustering, classification, and recommendation tasks, we demonstrate that our structure-aware approaches consistently outperform both text-only and post-hoc baselines. Our analysis reveals critical trade-offs: sequential concatenation excels with noisy, moderate-length contexts, while parallel caching scales more effectively to long, high-signal contexts but is more susceptible to distractors. To address the challenge of noisy structural data, we also introduce and validate two effective techniques: Context Distillation and Semantic Balancing. This work provides the first comprehensive analysis of in-process structure-aware encoding, offering a blueprint for building more powerful and contextually aware embedding models.

[420] Guiding Exploration in Reinforcement Learning Through LLM-Augmented Observations

Vaibhav Jain, Gerrit Grossmann

Main category: cs.LG

TL;DR: A framework that uses LLM-generated action recommendations through augmented observation spaces to guide RL exploration in sparse-reward environments, achieving significant performance improvements without algorithm modifications.

Details

Motivation: RL agents struggle in sparse-reward environments where traditional exploration fails, while LLMs possess procedural knowledge that could guide exploration but existing approaches create rigid dependencies.

Method: Provides LLM-generated action recommendations through augmented observation spaces, allowing RL agents to learn when to follow or ignore guidance while maintaining flexibility through soft constraints.

Result: 71% relative improvement in final success rates in the most challenging environment, with agents reaching performance thresholds up to 9 times faster and substantial sample efficiency gains.

Conclusion: The approach effectively leverages LLM planning capabilities to accelerate RL training in challenging environments while requiring no modifications to existing RL algorithms.

Abstract: Reinforcement Learning (RL) agents often struggle in sparse-reward environments where traditional exploration strategies fail to discover effective action sequences. Large Language Models (LLMs) possess procedural knowledge and reasoning capabilities from text pretraining that could guide RL exploration, but existing approaches create rigid dependencies where RL policies must follow LLM suggestions or incorporate them directly into reward functions. We propose a framework that provides LLM-generated action recommendations through augmented observation spaces, allowing RL agents to learn when to follow or ignore this guidance. Our method leverages LLMs’ world knowledge and reasoning abilities while maintaining flexibility through soft constraints. We evaluate our approach on three BabyAI environments of increasing complexity and show that the benefits of LLM guidance scale with task difficulty. In the most challenging environment, we achieve 71% relative improvement in final success rates over baseline. The approach provides substantial sample efficiency gains, with agents reaching performance thresholds up to 9 times faster, and requires no modifications to existing RL algorithms. Our results demonstrate an effective method for leveraging LLM planning capabilities to accelerate RL training in challenging environments.

[421] Weights initialization of neural networks for function approximation

Xinwen Hu, Yunqing Huang, Nianyu Yi, Peimeng Yin

Main category: cs.LG

TL;DR: A reusable initialization framework using basis function pretraining to improve neural network training efficiency, generalization, and transferability for function approximation.

Details

Motivation: To address challenges in neural network training: requiring new models for each function, sensitivity to architecture/hyperparameters, and poor generalization beyond training domains.

Method: Pretrain basis neural networks on polynomial families in a reference domain, then use learned parameters to initialize networks for complex functions. Includes domain mapping to transform inputs to reference domain.

Result: Extensive experiments in 1D and 2D settings show substantial improvements in training efficiency, generalization, and model transferability.

Conclusion: Initialization-based strategies show promise for scalable and modular neural function approximation, with code made publicly available.

Abstract: Neural network-based function approximation plays a pivotal role in the advancement of scientific computing and machine learning. Yet, training such models faces several challenges: (i) each target function often requires training a new model from scratch; (ii) performance is highly sensitive to architectural and hyperparameter choices; and (iii) models frequently generalize poorly beyond the training domain. To overcome these challenges, we propose a reusable initialization framework based on basis function pretraining. In this approach, basis neural networks are first trained to approximate families of polynomials on a reference domain. Their learned parameters are then used to initialize networks for more complex target functions. To enhance adaptability across arbitrary domains, we further introduce a domain mapping mechanism that transforms inputs into the reference domain, thereby preserving structural correspondence with the pretrained models. Extensive numerical experiments in one- and two-dimensional settings demonstrate substantial improvements in training efficiency, generalization, and model transferability, highlighting the promise of initialization-based strategies for scalable and modular neural function approximation. The full code is made publicly available on Gitee.

[422] Deceptive Exploration in Multi-armed Bandits

I. Arda Vurankaya, Mustafa O. Karabag, Wesley A. Suttle, Jesse Milzman, David Fridovich-Keil, Ufuk Topcu

Main category: cs.LG

TL;DR: This paper studies deceptive multi-armed bandits where an agent aims to identify the best private arm while appearing to follow Thompson Sampling based on public rewards, under KL divergence detectability constraints.

Details

Motivation: To model scenarios where agents have private objectives but must maintain plausible deniability by appearing to follow expected public behavior, such as in strategic decision-making or privacy-preserving systems.

Method: Formalizes detectability using stepwise KL divergence constraints, models public suboptimal arm pulls as Bernoulli processes with decreasing success probabilities, and proposes a maximin problem to characterize optimal error exponents for best private arm identification.

Result: Shows that deceptive pulls of public suboptimal arms can occur at most at Θ(√T) rate under KL constraints, and develops an algorithm that adapts exploration based on public suboptimality gaps.

Conclusion: The proposed algorithm effectively balances private arm identification with detectability constraints, achieving the theoretical Θ(√T) rate while maintaining plausible deniability.

Abstract: We consider a multi-armed bandit setting in which each arm has a public and a private reward distribution. An observer expects an agent to follow Thompson Sampling according to the public rewards, however, the deceptive agent aims to quickly identify the best private arm without being noticed. The observer can observe the public rewards and the pulled arms, but not the private rewards. The agent, on the other hand, observes both the public and private rewards. We formalize detectability as a stepwise Kullback-Leibler (KL) divergence constraint between the actual pull probabilities used by the agent and the anticipated pull probabilities by the observer. We model successful pulling of public suboptimal arms as a % Bernoulli process where the success probability decreases with each successful pull, and show these pulls can happen at most at a $\Theta(\sqrt{T}) $ rate under the KL constraint. We then formulate a maximin problem based on public and private means, whose solution characterizes the optimal error exponent for best private arm identification. We finally propose an algorithm inspired by top-two algorithms. This algorithm naturally adapts its exploration according to the hardness of pulling arms based on the public suboptimality gaps. We provide numerical examples illustrating the $\Theta(\sqrt{T}) $ rate and the behavior of the proposed algorithm.

[423] PO-CKAN:Physics Informed Deep Operator Kolmogorov Arnold Networks with Chunk Rational Structure

Junyi Wu, Guang Lin

Main category: cs.LG

TL;DR: PO-CKAN is a physics-informed deep operator framework using Chunkwise Rational Kolmogorov-Arnold Networks (CKANs) for learning solution operators of parametric PDEs, achieving improved accuracy over existing methods.

Details

Motivation: To develop an efficient framework for learning physically consistent solution operators of parametric time-dependent PDEs with varying inputs, addressing limitations in existing operator learning methods.

Method: Combines DeepONet architecture with CKAN sub-networks and integrates PINN principles to enforce physical consistency through PDE residual loss, using a branch-trunk structure with rational KAN modules.

Result: On Burgers’ equation with ν=0.01, PO-CKAN reduces mean relative L² error by ~48% compared to PI-DeepONet, and achieves competitive accuracy on Eikonal and diffusion-reaction benchmarks.

Conclusion: PO-CKAN provides an effective framework for accurate and physically consistent operator learning of parametric PDEs, demonstrating significant improvements over existing approaches.

Abstract: We propose PO-CKAN, a physics-informed deep operator framework based on Chunkwise Rational Kolmogorov–Arnold Networks (KANs), for approximating the solution operators of partial differential equations. This framework leverages a Deep Operator Network (DeepONet) architecture that incorporates Chunkwise Rational Kolmogorov–Arnold Network (CKAN) sub-networks for enhanced function approximation. The principles of Physics-Informed Neural Networks (PINNs) are integrated into the operator learning framework to enforce physical consistency. This design enables the efficient learning of physically consistent spatio-temporal solution operators and allows for rapid prediction for parametric time-dependent PDEs with varying inputs (e.g., parameters, initial/boundary conditions) after training. Validated on challenging benchmark problems, PO-CKAN demonstrates accurate operator learning with results closely matching high-fidelity solutions. PO-CKAN adopts a DeepONet-style branch–trunk architecture with its sub-networks instantiated as rational KAN modules, and enforces physical consistency via a PDE residual (PINN-style) loss. On Burgers’ equation with $\nu=0.01$, PO-CKAN reduces the mean relative $L^2$ error by approximately 48% compared to PI-DeepONet, and achieves competitive accuracy on the Eikonal and diffusion–reaction benchmarks.

[424] TAPAS: Datasets for Learning the Learning with Errors Problem

Eshika Saxena, Alberto Alfarano, François Charton, Emily Wenger, Kristin Lauter

Main category: cs.LG

TL;DR: The TAPAS datasets provide accessible LWE data for AI research on post-quantum cryptography attacks, addressing the current data scarcity issue.

Details

Motivation: AI attacks on LWE show promise but are limited by scarce, hard-to-create training data that requires domain expertise and significant compute resources.

Method: Created TAPAS datasets - a Toolkit for Analysis of Post-quantum cryptography using AI Systems - covering multiple LWE settings for off-the-shelf use by AI practitioners.

Result: Established attack performance baselines and provided ready-to-use datasets that enable prototyping of new LWE attack approaches without data creation overhead.

Conclusion: TAPAS accelerates AI research on LWE attacks by filling the data gap and provides foundation for future work in this domain.

Abstract: AI-powered attacks on Learning with Errors (LWE), an important hard math problem in post-quantum cryptography, rival or outperform “classical” attacks on LWE under certain parameter settings. Despite the promise of this approach, a dearth of accessible data limits AI practitioners’ ability to study and improve these attacks. Creating LWE data for AI model training is time- and compute-intensive and requires significant domain expertise. To fill this gap and accelerate AI research on LWE attacks, we propose the TAPAS datasets, a Toolkit for Analysis of Post-quantum cryptography using AI Systems. These datasets cover several LWE settings and can be used off-the-shelf by AI practitioners to prototype new approaches to cracking LWE. This work documents TAPAS dataset creation, establishes attack performance baselines, and lays out directions for future work.

[425] Edu-EmotionNet: Cross-Modality Attention Alignment with Temporal Feedback Loops

S M Rafiuddin

Main category: cs.LG

TL;DR: Edu-EmotionNet is a novel framework for robust emotion recognition in online education that dynamically models temporal emotion evolution and modality reliability through cross-modality attention, confidence-based weighting, and temporal feedback loops.

Details

Motivation: Existing emotion recognition methods use static fusion strategies and assume consistent modality reliability, which doesn't reflect real-world learning environments where modalities can be missing or noisy.

Method: Three key components: Cross-Modality Attention Alignment for dynamic context sharing, Modality Importance Estimator for confidence-based weighting per time step, and Temporal Feedback Loop for temporal consistency using previous predictions.

Result: Achieves state-of-the-art performance on educational subsets of IEMOCAP and MOSEI datasets, shows strong robustness to missing/noisy modalities, and captures emotional transitions while adaptively prioritizing reliable signals.

Conclusion: The framework is well-suited for deployment in real-time learning systems due to its robustness and ability to handle real-world educational scenarios with unreliable modality inputs.

Abstract: Understanding learner emotions in online education is critical for improving engagement and personalized instruction. While prior work in emotion recognition has explored multimodal fusion and temporal modeling, existing methods often rely on static fusion strategies and assume that modality inputs are consistently reliable, which is rarely the case in real-world learning environments. We introduce Edu-EmotionNet, a novel framework that jointly models temporal emotion evolution and modality reliability for robust affect recognition. Our model incorporates three key components: a Cross-Modality Attention Alignment (CMAA) module for dynamic cross-modal context sharing, a Modality Importance Estimator (MIE) that assigns confidence-based weights to each modality at every time step, and a Temporal Feedback Loop (TFL) that leverages previous predictions to enforce temporal consistency. Evaluated on educational subsets of IEMOCAP and MOSEI, re-annotated for confusion, curiosity, boredom, and frustration, Edu-EmotionNet achieves state-of-the-art performance and demonstrates strong robustness to missing or noisy modalities. Visualizations confirm its ability to capture emotional transitions and adaptively prioritize reliable signals, making it well suited for deployment in real-time learning systems

[426] TinyGraphEstimator: Adapting Lightweight Language Models for Graph Structure Inference

Michal Podstawski

Main category: cs.LG

TL;DR: Small transformer models can infer graph parameters from textual representations, achieving improved performance through LoRA fine-tuning on the new TinyGraphEstimator dataset.

Details

Motivation: To explore whether compact, resource-efficient language models can perform structural reasoning on graph data, as this capability remains largely unexplored compared to larger models.

Method: Created TinyGraphEstimator dataset with connected graphs and structural metadata, evaluated small open models on graph parameter prediction, and applied LoRA fine-tuning for efficient parameter adaptation.

Result: Small language models demonstrated non-trivial reasoning capacity over graph-structured data and showed consistent improvements across all metrics after LoRA fine-tuning.

Conclusion: Compact transformer models can effectively perform structural inference on graphs through efficient parameter tuning, showing promise for resource-constrained graph analysis applications.

Abstract: Graphs provide a universal framework for representing complex relational systems, and inferring their structural properties is a core challenge in graph analysis and reasoning. While large language models have recently demonstrated emerging abilities to perform symbolic and numerical reasoning, the potential of smaller, resource-efficient models in this context remains largely unexplored. This paper investigates whether compact transformer-based language models can infer graph-theoretic parameters directly from textual graph representations. To enable systematic evaluation, we introduce the TinyGraphEstimator dataset - a balanced collection of connected graphs generated from multiple random graph models and annotated with detailed structural metadata. We evaluate several small open models on their ability to predict key graph parameters such as density, clustering, and chromatic number. Furthermore, we apply lightweight fine-tuning using the Low-Rank Adaptation (LoRA) technique, achieving consistent improvements across all evaluated metrics. The results demonstrate that small language models possess non-trivial reasoning capacity over graph-structured data and can be effectively adapted for structural inference tasks through efficient parameter tuning.

[427] Titans Revisited: A Lightweight Reimplementation and Critical Analysis of a Test-Time Memory Model

Gavriel Di Nepi, Federico Siciliano, Fabrizio Silvestri

Main category: cs.LG

TL;DR: A reimplementation and evaluation of Google’s Titans model shows its neural memory component improves performance over attention-only models, but chunking limitations prevent it from consistently outperforming established baselines.

Details

Motivation: To address reproducibility issues with Google's Titans model due to lack of public code and ambiguous descriptions, and to conduct a comprehensive evaluation of its performance claims.

Method: Created a lightweight reimplementation of Titans and evaluated it on Masked Language Modeling, Time Series Forecasting, and Recommendation tasks, comparing against established baselines.

Result: Titans does not always outperform established baselines due to chunking limitations, but its Neural Memory component consistently improves performance compared to attention-only models.

Conclusion: The neural memory component shows innovative potential but has practical limitations that raise questions for future research, particularly regarding chunking issues.

Abstract: By the end of 2024, Google researchers introduced Titans: Learning at Test Time, a neural memory model achieving strong empirical results across multiple tasks. However, the lack of publicly available code and ambiguities in the original description hinder reproducibility. In this work, we present a lightweight reimplementation of Titans and conduct a comprehensive evaluation on Masked Language Modeling, Time Series Forecasting, and Recommendation tasks. Our results reveal that Titans does not always outperform established baselines due to chunking. However, its Neural Memory component consistently improves performance compared to attention-only models. These findings confirm the model’s innovative potential while highlighting its practical limitations and raising questions for future research.

[428] Long-Tailed Recognition via Information-Preservable Two-Stage Learning

Fudong Lin, Xu Yuan

Main category: cs.LG

TL;DR: A two-stage learning approach for long-tailed classification that uses information theory-based representation learning and informative instance sampling to address class imbalance bias.

Details

Motivation: Real-world data often has imbalanced distributions (long-tail), causing deep models to be biased toward frequent classes and perform poorly on tail classes.

Method: Two-stage approach: 1) Information theory-based representation learning to minimize intra-class distance and create well-separated feature space; 2) Novel sampling strategy that selects mathematically informative instances to rectify biased decision boundaries.

Result: Achieves state-of-the-art performance across various long-tailed benchmark datasets.

Conclusion: The proposed approach effectively mitigates majority-biased tendency while preserving valuable dataset information, demonstrating superior performance in long-tailed classification tasks.

Abstract: The imbalance (or long-tail) is the nature of many real-world data distributions, which often induces the undesirable bias of deep classification models toward frequent classes, resulting in poor performance for tail classes. In this paper, we propose a novel two-stage learning approach to mitigate such a majority-biased tendency while preserving valuable information within datasets. Specifically, the first stage proposes a new representation learning technique from the information theory perspective. This approach is theoretically equivalent to minimizing intra-class distance, yielding an effective and well-separated feature space. The second stage develops a novel sampling strategy that selects mathematically informative instances, able to rectify majority-biased decision boundaries without compromising a model’s overall performance. As a result, our approach achieves the state-of-the-art performance across various long-tailed benchmark datasets, validated via extensive experiments. Our code is available at https://github.com/fudong03/BNS_IPDPP.

[429] The Boundaries of Fair AI in Medical Image Prognosis: A Causal Perspective

Thai-Hoang Pham, Jiayuan Chen, Seungyeon Lee, Yuanlong Wang, Sayoko Moroi, Xueru Zhang, Ping Zhang

Main category: cs.LG

TL;DR: FairTTE is the first comprehensive framework for assessing fairness in time-to-event prediction in medical imaging, addressing bias concerns in prognosis scenarios that were previously overlooked.

Details

Motivation: Existing fairness approaches in medical imaging focus only on diagnosis tasks (classification/segmentation) while overlooking prognosis scenarios involving time-to-event prediction, creating a critical gap in ensuring equitable ML models for medical outcomes.

Method: Developed FairTTE framework that integrates TTE prediction and fairness algorithms, uses causal analysis to uncover bias sources, and conducts large-scale evaluation across diverse imaging modalities and TTE outcomes.

Result: Bias is pervasive across different imaging modalities, current fairness methods offer limited mitigation, fairness deteriorates under distribution shifts, and there’s strong association between bias sources and model disparities.

Conclusion: There is a pressing need for more robust, equitable prognostic models that target all forms of bias holistically, as existing solutions have significant limitations particularly under distribution shifts.

Abstract: As machine learning (ML) algorithms are increasingly used in medical image analysis, concerns have emerged about their potential biases against certain social groups. Although many approaches have been proposed to ensure the fairness of ML models, most existing works focus only on medical image diagnosis tasks, such as image classification and segmentation, and overlooked prognosis scenarios, which involve predicting the likely outcome or progression of a medical condition over time. To address this gap, we introduce FairTTE, the first comprehensive framework for assessing fairness in time-to-event (TTE) prediction in medical imaging. FairTTE encompasses a diverse range of imaging modalities and TTE outcomes, integrating cutting-edge TTE prediction and fairness algorithms to enable systematic and fine-grained analysis of fairness in medical image prognosis. Leveraging causal analysis techniques, FairTTE uncovers and quantifies distinct sources of bias embedded within medical imaging datasets. Our large-scale evaluation reveals that bias is pervasive across different imaging modalities and that current fairness methods offer limited mitigation. We further demonstrate a strong association between underlying bias sources and model disparities, emphasizing the need for holistic approaches that target all forms of bias. Notably, we find that fairness becomes increasingly difficult to maintain under distribution shifts, underscoring the limitations of existing solutions and the pressing need for more robust, equitable prognostic models.

[430] On the Alignment Between Supervised and Self-Supervised Contrastive Learning

Achleshwar Luthra, Priyadarsi Mishra, Tomer Galanti

Main category: cs.LG

TL;DR: This paper analyzes the representation alignment between self-supervised contrastive learning (CL) and its supervised surrogate (NSCL), showing their representations remain similar throughout training despite parameter divergence.

Details

Motivation: To understand whether CL and NSCL remain aligned at the representation level throughout training, not just in their objectives, given that previous theory showed loss-level similarity.

Method: Analyzed representation alignment under shared randomness (same initialization, batches, augmentations), proved bounds on similarity matrices, and validated empirically with scale and temperature experiments.

Result: CL and NSCL representations remain similar with high-probability guarantees on alignment metrics, alignment improves with more classes/higher temperatures, but parameter-space coupling is unstable with exponential divergence.

Conclusion: NSCL serves as a principled bridge between self-supervised and supervised learning, with CL-NSCL alignment strengthening under realistic conditions despite parameter divergence.

Abstract: Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL) loss, as the number of classes grows. Yet this loss-level similarity leaves an open question: {\em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives?} We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time. Finally, we validate these predictions empirically, showing that CL-NSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning. Our code and project page are available at [\href{https://github.com/DLFundamentals/understanding_ssl_v2}{code}, \href{https://dlfundamentals.github.io/cl-nscl-representation-alignment/}{project page}].

[431] Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training

T. Ed Li, Junyu Ren

Main category: cs.LG

TL;DR: ATM is a novel training method for sparse autoencoders that dynamically adjusts feature selection using activation tracking and probabilistic masking to reduce feature absorption while maintaining reconstruction quality.

Details

Motivation: Current sparse autoencoder training methods suffer from feature absorption, where features merge together to minimize L1 penalty, making it difficult to consistently identify and analyze model behaviors for reliability and safety.

Method: Adaptive Temporal Masking (ATM) tracks activation magnitudes, frequencies, and reconstruction contributions over time to compute evolving importance scores, then applies probabilistic masking based on statistical thresholding of these scores.

Result: ATM achieves substantially lower absorption scores compared to existing methods like TopK and JumpReLU SAEs while maintaining excellent reconstruction quality in experiments on Gemma-2-2b model.

Conclusion: ATM provides a principled solution for learning stable, interpretable features in neural networks, establishing a foundation for more reliable model analysis and understanding internal representations.

Abstract: Understanding the internal representations of large language models is crucial for ensuring their reliability and safety, with sparse autoencoders (SAEs) emerging as a promising interpretability approach. However, current SAE training methods face feature absorption, where features (or neurons) are absorbed into each other to minimize $L_1$ penalty, making it difficult to consistently identify and analyze model behaviors. We introduce Adaptive Temporal Masking (ATM), a novel training approach that dynamically adjusts feature selection by tracking activation magnitudes, frequencies, and reconstruction contributions to compute importance scores that evolve over time. ATM applies a probabilistic masking mechanism based on statistical thresholding of these importance scores, creating a more natural feature selection process. Through extensive experiments on the Gemma-2-2b model, we demonstrate that ATM achieves substantially lower absorption scores compared to existing methods like TopK and JumpReLU SAEs, while maintaining excellent reconstruction quality. These results establish ATM as a principled solution for learning stable, interpretable features in neural networks, providing a foundation for more reliable model analysis.

[432] Deep Multimodal Subspace Clustering Networks

Mahdi Abavisani, Vishal M. Patel

Main category: cs.LG

TL;DR: Proposes CNN-based unsupervised multimodal subspace clustering with encoder-fusion-decoder architecture using spatial and affinity fusion techniques, achieving state-of-the-art performance.

Details

Motivation: To develop effective unsupervised methods for multimodal subspace clustering that can handle different fusion strategies and improve clustering performance across multiple datasets.

Method: Three-stage framework: multimodal encoder (early/late/intermediate fusion), self-expressive layer for affinity matrix, and multimodal decoder. Uses reconstruction loss for training. Proposes spatial fusion and affinity fusion approaches.

Result: Extensive experiments on three datasets show significant performance improvements over state-of-the-art multimodal subspace clustering methods.

Conclusion: The proposed CNN-based multimodal subspace clustering framework with various fusion strategies effectively captures multimodal relationships and outperforms existing methods.

Abstract: We present convolutional neural network (CNN) based approaches for unsupervised multimodal subspace clustering. The proposed framework consists of three main stages - multimodal encoder, self-expressive layer, and multimodal decoder. The encoder takes multimodal data as input and fuses them to a latent space representation. The self-expressive layer is responsible for enforcing the self-expressiveness property and acquiring an affinity matrix corresponding to the data points. The decoder reconstructs the original input data. The network uses the distance between the decoder’s reconstruction and the original input in its training. We investigate early, late and intermediate fusion techniques and propose three different encoders corresponding to them for spatial fusion. The self-expressive layers and multimodal decoders are essentially the same for different spatial fusion-based approaches. In addition to various spatial fusion-based methods, an affinity fusion-based network is also proposed in which the self-expressive layer corresponding to different modalities is enforced to be the same. Extensive experiments on three datasets show that the proposed methods significantly outperform the state-of-the-art multimodal subspace clustering methods.

[433] Sparse components distinguish visual pathways & their alignment to neural networks

Ammar I Marvi, Nancy G Kanwisher, Meenakshi Khosla

Main category: cs.LG

TL;DR: A novel sparse decomposition method reveals distinct functional components across visual streams and introduces SCA for better brain-machine representational alignment measurement.

Details

Motivation: To resolve the inconsistency between distinct functional processes in human visual streams and DNNs' ability to model the entire visual system with common computational principles.

Method: Applied sparse decomposition to identify dominant visual components in each stream, then introduced Sparse Component Alignment (SCA) method for measuring representational alignment between brains and machines.

Result: Found clear differences in component profiles across streams (faces, places, bodies in ventral; social interactions, motion in lateral; less interpretable in dorsal). SCA showed DNNs are more aligned with ventral than dorsal/lateral representations.

Conclusion: SCA provides higher-resolution measurement of representational alignment than conventional methods, capturing underlying neural tuning axes more sensitively.

Abstract: The ventral, dorsal, and lateral streams in high-level human visual cortex are implicated in distinct functional processes. Yet, deep neural networks (DNNs) trained on a single task model the entire visual system surprisingly well, hinting at common computational principles across these pathways. To explore this inconsistency, we applied a novel sparse decomposition approach to identify the dominant components of visual representations within each stream. Consistent with traditional neuroscience research, we find a clear difference in component response profiles across the three visual streams – identifying components selective for faces, places, bodies, text, and food in the ventral stream; social interactions, implied motion, and hand actions in the lateral stream; and some less interpretable components in the dorsal stream. Building on this, we introduce Sparse Component Alignment (SCA), a new method for measuring representational alignment between brains and machines that better captures the latent neural tuning of these two visual systems. Using SCA, we find that standard visual DNNs are more aligned with the ventral than either dorsal or lateral representations. SCA reveals these distinctions with greater resolution than conventional population-level geometry, offering a measure of representational alignment that is sensitive to a system’s underlying axes of neural tuning.

[434] Multi-fidelity Batch Active Learning for Gaussian Process Classifiers

Murray Cutforth, Yiming Yang, Tiffany Fan, Serge Guillas, Eric Darve

Main category: cs.LG

TL;DR: BPMI is a batch active learning algorithm for multi-fidelity GP classifiers that uses Bernoulli Parameter Mutual Information to efficiently allocate simulation budgets by approximating mutual information through Taylor expansion.

Details

Motivation: Many science and engineering problems rely on expensive computational simulations, and multi-fidelity approaches can accelerate parameter space exploration, particularly for binary simulation outputs.

Method: Introduces Bernoulli Parameter Mutual Information (BPMI) which circumvents intractability of mutual information calculation in probability space by using first-order Taylor expansion of the link function for multi-fidelity GP classifiers.

Result: BPMI outperforms several baselines on two synthetic test cases and a real-world laser-ignited rocket combustor simulation, achieving higher predictive accuracy for fixed computational budgets.

Conclusion: BPMI demonstrates superior performance in all experiments, making it an effective approach for efficient simulation budget allocation in multi-fidelity settings.

Abstract: Many science and engineering problems rely on expensive computational simulations, where a multi-fidelity approach can accelerate the exploration of a parameter space. We study efficient allocation of a simulation budget using a Gaussian Process (GP) model in the binary simulation output case. This paper introduces Bernoulli Parameter Mutual Information (BPMI), a batch active learning algorithm for multi-fidelity GP classifiers. BPMI circumvents the intractability of calculating mutual information in the probability space by employing a first-order Taylor expansion of the link function. We evaluate BPMI against several baselines on two synthetic test cases and a complex, real-world application involving the simulation of a laser-ignited rocket combustor. In all experiments, BPMI demonstrates superior performance, achieving higher predictive accuracy for a fixed computational budget.

[435] An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs

Haolin Liu, Chen-Yu Wei, Julian Zimmert

Main category: cs.LG

TL;DR: The paper introduces Dig-DEC, a model-free decision-estimation coefficient that uses information gain for exploration instead of optimism. It achieves tighter regret bounds for hybrid MDPs with adversarial rewards and improves online function estimation procedures.

Details

Motivation: To close the gap between upper and lower bounds in DMSO and extend optimistic DEC to handle adversarial environments, as previous optimism-based approaches only worked in stochastic settings.

Method: Proposes Dig-DEC which removes optimism and uses purely information gain for exploration. Also improves online function-estimation procedures for both average estimation error and squared error minimization.

Result: Dig-DEC achieves smaller bounds than optimistic DEC and handles adversarial environments. Improved regret bounds from T^{3/4} to T^{2/3} (on-policy) and from T^{5/6} to T^{7/9} (off-policy) for average estimation, and from T^{2/3} to √T for squared error in Bellman-complete MDPs.

Conclusion: Dig-DEC provides the first model-free regret bounds for hybrid MDPs with adversarial rewards and matches optimism-based performance in Bellman-complete MDPs, resolving open problems in the field.

Abstract: We study decision making with structured observation (DMSO). Previous work (Foster et al., 2021b, 2023a) has characterized the complexity of DMSO via the decision-estimation coefficient (DEC), but left a gap between the regret upper and lower bounds that scales with the size of the model class. To tighten this gap, Foster et al. (2023b) introduced optimistic DEC, achieving a bound that scales only with the size of the value-function class. However, their optimism-based exploration is only known to handle the stochastic setting, and it remains unclear whether it extends to the adversarial setting. We introduce Dig-DEC, a model-free DEC that removes optimism and drives exploration purely by information gain. Dig-DEC is always no larger than optimistic DEC and can be much smaller in special cases. Importantly, the removal of optimism allows it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, we obtain the first model-free regret bounds for hybrid MDPs with bandit feedback under several general transition structures, resolving the main open problem left by Liu et al. (2025). We also improve the online function-estimation procedure in model-free learning: For average estimation error minimization, we refine the estimator in Foster et al. (2023b) to achieve sharper concentration, improving their regret bounds from $T^{3/4}$ to $T^{2/3}$ (on-policy) and from $T^{5/6}$ to $T^{7/9}$ (off-policy). For squared error minimization in Bellman-complete MDPs, we redesign their two-timescale procedure, improving the regret bound from $T^{2/3}$ to $\sqrt{T}$. This is the first time a DEC-based method achieves performance matching that of optimism-based approaches (Jin et al., 2021; Xie et al., 2023) in Bellman-complete MDPs.

[436] Pinpointing crucial steps: Attribution-based Credit Assignment for Verifiable Reinforcement Learning

Junxi Yin, Haisen Luo, Zhenyu Li, Yihua Liu, Dan Liu, Zequn Li, Xiaohang Xu

Main category: cs.LG

TL;DR: ACPO is a phased framework that improves RLVR by balancing exploration and exploitation through difficulty-aware curriculum, trajectory segmentation, and factorized rewards for better credit assignment.

Details

Motivation: Current RLVR methods struggle with balancing exploration and exploitation, leading to inaccurate credit assignment for intermediate steps and premature entropy collapse that limits model performance.

Method: ACPO uses a phased framework with difficulty-aware curriculum, trajectory semantic segmentation, attribution-based representation to regulate policy entropy, and factorized reward system for hierarchical contribution quantification.

Result: Extensive experiments on AIME, MATH, and AMC benchmarks show ACPO significantly outperforms existing state-of-the-art approaches.

Conclusion: ACPO effectively addresses exploration-exploitation balance in RLVR through its phased framework, improving reasoning performance on challenging benchmarks.

Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) enhances complex reasoning in LLMs, current methods struggle to balance exploration and exploitation. This leads to critical issues like inaccurate credit assignment for intermediate steps and premature entropy collapse, limiting model performance. To address this, we introduce Attribution-based Contribution to Policy Optimization (ACPO), a phased framework that incorporates a difficulty-aware curriculum. ACPO improves exploration by using trajectory semantic segmentation and an attribution-based representation to dynamically regulate policy entropy, thus mitigating its collapse. Concurrently, it enhances exploitation with a factorized reward system that precisely quantifies the hierarchical contribution of each reasoning step, ensuring accurate credit assignment. Extensive experiments on challenging benchmarks, including AIME, MATH, and AMC, demonstrate that ACPO significantly outperforms existing state-of-the-art approaches.

[437] A Frequency-Domain Analysis of the Multi-Armed Bandit Problem: A New Perspective on the Exploration-Exploitation Trade-off

Di Zhang

Main category: cs.LG

TL;DR: This paper introduces a frequency-domain analysis framework for multi-armed bandit problems, reformulating bandit algorithms as signal processing problems where reward estimates are spectral components and uncertainty corresponds to frequency.

Details

Motivation: Existing bandit analyses operate from time-domain and cumulative regret perspectives, struggling to characterize the dynamic learning process. A frequency-domain approach could provide better insights into the exploration-exploitation trade-off.

Method: The paper proposes a frequency-domain analysis framework that views bandit processes as signal processing problems. It constructs a Frequency-Domain Bandit Model and proves that UCB’s confidence bound term is equivalent to a time-varying gain applied to uncertain spectral components.

Result: The main theorem shows that UCB’s confidence bound corresponds to a gain inversely proportional to the square root of visit count. The paper also derives finite-time dynamic bounds for exploration rate decay.

Conclusion: The frequency-domain theory provides novel physical interpretation for classical bandit algorithms and lays theoretical foundation for designing next-generation algorithms with adaptive parameter adjustment.

Abstract: The stochastic multi-armed bandit (MAB) problem is one of the most fundamental models in sequential decision-making, with the core challenge being the trade-off between exploration and exploitation. Although algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling, along with their regret theories, are well-established, existing analyses primarily operate from a time-domain and cumulative regret perspective, struggling to characterize the dynamic nature of the learning process. This paper proposes a novel frequency-domain analysis framework, reformulating the bandit process as a signal processing problem. Within this framework, the reward estimate of each arm is viewed as a spectral component, with its uncertainty corresponding to the component’s frequency, and the bandit algorithm is interpreted as an adaptive filter. We construct a formal Frequency-Domain Bandit Model and prove the main theorem: the confidence bound term in the UCB algorithm is equivalent in the frequency domain to a time-varying gain applied to uncertain spectral components, a gain inversely proportional to the square root of the visit count. Based on this, we further derive finite-time dynamic bounds concerning the exploration rate decay. This theory not only provides a novel and intuitive physical interpretation for classical algorithms but also lays a rigorous theoretical foundation for designing next-generation algorithms with adaptive parameter adjustment.

[438] Velocity and Density-Aware RRI Analysis and Optimization for AoI Minimization in IoV SPS

Maoxin Ji, Tong Wang, Qiong Wu, Pingyi Fan, Nan Cheng, Wen Chen

Main category: cs.LG

TL;DR: This paper proposes an LLM and DDPG-based optimization approach to reduce Age of Information deterioration in Semi-Persistent Scheduling for Internet of Vehicles, addressing packet collisions and vehicle speed-related channel uncertainties.

Details

Motivation: Address the problem of Age of Information deterioration caused by packet collisions and vehicle speed-related channel uncertainties in Semi-Persistent Scheduling for Internet of Vehicles.

Method: Establishes an AoI calculation model considering vehicle speed, density, and Resource Reservation Interval, then designs a dual-path optimization scheme using DDPG (guided by state space and reward function) and LLM (leveraging contextual learning to generate optimal parameters).

Result: LLM significantly reduces AoI after accumulating a small number of exemplars without model training, while DDPG achieves more stable performance after training.

Conclusion: The proposed dual-path optimization approach combining LLM and DDPG effectively addresses AoI deterioration in IoV SPS systems, with LLM providing rapid optimization and DDPG offering stable performance after training.

Abstract: Addressing the problem of Age of Information (AoI) deterioration caused by packet collisions and vehicle speed-related channel uncertainties in Semi-Persistent Scheduling (SPS) for the Internet of Vehicles (IoV), this letter proposes an optimization approach based on Large Language Models (LLM) and Deep Deterministic Policy Gradient (DDPG). First, an AoI calculation model influenced by vehicle speed, vehicle density, and Resource Reservation Interval (RRI) is established, followed by the design of a dual-path optimization scheme. The DDPG is guided by the state space and reward function, while the LLM leverages contextual learning to generate optimal parameter configurations. Experimental results demonstrate that LLM can significantly reduce AoI after accumulating a small number of exemplars without requiring model training, whereas the DDPG method achieves more stable performance after training.

[439] Simple and Robust Forecasting of Spatiotemporally Correlated Small Earth Data with A Tabular Foundation Model

Yuting Yang, Gang Mei, Zhengjing Ma, Nengxiong Xu, Jianbing Peng

Main category: cs.LG

TL;DR: A robust approach for spatiotemporal forecasting of small Earth data using tabular foundation models that characterizes spatiotemporal patterns to achieve accurate forecasting across different scenarios without task-specific training.

Details

Motivation: Small Earth data with limited short-term monitoring variability and spatiotemporal correlations require forecasting methods that don't need task-specific training and avoid the forecasting bias toward global mean that conventional foundation models exhibit.

Method: Characterize and quantify spatiotemporal patterns of small Earth data, then utilize tabular foundation models for accurate forecasting across different scenarios.

Result: Superior accuracy compared to graph deep learning model (T-GCN) and tabular foundation model (TabPFN) in majority of instances across three typical scenarios, with stronger robustness.

Conclusion: The proposed approach provides a simple and robust solution for spatiotemporally correlated small Earth data forecasting that works effectively across different scenarios without requiring task-specific training.

Abstract: Small Earth data are geoscience observations with limited short-term monitoring variability, providing sparse but meaningful measurements, typically exhibiting spatiotemporal correlations. Spatiotemporal forecasting on such data is crucial for understanding geoscientific processes despite their small scale. However, conventional deep learning models for spatiotemporal forecasting requires task-specific training for different scenarios. Foundation models do not need task-specific training, but they often exhibit forecasting bias toward the global mean of the pretraining distribution. Here we propose a simple and robust approach for spatiotemporally correlated small Earth data forecasting. The essential idea is to characterize and quantify spatiotemporal patterns of small Earth data and then utilize tabular foundation models for accurate forecasting across different scenarios. Comparative results across three typical scenarios demonstrate that our forecasting approach achieves superior accuracy compared to the graph deep learning model (T-GCN) and tabular foundation model (TabPFN) in the majority of instances, exhibiting stronger robustness.

[440] AB-PINNs: Adaptive-Basis Physics-Informed Neural Networks for Residual-Driven Domain Decomposition

Jonah Botvinick-Greenhouse, Wael H. Ali, Mouhacine Benosman, Saviz Mowlavi

Main category: cs.LG

TL;DR: AB-PINNs introduce adaptive domain decomposition for PINNs, where subdomains dynamically adjust to solution features and new subdomains are added in high-residual regions during training to handle multiscale problems.

Details

Motivation: To address limitations of static domain decomposition in PINNs for multiscale problems, where different regions require different expressive power and fixed subdomains may lead to poor convergence or extensive hyperparameter tuning.

Method: Dynamic domain decomposition with adaptive subdomains that adjust to solution features, plus on-the-fly introduction of new subdomains in regions with high residual loss, inspired by classical mesh refinement techniques.

Result: Comprehensive numerical results demonstrate effectiveness in solving various complex multiscale partial differential equations, with improved convergence and reduced hyperparameter tuning needs.

Conclusion: AB-PINNs provide a flexible domain decomposition approach well-suited for multiscale problems, enabling different subdomains to capture different solution scales and preventing convergence to local minima.

Abstract: We introduce adaptive-basis physics-informed neural networks (AB-PINNs), a novel approach to domain decomposition for training PINNs in which existing subdomains dynamically adapt to the intrinsic features of the unknown solution. Drawing inspiration from classical mesh refinement techniques, we also modify the domain decomposition on-the-fly throughout training by introducing new subdomains in regions of high residual loss, thereby providing additional expressive power where the solution of the differential equation is challenging to represent. Our flexible approach to domain decomposition is well-suited for multiscale problems, as different subdomains can learn to capture different scales of the underlying solution. Moreover, the ability to introduce new subdomains during training helps prevent convergence to unwanted local minima and can reduce the need for extensive hyperparameter tuning compared to static domain decomposition approaches. Throughout, we present comprehensive numerical results which demonstrate the effectiveness of AB-PINNs at solving a variety of complex multiscale partial differential equations.

[441] MATT-CTR: Unleashing a Model-Agnostic Test-Time Paradigm for CTR Prediction with Confidence-Guided Inference Paths

Moyu Zhang, Yun Chen, Yujun Jin, Jinxin Hu, Yu Zhang, Xiaoyi Zeng

Main category: cs.LG

TL;DR: MATT is a model-agnostic test-time paradigm that improves CTR prediction by generating multiple inference paths based on feature combination confidence scores, mitigating the impact of low-confidence features.

Details

Motivation: Previous CTR research focused mainly on training phase optimizations, neglecting inference phase opportunities. Infrequent feature combinations degrade prediction performance, leading to unreliable outputs.

Method: Uses hierarchical probabilistic hashing to estimate feature combination frequencies as confidence scores, then generates multiple inference paths through iterative sampling and aggregates predictions for robust results.

Result: Extensive offline experiments and online A/B tests strongly validate MATT’s compatibility and effectiveness across existing CTR models.

Conclusion: MATT successfully unlocks the predictive potential of trained CTR models by addressing inference-phase optimization through confidence-guided multiple path generation.

Abstract: Recently, a growing body of research has focused on either optimizing CTR model architectures to better model feature interactions or refining training objectives to aid parameter learning, thereby achieving better predictive performance. However, previous efforts have primarily focused on the training phase, largely neglecting opportunities for optimization during the inference phase. Infrequently occurring feature combinations, in particular, can degrade prediction performance, leading to unreliable or low-confidence outputs. To unlock the predictive potential of trained CTR models, we propose a Model-Agnostic Test-Time paradigm (MATT), which leverages the confidence scores of feature combinations to guide the generation of multiple inference paths, thereby mitigating the influence of low-confidence features on the final prediction. Specifically, to quantify the confidence of feature combinations, we introduce a hierarchical probabilistic hashing method to estimate the occurrence frequencies of feature combinations at various orders, which serve as their corresponding confidence scores. Then, using the confidence scores as sampling probabilities, we generate multiple instance-specific inference paths through iterative sampling and subsequently aggregate the prediction scores from multiple paths to conduct robust predictions. Finally, extensive offline experiments and online A/B tests strongly validate the compatibility and effectiveness of MATT across existing CTR models.

[442] Bi-level Meta-Policy Control for Dynamic Uncertainty Calibration in Evidential Deep Learning

Zhen Yang, Yansong Ma, Lei Chen

Main category: cs.LG

TL;DR: Proposes Meta-Policy Controller (MPC), a dynamic meta-learning framework that adaptively adjusts KL divergence coefficient and Dirichlet prior strengths for better uncertainty calibration in Evidence Deep Learning, overcoming limitations of static hyperparameters.

Details

Motivation: Traditional EDL methods use static hyperparameters for uncertainty calibration, which limits adaptability in dynamic data distributions and leads to poor calibration and generalization in high-risk decision-making tasks.

Method: MPC uses bi-level optimization: inner loop updates model parameters with dynamically configured loss function; outer loop uses policy network to optimize KL divergence coefficient and class-specific Dirichlet prior strengths based on multi-objective rewards balancing accuracy and uncertainty quality.

Result: Extensive experiments show MPC significantly enhances reliability and calibration of model predictions across various tasks, improving uncertainty calibration, prediction accuracy, and performance retention after confidence-based sample rejection.

Conclusion: The learnable Dirichlet prior enables flexible adaptation to class distributions and training dynamics, overcoming limitations of fixed priors in previous methods.

Abstract: Traditional Evidence Deep Learning (EDL) methods rely on static hyperparameter for uncertainty calibration, limiting their adaptability in dynamic data distributions, which results in poor calibration and generalization in high-risk decision-making tasks. To address this limitation, we propose the Meta-Policy Controller (MPC), a dynamic meta-learning framework that adjusts the KL divergence coefficient and Dirichlet prior strengths for optimal uncertainty modeling. Specifically, MPC employs a bi-level optimization approach: in the inner loop, model parameters are updated through a dynamically configured loss function that adapts to the current training state; in the outer loop, a policy network optimizes the KL divergence coefficient and class-specific Dirichlet prior strengths based on multi-objective rewards balancing prediction accuracy and uncertainty quality. Unlike previous methods with fixed priors, our learnable Dirichlet prior enables flexible adaptation to class distributions and training dynamics. Extensive experimental results show that MPC significantly enhances the reliability and calibration of model predictions across various tasks, improving uncertainty calibration, prediction accuracy, and performance retention after confidence-based sample rejection.

[443] Variability Aware Recursive Neural Network (VARNN): A Residual-Memory Model for Capturing Temporal Deviation in Sequence Regression Modeling

Haroon Gharwi, Kai Shu

Main category: cs.LG

TL;DR: VARNN is a neural network architecture that uses error memory from recent prediction residuals to improve time-series regression under non-stationary and heteroscedastic conditions.

Details

Motivation: Real-world time series data exhibit non-stationary behavior, regime shifts, and temporally varying noise that degrade standard regression models' robustness.

Method: VARNN augments a feed-forward predictor with a learned error-memory state updated from residuals over short context steps, using this as signal of variability and drift to recalibrate subsequent predictions.

Result: Across diverse domains (appliance energy, healthcare, environmental monitoring), VARNN achieves superior performance with lower test MSE and minimal computational overhead compared to static, dynamic, and recurrent baselines.

Conclusion: VARNN offers robust predictions under drift and volatility environments, showing potential as a promising framework for time-series learning.

Abstract: Real-world time series data exhibit non-stationary behavior, regime shifts, and temporally varying noise (heteroscedastic) that degrade the robustness of standard regression models. We introduce the Variability-Aware Recursive Neural Network (VARNN), a novel residual-aware architecture for supervised time-series regression that learns an explicit error memory from recent prediction residuals and uses it to recalibrate subsequent predictions. VARNN augments a feed-forward predictor with a learned error-memory state that is updated from residuals over a short context steps as a signal of variability and drift, and then conditions the final prediction at the current time step. Across diverse dataset domains, appliance energy, healthcare, and environmental monitoring, experimental results demonstrate VARNN achieves superior performance and attains lower test MSE with minimal computational overhead over static, dynamic, and recurrent baselines. Our findings show that the VARNN model offers robust predictions under a drift and volatility environment, highlighting its potential as a promising framework for time-series learning.

[444] When LLM Agents Meet Graph Optimization: An Automated Data Quality Improvement Approach

Zhihan Zhang, Xunkai Li, Yilong Zuo, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang

Main category: cs.LG

TL;DR: LAGA is a multi-agent framework that addresses graph quality issues in text-attributed graphs through automated detection, planning, action, and evaluation agents, achieving state-of-the-art performance across various scenarios.

Details

Motivation: Graph neural networks on text-attributed graphs are highly sensitive to input quality issues like sparsity, noise, and imbalance, but existing approaches focus on model architectures rather than systematic data optimization.

Method: LAGA integrates four collaborative agents (detection, planning, action, evaluation) in an automated closed loop, with a dual-encoder and tri-objective design in the action agent for holistic graph quality enhancement.

Result: Experiments across nine scenarios show LAGA improves graph quality and achieves state-of-the-art performance across various tasks and backbones.

Conclusion: Data-centric quality optimization is key to reliable text-attributed graphs and robust graph learning, with LAGA providing an effective unified framework for this purpose.

Abstract: Text-attributed graphs (TAGs) have emerged as a powerful representation that combines structural connections with fine-grained semantics, supporting a wide range of data-centric applications. However, the performance of graph neural networks (GNNs) on TAGs is highly sensitive to input quality. Our empirical study shows that both traditional GNNs and LLM-enhanced GNNs suffer significant degradation across nine representative scenarios of sparsity, noise, and imbalance, highlighting graph quality as a critical bottleneck. Existing approaches mainly focus on improving model architectures, while neglecting systematic optimization of TAG data itself, leading to limited effectiveness in practice. To address this gap, we propose LAGA (Large Language and Graph Agent), a unified multi-agent framework that treats graph quality control as a first-class, data-centric problem. LAGA integrates four collaborative agents-detection, planning, action, and evaluation-into an automated closed loop. At its core, the action agent employs a dual-encoder and tri-objective design to capture complementary information across modalities and perform holistic graph quality enhancement. Experiments across nine scenarios show that LAGA improves graph quality and achieves state-of-the-art performance across various tasks and backbones, validating data-centric quality optimization as key to reliable TAGs and robust graph learning.

[445] Analytical Survey of Learning with Low-Resource Data: From Analysis to Investigation

Xiaofeng Cao, Mingwei Xu, Xin Yu, Jiangchao Yao, Wei Ye, Shengjun Huang, Minling Zhang, Ivor W. Tsang, Yew Soon Ong, James T. Kwok, Heng Tao Shen

Main category: cs.LG

TL;DR: This survey analyzes learning with low-resource data using PAC framework and active sampling theory, investigates optimization strategies for low-resource learning, and explores various learning paradigms that benefit from limited data.

Details

Motivation: High-resource data learning is successful but expensive in annotation and training costs. The fundamental goal is to achieve robust generalization with limited-resource data.

Method: Uses agnostic active sampling theory within PAC framework to analyze generalization error and label complexity. Investigates optimization strategies including gradient-informed, meta-iteration, geometry-aware, and LLMs-powered optimization.

Result: Provides theoretical analysis of generalization error and label complexity for low-resource learning in both supervised and unsupervised settings. Identifies effective optimization strategies and learning paradigms for limited data scenarios.

Conclusion: Summarizes key findings and implications for learning with low-resource data, highlighting the importance of specialized optimization approaches and various learning paradigms that can effectively leverage limited data resources.

Abstract: Learning with high-resource data has demonstrated substantial success in artificial intelligence (AI); however, the costs associated with data annotation and model training remain significant. A fundamental objective of AI research is to achieve robust generalization with limited-resource data. This survey employs agnostic active sampling theory within the Probably Approximately Correct (PAC) framework to analyze the generalization error and label complexity associated with learning from low-resource data in both model-agnostic supervised and unsupervised settings. Based on this analysis, we investigate a suite of optimization strategies tailored for low-resource data learning, including gradient-informed optimization, meta-iteration optimization, geometry-aware optimization, and LLMs-powered optimization. Furthermore, we provide a comprehensive overview of multiple learning paradigms that can benefit from low-resource data, including domain transfer, reinforcement feedback, and hierarchical structure modeling. Finally, we conclude our analysis and investigation by summarizing the key findings and highlighting their implications for learning with low-resource data.

[446] HiBBO: HiPPO-based Space Consistency for High-dimensional Bayesian Optimisation

Junyu Xuan, Wenlong Chen, Yingzhen Li

Main category: cs.LG

TL;DR: HiBBO is a Bayesian Optimization framework that uses HiPPO-enhanced VAE to reduce functional distribution mismatch between latent and original spaces, improving optimization in high-dimensional problems.

Details

Motivation: Standard VAE-based BO methods suffer from functional distribution mismatch between latent and original spaces due to reconstruction-only objectives, leading to suboptimal optimization performance in high-dimensional settings.

Method: Proposes HiBBO framework that incorporates space consistency into VAE latent space construction using HiPPO (long-term sequence modeling) to align functional distributions between latent and original spaces.

Result: Experiments on high-dimensional benchmark tasks show HiBBO outperforms existing VAE-BO methods in both convergence speed and solution quality.

Conclusion: HiBBO bridges the gap between high-dimensional sequence representation learning and efficient Bayesian Optimization, enabling broader applications in neural architecture search and materials science.

Abstract: Bayesian Optimisation (BO) is a powerful tool for optimising expensive blackbox functions but its effectiveness diminishes in highdimensional spaces due to sparse data and poor surrogate model scalability While Variational Autoencoder (VAE) based approaches address this by learning low-dimensional latent representations the reconstructionbased objective function often brings the functional distribution mismatch between the latent space and original space leading to suboptimal optimisation performance In this paper we first analyse the reason why reconstructiononly loss may lead to distribution mismatch and then propose HiBBO a novel BO framework that introduces the space consistency into the latent space construction in VAE using HiPPO - a method for longterm sequence modelling - to reduce the functional distribution mismatch between the latent space and original space Experiments on highdimensional benchmark tasks demonstrate that HiBBO outperforms existing VAEBO methods in convergence speed and solution quality Our work bridges the gap between high-dimensional sequence representation learning and efficient Bayesian Optimisation enabling broader applications in neural architecture search materials science and beyond.

[447] Learning Regularizers: Learning Optimizers that can Regularize

Suraj Kumar Sahoo, Narayanan C Krishnan

Main category: cs.LG

TL;DR: Learned Optimizers can internalize regularization effects without explicit application, outperforming unregularized counterparts in generalization and test accuracy.

Details

Motivation: To investigate whether regularization can be learned by LOs rather than explicitly applied, challenging traditional regularization approaches.

Method: Train LOs with and without access to explicit regularizers (SAM, GAM, GSAM) and compare performance on standard benchmarks including MNIST, FMNIST, CIFAR with MLP, MLP-Relu and CNN architectures.

Result: Regularized LOs consistently outperform unregularized counterparts in test accuracy and generalization, and can transfer regularization effects to new tasks.

Conclusion: LOs can inherently learn regularization properties, questioning the necessity of explicit regularization in optimization.

Abstract: Learned Optimizers (LOs), a type of Meta-learning, have gained traction due to their ability to be parameterized and trained for efficient optimization. Traditional gradient-based methods incorporate explicit regularization techniques such as Sharpness-Aware Minimization (SAM), Gradient-norm Aware Minimization (GAM), and Gap-guided Sharpness-Aware Minimization (GSAM) to enhance generalization and convergence. In this work, we explore a fundamental question: \textbf{Can regularizers be learned?} We empirically demonstrate that LOs can be trained to learn and internalize the effects of traditional regularization techniques without explicitly applying them to the objective function. We validate this through extensive experiments on standard benchmarks (including MNIST, FMNIST, CIFAR and Neural Networks such as MLP, MLP-Relu and CNN), comparing LOs trained with and without access to explicit regularizers. Regularized LOs consistently outperform their unregularized counterparts in terms of test accuracy and generalization. Furthermore, we show that LOs retain and transfer these regularization effects to new optimization tasks by inherently seeking minima similar to those targeted by these regularizers. Our results suggest that LOs can inherently learn regularization properties, \textit{challenging the conventional necessity of explicit optimizee loss regularization.

[448] Diagnosing and Mitigating System Bias in Self-Rewarding RL

Chuyi Tan, Peiwen Yuan, Xinglin Wang, Yiwei Li, Shaoxiong Feng, Yueqi Zhang, Jiayi Shi, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li

Main category: cs.LG

TL;DR: RLER improves reinforcement learning with intrinsic rewards by addressing system bias through ensemble methods, achieving performance close to verifiable reward approaches while enabling stable scaling on unlabeled data.

Details

Motivation: RLIR enables sustainable scaling in unlabeled settings but suffers from performance and stability issues due to system bias where models overestimate high-confidence rollouts, leading to biased reward estimation and unstable training.

Method: Proposed RLER (reinforcement learning with ensembled rewards) which aggregates diverse models and adapts reward interpolation and rollout selection to mitigate system bias characterized by three metrics: noise bias, self-bias, and symmetry bias.

Result: RLER improves by +13.6% over RLIR and is only 3.6% below RLVR, achieving stable scaling on unlabeled samples with significant performance gains.

Conclusion: RLER effectively addresses the system bias problem in intrinsic reward reinforcement learning, making it highly applicable for stable scaling in unlabeled settings while approaching the performance of verifiable reward methods.

Abstract: Reinforcement learning with verifiable rewards (RLVR) scales the reasoning ability of large language models (LLMs) but remains bottlenecked by limited labeled samples for continued data scaling. Reinforcement learning with intrinsic rewards (RLIR), where the policy model assigns rewards to its own rollouts, enables sustainable scaling in unlabeled settings, yet its performance and stability lag behind RLVR. We trace this gap to a system bias: the model tends to overestimate its high-confidence rollouts, leading to biased and unstable reward estimation. This bias accumulates as training progresses, with deviations from the oracle drifting toward over-reward, causing unstable training. We characterize this bias using three metrics: $\rho_{\text{noise}}$, $\rho_{\text{selfbias}}$, and $\rho_{\text{symbias}}$. We find that $\rho_{\text{noise}}$ and $\rho_{\text{symbias}}$ impact convergence, while $\rho_{\text{selfbias}}$ amplifies both correct and incorrect updates, leading to instability. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models and adapts reward interpolation and rollout selection. Extensive experiments show that RLER improves by +13.6% over RLIR and is only 3.6% below RLVR, achieving stable scaling on unlabeled samples, making it highly applicable.

[449] FedL2T: Personalized Federated Learning with Two-Teacher Distillation for Seizure Prediction

Jionghao Lou, Jian Zhang, Zhongmei Li, Lanlan Chen, Enbo Feng

Main category: cs.LG

TL;DR: FedL2T is a personalized federated learning framework for seizure prediction that uses two-teacher knowledge distillation to handle data heterogeneity across patients while preserving privacy.

Details

Motivation: Training deep learning models for seizure prediction requires large EEG datasets, but acquiring labeled EEG data is challenging due to annotation costs and privacy constraints. Existing FL methods struggle with inter-patient variability in real-world scenarios.

Method: Proposes FedL2T with two-teacher knowledge distillation where each client learns from both a global model and a dynamically assigned peer model. Uses adaptive multi-level distillation aligning prediction outputs and feature representations, plus proximal regularization for training stability.

Result: Extensive experiments on two EEG datasets show FedL2T consistently outperforms state-of-the-art FL methods, especially under low-label conditions. Achieves rapid and stable convergence, reducing communication rounds and overhead.

Conclusion: FedL2T demonstrates strong potential as a reliable and personalized solution for seizure prediction in privacy-sensitive healthcare scenarios, effectively handling data heterogeneity while maintaining privacy.

Abstract: The training of deep learning models in seizure prediction requires large amounts of Electroencephalogram (EEG) data. However, acquiring sufficient labeled EEG data is difficult due to annotation costs and privacy constraints. Federated Learning (FL) enables privacy-preserving collaborative training by sharing model updates instead of raw data. However, due to the inherent inter-patient variability in real-world scenarios, existing FL-based seizure prediction methods struggle to achieve robust performance under heterogeneous client settings. To address this challenge, we propose FedL2T, a personalized federated learning framework that leverages a novel two-teacher knowledge distillation strategy to generate superior personalized models for each client. Specifically, each client simultaneously learns from a globally aggregated model and a dynamically assigned peer model, promoting more direct and enriched knowledge exchange. To ensure reliable knowledge transfer, FedL2T employs an adaptive multi-level distillation strategy that aligns both prediction outputs and intermediate feature representations based on task confidence. In addition, a proximal regularization term is introduced to constrain personalized model updates, thereby enhancing training stability. Extensive experiments on two EEG datasets demonstrate that FedL2T consistently outperforms state-of-the-art FL methods, particularly under low-label conditions. Moreover, FedL2T exhibits rapid and stable convergence toward optimal performance, thereby reducing the number of communication rounds and associated overhead. These results underscore the potential of FedL2T as a reliable and personalized solution for seizure prediction in privacy-sensitive healthcare scenarios.

[450] Constraints-of-Thought: A Framework for Constrained Reasoning in Language-Model-Guided Search

Kamel Alrashedy, Vriksha Srihari, Zulfiqar Zaidi, Ridam Srivastava, Pradyumna Tambwekar, Matthew Gombolay

Main category: cs.LG

TL;DR: Constraints-of-Thought (Const-o-T) is a framework that uses (intent, constraint) pairs to guide Monte Carlo Tree Search, improving LLM planning by focusing on semantically meaningful paths while satisfying symbolic constraints.

Details

Motivation: LLMs struggle with multi-step planning alignment to user intent and symbolic constraints in complex domains. Existing methods like CoT and ToT often produce infeasible actions or hallucinated steps.

Method: Proposes Const-o-T framework that represents reasoning steps as (intent, constraint) pairs, integrated with MCTS to prune infeasible branches and guide exploration toward semantically valid actions.

Result: Outperforms baselines across Risk game, CAD code generation, and arithmetic reasoning domains, achieving higher accuracy and stronger structural alignment.

Conclusion: Const-o-T provides a generalizable foundation for constraint-guided reasoning, enabling more efficient, constraint-aligned, and domain-adaptable planning with LLMs.

Abstract: While researchers have made significant progress in enabling large language models (LLMs) to perform multi-step planning, LLMs struggle to ensure that those plans align with high-level user intent and satisfy symbolic constraints, especially in complex, multi-step domains. Existing reasoning approaches such as Chain-of-Thought (CoT), Tree-of-Thought (ToT), and verifier-augmented methods, expand the search space but often yield infeasible actions or hallucinated steps. To overcome these limitations, we propose Constraints-of-Thought (Const-o-T), a framework that provides a structured prior that enables Monte Carlo Tree Search (MCTS) focus search on semantically meaningful paths. Each reasoning step is represented as an (intent, constraint) pair, which serves both to compress the search space and enforce validity. Unlike prior methods that merely generate reasoning traces or validate outputs post hoc, Const-o-T uses (intent, constraint)pairs to actively focus the search toward feasible and meaningful plans. We integrate Const-o-T into MCTS using a structured representation of intent-constraint pairs constraints prune infeasible branches and guide exploration toward semantically valid actions, improving planning efficiency and verifiable decision-making. We demonstrate across three domains Risk game, CAD code generation, and arithmetic reasoning that our approach outperforms baselines, yielding higher accuracy and stronger structural alignment. Our contribution is to demonstrate that Const-of-T offers a generalizable foundation for constraint-guided reasoning, enabling more efficient, constraint-aligned, and domain-adaptable planning with LLMs.

[451] PlatformX: An End-to-End Transferable Platform for Energy-Efficient Neural Architecture Search

Xiaolong Tu, Dawei Chen, Kyungtae Han, Onur Altintas, Haoxin Wang

Main category: cs.LG

TL;DR: PlatformX is an automated HW-NAS framework that overcomes practical limitations of existing methods by integrating energy-driven search space, transferable energy prediction, multi-objective search, and automated energy profiling to efficiently design DNNs for edge devices.

Details

Motivation: Existing HW-NAS methods are impractical for real-world deployment due to high time cost, extensive manual profiling, and poor scalability across diverse hardware platforms with complex energy behavior.

Method: PlatformX integrates four components: energy-driven search space, transferable kernel-level energy predictor across devices, Pareto-based multi-objective search algorithm, and automated high-resolution runtime energy profiling system.

Result: PlatformX significantly reduces search overhead while preserving accuracy and energy fidelity, identifying models with up to 0.94 accuracy or as little as 0.16 mJ per inference, outperforming MobileNet-V2 in both accuracy and efficiency.

Conclusion: PlatformX provides a fully automated and transferable HW-NAS framework that enables efficient design of DNNs for edge devices across multiple mobile platforms.

Abstract: Hardware-Aware Neural Architecture Search (HW-NAS) has emerged as a powerful tool for designing efficient deep neural networks (DNNs) tailored to edge devices. However, existing methods remain largely impractical for real-world deployment due to their high time cost, extensive manual profiling, and poor scalability across diverse hardware platforms with complex, device-specific energy behavior. In this paper, we present PlatformX, a fully automated and transferable HW-NAS framework designed to overcome these limitations. PlatformX integrates four key components: (i) an energy-driven search space that expands conventional NAS design by incorporating energy-critical configurations, enabling exploration of high-efficiency architectures; (ii) a transferable kernel-level energy predictor across devices and incrementally refined with minimal on-device samples; (iii) a Pareto-based multi-objective search algorithm that balances energy and accuracy to identify optimal trade-offs; and (iv) a high-resolution runtime energy profiling system that automates on-device power measurement using external monitors without human intervention. We evaluate PlatformX across multiple mobile platforms, showing that it significantly reduces search overhead while preserving accuracy and energy fidelity. It identifies models with up to 0.94 accuracy or as little as 0.16 mJ per inference, both outperforming MobileNet-V2 in accuracy and efficiency. Code and tutorials are available at github.com/amai-gsu/PlatformX.

[452] SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions

Ziyi Wang, Nan Jiang, Guang Lin, Qifan Song

Main category: cs.LG

TL;DR: A unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (SQS) that achieves higher compression rates than existing methods while maintaining comparable performance.

Details

Motivation: Compressing large-scale neural networks is essential for deploying models on resource-constrained devices, but existing methods using pruning or quantization individually result in suboptimal compression rates.

Method: Uses Bayesian variational learning with spike-and-slab prior for sparsity and Gaussian Mixture Models (GMMs) for quantized weights, enabling simultaneous pruning and low-bit quantization.

Result: Achieves higher compression rates than prior baselines while maintaining comparable performance drops across ResNet, BERT-base, Llama3, and Qwen2.5 models.

Conclusion: The proposed SQS framework provides an effective unified approach for neural network compression through simultaneous pruning and quantization, with theoretical consistency guarantees.

Abstract: Compressing large-scale neural networks is essential for deploying models on resource-constrained devices. Most existing methods adopt weight pruning or low-bit quantization individually, often resulting in suboptimal compression rates to preserve acceptable performance drops. We introduce a unified framework for simultaneous pruning and low-bit quantization via Bayesian variational learning (SQS), which achieves higher compression rates than prior baselines while maintaining comparable performance. The key idea is to employ a spike-and-slab prior to inducing sparsity and model quantized weights using Gaussian Mixture Models (GMMs) to enable low-bit precision. In theory, we provide the consistent result of our proposed variational approach to a sparse and quantized deep neural network. Extensive experiments on compressing ResNet, BERT-base, Llama3, and Qwen2.5 models show that our method achieves higher compression rates than a line of existing methods with comparable performance drops.

[453] LLM Unlearning on Noisy Forget Sets: A Study of Incomplete, Rewritten, and Watermarked Data

Changsheng Wang, Yihua Zhang, Dennis Wei, Jinghan Jia, Pin-Yu Chen, Sijia Liu

Main category: cs.LG

TL;DR: LLM unlearning remains robust to noisy forget data when core semantic signals are preserved, as algorithms rely on deep semantic cues rather than surface patterns.

Details

Motivation: Address ethical and security concerns in LLMs by studying unlearning reliability when forget data is low-quality, synthetically rewritten, or watermarked rather than clean and well-defined.

Method: Systematically benchmark state-of-the-art LLM unlearning methods (RMU and NPO) on noisy forget sets with perturbations while preserving semantic signals.

Result: Unlearning shows surprising robustness to perturbations as long as core semantic components remain intact, with key semantic drivers of forgetting maintaining influence despite surface variations.

Conclusion: LLM unlearning algorithms are primarily guided by deep semantic cues rather than shallow lexical patterns, making them resilient to noisy forget data when semantic signals are preserved.

Abstract: Large language models (LLMs) exhibit remarkable generative capabilities but raise ethical and security concerns by memorizing sensitive data, reinforcing biases, and producing harmful content. These risks have spurred interest in LLM unlearning, the task of removing knowledge associated with undesirable data from pre-trained models. However, most existing methods assume access to clean, well-defined forget data samples, whereas real-world forget data could often be low-quality, synthetically rewritten, or watermarked, casting doubt on the reliability of unlearning. This work presents the first study of unlearning under perturbed or low-fidelity forget data, referred to as noisy forget sets. By systematically benchmarking state-of-the-art LLM unlearning methods, RMU and NPO, on such noisy forget sets, we find that unlearning remains surprisingly robust to perturbations, provided that core semantic signals are preserved. To explain this robustness, we propose a saliency-based interpretation: key semantic components that drive forgetting remain consistently influential despite substantial variation in surface form. This suggests that unlearning algorithms are primarily guided by deep semantic cues rather than shallow lexical patterns.

[454] Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers

Rui Bu, Haofeng Zhong, Wenzheng Chen, Yangyan Li

Main category: cs.LG

TL;DR: Proposes Value-State Gated Attention (VGA) to address extreme-token phenomena like attention sinks and value-state drains in Transformers by introducing a learnable gate computed from value vectors to modulate attention output.

Details

Motivation: Transformer models suffer from extreme-token phenomena that degrade performance, quantization fidelity, and interpretability due to a problematic mutual reinforcement mechanism where models learn inefficient 'no-op' behavior.

Method: Introduces VGA with a learnable, data-dependent gate computed directly from value vectors (V) to modulate attention output, breaking the cycle of mutual reinforcement between attention scores and value states.

Result: VGA significantly mitigates attention sinks, stabilizes value-state norms, improves performance, enhances quantization fidelity, and provides better model interpretability.

Conclusion: VGA provides an effective architectural solution to extreme-token phenomena by directly regulating value-state contributions, offering improvements across multiple metrics without complex modifications.

Abstract: Large models based on the Transformer architecture are susceptible to extreme-token phenomena, such as attention sinks and value-state drains. These issues, which degrade model performance, quantization fidelity, and interpretability, arise from a problematic mutual reinforcement mechanism where the model learns an inefficient ’no-op’ behavior by focusing attention on tokens with near-zero value states. In this paper, we propose Value-State Gated Attention (VGA), a simple, dedicated, and stable architectural mechanism for performing ’no-op’ attention efficiently by directly breaking this cycle. VGA introduces a learnable, data-dependent gate, computed directly from the value vectors (V), to modulate the output. Through a theoretical analysis of the underlying gradients, we show that gating the value-state with a function of itself is more effective at decoupling value and attention score updates than prior methods that gate on input embeddings. This creates a direct regulatory pathway that allows the model to suppress a token’s contribution based on its emergent value representation. Our experiments demonstrate that VGA significantly mitigates the formation of attention sinks and stabilizes value-state norms, leading to improved performance, robust quantization fidelity, and enhanced model interpretability.

[455] Slim Scheduler: A Runtime-Aware RL and Scheduler System for Efficient CNN Inference

Ian Harshbarger, Calvin Chidambaram

Main category: cs.LG

TL;DR: Slim Scheduler is a hybrid framework combining PPO reinforcement learning with greedy schedulers to dynamically coordinate distributed inference for slimmable neural networks, achieving significant latency and energy reductions while maintaining accuracy trade-offs.

Details

Motivation: Most neural network scheduling research focuses on static, end-to-end models and overlooks dynamic approaches that adapt to heterogeneous hardware and fluctuating runtime conditions.

Method: Hierarchical design with PPO reinforcement learning policy for global routing (device selection, width ratio, batch configuration) and local greedy schedulers on each server for batching compatible requests and managing instance scaling based on VRAM/utilization constraints.

Result: 96.45% reduction in mean latency and 97.31% reduction in energy usage when dropping to slimmest model (70.3% accuracy); overall reduction in average latency plus energy consumption with accuracy increase, though with higher standard deviations.

Conclusion: The hybrid approach reduces search space complexity, mitigates hardware overfitting, and balances efficiency and throughput for dynamic neural network inference scheduling.

Abstract: Most neural network scheduling research focuses on optimizing static, end-to-end models of fixed width, overlooking dynamic approaches that adapt to heterogeneous hardware and fluctuating runtime conditions. We present Slim Scheduler, a hybrid scheduling framework that integrates a Proximal Policy Optimization (PPO) reinforcement learning policy with algorithmic, greedy schedulers to coordinate distributed inference for slimmable models. Each server runs a local greedy scheduler that batches compatible requests and manages instance scaling based on VRAM and utilization constraints, while the PPO router learns global routing policies for device selection, width ratio, and batch configuration. This hierarchical design reduces search space complexity, mitigates overfitting to specific hardware, and balances efficiency and throughput. Compared to a purely randomized task distribution baseline, Slim Scheduler can achieve various accuracy and latency trade-offs such as: A 96.45% reduction in mean latency and a 97.31% reduction in energy usage dropping accuracy to the slimmest model available (70.3%). It can then accomplish an overall reduction in average latency plus energy consumption with an increase in accuracy at the cost of higher standard deviations of said latency and energy, effecting overall task throughput.

[456] MagicDock: Toward Docking-oriented De Novo Ligand Design via Gradient Inversion

Zekai Chen, Xunkai Li, Sirui Zhang, Henan Sun, Jia Li, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang

Main category: cs.LG

TL;DR: MagicDock is a framework for de novo ligand design that addresses limitations in existing methods through gradient inversion and differentiable surface modeling, achieving significant improvements over state-of-the-art baselines.

Details

Motivation: Existing de novo ligand design methods suffer from Pseudo De Novo approaches, limited docking modeling, and inflexible ligand types, which MagicDock aims to overcome.

Method: Uses a gradient inversion framework that incorporates general docking knowledge, then instantiates it through reverse gradient flows. Employs differentiable surface modeling with learnable 3D point-cloud representations and provides flexible triggers for different ligand types.

Result: Achieves average improvements of 27.1% and 11.7% over SOTA baselines for protein and molecule ligand design respectively across 9 scenarios.

Conclusion: MagicDock provides a comprehensive solution for de novo ligand design with theoretical guarantees and demonstrates superior performance across various scenarios.

Abstract: De novo ligand design is a fundamental task that seeks to generate protein or molecule candidates that can effectively dock with protein receptors and achieve strong binding affinity entirely from scratch. It holds paramount significance for a wide spectrum of biomedical applications. However, most existing studies are constrained by the \textbf{Pseudo De Novo}, \textbf{Limited Docking Modeling}, and \textbf{Inflexible Ligand Type}. To address these issues, we propose MagicDock, a forward-looking framework grounded in the progressive pipeline and differentiable surface modeling. (1) We adopt a well-designed gradient inversion framework. To begin with, general docking knowledge of receptors and ligands is incorporated into the backbone model. Subsequently, the docking knowledge is instantiated as reverse gradient flows by binding prediction, which iteratively guide the de novo generation of ligands. (2) We emphasize differentiable surface modeling in the docking process, leveraging learnable 3D point-cloud representations to precisely capture binding details, thereby ensuring that the generated ligands preserve docking validity through direct and interpretable spatial fingerprints. (3) We introduce customized designs for different ligand types and integrate them into a unified gradient inversion framework with flexible triggers, thereby ensuring broad applicability. Moreover, we provide rigorous theoretical guarantees for each component of MagicDock. Extensive experiments across 9 scenarios demonstrate that MagicDock achieves average improvements of 27.1% and 11.7% over SOTA baselines specialized for protein or molecule ligand design, respectively.

[457] The Environmental Impacts of Machine Learning Training Keep Rising Evidencing Rebound Effect

Clément Morand, Anne-Laure Ligozat, Aurélie Névéol

Main category: cs.LG

TL;DR: The paper analyzes environmental impacts of ML training over the last decade, finding that despite optimization strategies, energy consumption and environmental impacts have increased exponentially due to rebound effects.

Details

Motivation: To investigate whether current optimization strategies can lead to sustainable ML model training given escalating computational demands and environmental concerns.

Method: Estimate environmental impacts of training notable AI systems over the last decade, focusing on the life cycle of graphics cards and considering reduction strategies like location shifting.

Result: Two critical trends: graphics card production impacts have increased steadily, and training energy consumption/environmental impacts have increased exponentially despite optimization strategies, showing rebound effects.

Conclusion: Increasing efficiency alone cannot ensure ML sustainability; mitigating environmental impact requires reducing AI activities and questioning the scale/frequency of resource-intensive training, while considering hardware life cycle impacts beyond just use phase.

Abstract: Recent Machine Learning (ML) approaches have shown increased performance on benchmarks but at the cost of escalating computational demands. Hardware, algorithmic and carbon optimizations have been proposed to curb energy consumption and environmental impacts. Can these strategies lead to sustainable ML model training? Here, we estimate the environmental impacts associated with training notable AI systems over the last decade, including Large Language Models, with a focus on the life cycle of graphics cards. Our analysis reveals two critical trends: First, the impacts of graphics cards production have increased steadily over this period; Second, energy consumption and environmental impacts associated with training ML models have increased exponentially, even when considering reduction strategies such as location shifting to places with less carbon intensive electricity mixes. Optimization strategies do not mitigate the impacts induced by model training, evidencing rebound effect. We show that the impacts of hardware must be considered over the entire life cycle rather than the sole use phase in order to avoid impact shifting. Our study demonstrates that increasing efficiency alone cannot ensure sustainability in ML. Mitigating the environmental impact of AI also requires reducing AI activities and questioning the scale and frequency of resource-intensive training.

[458] The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, Florian Tramèr

Main category: cs.LG

TL;DR: Current language model defense evaluations are flawed because they use static attack sets or weak optimization methods. The paper shows that adaptive attackers with strong optimization techniques can bypass 12 recent defenses with over 90% success rates, even though most originally reported near-zero attack success.

Details

Motivation: To demonstrate that current evaluation methods for language model defenses are inadequate because they don't test against adaptive attackers who can modify their strategies to counter specific defenses.

Method: Systematically applied and scaled general optimization techniques including gradient descent, reinforcement learning, random search, and human-guided exploration to test against 12 recent defenses.

Result: Successfully bypassed all 12 defenses with attack success rates above 90% for most defenses, even though the majority originally reported near-zero attack success rates.

Conclusion: Future defense work must consider stronger adaptive attacks to make reliable claims of robustness, as current evaluation methods are insufficient.

Abstract: How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense’s design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques-gradient descent, reinforcement learning, random search, and human-guided exploration-we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. We believe that future defense work must consider stronger attacks, such as the ones we describe, in order to make reliable and convincing claims of robustness.

[459] Convergence of optimizers implies eigenvalues filtering at equilibrium

Jerome Bolte, Quoc-Tung Le, Edouard Pauwels

Main category: cs.LG

TL;DR: The paper argues that different optimizers act as eigenvalue filters, with gradient descent avoiding sharp minima and SAM favoring wider basins. The authors propose two new algorithms with enhanced eigenvalue filtering and provide theoretical analysis without requiring non-degeneracy conditions.

Details

Motivation: Empirical evidence shows various optimizers find nearly global optima, but the paper takes a reversed perspective - assuming convergence to analyze the consequences and understand how optimizers act as eigenvalue filters.

Method: Theoretical analysis using generalized Hadamard-Perron stable manifold theorem for general semialgebraic C² functions, without non-degeneracy conditions or global Lipschitz bounds. Proposed two novel algorithms with enhanced eigenvalue filtering.

Result: Different optimizers act as eigenvalue filters determined by hyperparameters: standard gradient descent avoids sharpest minima, while SAM favors wider basins. The proposed algorithms effectively promote wider minima.

Conclusion: Optimizers can be understood as eigenvalue filters, with implications for finding wider minima. The theoretical framework applies broadly without restrictive assumptions, and numerical experiments support the conclusions.

Abstract: Ample empirical evidence in deep neural network training suggests that a variety of optimizers tend to find nearly global optima. In this article, we adopt the reversed perspective that convergence to an arbitrary point is assumed rather than proven, focusing on the consequences of this assumption. From this viewpoint, in line with recent advances on the edge-of-stability phenomenon, we argue that different optimizers effectively act as eigenvalue filters determined by their hyperparameters. Specifically, the standard gradient descent method inherently avoids the sharpest minima, whereas Sharpness-Aware Minimization (SAM) algorithms go even further by actively favoring wider basins. Inspired by these insights, we propose two novel algorithms that exhibit enhanced eigenvalue filtering, effectively promoting wider minima. Our theoretical analysis leverages a generalized Hadamard–Perron stable manifold theorem and applies to general semialgebraic $C^2$ functions, without requiring additional non-degeneracy conditions or global Lipschitz bound assumptions. We support our conclusions with numerical experiments on feed-forward neural networks.

[460] Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach

Junchao Fan, Xiaolin Chang

Main category: cs.LG

TL;DR: IGCARL is a novel robust autonomous driving approach that uses strategic targeted adversaries and constrained reinforcement learning to improve resilience against adversarial attacks.

Details

Motivation: Current robust DRL methods for autonomous driving suffer from myopic adversarial attacks, inability to cause safety-critical events, and learning instability due to lack of robust constraints.

Method: IGCARL consists of a strategic targeted adversary that executes coordinated multi-step attacks with general-sum objectives, and a robust driving agent optimized under constrained formulation to ensure stable learning.

Result: IGCARL improves success rate by at least 27.9% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks.

Conclusion: IGCARL enhances the safety and reliability of DRL-based autonomous driving by effectively addressing key vulnerabilities in existing robust methods.

Abstract: Deep reinforcement learning (DRL) has demonstrated remarkable success in developing autonomous driving policies. However, its vulnerability to adversarial attacks remains a critical barrier to real-world deployment. Although existing robust methods have achieved success, they still suffer from three key issues: (i) these methods are trained against myopic adversarial attacks, limiting their abilities to respond to more strategic threats, (ii) they have trouble causing truly safety-critical events (e.g., collisions), but instead often result in minor consequences, and (iii) these methods can introduce learning instability and policy drift during training due to the lack of robust constraints. To address these issues, we propose Intelligent General-sum Constrained Adversarial Reinforcement Learning (IGCARL), a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent. The strategic targeted adversary is designed to leverage the temporal decision-making capabilities of DRL to execute strategically coordinated multi-step attacks. In addition, it explicitly focuses on inducing safety-critical events by adopting a general-sum objective. The robust driving agent learns by interacting with the adversary to develop a robust autonomous driving policy against adversarial attacks. To ensure stable learning in adversarial environments and to mitigate policy drift caused by attacks, the agent is optimized under a constrained formulation. Extensive experiments show that IGCARL improves the success rate by at least 27.9% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks and enhancing the safety and reliability of DRL-based autonomous driving.

Jose Tupayachi, Mustafa C. Camur, Kevin Heaslip, Xueping Li

Main category: cs.LG

TL;DR: TW-GCN framework combines Graph Convolutional Networks with temporal models to predict EV charging demand using traffic, weather, and infrastructure data, achieving best performance with 3-hour forecasts and 1DCNN temporal model.

Details

Motivation: Transition to electric vehicles faces challenges from uneven charging infrastructure distribution and utilization, which affects power grid stability and investment planning.

Method: TW-GCN framework combining Graph Convolutional Networks with temporal architectures, using real-world traffic flows, weather conditions, and proprietary EV infrastructure data to capture spatial dependencies and temporal dynamics.

Result: Mid-horizon (3-hour) forecasts achieve best balance between responsiveness and stability, with 1DCNN consistently outperforming other temporal models. Regional disparities in predictive accuracy reflect station density, population, and local demand variability.

Conclusion: TW-GCN framework advances integration of data-driven intelligence into EV infrastructure planning, supporting sustainable mobility transitions and resilient grid management.

Abstract: Transportation remains a major contributor to greenhouse gas emissions, highlighting the urgency of transitioning toward sustainable alternatives such as electric vehicles (EVs). Yet, uneven spatial distribution and irregular utilization of charging infrastructure create challenges for both power grid stability and investment planning. This study introduces TW-GCN, a spatio-temporal forecasting framework that combines Graph Convolutional Networks with temporal architectures to predict EV charging demand in Tennessee, United States (U.S.). We utilize real-world traffic flows, weather conditions, and proprietary data provided by one of the largest EV infrastructure company in the U.S. to capture both spatial dependencies and temporal dynamics. Extensive experiments across varying lag horizons, clustering strategies, and sequence lengths reveal that mid-horizon (3-hour) forecasts achieve the best balance between responsiveness and stability, with 1DCNN consistently outperforming other temporal models. Regional analysis shows disparities in predictive accuracy across East, Middle, and West Tennessee, reflecting how station density, population, and local demand variability shape model performance. The proposed TW-GCN framework advances the integration of data-driven intelligence into EV infrastructure planning, supporting both sustainable mobility transitions and resilient grid management.

[462] Improving Anomaly Detection in Industrial Time Series: The Role of Segmentation and Heterogeneous Ensemble

Emilio Mastriani, Alessandro Costa, Federico Incardona, Kevin Munari, Sebastiano Spinello

Main category: cs.LG

TL;DR: Integration of segmentation techniques with heterogeneous ensembles significantly improves anomaly detection in industrial production, boosting AUC-ROC from 0.8599 to 0.9760 by reducing temporal uncertainty.

Details

Motivation: To enhance anomaly detection in industrial production by addressing temporal uncertainty and improving detection of transitions between normal and anomalous conditions.

Method: Applied segmentation as pre-processing step using Change Point Detection (CPD) algorithms like ChangeFinder, combined with heterogeneous ensemble methods including Random Forest and XGBoost.

Result: Significant improvement in AUC-ROC metric from 0.8599 (PCA and LSTM ensemble) to 0.9760 (Random Forest and XGBoost ensemble) in industrial case study.

Conclusion: Segmentation effectively reduces temporal ambiguity and facilitates supervised learning, providing substantial benefits for anomaly detection in industrial contexts.

Abstract: Concerning machine learning, segmentation models can identify state changes within time series, facilitating the detection of transitions between normal and anomalous conditions. Specific techniques such as Change Point Detection (CPD), particularly algorithms like ChangeFinder, have been successfully applied to segment time series and improve anomaly detection by reducing temporal uncertainty, especially in multivariate environments. In this work, we explored how the integration of segmentation techniques, combined with a heterogeneous ensemble, can enhance anomaly detection in an industrial production context. The results show that applying segmentation as a pre-processing step before selecting heterogeneous ensemble algorithms provided a significant advantage in our case study, improving the AUC-ROC metric from 0.8599 (achieved with a PCA and LSTM ensemble) to 0.9760 (achieved with Random Forest and XGBoost). This improvement is imputable to the ability of segmentation to reduce temporal ambiguity and facilitate the learning process of supervised algorithms. In our future work, we intend to assess the benefit of introducing weighted features derived from the study of change points, combined with segmentation and the use of heterogeneous ensembles, to further optimize model performance in early anomaly detection.

[463] Neural Codecs as Biosignal Tokenizers

Kleanthis Avramidis, Tiantian Feng, Woojae Jeong, Jihwan Lee, Wenhui Cui, Richard M Leahy, Shrikanth Narayanan

Main category: cs.LG

TL;DR: BioCodec is a representation learning framework inspired by neural codecs that tokenizes biosignals like EEG into discrete tokens, enabling effective decoding across multiple downstream tasks in low-resource settings.

Details

Motivation: EEG and other biosignals produce high-dimensional, noisy time-series data that require extensive pre-processing and handcrafted feature extraction. There's a need for more efficient representation learning methods to decode meaningful information from these signals.

Method: BioCodec uses a neural codec-inspired framework to capture low-level signal characteristics as discrete tokens. It’s pre-trained on thousands of EEG hours and can handle various biosignal data including EEG and EMG.

Result: BioCodec shows efficacy across multiple downstream tasks including clinical diagnostics, sleep physiology, speech decoding, and motor imagery. It performs competitively with state-of-the-art models, especially in low-resource settings.

Conclusion: BioCodec provides a versatile solution for biosignal tokenization that effectively captures signal characteristics and enables competitive performance across diverse applications, with demonstrated suitability for multiple biosignal types.

Abstract: Neurophysiological recordings such as electroencephalography (EEG) offer accessible and minimally invasive means of estimating physiological activity for applications in healthcare, diagnostic screening, and even immersive entertainment. However, these recordings yield high-dimensional, noisy time-series data that typically require extensive pre-processing and handcrafted feature extraction to reveal meaningful information. Recently, there has been a surge of interest in applying representation learning techniques from large pre-trained (foundation) models to effectively decode and interpret biosignals. We discuss the challenges posed for incorporating such methods and introduce BioCodec, an alternative representation learning framework inspired by neural codecs to capture low-level signal characteristics in the form of discrete tokens. Pre-trained on thousands of EEG hours, BioCodec shows efficacy across multiple downstream tasks, ranging from clinical diagnostic tasks and sleep physiology to decoding speech and motor imagery, particularly in low-resource settings. Additionally, we provide a qualitative analysis of codebook usage and estimate the spatial coherence of codebook embeddings from EEG connectivity. Notably, we also document the suitability of our method to other biosignal data, i.e., electromyographic (EMG) signals. Overall, the proposed approach provides a versatile solution for biosignal tokenization that performs competitively with state-of-the-art models. The source code and model checkpoints are shared.

[464] AdaPM: a Partial Momentum Algorithm for LLM Training

Yimu Zhang, Yuanshi Liu, Cong Fang

Main category: cs.LG

TL;DR: AdaPM is an adaptive training strategy that uses partial momentum with bias correction to reduce memory usage by over 90% while maintaining optimization performance for large language models.

Details

Motivation: Momentum optimization is widely used in training large language models but presents significant memory challenges due to storage requirements.

Method: Proposes AdaPM with non-uniform momentum design that applies partial momentum to most blocks, enhanced by bias correction technique to mitigate performance loss.

Result: Reduces memory by over 90% in momentum and up to 95% in optimizer states, saving over 30% GPU hours for pretraining GPT-2 1.5B while maintaining performance across models from 60M to 1.5B parameters.

Conclusion: AdaPM provides an effective memory-efficient optimizer that maintains training efficiency and performance while significantly reducing memory footprint for large language model training.

Abstract: In the training of large language models, momentum is widely used and often demonstrated to achieve significant acceleration. However, storing momentum typically presents memory challenges. In this paper, we propose AdaPM, an adaptive training strategy that leverages partial momentum to implement a memory-efficient optimizer. To this end, AdaPM utilizes a non-uniform momentum design: for most blocks, full momentum is not necessary to preserve the performance of the optimization. In the momentum design of AdaPM, to mitigate the bias and performance loss caused by partial momentum, we enhance the partial momentum by a bias correction technique. Empirically, we verify that our approach reduces memory by over $90%$ in momentum while maintaining both efficiency and performance for pretraining various language models ranging from 60M to 1.5B, as well as for supervised fine-tuning and RLHF. AdaPM can further reduce memory by up to $95%$ in optimizer states by combining the memory-efficient technique on the second-order statistic, saving over $30%$ GPU hours for pretraining GPT-2 1.5B.

[465] MemLoss: Enhancing Adversarial Training with Recycling Adversarial Examples

Soroush Mahdi, Maryam Amirmazlaghani, Saeed Saravani, Zahra Dehghanian

Main category: cs.LG

TL;DR: MemLoss improves adversarial training by reusing previously generated adversarial examples across epochs to enhance both model robustness and natural accuracy.

Details

Motivation: To address the trade-off between natural accuracy and adversarial robustness in existing adversarial training methods, aiming for balanced improvements in both aspects.

Method: Leverages ‘Memory Adversarial Examples’ - previously generated adversarial examples that are reused across training epochs to enhance model training.

Result: Achieves better accuracy than existing adversarial training methods on datasets like CIFAR-10 while maintaining strong robustness against attacks.

Conclusion: MemLoss provides an effective approach for balanced improvement in both natural accuracy and adversarial robustness without compromising clean data performance.

Abstract: In this paper, we propose a new approach called MemLoss to improve the adversarial training of machine learning models. MemLoss leverages previously generated adversarial examples, referred to as ‘Memory Adversarial Examples,’ to enhance model robustness and accuracy without compromising performance on clean data. By using these examples across training epochs, MemLoss provides a balanced improvement in both natural accuracy and adversarial robustness. Experimental results on multiple datasets, including CIFAR-10, demonstrate that our method achieves better accuracy compared to existing adversarial training methods while maintaining strong robustness against attacks.

[466] On the Fairness of Privacy Protection: Measuring and Mitigating the Disparity of Group Privacy Risks for Differentially Private Machine Learning

Zhi Yang, Changwu Huang, Ke Tang, Xin Yao

Main category: cs.LG

TL;DR: This paper addresses fairness in privacy protection across groups by introducing a novel membership inference game to efficiently audit worst-case privacy risks and enhancing DP-SGD with adaptive group-specific gradient clipping to reduce privacy risk disparities.

Details

Motivation: Current methods for assessing group privacy risks are based on average-case privacy risks, which may underestimate disparities across groups. Existing worst-case assessment methods are time-consuming and impractical.

Method: 1) A novel membership inference game for efficient auditing of approximate worst-case privacy risks; 2) Enhanced DP-SGD algorithm with adaptive group-specific gradient clipping strategy inspired by differential privacy auditing canaries.

Result: Experimental results show the method provides more stringent measurement of group privacy risks and reliable assessment of disparities. The enhanced algorithm effectively reduces disparity in group privacy risks.

Conclusion: The proposed approach enhances fairness of privacy protection in differentially private machine learning by providing better assessment of privacy risk disparities and reducing them through adaptive group-specific protection mechanisms.

Abstract: While significant progress has been made in conventional fairness-aware machine learning (ML) and differentially private ML (DPML), the fairness of privacy protection across groups remains underexplored. Existing studies have proposed methods to assess group privacy risks, but these are based on the average-case privacy risks of data records. Such approaches may underestimate the group privacy risks, thereby potentially underestimating the disparity across group privacy risks. Moreover, the current method for assessing the worst-case privacy risks of data records is time-consuming, limiting their practical applicability. To address these limitations, we introduce a novel membership inference game that can efficiently audit the approximate worst-case privacy risks of data records. Experimental results demonstrate that our method provides a more stringent measurement of group privacy risks, yielding a reliable assessment of the disparity in group privacy risks. Furthermore, to promote privacy protection fairness in DPML, we enhance the standard DP-SGD algorithm with an adaptive group-specific gradient clipping strategy, inspired by the design of canaries in differential privacy auditing studies. Extensive experiments confirm that our algorithm effectively reduces the disparity in group privacy risks, thereby enhancing the fairness of privacy protection in DPML.

[467] Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback

Orin Levy, Liad Erez, Alon Cohen, Yishay Mansour

Main category: cs.LG

TL;DR: Regret minimization algorithms for contextual multi-armed bandits with adversarial delayed feedback, achieving optimal bounds for finite policy classes and general function approximation settings.

Details

Motivation: Address the challenge of delayed feedback in contextual bandits where loss observations arrive with adversarial delays, which is common in real-world applications like online advertising and recommendation systems.

Method: Two approaches: (1) For finite policy classes, direct access method; (2) For general function approximation, using online least-square regression oracle with FIFO order and Hedge-based version of Vovk’s aggregating forecaster.

Result: Achieved expected regret bounds: O(√(KT log|Π|) + √(D log|Π|)) for finite policy classes, and O(√(KT R_T(O)) + √(d_max D β)) for general function approximation, with β ≤ log|F| for finite function classes.

Conclusion: The algorithms provide optimal regret bounds for delayed feedback CMAB problems, with the general function approximation approach achieving bounds close to the lower bound, differing by only a √d_max factor.

Abstract: We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $\Pi$ we establish an optimal expected regret bound of $ O (\sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|)} $ where $D$ is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \mathcal{F} $ with access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$. In this setting, we achieve an expected regret bound of $O(\sqrt{KT\mathcal{R}T(\mathcal{O})} + \sqrt{ d{\max} D \beta})$ assuming FIFO order, where $d_{\max}$ is the maximal delay, $\mathcal{R}T(\mathcal{O})$ is an upper bound on the oracle’s regret and $\beta$ is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk’s aggregating forecaster as an oracle implementation for least-square regression over a finite function class $\mathcal{F}$ and show that its stability parameter $\beta$ is bounded by $\log |\mathcal{F}|$, resulting in an expected regret bound of $O(\sqrt{KT \log |\mathcal{F}|} + \sqrt{d{\max} D \log |\mathcal{F}|})$ which is a $\sqrt{d_{\max}}$ factor away from the lower bound of $\Omega(\sqrt{KT \log |\mathcal{F}|} + \sqrt{D \log |\mathcal{F}|})$ that we also present.

[468] Score-Based Density Estimation from Pairwise Comparisons

Petrus Mikkola, Luigi Acerbi, Arto Klami

Main category: cs.LG

TL;DR: This paper proposes a method to estimate probability densities from pairwise comparisons by relating the target density to a tempered winner density, using score-matching to learn the winner’s score and then ‘de-tempering’ it to recover the original density.

Details

Motivation: The research is motivated by expert knowledge elicitation and learning from human feedback, where direct density estimation is challenging but pairwise comparisons are more accessible.

Method: The method relates the unobserved target density to a tempered winner density, learns the winner’s score via score-matching, and then estimates the target by ‘de-tempering’ the estimated winner density’s score. It uses analytical formulas for the tempering field and proposes an estimator under the Bradley-Terry model, employing diffusion models trained on tempered samples generated via score-scaled annealed Langevin dynamics.

Result: The approach can learn complex multivariate belief densities of simulated experts using only hundreds to thousands of pairwise comparisons, demonstrating effective density estimation from limited comparison data.

Conclusion: The paper presents a novel framework for density estimation from pairwise comparisons that bridges the gap between the target density and observable winner density through tempering and de-tempering operations, enabling efficient learning of complex distributions from human feedback.

Abstract: We study density estimation from pairwise comparisons, motivated by expert knowledge elicitation and learning from human feedback. We relate the unobserved target density to a tempered winner density (marginal density of preferred choices), learning the winner’s score via score-matching. This allows estimating the target by `de-tempering’ the estimated winner density’s score. We prove that the score vectors of the belief and the winner density are collinear, linked by a position-dependent tempering field. We give analytical formulas for this field and propose an estimator for it under the Bradley-Terry model. Using a diffusion model trained on tempered samples generated via score-scaled annealed Langevin dynamics, we can learn complex multivariate belief densities of simulated experts, from only hundreds to thousands of pairwise comparisons.

[469] Cross-Representation Benchmarking in Time-Series Electronic Health Records for Clinical Outcome Prediction

Tianyi Chen, Mingcheng Zhu, Zhiyao Luo, Tingting Zhu

Main category: cs.LG

TL;DR: Systematic benchmark comparing EHR representation methods shows event stream models perform best, with CLMBR being sample-efficient in few-shot settings and feature selection strategies needing adaptation to clinical context.

Details

Motivation: Unclear optimal methods for representing patient data in EHRs due to inconsistent evaluation practices, necessitating standardized comparison of different representation paradigms.

Method: Standardized benchmark comparing three EHR representation paradigms: multivariate time-series, event streams, and textual event streams for LLMs, evaluated across MIMIC-IV (ICU tasks) and EHRSHOT (longitudinal care) datasets with appropriate modeling families.

Result: Event stream models consistently deliver strongest performance; CLMBR pre-trained models are highly sample-efficient in few-shot settings; feature selection must be adapted to clinical setting - pruning sparse features helps ICU predictions but retaining them is critical for longitudinal tasks.

Conclusion: Unified and reproducible pipeline provides practical guidance for selecting EHR representations based on clinical context and data regime, with event stream models generally performing best.

Abstract: Electronic Health Records (EHRs) enable deep learning for clinical predictions, but the optimal method for representing patient data remains unclear due to inconsistent evaluation practices. We present the first systematic benchmark to compare EHR representation methods, including multivariate time-series, event streams, and textual event streams for LLMs. This benchmark standardises data curation and evaluation across two distinct clinical settings: the MIMIC-IV dataset for ICU tasks (mortality, phenotyping) and the EHRSHOT dataset for longitudinal care (30-day readmission, 1-year pancreatic cancer). For each paradigm, we evaluate appropriate modelling families–including Transformers, MLP, LSTMs and Retain for time-series, CLMBR and count-based models for event streams, 8-20B LLMs for textual streams–and analyse the impact of feature pruning based on data missingness. Our experiments reveal that event stream models consistently deliver the strongest performance. Pre-trained models like CLMBR are highly sample-efficient in few-shot settings, though simpler count-based models can be competitive given sufficient data. Furthermore, we find that feature selection strategies must be adapted to the clinical setting: pruning sparse features improves ICU predictions, while retaining them is critical for longitudinal tasks. Our results, enabled by a unified and reproducible pipeline, provide practical guidance for selecting EHR representations based on the clinical context and data regime.

[470] Logits Replay + MoClip: Stabilized, Low-Cost Post-Training with Minimal Forgetting

Suming Qiu, Jing Li, Zhicheng Zhou, Junjie Huang, Linyuan Qiu, Zhijie Sun

Main category: cs.LG

TL;DR: Logits Replay + MoClip is a two-stage framework for efficient LLM domain adaptation that compresses supervision in logit space and stabilizes optimization, reducing training costs by over 40% while maintaining general capabilities.

Details

Motivation: Address the trade-off in LLM post-training where improvements on specialized domains come at the expense of general capabilities, and overcome limitations of existing methods that impose significant computational costs or data access requirements.

Method: Two-stage framework: Stage 0 records dynamic Top-K token subsets covering probability threshold (including gold label); Stage 1 replays compact subsets to compute exact renormalized losses. MoClip optimizer caps gradient-momentum rotation and applies arctan2-based rescaling for stability.

Result: Improves domain performance on Communication Technology and NL2SQL tasks while mitigating forgetting on general benchmarks (MMLU, BBH, GPQA, MATH), with over 40% reduction in training cost.

Conclusion: Provides a scalable, architecture-agnostic path for domain adaptation of LLMs without sacrificing generalization, offering efficient adaptation through logit space compression and stable optimization.

Abstract: Large language models (LLMs) often face a trade-off in post-training: improvements on specialized domains frequently come at the expense of general capabilities. Existing solutions attempt to mitigate this tension via regularization, selective parameter updates, or data-centric replay, but each imposes significant costs in computation, data access, or adaptability. Recent work has shown that training signals can be compressed to subsets of logits without severe accuracy loss, suggesting a path toward efficient adaptation. However, naive truncation destabilizes optimization and exacerbates forgetting. We introduce Logits Replay + MoClip, a two-stage framework that compresses supervision in the logit space and stabilizes optimization at the update level. In Stage 0, we record dynamic Top-K token subsets that cover a probability threshold, always including the gold label. In Stage 1, we replay these compact subsets to compute exact renormalized losses, avoiding full softmax computation and implicitly regularizing. To ensure stability, we design MoClip, an optimizer that caps gradient-momentum rotation and applies an arctan2-based rescaling of updates. Empirically, our method improves domain performance on Communication Technology (CT) and NL2SQL tasks while mitigating forgetting on general benchmarks (MMLU, BBH, GPQA, MATH), and reduces training cost by over 40%. Together, these contributions offer a scalable, architecture-agnostic path for domain adaptation of LLMs without sacrificing generalization.

[471] On the Implicit Adversariality of Catastrophic Forgetting in Deep Continual Learning

Ze Peng, Jian Zhang, Jintao Guo, Lei Qi, Yang Gao, Yinghuan Shi

Main category: cs.LG

TL;DR: This paper reveals that catastrophic forgetting in continual learning occurs because new-task training acts as an adversarial attack on old-task knowledge, with gradients automatically aligning with sharp directions of the old-task loss landscape due to training’s low-rank bias.

Details

Motivation: To understand the underlying cause of catastrophic forgetting in deep networks, which has not been fully understood despite being a central challenge in continual learning.

Method: The authors theoretically analyze how training’s low-rank bias causes adversarial alignment between new-task gradients and old-task loss landscape sharp directions. They propose backGP to address alignment due to backward propagation that gradient projection methods cannot handle.

Result: The proposed backGP method reduces forgetting by 10.8% and improves accuracy by 12.7% on average over gradient projection methods.

Conclusion: Catastrophic forgetting is fundamentally caused by new-task training acting as an adversarial attack on old-task knowledge through automatic gradient alignment with sharp loss landscape directions, and backGP effectively addresses this issue.

Abstract: Continual learning seeks the human-like ability to accumulate new skills in machine intelligence. Its central challenge is catastrophic forgetting, whose underlying cause has not been fully understood for deep networks. In this paper, we demystify catastrophic forgetting by revealing that the new-task training is implicitly an adversarial attack against the old-task knowledge. Specifically, the new-task gradients automatically and accurately align with the sharp directions of the old-task loss landscape, rapidly increasing the old-task loss. This adversarial alignment is intriguingly counter-intuitive because the sharp directions are too sparsely distributed to align with by chance. To understand it, we theoretically show that it arises from training’s low-rank bias, which, through forward and backward propagation, confines the two directions into the same low-dimensional subspace, facilitating alignment. Gradient projection (GP) methods, a representative family of forgetting-mitigating methods, reduce adversarial alignment caused by forward propagation, but cannot address the alignment due to backward propagation. We propose backGP to address it, which reduces forgetting by 10.8% and improves accuracy by 12.7% on average over GP methods.

[472] Agentic-KGR: Co-evolutionary Knowledge Graph Construction through Multi-Agent Reinforcement Learning

Jing Li, Zhijie Sun, Zhicheng Zhou, Suming Qiu, Junjie Huang, Haijia Sun, Linyuan Qiu

Main category: cs.LG

TL;DR: Agentic-KGR is a framework that enables co-evolution between LLMs and knowledge graphs through multi-round RL, addressing limitations of static knowledge bases with dynamic schema expansion, retrieval-augmented memory, and learnable prompt compression.

Details

Motivation: Current knowledge-enhanced LLMs rely on static knowledge bases with coverage gaps and temporal obsolescence, limiting effectiveness in dynamic information environments.

Method: Multi-round reinforcement learning framework with three innovations: dynamic schema expansion, retrieval-augmented memory system, and learnable multi-scale prompt compression for adaptive sequence optimization.

Result: Substantial improvements over supervised baselines and single-round RL approaches in knowledge extraction tasks. When integrated with GraphRAG, achieves superior performance in downstream QA tasks with significant gains in accuracy and knowledge coverage.

Conclusion: Agentic-KGR enables effective co-evolution between LLMs and knowledge graphs, overcoming limitations of static knowledge bases and demonstrating superior performance in dynamic information environments.

Abstract: Current knowledge-enhanced large language models (LLMs) rely on static, pre-constructed knowledge bases that suffer from coverage gaps and temporal obsolescence, limiting their effectiveness in dynamic information environments. We present Agentic-KGR, a novel framework enabling co-evolution between LLMs and knowledge graphs (KGs) through multi-round reinforcement learning (RL). Our approach introduces three key innovations: (1) a dynamic schema expansion mechanism that systematically extends graph ontologies beyond pre-defined boundaries during training; (2) a retrieval-augmented memory system enabling synergistic co-evolution between model parameters and knowledge structures through continuous optimization; (3) a learnable multi-scale prompt compression approach that preserves critical information while reducing computational complexity through adaptive sequence optimization. Experimental results demonstrate substantial improvements over supervised baselines and single-round RL approaches in knowledge extraction tasks. When integrated with GraphRAG, our method achieves superior performance in downstream QA tasks, with significant gains in both accuracy and knowledge coverage compared to existing methods.

[473] Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang

Main category: cs.LG

TL;DR: The paper introduces multimodal prompt optimization, extending text-only prompt optimization to handle multimodal inputs (text + images/videos/molecules) for MLLMs, and proposes MPO framework that jointly optimizes multimodal prompts while using Bayesian selection strategies.

Details

Motivation: Current prompt optimization methods are limited to text-only inputs, which restricts the full potential of multimodal large language models (MLLMs) that can handle images, videos, and other modalities beyond text.

Method: Proposed Multimodal Prompt Optimizer (MPO) - a unified framework that performs joint optimization of multimodal prompts through alignment-preserving updates and uses Bayesian-based selection strategy to guide candidate prompt selection by leveraging earlier evaluations as priors.

Result: Extensive experiments across diverse modalities (images, videos, molecules) show that MPO outperforms leading text-only optimization methods, demonstrating the effectiveness of multimodal prompt optimization.

Conclusion: Multimodal prompt optimization is a crucial step to fully realize the potential of MLLMs, and MPO provides an effective solution that extends prompt optimization capabilities beyond text-only approaches.

Abstract: Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.

[474] Efficient Resource-Constrained Training of Vision Transformers via Subspace Optimization

Le-Trung Nguyen, Enzo Tartaglione, Van-Tam Nguyen

Main category: cs.LG

TL;DR: WASI applies subspace-based training to transformers, reducing memory usage by 62x and computational cost by 2x while maintaining accuracy comparable to vanilla training.

Details

Motivation: Address energy consumption and data privacy concerns in AI by enabling on-device learning, overcoming memory bottlenecks in large transformer models.

Method: Weight-Activation Subspace Iteration (WASI) restricts training to a fixed subspace where essential model information resides, mitigating backpropagation memory bottlenecks.

Result: Achieved comparable accuracy to vanilla training with 62x memory reduction, 2x FLOPs reduction, and 1.5x faster training/inference on Raspberry Pi 5.

Conclusion: WASI enables efficient on-device transformer training while maintaining performance, addressing key challenges in edge AI deployment.

Abstract: As AI increasingly shapes daily life, energy consumption and data privacy have become pressing concerns. On-device learning trains models directly on edge devices, cutting energy consumption and safeguarding data privacy. However, the expanding scale of modern neural networks creates a major obstacle for on-device training. Although prior work has concentrated on compact convolutional architectures, we instead apply subspace-based training to transformer models. Motivated by the idea that a model’s essential information lies in a fixed subspace, we introduce Weight-Activation Subspace Iteration (WASI), a method that mitigates the memory bottleneck of backpropagation and boosts inference efficiency in transformer models by restricting training to this subspace. Our results demonstrate that WASI maintains accuracy comparable to vanilla training while reducing memory usage by up to $62\times$ and computational cost (FLOPs) by up to $2\times$. On a Raspberry Pi 5, WASI achieves roughly $1.5\times$ faster training and inference than vanilla training.

[475] Robustness and Regularization in Hierarchical Re-Basin

Benedikt Franke, Florian Heinrich, Markus Lange, Arne Raul

Main category: cs.LG

TL;DR: This paper analyzes Git Re-Basin model merging and proposes a hierarchical merging scheme that outperforms MergeMany, but finds Re-Basin causes larger performance drops than originally reported while inducing robustness benefits.

Details

Motivation: To examine Git Re-Basin's model merging approach more closely and improve upon the standard MergeMany algorithm with a hierarchical scheme.

Method: Proposed a hierarchical model merging scheme that significantly outperforms the standard MergeMany algorithm.

Result: Re-Basin induces adversarial and perturbation robustness in merged models, with stronger effects from more models in hierarchical merging, but causes larger performance drops than originally reported.

Conclusion: While hierarchical Re-Basin merging provides robustness benefits, it comes with greater performance degradation than previously indicated, requiring careful consideration of this trade-off.

Abstract: This paper takes a closer look at Git Re-Basin, an interesting new approach to merge trained models. We propose a hierarchical model merging scheme that significantly outperforms the standard MergeMany algorithm. With our new algorithm, we find that Re-Basin induces adversarial and perturbation robustness into the merged models, with the effect becoming stronger the more models participate in the hierarchical merging scheme. However, in our experiments Re-Basin induces a much bigger performance drop than reported by the original authors.

[476] Beyond Pairwise Connections: Extracting High-Order Functional Brain Network Structures under Global Constraints

Ling Zhan, Junjie Huang, Xiaoyao Yu, Wenyu Chen, Tao Jia

Main category: cs.LG

TL;DR: The paper proposes GCM, a Global Constraints oriented Multi-resolution framework for functional brain network structure learning that incorporates 4 types of global constraints to capture high-order dependencies across 4 modeling resolutions, achieving significant improvements in accuracy and computational efficiency.

Details

Motivation: Current functional brain network modeling relies on local pairwise interactions which have limitations in capturing high-order dependencies, and existing hypergraph approaches suffer from computational burden and heuristic nature that prevent end-to-end learning from data distributions.

Method: The GCM framework extracts high-order FBN structures under global constraints, incorporating 4 types of constraints (signal synchronization, subject identity, expected edge numbers, and data labels) to enable learning at 4 distinct levels (sample/subject/group/project) of modeling resolution.

Result: GCM achieves up to 30.6% improvement in relative accuracy and 96.3% reduction in computational time across 5 datasets and 2 task settings, outperforming 9 baselines and 10 state-of-the-art methods. Extensive experiments validate individual component contributions and highlight interpretability.

Conclusion: This work provides a novel perspective on FBN structure learning and establishes a foundation for interdisciplinary applications in cognitive neuroscience, with publicly available code for implementation.

Abstract: Functional brain network (FBN) modeling often relies on local pairwise interactions, whose limitation in capturing high-order dependencies is theoretically analyzed in this paper. Meanwhile, the computational burden and heuristic nature of current hypergraph modeling approaches hinder end-to-end learning of FBN structures directly from data distributions. To address this, we propose to extract high-order FBN structures under global constraints, and implement this as a Global Constraints oriented Multi-resolution (GCM) FBN structure learning framework. It incorporates 4 types of global constraint (signal synchronization, subject identity, expected edge numbers, and data labels) to enable learning FBN structures for 4 distinct levels (sample/subject/group/project) of modeling resolution. Experimental results demonstrate that GCM achieves up to a 30.6% improvement in relative accuracy and a 96.3% reduction in computational time across 5 datasets and 2 task settings, compared to 9 baselines and 10 state-of-the-art methods. Extensive experiments validate the contributions of individual components and highlight the interpretability of GCM. This work offers a novel perspective on FBN structure learning and provides a foundation for interdisciplinary applications in cognitive neuroscience. Code is publicly available on https://github.com/lzhan94swu/GCM.

[477] RepDL: Bit-level Reproducible Deep Learning Training and Inference

Peichen Xie, Xian Zhang, Shuo Chen

Main category: cs.LG

TL;DR: RepDL is an open-source library that ensures deterministic and bitwise-reproducible deep learning training and inference across diverse computing environments by addressing floating-point computation inconsistencies.

Details

Motivation: Non-determinism and non-reproducibility in deep learning cause inconsistent results across runs and platforms, primarily due to random number generation and floating-point computation issues.

Method: RepDL enforces correct rounding and order invariance in floating-point computation to achieve deterministic and bitwise-reproducible results.

Result: The library successfully addresses floating-point inconsistencies that remain unresolved by traditional deterministic configurations.

Conclusion: RepDL provides a solution for ensuring reproducible deep learning results across different computing environments, with source code available as an open-source project.

Abstract: Non-determinism and non-reproducibility present significant challenges in deep learning, leading to inconsistent results across runs and platforms. These issues stem from two origins: random number generation and floating-point computation. While randomness can be controlled through deterministic configurations, floating-point inconsistencies remain largely unresolved. To address this, we introduce RepDL, an open-source library that ensures deterministic and bitwise-reproducible deep learning training and inference across diverse computing environments. RepDL achieves this by enforcing correct rounding and order invariance in floating-point computation. The source code is available at https://github.com/microsoft/RepDL .

[478] Rate optimal learning of equilibria from data

Till Freihaut, Luca Viano, Emanuele Nevali, Volkan Cevher, Matthieu Geist, Giorgia Ramponi

Main category: cs.LG

TL;DR: This paper closes theoretical gaps in Multi-Agent Imitation Learning by proving BC is optimal in non-interactive settings and introducing MAIL-WARM algorithm that achieves near-optimal sample complexity in interactive settings.

Details

Motivation: To address theoretical gaps in Multi-Agent Imitation Learning, particularly characterizing the limits of non-interactive approaches and developing optimal interactive algorithms.

Method: Proved statistical lower bounds for non-interactive MAIL, identified all-policy deviation concentrability coefficient as key complexity measure, and introduced MAIL-WARM algorithm combining reward-free RL with interactive MAIL.

Result: Showed BC is rate-optimal in non-interactive settings, and MAIL-WARM improves sample complexity from O(ε⁻⁸) to O(ε⁻²), matching the lower bound. Numerical results demonstrate BC failures in grid worlds.

Conclusion: The paper provides complete theoretical characterization of MAIL, establishes BC’s optimality in non-interactive settings, and presents MAIL-WARM as near-optimal interactive algorithm with significant sample complexity improvements.

Abstract: We close open theoretical gaps in Multi-Agent Imitation Learning (MAIL) by characterizing the limits of non-interactive MAIL and presenting the first interactive algorithm with near-optimal sample complexity. In the non-interactive setting, we prove a statistical lower bound that identifies the all-policy deviation concentrability coefficient as the fundamental complexity measure, and we show that Behavior Cloning (BC) is rate-optimal. For the interactive setting, we introduce a framework that combines reward-free reinforcement learning with interactive MAIL and instantiate it with an algorithm, MAIL-WARM. It improves the best previously known sample complexity from $\mathcal{O}(\varepsilon^{-8})$ to $\mathcal{O}(\varepsilon^{-2}),$ matching the dependence on $\varepsilon$ implied by our lower bound. Finally, we provide numerical results that support our theory and illustrate, in environments such as grid worlds, where Behavior Cloning fails to learn.

[479] FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning

Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Chubin Zhang, Mingcong Lei, Bo An, Ivor Tsang

Main category: cs.LG

TL;DR: This paper proposes a student-teacher framework where a simple MLP student policy explores online while being guided by a teacher Flow Matching model, overcoming FM’s limitations in online interaction and enabling efficient RL-based policy improvement.

Details

Motivation: Flow Matching policies excel at behavioral cloning but lack environmental interaction and exploration, leading to poor generalization. Online optimization of FM policies is challenging due to gradient instability and high inference costs.

Method: Use a simple MLP student policy for online exploration and RL updates, guided by a reward model derived from a teacher FM model. The teacher FM model also regularizes the student’s behavior to stabilize learning.

Result: The approach significantly enhances learning efficiency, generalization, and robustness, particularly when learning from suboptimal expert data.

Conclusion: The student-teacher framework effectively leverages FM’s expressiveness while avoiding its online optimization challenges, enabling efficient policy improvement through online interaction.

Abstract: Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness, FM-based policies are inherently limited by their lack of environmental interaction and exploration. This leads to poor generalization in unseen scenarios beyond the expert demonstrations, underscoring the necessity of online interaction with environment. Unfortunately, optimizing FM policies via online interaction is challenging and inefficient due to instability in gradient computation and high inference costs. To address these issues, we propose to let a student policy with simple MLP structure explore the environment and be online updated via RL algorithm with a reward model. This reward model is associated with a teacher FM model, containing rich information of expert data distribution. Furthermore, the same teacher FM model is utilized to regularize the student policy’s behavior to stabilize policy learning. Due to the student’s simple architecture, we avoid the gradient instability of FM policies and enable efficient online exploration, while still leveraging the expressiveness of the teacher FM model. Extensive experiments show that our approach significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data.

[480] Prime Implicant Explanations for Reaction Feasibility Prediction

Klaus Weinbauer, Tieu-Long Phan, Peter F. Stadler, Thomas Gärtner, Sagar Malhotra

Main category: cs.LG

TL;DR: A novel formulation of prime implicant explanations for chemical reaction feasibility prediction models, providing interpretable explanations that capture essential molecular attributes.

Details

Motivation: Machine learning models for chemical reaction prediction lack transparency and interpretability, making it difficult to understand why certain reactions are predicted as feasible.

Method: Developed a novel formulation of prime implicant explanations (minimally sufficient reasons) tailored to chemical reaction prediction, with an algorithm for computing such explanations in small-scale tasks.

Result: Preliminary experiments show that prime implicant explanations conservatively capture ground truth explanations - they often contain redundant bonds/atoms but consistently capture essential molecular attributes for reaction feasibility prediction.

Conclusion: The proposed prime implicant explanation framework provides interpretable insights into reaction prediction models while reliably identifying the molecular features that are essential for determining reaction feasibility.

Abstract: Machine learning models that predict the feasibility of chemical reactions have become central to automated synthesis planning. Despite their predictive success, these models often lack transparency and interpretability. We introduce a novel formulation of prime implicant explanations–also known as minimally sufficient reasons–tailored to this domain, and propose an algorithm for computing such explanations in small-scale reaction prediction tasks. Preliminary experiments demonstrate that our notion of prime implicant explanations conservatively captures the ground truth explanations. That is, such explanations often contain redundant bonds and atoms but consistently capture the molecular attributes that are essential for predicting reaction feasibility.

[481] Incentivizing Time-Aware Fairness in Data Sharing

Jiangwei Chen, Kieu Thao Nguyen Pham, Rachael Hwee Ling Sim, Arun Verma, Zhaoxuan Wu, Chuan-Sheng Foo, Bryan Kian Hsiang Low

Main category: cs.LG

TL;DR: Proposes a time-aware data sharing framework with novel incentives that reward parties who join earlier, addressing the real-world scenario where parties join collaborations at different times rather than simultaneously.

Details

Motivation: Existing collaborative ML frameworks assume all parties join simultaneously, but in reality parties join at different times due to data cleaning, legal barriers, or unawareness. Early joiners incur higher risk and encourage others, so they deserve higher rewards.

Method: Developed a fair and time-aware data sharing framework with time-aware incentives, including methods for deciding reward values and generating model rewards that realize these values.

Result: Empirical demonstration on synthetic and real-world datasets shows the framework’s properties work effectively in practice.

Conclusion: The proposed time-aware framework successfully addresses the practical issue of staggered participation in collaborative ML, providing appropriate incentives for early contributors while maintaining fairness and individual rationality.

Abstract: In collaborative data sharing and machine learning, multiple parties aggregate their data resources to train a machine learning model with better model performance. However, as the parties incur data collection costs, they are only willing to do so when guaranteed incentives, such as fairness and individual rationality. Existing frameworks assume that all parties join the collaboration simultaneously, which does not hold in many real-world scenarios. Due to the long processing time for data cleaning, difficulty in overcoming legal barriers, or unawareness, the parties may join the collaboration at different times. In this work, we propose the following perspective: As a party who joins earlier incurs higher risk and encourages the contribution from other wait-and-see parties, that party should receive a reward of higher value for sharing data earlier. To this end, we propose a fair and time-aware data sharing framework, including novel time-aware incentives. We develop new methods for deciding reward values to satisfy these incentives. We further illustrate how to generate model rewards that realize the reward values and empirically demonstrate the properties of our methods on synthetic and real-world datasets.

[482] The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Natalie Abreu, Nikhil Vyas, Sham Kakade, Depen Morwani

Main category: cs.LG

TL;DR: Full Gauss-Newton preconditioning achieves 5.4x reduction in training iterations compared to strong baselines, with layerwise approximation nearly matching full performance.

Details

Motivation: To determine how much performance is lost by computationally-efficient approximations in LLM pretraining and establish a practical upper bound on iteration complexity.

Method: Applied full Gauss-Newton preconditioning to transformer models up to 150M parameters and compared with layerwise GN preconditioner that ignores cross-layer information.

Result: Full GN updates yield substantial gains over existing optimizers (SOAP, Muon), achieving 5.4x reduction in training iterations. Layerwise GN preconditioner nearly matches full GN performance.

Conclusion: GN approximation is highly effective for preconditioning, layerwise Hessian structure contains sufficient information for most gains, and significant performance gap exists between current approximate methods and idealized layerwise oracle.

Abstract: Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.

[483] A PCA-based Data Prediction Method

Peteris Daugulis, Vija Vagale, Emiliano Mancini, Filippo Castiglione

Main category: cs.LG

TL;DR: A novel hybrid method combining traditional mathematics and machine learning for imputing missing data using distance between shifted linear subspaces.

Details

Motivation: The frequent problem of handling missing data in data science requires effective imputation methods.

Method: Uses distance between shifted linear subspaces representing existing data and candidate sets, with the data represented by subspaces spanned by first principal components. Euclidean metric solutions are provided.

Result: The method provides a mathematical framework for missing data imputation using subspace distances.

Conclusion: The proposed hybrid approach offers a principled way to impute missing values by leveraging subspace geometry and principal components.

Abstract: The problem of choosing appropriate values for missing data is often encountered in the data science. We describe a novel method containing both traditional mathematics and machine learning elements for prediction (imputation) of missing data. This method is based on the notion of distance between shifted linear subspaces representing the existing data and candidate sets. The existing data set is represented by the subspace spanned by its first principal components. Solutions for the case of the Euclidean metric are given.

[484] Task-Level Insights from Eigenvalues across Sequence Models

Rahel Rickenbach, Jelena Trisovic, Alexandre Didier, Jerome Sieber, Melanie N. Zeilinger

Main category: cs.LG

TL;DR: The paper analyzes softmax attention and linear alternatives like state space models using dynamical systems framework, showing how eigenvalue spectra influence memory and long-range dependency modeling capabilities.

Details

Motivation: To understand fundamental differences in information processing between softmax attention and linear alternatives like SSMs, and to provide principled analysis of sequence model capabilities.

Method: Represent softmax, norm and linear attention as dynamical systems and analyze their eigenvalue spectra across diverse sequence models and benchmarks.

Result: Eigenvalues influence memory and long-range dependency modeling, with spectral signatures aligning with task requirements. Architectural modifications impact both eigenvalues and performance.

Conclusion: Eigenvalue analysis serves as a principled metric for interpreting, understanding, and improving sequence model capabilities.

Abstract: Although softmax attention drives state-of-the-art performance for sequence models, its quadratic complexity limits scalability, motivating linear alternatives such as state space models (SSMs). While these alternatives improve efficiency, their fundamental differences in information processing remain poorly understood. In this work, we leverage the recently proposed dynamical systems framework to represent softmax, norm and linear attention as dynamical systems, enabling a structured comparison with SSMs by analyzing their respective eigenvalue spectra. Since eigenvalues capture essential aspects of dynamical system behavior, we conduct an extensive empirical analysis across diverse sequence models and benchmarks. We first show that eigenvalues influence essential aspects of memory and long-range dependency modeling, revealing spectral signatures that align with task requirements. Building on these insights, we then investigate how architectural modifications in sequence models impact both eigenvalue spectra and task performance. This correspondence further strengthens the position of eigenvalue analysis as a principled metric for interpreting, understanding, and ultimately improving the capabilities of sequence models.

[485] Mitigating Model Drift in Developing Economies Using Synthetic Data and Outliers

Ilyas Varshavskiy, Bonu Boboeva, Shuhrat Khalilbekov, Azizjon Azimi, Sergey Shulgin, Akhlitdin Nizamitdinov, Haitz Saez de Ocariz Borde

Main category: cs.LG

TL;DR: This paper investigates using synthetic outliers to mitigate model drift in financial ML models for developing economies in Central Asia and the Caucasus, introducing a two-level evaluation framework.

Details

Motivation: Machine learning models in finance are highly susceptible to model drift, especially in developing economies like Tajikistan, Uzbekistan, Kazakhstan, and Azerbaijan where frequent macroeconomic shocks destabilize financial data distributions.

Method: The study investigates using synthetic outliers to improve model stability against unforeseen shocks, and introduces a two-level framework to measure both performance degradation and shock severity.

Result: Experiments on macroeconomic tabular datasets show that adding a small proportion of synthetic outliers generally improves stability compared to baseline models, though the optimal amount varies by dataset and model.

Conclusion: Synthetic outliers show promise as a drift mitigation method for financial ML models in developing economies, with effectiveness depending on dataset and model characteristics.

Abstract: Machine Learning models in finance are highly susceptible to model drift, where predictive performance declines as data distributions shift. This issue is especially acute in developing economies such as those in Central Asia and the Caucasus - including Tajikistan, Uzbekistan, Kazakhstan, and Azerbaijan - where frequent and unpredictable macroeconomics shocks destabilize financial data. To the best of our knowledge, this is among the first studies to examine drift mitigation methods on financial datasets from these regions. We investigate the use of synthetic outliers, a largely unexplored approach, to improve model stability against unforeseen shocks. To evaluate effectiveness, we introduce a two-level framework that measures both the extent of performance degradation and the severity of shocks. Our experiments on macroeconomic tabular datasets show that adding a small proportion of synthetic outliers generally improves stability compared to baseline models, though the optimal amount varies by dataset and model

[486] Design Principles for Sequence Models via Coefficient Dynamics

Jerome Sieber, Antonio Orvieto, Melanie N. Zeilinger, Carmen Amo Alonso

Main category: cs.LG

TL;DR: The paper develops a unified framework that views deep sequence models as computing outputs through linear combinations of past value vectors, with coefficients governed by autonomous linear dynamical systems.

Details

Motivation: To systematically compare and understand diverse sequence model architectures (Transformers, SSMs, gated RNNs) by revealing their common mathematical structure and deriving design principles rather than just evaluating on benchmarks.

Method: Develops a framework that explicitly models output operations as linear combinations with coefficients from autonomous linear dynamical systems driven by impulse inputs, capturing attention mechanisms and various RNN/SSM architectures.

Result: The framework reveals common mathematical themes across architectures, captures softmax attention, identifies tradeoffs between expressivity and efficiency, geometric constraints on input selectivity, and stability conditions for training.

Conclusion: The unified framework explains empirical successes of recent designs and provides systematic principles for designing new sequence model architectures by connecting insights from recent literature.

Abstract: Deep sequence models, ranging from Transformers and State Space Models (SSMs) to more recent approaches such as gated linear RNNs, fundamentally compute outputs as linear combinations of past value vectors. To draw insights and systematically compare such architectures, we develop a unified framework that makes this output operation explicit, by casting the linear combination coefficients as the outputs of autonomous linear dynamical systems driven by impulse inputs. This viewpoint, in spirit substantially different from approaches focusing on connecting linear RNNs with linear attention, reveals a common mathematical theme across diverse architectures and crucially captures softmax attention, on top of RNNs, SSMs, and related models. In contrast to new model proposals that are commonly evaluated on benchmarks, we derive design principles linking architectural choices to model properties. Thereby identifying tradeoffs between expressivity and efficient implementation, geometric constraints on input selectivity, and stability conditions for numerically stable training and information retention. By connecting several insights and observations from recent literature, the framework both explains empirical successes of recent designs and provides guiding principles for systematically designing new sequence model architectures.

[487] Large Language Model Prompt Datasets: An In-depth Analysis and Insights

Yuanming Zhang, Yan Lin, Arijit Khan, Huaiyu Wan

Main category: cs.LG

TL;DR: This paper compiles and analyzes diverse prompt datasets for LLMs, proposes a prompt optimization method using syntactic embeddings, and improves model output meaningfulness.

Details

Motivation: With the growing deployment of LLMs, diverse prompt datasets are emerging from various platforms, facilitating broader LLM utilization and improved prompt engineering. The authors aim to systematically analyze these datasets and develop optimization methods.

Method: Compiled extensive prompt datasets from various channels, selected key representative datasets for systematic analysis, and proposed a prompt optimization approach using syntactic embeddings of part-of-speech and dependency structures to guide LLMs in rewriting prompts toward a centroid representation.

Result: The analysis revealed commonalities and differences in prompt construction across categories, distinguishing them from other text corpora. The proposed optimization method successfully improved the meaningfulness of model outputs.

Conclusion: The work provides comprehensive prompt datasets and an effective optimization approach that enhances LLM performance through better prompt engineering, with datasets and code made publicly available.

Abstract: A prompt is a natural language instruction that defines a specific task for a large language model (LLM) and serves as the primary interface for human-LLM interaction. With the growing deployment of LLMs, diverse prompt datasets are emerging from platforms such as GitHub and social media. These datasets span a wide array of applications and content types, facilitating both broader LLM utilization and improved prompt engineering. In this work, we–for the first time–have compiled an extensive list of prompt datasets sourced from various channels, representing a spectrum of downstream tasks, languages, engineering techniques, attributes, and modalities. We select key representative datasets for systematic analysis, revealing commonalities and differences in prompt construction across categories, distinguishing them from other text corpora like literature and web. We further propose a prompt optimization approach that leverages syntactic embeddings of part-of-speech and dependency structures. By identifying a centroid representation of prompts and guiding LLMs to rewrite prompts toward this centroid, our method improves the meaningfulness of model outputs. We have made our datasets and code available.

[488] Residual-Informed Learning of Solutions to Algebraic Loops

Felix Brandt, Andreas Heuermann, Philip Hannebohm, Bernhard Bachmann

Main category: cs.LG

TL;DR: Residual-informed machine learning replaces algebraic loops in Modelica models with neural network surrogates, achieving 60% faster simulation while maintaining accuracy.

Details

Motivation: To overcome computational bottlenecks from algebraic loops in equation-based Modelica models and eliminate the need for supervised training datasets.

Method: Feedforward neural network trained using algebraic loop residual directly in loss function, enabling unsupervised learning and resolving solution ambiguity.

Result: 60% reduction in simulation time for IEEE 14-Bus system compared to conventional methods, with maintained accuracy through error control.

Conclusion: The residual-informed approach effectively replaces algebraic loops with neural surrogates, providing significant speedup without compromising accuracy.

Abstract: This paper presents a residual-informed machine learning approach for replacing algebraic loops in equation-based Modelica models with neural network surrogates. A feedforward neural network is trained using the residual (error) of the algebraic loop directly in its loss function, eliminating the need for a supervised dataset. This training strategy also resolves the issue of ambiguous solutions, allowing the surrogate to converge to a consistent solution rather than averaging multiple valid ones. Applied to the large-scale IEEE 14-Bus system, our method achieves a 60% reduction in simulation time compared to conventional simulations, while maintaining the same level of accuracy through error control mechanisms.

[489] Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers

Tuan Nguyen, Long Tran-Thanh

Main category: cs.LG

TL;DR: A black-box framework for LLM safety alignment using game theory and linear programming, enabling safety enforcement without model access or retraining.

Details

Motivation: Existing alignment methods require costly retraining or model access, making them impractical for third-party stakeholders who lack model access.

Method: Formulates safety-helpfulness trade-off as a two-player zero-sum game, uses linear programming solver at inference time to compute equilibrium strategies.

Result: Demonstrates feasibility of black-box safety alignment that balances safety and helpfulness without requiring model internals.

Conclusion: Provides scalable, accessible safety enforcement for stakeholders in resource-constrained settings across evolving LLM ecosystems.

Abstract: Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine-tuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black-box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource-constrained settings, to enforce safety across rapidly evolving LLM ecosystems.

[490] Bandits with Single-Peaked Preferences and Limited Resources

Gur Keinan, Rotem Torkan, Omer Ben-Porat

Main category: cs.LG

TL;DR: The paper presents online algorithms for stochastic matching with budget constraints, leveraging single-peaked preferences to achieve efficient regret bounds.

Details

Motivation: Online stochastic matching with budget constraints is computationally challenging (NP-hard) without structural assumptions, motivating the use of single-peaked preferences from social choice theory to enable efficient learning.

Method: Developed two algorithms: 1) A novel PQ tree-based order approximation method for unknown single-peaked structure with O~(UKT^{2/3}) regret, and 2) An efficient UCB-like algorithm for known structure with O~(U√(TK)) regret.

Result: Achieved efficient regret bounds: O~(UKT^{2/3}) for unknown single-peaked structure and O~(U√(TK)) for known structure, overcoming the computational infeasibility of optimal matching.

Conclusion: Single-peaked preferences enable computationally efficient online learning for stochastic matching with budget constraints, with provable regret guarantees that scale favorably with problem parameters.

Abstract: We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on \emph{single-peaked preferences} – a well-established structure in social choice theory, where users’ preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.

[491] Efficient Bayesian Inference from Noisy Pairwise Comparisons

Till Aczel, Lucas Theis, Wattenhofer Roger

Main category: cs.LG

TL;DR: BBQ is a Bayesian Bradley-Terry model that improves human evaluation of generative models by modeling rater quality and providing guaranteed convergence, resulting in more robust and interpretable rankings.

Details

Motivation: Standard evaluation metrics for generative models don't reflect human preferences well, and human evaluations are costly and noisy due to variability in rater expertise and attention.

Method: BBQ uses a Bayesian Bradley-Terry variant with Expectation-Maximization algorithm to model rater quality, downweight unreliable participants, and provide guaranteed monotonic likelihood convergence.

Result: BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters.

Conclusion: BBQ enables more reliable and cost-effective human evaluation of generative models by addressing rater variability and providing convergence guarantees.

Abstract: Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ achieves faster convergence, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.

[492] Deep Learning to Identify the Spatio-Temporal Cascading Effects of Train Delays in a High-Density Network

Vu Duc Anh Nguyen, Ziyue Li

Main category: cs.LG

TL;DR: A novel XGeoAI framework using Graph Attention Networks for live, explainable multi-step train delay forecasting that achieves higher precision in classifying delay events despite challenges on pure error metrics.

Details

Motivation: Railway network efficiency is undermined by cascading train delays, and existing GNN approaches lack multi-step autoregressive forecasting with live interpretable explanations needed for decision support.

Method: Two-stage autoregressive Graph Attention Network (GAT) model trained on Dutch railway data, representing system as spatio-temporal graph with granular features including platform and station congestion, evaluated using sequential k-step-ahead forecasting protocol.

Result: While challenged on MAE metrics by simpler Persistence baseline, the GATv2 model achieves consistently higher precision in classifying delay events - crucial for reliable decision support.

Conclusion: The proposed framework provides viable live deployment for train delay forecasting with improved delay event classification precision, addressing the need for interpretable multi-step forecasting in railway management.

Abstract: The operational efficiency of railway networks, a cornerstone of modern economies, is persistently undermined by the cascading effects of train delays. Accurately forecasting this delay propagation is a critical challenge for real-time traffic management. While recent research has leveraged Graph Neural Networks (GNNs) to model the network structure of railways, a significant gap remains in developing frameworks that provide multi-step autoregressive forecasts at a network-wide scale, while simultaneously offering the live, interpretable explanations needed for decision support. This paper addresses this gap by developing and evaluating a novel XGeoAI framework for live, explainable, multi-step train delay forecasting. The core of this work is a two-stage, autoregressive Graph Attention Network (GAT) model, trained on a real-world dataset covering over 40% of the Dutch railway network. The model represents the system as a spatio-temporal graph of operational events (arrivals and departures) and is enriched with granular features, including platform and station congestion. To test its viability for live deployment, the model is rigorously evaluated using a sequential, k-step-ahead forecasting protocol that simulates real-world conditions where prediction errors can compound. The results demonstrate that while the proposed GATv2 model is challenged on pure error metrics (MAE) by a simpler Persistence baseline, it achieves consistently higher precision in classifying delay events – a crucial advantage for a reliable decision support tool.

[493] Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou, Caglar Gulcehre, Maksym Andriushchenko, Ameya Prabhu, Jonas Geiping

Main category: cs.LG

TL;DR: AI control protocols using LLM monitors are vulnerable to adaptive attacks where malicious models embed prompt injections to evade detection, making current monitoring-based defenses ineffective.

Details

Motivation: Current AI control protocols rely on LLM monitors as a central point of failure, but these can be compromised by adaptive attacks from untrusted models that know the protocol and monitor details.

Method: The study instantiates adaptive attacks where attackers embed publicly known or zero-shot prompt injections in model outputs to evade diverse monitors and complete malicious tasks on AI control benchmarks.

Result: Frontier models consistently evade diverse monitors and complete malicious tasks. The attack works universally against current monitor-based protocols, and the Defer-to-Resample protocol backfires by amplifying prompt injections.

Conclusion: Adaptive attacks on monitor models represent a major blind spot in current control protocols and should become a standard component of evaluations for future AI control mechanisms.

Abstract: AI control protocols serve as a defense mechanism to stop untrusted LLM agents from causing harm in autonomous settings. Prior work treats this as a security problem, stress testing with exploits that use the deployment context to subtly complete harmful side tasks, such as backdoor insertion. In practice, most AI control protocols are fundamentally based on LLM monitors, which can become a central point of failure. We study adaptive attacks by an untrusted model that knows the protocol and the monitor model, which is plausible if the untrusted model was trained with a later knowledge cutoff or can search for this information autonomously. We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs. Using this tactic, frontier models consistently evade diverse monitors and complete malicious tasks on two main AI control benchmarks. The attack works universally against current protocols that rely on a monitor. Furthermore, the recent Defer-to-Resample protocol even backfires, as its resampling amplifies the prompt injection and effectively reframes it as a best-of-$n$ attack. In general, adaptive attacks on monitor models represent a major blind spot in current control protocols and should become a standard component of evaluations for future AI control mechanisms.

[494] CHUCKLE – When Humans Teach AI To Learn Emotions The Easy Way

Ankush Pratap Singh, Houwei Cao, Yong Liu

Main category: cs.LG

TL;DR: CHUCKLE is a perception-driven curriculum learning framework for emotion recognition that uses human annotator agreement to define sample difficulty, improving model accuracy and training efficiency.

Details

Motivation: Existing curriculum learning approaches for emotion recognition rely on heuristic or model-based difficulty definitions, ignoring human perception difficulty which is crucial for subjective tasks like emotion recognition.

Method: Leverages annotator agreement and alignment in crowd-sourced datasets to define sample difficulty, assuming clips challenging for humans are similarly hard for machine learning models.

Result: Increases relative mean accuracy by 6.56% for LSTMs and 1.61% for Transformers over non-curriculum baselines while reducing gradient updates.

Conclusion: CHUCKLE enhances both training efficiency and model robustness by incorporating human perception difficulty into curriculum learning for emotion recognition.

Abstract: Curriculum learning (CL) structures training from simple to complex samples, facilitating progressive learning. However, existing CL approaches for emotion recognition often rely on heuristic, data-driven, or model-based definitions of sample difficulty, neglecting the difficulty for human perception, a critical factor in subjective tasks like emotion recognition. We propose CHUCKLE (Crowdsourced Human Understanding Curriculum for Knowledge Led Emotion Recognition), a perception-driven CL framework that leverages annotator agreement and alignment in crowd-sourced datasets to define sample difficulty, under the assumption that clips challenging for humans are similarly hard for machine learning models. Empirical results suggest that CHUCKLE increases the relative mean accuracy by 6.56% for LSTMs and 1.61% for Transformers over non-curriculum baselines, while reducing the number of gradient updates, thereby enhancing both training efficiency and model robustness.

[495] HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness

Xinyi Wang, Jinyi Han, Zishang Jiang, Tingyun Li, Jiaqing Liang, Sihang Jiang, Zhaoqian Dai, Shuguang Ma, Fei Yu, Yanghua Xiao

Main category: cs.LG

TL;DR: HINT is an adaptive hinting framework that improves RL training for LLMs by providing heuristic hints instead of direct answers, addressing reward sparsity and training instability issues in complex reasoning tasks.

Details

Motivation: Existing RL methods like GRPO fail when task difficulty exceeds model capacity, causing reward sparsity and inefficient training. Prior approaches using off-policy data often misguide policy updates due to distributional mismatches.

Method: Proposes HINT framework that supplies heuristic hints to guide models toward discovering solutions autonomously, preserving reasoning capabilities. Introduces Affinity metric to monitor exploration efficiency and training stability.

Result: Extensive experiments on mathematical reasoning tasks show HINT consistently outperforms existing methods, achieving state-of-the-art results across various model scales with more stable learning and greater data efficiency.

Conclusion: HINT effectively addresses low training affinity in RL for LLMs, providing a robust solution for enhancing long chain-of-thought reasoning while maintaining model autonomy and improving training efficiency.

Abstract: Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs). However, prevalent methods like GRPO often fail when task difficulty exceeds the model’s capacity, leading to reward sparsity and inefficient training. While prior work attempts to mitigate this using off-policy data, such as mixing RL with Supervised Fine-Tuning (SFT) or using hints, they often misguide policy updates In this work, we identify a core issue underlying these failures, which we term low training affinity. This condition arises from a large distributional mismatch between external guidance and the model’s policy. To diagnose this, we introduce Affinity, the first quantitative metric for monitoring exploration efficiency and training stability. To improve Affinity, we propose HINT: Helping Ineffective rollouts Navigate Towards effectiveness, an adaptive hinting framework. Instead of providing direct answers, HINT supplies heuristic hints that guide the model to discover solutions on its own, preserving its autonomous reasoning capabilities. Extensive experiments on mathematical reasoning tasks show that HINT consistently outperforms existing methods, achieving state-of-the-art results with models of various scales, while also demonstrating significantly more stable learning and greater data efficiency.Code is available on Github.

[496] Performance Analysis of Machine Learning Algorithms in Chronic Kidney Disease Prediction

Iftekhar Ahmed, Tanzil Ebad Chowdhury, Biggo Bushon Routh, Nafisa Tasmiya, Shadman Sakib, Adil Ahmed Chowdhury

Main category: cs.LG

TL;DR: This paper proposes machine learning models for Chronic Kidney Disease (CKD) diagnosis, comparing eight ML algorithms and finding Random Forest and Logistic Regression achieve 99% accuracy.

Details

Motivation: Chronic Kidney Disease affects 10% of global population, requiring effective risk evaluation and monitoring. Machine learning can provide quick and precise detection to help practitioners diagnose CKD efficiently.

Method: Used UCL machine learning repository CKD dataset with missing values filled using ‘mean-mode’ and ‘Random sampling method’. Evaluated eight ML techniques: Random Forest, SVM, Naive Bayes, Logistic Regression, KNN, XGBoost, Decision Tree, and AdaBoost.

Result: Random Forest and Logistic Regression showed outstanding 99% accuracy. AdaBoost, XGBoost, Naive Bayes, Decision Tree, and SVM performed well, while KNN classifier had the lowest accuracy at 73%.

Conclusion: Machine learning models, particularly Random Forest and Logistic Regression, can effectively predict CKD with high accuracy, providing valuable tools for healthcare practitioners in disease diagnosis and monitoring.

Abstract: Kidneys are the filter of the human body. About 10% of the global population is thought to be affected by Chronic Kidney Disease (CKD), which causes kidney function to decline. To protect in danger patients from additional kidney damage, effective risk evaluation of CKD and appropriate CKD monitoring are crucial. Due to quick and precise detection capabilities, Machine Learning models can help practitioners accomplish this goal efficiently; therefore, an enormous number of diagnosis systems and processes in the healthcare sector nowadays are relying on machine learning due to its disease prediction capability. In this study, we designed and suggested disease predictive computer-aided designs for the diagnosis of CKD. The dataset for CKD is attained from the repository of machine learning of UCL, with a few missing values; those are filled in using “mean-mode” and “Random sampling method” strategies. After successfully achieving the missing data, eight ML techniques (Random Forest, SVM, Naive Bayes, Logistic Regression, KNN, XGBoost, Decision Tree, and AdaBoost) were used to establish models, and the performance evaluation comparisons among the result accuracies are measured by the techniques to find the machine learning models with the highest accuracy. Among them, Random Forest as well as Logistic Regression showed an outstanding 99% accuracy, followed by the Ada Boost, XGBoost, Naive Bayes, Decision Tree, and SVM, whereas the KNN classifier model stands last with an accuracy of 73%.

[497] Cross-Receiver Generalization for RF Fingerprint Identification via Feature Disentanglement and Adversarial Training

Yuhao Pan, Xiucheng Wang, Nan Cheng, Wenchao Xu

Main category: cs.LG

TL;DR: Proposes an RFFI framework using adversarial training and style transfer to handle cross-receiver variability by disentangling transmitter and receiver features, achieving up to 10% accuracy improvement.

Details

Motivation: Real-world RFFI deployment is hindered by receiver-induced variability, causing models to overfit to receiver-specific patterns and degrade performance when receivers change.

Method: Integrates adversarial training and style transfer to explicitly disentangle transmitter and receiver features, enforcing domain-invariant representation learning.

Result: Extensive experiments show the approach consistently outperforms state-of-the-art baselines with up to 10% improvement in average accuracy across diverse receiver settings.

Conclusion: The proposed framework successfully isolates genuine hardware signatures from receiver artifacts, ensuring robustness against receiver changes in RFFI systems.

Abstract: Radio frequency fingerprint identification (RFFI) is a critical technique for wireless network security, leveraging intrinsic hardware-level imperfections introduced during device manufacturing to enable precise transmitter identification. While deep neural networks have shown remarkable capability in extracting discriminative features, their real-world deployment is hindered by receiver-induced variability. In practice, RF fingerprint signals comprise transmitter-specific features as well as channel distortions and receiver-induced biases. Although channel equalization can mitigate channel noise, receiver-induced feature shifts remain largely unaddressed, causing the RFFI models to overfit to receiver-specific patterns. This limitation is particularly problematic when training and evaluation share the same receiver, as replacing the receiver in deployment can cause substantial performance degradation. To tackle this challenge, we propose an RFFI framework robust to cross-receiver variability, integrating adversarial training and style transfer to explicitly disentangle transmitter and receiver features. By enforcing domain-invariant representation learning, our method isolates genuine hardware signatures from receiver artifacts, ensuring robustness against receiver changes. Extensive experiments on multi-receiver datasets demonstrate that our approach consistently outperforms state-of-the-art baselines, achieving up to a 10% improvement in average accuracy across diverse receiver settings.

[498] What Do Temporal Graph Learning Models Learn?

Abigail J. Hayes, Tobias Schumacher, Markus Strohmaier

Main category: cs.LG

TL;DR: Systematic evaluation of 7 temporal graph learning models reveals they capture some fundamental graph attributes well but fail at others, exposing important limitations and calling for more interpretability-driven evaluations.

Details

Motivation: Address concerns about reliability of temporal graph learning benchmarks and investigate which properties models actually use for predictions, given surprising competitiveness of simple heuristics.

Method: Systematically evaluate 7 models on their ability to capture 8 fundamental attributes related to link structure of temporal graphs, including structural characteristics, temporal patterns, and edge formation mechanisms, using both synthetic and real-world datasets.

Result: Mixed performance - models capture some attributes well but fail to reproduce others, exposing important limitations in current temporal graph learning approaches.

Conclusion: Results provide practical insights for applying temporal graph learning models and motivate more interpretability-driven evaluations in temporal graph learning research.

Abstract: Learning on temporal graphs has become a central topic in graph representation learning, with numerous benchmarks indicating the strong performance of state-of-the-art models. However, recent work has raised concerns about the reliability of benchmark results, noting issues with commonly used evaluation protocols and the surprising competitiveness of simple heuristics. This contrast raises the question of which properties of the underlying graphs temporal graph learning models actually use to form their predictions. We address this by systematically evaluating seven models on their ability to capture eight fundamental attributes related to the link structure of temporal graphs. These include structural characteristics such as density, temporal patterns such as recency, and edge formation mechanisms such as homophily. Using both synthetic and real-world datasets, we analyze how well models learn these attributes. Our findings reveal a mixed picture: models capture some attributes well but fail to reproduce others. With this, we expose important limitations. Overall, we believe that our results provide practical insights for the application of temporal graph learning models, and motivate more interpretability-driven evaluations in temporal graph learning research.

[499] STaTS: Structure-Aware Temporal Sequence Summarization via Statistical Window Merging

Disharee Bhowmick, Ranjith Ramanathan, Sathyanarayanan N. Aakur

Main category: cs.LG

TL;DR: STaTS is an unsupervised framework for structure-aware temporal summarization that compresses time series into compact token sequences while preserving temporal dynamics, achieving up to 30x compression with minimal performance loss.

Details

Motivation: Standard time series models treat all time steps equally, leading to inefficiencies and poor robustness. There's a need to leverage latent temporal structure like regime transitions and repeated motifs for more efficient representation learning.

Method: Detects change points across multiple temporal resolutions using BIC-based statistical divergence, then summarizes each segment using functions like mean or generative models like GMMs. Works as model-agnostic preprocessor.

Result: Achieves 85-90% of full-model performance with dramatic computational cost reduction. Enables 30x sequence compression while retaining core temporal dynamics. Improves robustness under noise and outperforms uniform/clustering baselines.

Conclusion: STaTS provides a principled, general-purpose solution for efficient, structure-aware time series modeling that can be integrated with existing encoders without retraining.

Abstract: Time series data often contain latent temporal structure, transitions between locally stationary regimes, repeated motifs, and bursts of variability, that are rarely leveraged in standard representation learning pipelines. Existing models typically operate on raw or fixed-window sequences, treating all time steps as equally informative, which leads to inefficiencies, poor robustness, and limited scalability in long or noisy sequences. We propose STaTS, a lightweight, unsupervised framework for Structure-Aware Temporal Summarization that adaptively compresses both univariate and multivariate time series into compact, information-preserving token sequences. STaTS detects change points across multiple temporal resolutions using a BIC-based statistical divergence criterion, then summarizes each segment using simple functions like the mean or generative models such as GMMs. This process achieves up to 30x sequence compression while retaining core temporal dynamics. STaTS operates as a model-agnostic preprocessor and can be integrated with existing unsupervised time series encoders without retraining. Extensive experiments on 150+ datasets, including classification tasks on the UCR-85, UCR-128, and UEA-30 archives, and forecasting on ETTh1 and ETTh2, ETTm1, and Electricity, demonstrate that STaTS enables 85-90% of the full-model performance while offering dramatic reductions in computational cost. Moreover, STaTS improves robustness under noise and preserves discriminative structure, outperforming uniform and clustering-based compression baselines. These results position STaTS as a principled, general-purpose solution for efficient, structure-aware time series modeling.

[500] Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models

Yankun Han

Main category: cs.LG

TL;DR: This paper studies weight initialization effects in ReLU MLPs and GPT-2 transformers, identifying stable initialization ranges and showing Kaiming initialization outperforms Xavier for ReLU networks. It also analyzes layerwise weight variance evolution during transformer pretraining.

Details

Motivation: To understand how weight initialization governs signal propagation and gradient flow at training start, and to connect classic initialization principles with modern transformer behavior.

Method: Conducted logarithmic sweep of initial standard deviations, compared Kaiming vs Xavier initialization in ReLU networks, and tracked layerwise Q/K/V weight variance through pretraining in a 12-layer GPT-2-style model.

Result: Found stable initialization band between 1e-2 and 1e-1 standard deviation, Kaiming initialization converges faster and more stably than Xavier under ReLU, and observed depth-dependent weight variance equilibration in transformers where shallow layers expand rapidly while deeper layers change gradually.

Conclusion: The results connect classic initialization principles with modern transformer behavior and provide practical recipes for robust training, including using Kaiming initialization for ReLU networks and understanding layer-specific initialization needs in transformers.

Abstract: Weight initialization governs signal propagation and gradient flow at the start of training. This paper offers a theory-grounded and empirically validated study across two regimes: compact ReLU multilayer perceptrons and GPT-2-style transformers. First, a logarithmic sweep of the initial standard deviation maps vanishing and exploding regimes and identifies a broad stability band with standard deviations between 1e-2 and 1e-1. Second, a controlled comparison shows that Kaiming (fan-in) initialization converges faster and more stably than Xavier under ReLU, consistent with variance-preserving theory. Third, in a from-scratch 12-layer GPT-2-style model, this paper tracks layerwise Q/K/V weight variance through pretraining and observe depth-dependent equilibration into narrow bands: shallow layers expand rapidly while deeper layers change more gradually. Together, these results connect classic initialization principles with modern transformer behavior and yield simple, practical recipes for robust training.

[501] Cross-attention Secretly Performs Orthogonal Alignment in Recommendation Models

Hyunin Lee, Yong Zhang, Hoang Vu Nguyen, Xiaoyi Liu, Namyong Park, Christopher Jung, Rong Jin, Yang Wang, Zhigang Wang, Somayeh Sojoudi, Xue Feng

Main category: cs.LG

TL;DR: Cross-attention in CDSR can function as Orthogonal Alignment, discovering novel information not present in query inputs, which naturally emerges and improves model performance and scaling efficiency.

Details

Motivation: To better understand the underlying mechanisms of cross-attention in cross-domain sequential recommendation, beyond the prevailing residual alignment interpretation.

Method: Analysis of cross-attention mechanisms through over 300 experiments, examining when query inputs and cross-attention outputs are orthogonal.

Result: Orthogonal Alignment improves model performance and achieves superior accuracy-per-model parameter compared to parameter-matched baselines, emerging naturally without explicit constraints.

Conclusion: Orthogonal Alignment naturally emerges in cross-attention and improves scaling laws, offering new directions for parameter-efficient scaling in multi-modal research.

Abstract: Cross-domain sequential recommendation (CDSR) aims to align heterogeneous user behavior sequences collected from different domains. While cross-attention is widely used to enhance alignment and improve recommendation performance, its underlying mechanism is not fully understood. Most researchers interpret cross-attention as residual alignment, where the output is generated by removing redundant and preserving non-redundant information from the query input by referencing another domain data which is input key and value. Beyond the prevailing view, we introduce Orthogonal Alignment, a phenomenon in which cross-attention discovers novel information that is not present in the query input, and further argue that those two contrasting alignment mechanisms can co-exist in recommendation models We find that when the query input and output of cross-attention are orthogonal, model performance improves over 300 experiments. Notably, Orthogonal Alignment emerges naturally, without any explicit orthogonality constraints. Our key insight is that Orthogonal Alignment emerges naturally because it improves scaling law. We show that baselines additionally incorporating cross-attention module outperform parameter-matched baselines, achieving a superior accuracy-per-model parameter. We hope these findings offer new directions for parameter-efficient scaling in multi-modal research.

[502] BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards

Sangyun Lee, Brandon Amos, Giulia Fanti

Main category: cs.LG

TL;DR: BaNEL is a Bayesian algorithm that improves generative models using only failed attempts, minimizing expensive reward evaluations by learning from negative evidence.

Details

Motivation: Current generative models struggle when reward signals are near-zero and reward evaluations are expensive, creating a different learning challenge than standard reward-based training.

Method: Cast learning from failures as another generative modeling problem, use Bayesian approach to assess if new data resembles previous failures and steer generation away from them.

Result: BaNEL improves model performance without observing successful samples, outperforms novelty-bonus approaches by orders of magnitude in success rate while using fewer reward evaluations.

Conclusion: BaNEL effectively addresses the challenge of learning from sparse rewards and expensive evaluations by leveraging negative evidence from failed attempts.

Abstract: Today’s generative models thrive with large amounts of supervised data and informative reward functions characterizing the quality of the generation. They work under the assumptions that the supervised data provides knowledge to pre-train the model, and the reward function provides dense information about how to further improve the generation quality and correctness. However, in the hardest instances of important problems, two problems arise: (1) the base generative model attains a near-zero reward signal, and (2) calls to the reward oracle are expensive. This setting poses a fundamentally different learning challenge than standard reward-based post-training. To address this, we propose BaNEL (Bayesian Negative Evidence Learning), an algorithm that post-trains the model using failed attempts only, while minimizing the number of reward evaluations (NREs). Our method is based on the idea that the problem of learning regularities underlying failures can be cast as another, in-loop generative modeling problem. We then leverage this model to assess whether new data resembles previously seen failures and steer the generation away from them. We show that BaNEL can improve model performance without observing a single successful sample on several sparse-reward tasks, outperforming existing novelty-bonus approaches by up to several orders of magnitude in success rate, while using fewer reward evaluations.

[503] On Uniformly Scaling Flows: A Density-Aligned Approach to Deep One-Class Classification

Faried Abu Zaid, Tim Katzke, Emmanuel Müller, Daniel Neider

Main category: cs.LG

TL;DR: Uniformly scaling flows (USFs) connect deep one-class classification and density estimation approaches for anomaly detection, showing that USF training reduces to Deep SVDD with collapse prevention, and replacing non-USFs with USFs improves performance and stability.

Details

Motivation: To bridge the gap between two major unsupervised anomaly detection paradigms: deep one-class classification (like Deep SVDD) and density estimation (like normalizing flows), and show their theoretical connection through uniformly scaling flows.

Method: Prove that training uniformly scaling flows (USFs) via maximum-likelihood reduces to a Deep SVDD objective with inherent regularization against representational collapse. Use USFs as drop-in replacements for non-USFs in anomaly detection architectures.

Result: USFs induce tighter alignment between negative log-likelihood and latent norm than Deep SVDD or non-USFs. Empirical results show consistent performance gains and improved training stability across multiple benchmarks for both image-level and pixel-level anomaly detection.

Conclusion: USFs unify two major anomaly detection paradigms, providing both theoretical understanding and practical performance improvements, making them recommended replacements for non-USFs in modern anomaly detection architectures.

Abstract: Unsupervised anomaly detection is often framed around two widely studied paradigms. Deep one-class classification, exemplified by Deep SVDD, learns compact latent representations of normality, while density estimators realized by normalizing flows directly model the likelihood of nominal data. In this work, we show that uniformly scaling flows (USFs), normalizing flows with a constant Jacobian determinant, precisely connect these approaches. Specifically, we prove how training a USF via maximum-likelihood reduces to a Deep SVDD objective with a unique regularization that inherently prevents representational collapse. This theoretical bridge implies that USFs inherit both the density faithfulness of flows and the distance-based reasoning of one-class methods. We further demonstrate that USFs induce a tighter alignment between negative log-likelihood and latent norm than either Deep SVDD or non-USFs, and how recent hybrid approaches combining one-class objectives with VAEs can be naturally extended to USFs. Consequently, we advocate using USFs as a drop-in replacement for non-USFs in modern anomaly detection architectures. Empirically, this substitution yields consistent performance gains and substantially improved training stability across multiple benchmarks and model backbones for both image-level and pixel-level detection. These results unify two major anomaly detection paradigms, advancing both theoretical understanding and practical performance.

[504] Interpretable Machine Learning for Predicting Startup Funding, Patenting, and Exits

Saeid Mashhadi, Amirhossein Saghezchi, Vesal Ghassemzadeh Kashani

Main category: cs.LG

TL;DR: An interpretable ML framework predicts startup outcomes (funding, patenting, exit) using Crunchbase and USPTO data, achieving high AUROC scores of 0.921, 0.817, and 0.872 respectively.

Details

Motivation: To develop a transparent and reproducible machine learning framework for forecasting startup success metrics including funding acquisition, patent growth, and exit events (IPO/acquisition).

Method: Constructed firm-quarter panel from Crunchbase and USPTO data (2010-2023), used preprocessing on 2010-2019 development window, addressed class imbalance with inverse-prevalence weights and SMOTE-NC, compared logistic regression with tree ensembles (Random Forest, XGBoost, LightGBM, CatBoost) using PR-AUC and AUROC metrics.

Result: Achieved AUROC values of 0.921 for patent predictions, 0.817 for funding predictions, and 0.872 for exit predictions, demonstrating strong predictive performance across all three startup outcome categories.

Conclusion: The framework provides transparent and reproducible rankings for innovation finance, successfully forecasting startup outcomes with high accuracy using interpretable machine learning methods.

Abstract: This study develops an interpretable machine learning framework to forecast startup outcomes, including funding, patenting, and exit. A firm-quarter panel for 2010-2023 is constructed from Crunchbase and matched to U.S. Patent and Trademark Office (USPTO) data. Three horizons are evaluated: next funding within 12 months, patent-stock growth within 24 months, and exit through an initial public offering (IPO) or acquisition within 36 months. Preprocessing is fit on a development window (2010-2019) and applied without change to later cohorts to avoid leakage. Class imbalance is addressed using inverse-prevalence weights and the Synthetic Minority Oversampling Technique for Nominal and Continuous features (SMOTE-NC). Logistic regression and tree ensembles, including Random Forest, XGBoost, LightGBM, and CatBoost, are compared using the area under the precision-recall curve (PR-AUC) and the area under the receiver operating characteristic curve (AUROC). Patent, funding, and exit predictions achieve AUROC values of 0.921, 0.817, and 0.872, providing transparent and reproducible rankings for innovation finance.

[505] Geodesic Calculus on Latent Spaces

Florine Hartwig, Josua Sassen, Juliane Braunsmann, Martin Rumpf, Benedikt Wirth

Main category: cs.LG

TL;DR: The paper develops a framework for geometric analysis of autoencoder latent manifolds using discrete Riemannian calculus and implicit representations, enabling computation of geodesics and Riemannian exponential maps.

Details

Motivation: To provide robust geometric tools for studying latent manifolds of autoencoders, which are often inaccurate in practical applications, and enable Riemannian geometric operations on these manifolds.

Method: Proposes describing latent manifolds as implicit submanifolds, develops discrete Riemannian calculus tools, and learns approximate projections via denoising objectives to handle representation inaccuracies.

Result: The framework successfully computes geodesic paths and Riemannian exponential maps on various autoencoder latent manifolds trained on both synthetic and real data.

Conclusion: The approach provides robust geometric analysis tools for autoencoder latent manifolds, supporting different Riemannian geometries and enabling practical computation of geodesics despite representation inaccuracies.

Abstract: Latent manifolds of autoencoders provide low-dimensional representations of data, which can be studied from a geometric perspective. We propose to describe these latent manifolds as implicit submanifolds of some ambient latent space. Based on this, we develop tools for a discrete Riemannian calculus approximating classical geometric operators. These tools are robust against inaccuracies of the implicit representation often occurring in practical examples. To obtain a suitable implicit representation, we propose to learn an approximate projection onto the latent manifold by minimizing a denoising objective. This approach is independent of the underlying autoencoder and supports the use of different Riemannian geometries on the latent manifolds. The framework in particular enables the computation of geodesic paths connecting given end points and shooting geodesics via the Riemannian exponential maps on latent manifolds. We evaluate our approach on various autoencoders trained on synthetic and real data.

[506] CRPS-LAM: Regional ensemble weather forecasting from matching marginals

Erik Larsson, Joel Oskarsson, Tomas Landelius, Fredrik Lindsten

Main category: cs.LG

TL;DR: CRPS-LAM is a probabilistic regional weather forecasting model that uses CRPS-based training to generate ensemble members in a single forward pass, achieving 39x faster sampling than diffusion models while maintaining accuracy.

Details

Motivation: Diffusion-based models for weather prediction show strong performance but are computationally expensive at sampling time, making them impractical for operational use.

Method: Train a probabilistic Limited-Area Modeling (LAM) forecasting model using Continuous Ranked Probability Score (CRPS) objective, injecting a single latent noise vector to generate ensemble members in one forward pass.

Result: CRPS-LAM achieves sampling speeds up to 39 times faster than diffusion-based models, matches the low errors of diffusion models on MEPS regional dataset, and retains fine-scale forecast details.

Conclusion: CRPS-LAM provides an effective approach for probabilistic regional weather forecasting with significantly improved computational efficiency while maintaining forecast quality.

Abstract: Machine learning for weather prediction increasingly relies on ensemble methods to provide probabilistic forecasts. Diffusion-based models have shown strong performance in Limited-Area Modeling (LAM) but remain computationally expensive at sampling time. Building on the success of global weather forecasting models trained based on Continuous Ranked Probability Score (CRPS), we introduce CRPS-LAM, a probabilistic LAM forecasting model trained with a CRPS-based objective. By sampling and injecting a single latent noise vector into the model, CRPS-LAM generates ensemble members in a single forward pass, achieving sampling speeds up to 39 times faster than a diffusion-based model. We evaluate the model on the MEPS regional dataset, where CRPS-LAM matches the low errors of diffusion models. By retaining also fine-scale forecast details, the method stands out as an effective approach for probabilistic regional weather forecasting

[507] Locally Optimal Private Sampling: Beyond the Global Minimax

Hrad Ghoukasian, Bonwoo Lee, Shahab Asoodeh

Main category: cs.LG

TL;DR: The paper studies local differential privacy (LDP) sampling, focusing on minimax risk around a fixed distribution P0 rather than global optimality. It extends prior work to functional LDP and shows that globally optimal samplers yield optimal local performance near P0.

Details

Motivation: To address the fundamental challenge of generating realistic-looking data under strong LDP guarantees, moving beyond global minimax-optimality to examine local performance around specific distributions, which better models practical scenarios like private sampling with public data.

Method: Extend prior work from pure LDP to functional LDP framework, prove that globally optimal functional LDP samplers yield optimal local samplers when constrained to distributions near P0, and derive closed-form expressions for locally minimax-optimal samplers.

Result: Characterized the exact value of local minimax risk around fixed distribution P0, showing it depends on both P0 and privacy level. Derived simple closed-form expressions for locally optimal samplers that don’t depend on f-divergence choice.

Conclusion: The local framework provides better performance than global methods for private sampling with public data, and the locally optimal sampler consistently outperforms global minimax samplers in empirical comparisons.

Abstract: We study the problem of sampling from a distribution under local differential privacy (LDP). Given a private distribution $P \in \mathcal{P}$, the goal is to generate a single sample from a distribution that remains close to $P$ in $f$-divergence while satisfying the constraints of LDP. This task captures the fundamental challenge of producing realistic-looking data under strong privacy guarantees. While prior work by Park et al. (NeurIPS'24) focuses on global minimax-optimality across a class of distributions, we take a local perspective. Specifically, we examine the minimax risk in a neighborhood around a fixed distribution $P_0$, and characterize its exact value, which depends on both $P_0$ and the privacy level. Our main result shows that the local minimax risk is determined by the global minimax risk when the distribution class $\mathcal{P}$ is restricted to a neighborhood around $P_0$. To establish this, we (1) extend previous work from pure LDP to the more general functional LDP framework, and (2) prove that the globally optimal functional LDP sampler yields the optimal local sampler when constrained to distributions near $P_0$. Building on this, we also derive a simple closed-form expression for the locally minimax-optimal samplers which does not depend on the choice of $f$-divergence. We further argue that this local framework naturally models private sampling with public data, where the public data distribution is represented by $P_0$. In this setting, we empirically compare our locally optimal sampler to existing global methods, and demonstrate that it consistently outperforms global minimax samplers.

[508] Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning

Shangzhe Li, Dongruo Zhou, Weitong Zhang

Main category: cs.LG

TL;DR: The paper introduces MB-AIL, a model-based adversarial imitation learning algorithm that achieves horizon-free, second-order sample complexity guarantees for online interaction with limited expert demonstrations, matching minimax-optimal bounds.

Details

Motivation: To address the poorly understood benefits of online interaction and impact of stochasticity in adversarial imitation learning, where agents learn from offline demonstrations without reward access.

Method: Proposed MB-AIL (model-based adversarial imitation learning) algorithm that uses general function approximations for both expert data and reward-free interactions, establishing second-order sample complexity guarantees.

Result: MB-AIL achieves horizon-free, second-order sample complexity that scales with policy variance, tightens in deterministic systems, and matches minimax-optimal bounds for online interaction with limited expert demonstrations.

Conclusion: The algorithm attains optimal sample complexity, validated by experiments showing it matches or surpasses existing methods’ sample efficiency, providing theoretical foundation for online adversarial imitation learning.

Abstract: We study online adversarial imitation learning (AIL), where an agent learns from offline expert demonstrations and interacts with the environment online without access to rewards. Despite strong empirical results, the benefits of online interaction and the impact of stochasticity remain poorly understood. We address these gaps by introducing a model-based AIL algorithm (MB-AIL) and establish its horizon-free, second-order sample-complexity guarantees under general function approximations for both expert data and reward-free interactions. These second-order bounds provide an instance-dependent result that can scale with the variance of returns under the relevant policies and therefore tighten as the system approaches determinism. Together with second-order, information-theoretic lower bounds on a newly constructed hard-instance family, we show that MB-AIL attains minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations and matches the lower bound for expert demonstrations in terms of the dependence on horizon $H$, precision $\epsilon$ and the policy variance $\sigma^2$. Experiments further validate our theoretical findings and demonstrate that a practical implementation of MB-AIL matches or surpasses the sample efficiency of existing methods.

[509] Limitations of Normalization in Attention Mechanism

Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State

Main category: cs.LG

TL;DR: The paper analyzes limitations of softmax normalization in attention mechanisms, showing that as more tokens are selected, the model’s ability to distinguish informative tokens decreases and converges toward uniform selection, with gradient sensitivity issues at low temperatures.

Details

Motivation: To investigate the limitations of normalization in attention mechanisms and understand how softmax scaling affects token selection capabilities and geometric separation in transformer models.

Method: Developed a theoretical framework for analyzing selective ability and geometric separation in token selection, conducted experiments with pre-trained GPT-2 model to empirically validate theoretical results, and analyzed attention mechanism behaviors including distance bounds and separation criteria.

Result: Empirical validation showed that as the number of selected tokens increases, the model’s ability to distinguish informative tokens declines and converges toward uniform selection patterns. Gradient sensitivity under softmax normalization presents training challenges, especially at low temperature settings.

Conclusion: The findings advance understanding of softmax-based attention mechanisms and motivate the need for more robust normalization and selection strategies in future attention architectures to address the identified limitations.

Abstract: This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model’s selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model’s ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

[510] Geo-Aware Models for Stream Temperature Prediction across Different Spatial Regions and Scales

Shiyuan Luo, Runlong Yu, Shengyu Chen, Yingda Fan, Yiqun Xie, Yanhua Li, Xiaowei Jia

Main category: cs.LG

TL;DR: Geo-STARS is a geo-aware spatio-temporal modeling framework that uses geographic embeddings to improve generalization of stream water temperature prediction across different watersheds and spatial scales, addressing data heterogeneity and limited training samples.

Details

Motivation: Existing physics-based and data-driven models fail to generalize across varying spatial regions and scales due to data heterogeneity and limited observation samples in environmental ecosystems.

Method: Proposes geo-aware embedding that leverages geographic information to capture shared patterns across regions and scales, integrated into a gated spatio-temporal graph neural network to learn complex patterns guided by geographic and hydrological context.

Result: Geo-STARS demonstrates superior generalization performance across regions and scales in predicting stream water temperature using 37 years of real-world data from multiple watersheds along the eastern US coast, outperforming state-of-the-art baselines.

Conclusion: The framework shows promise for scalable, data-efficient environmental monitoring and decision-making by effectively handling data sparsity and heterogeneity through geographic-aware modeling.

Abstract: Understanding environmental ecosystems is vital for the sustainable management of our planet. However,existing physics-based and data-driven models often fail to generalize to varying spatial regions and scales due to the inherent data heterogeneity presented in real environmental ecosystems. This generalization issue is further exacerbated by the limited observation samples available for model training. To address these issues, we propose Geo-STARS, a geo-aware spatio-temporal modeling framework for predicting stream water temperature across different watersheds and spatial scales. The major innovation of Geo-STARS is the introduction of geo-aware embedding, which leverages geographic information to explicitly capture shared principles and patterns across spatial regions and scales. We further integrate the geo-aware embedding into a gated spatio-temporal graph neural network. This design enables the model to learn complex spatial and temporal patterns guided by geographic and hydrological context, even with sparse or no observational data. We evaluate Geo-STARS’s efficacy in predicting stream water temperature, which is a master factor for water quality. Using real-world datasets spanning 37 years across multiple watersheds along the eastern coast of the United States, Geo-STARS demonstrates its superior generalization performance across both regions and scales, outperforming state-of-the-art baselines. These results highlight the promise of Geo-STARS for scalable, data-efficient environmental monitoring and decision-making.

[511] Automated Evolutionary Optimization for Resource-Efficient Neural Network Training

Ilia Revin, Leon Strelkov, Vadim A. Potemkin, Ivan Kireev, Andrey Savchenko

Main category: cs.LG

TL;DR: PETRA is an AutoML framework that uses evolutionary optimization for model architecture and training strategy, achieving significant model compression (75% size reduction) and performance improvements (33% latency reduction, 13% throughput increase) without metric degradation.

Details

Motivation: Addressing critical challenges in neural network optimization including distributed computing, compression techniques, and efficient training to meet the increasing need for scalable and resource-efficient models.

Method: Developed PETRA framework that applies evolutionary optimization to model architecture and training strategy, incorporating pruning, quantization, and loss regularization techniques.

Result: Experimental results show PETRA achieves up to 75% reduction in model size, 33% reduction in latency, and 13% increase in throughput without noticeable degradation in target metrics on financial event sequences, image, and time-series benchmarks.

Conclusion: PETRA successfully addresses neural network optimization challenges through automated evolutionary optimization, demonstrating significant improvements in model efficiency and scalability across various data types.

Abstract: There are many critical challenges in optimizing neural network models, including distributed computing, compression techniques, and efficient training, regardless of their application to specific tasks. Solving such problems is crucial because the need for scalable and resource-efficient models is increasing. To address these challenges, we have developed a new automated machine learning (AutoML) framework, Parameter Efficient Training with Robust Automation (PETRA). It applies evolutionary optimization to model architecture and training strategy. PETRA includes pruning, quantization, and loss regularization. Experimental studies on real-world data with financial event sequences, as well as image and time-series – benchmarks, demonstrate PETRA’s ability to improve neural model performance and scalability – namely, a significant decrease in model size (up to 75%) and latency (up to 33%), and an increase in throughput (by 13%) without noticeable degradation in the target metric.

[512] MODE: Learning compositional representations of complex systems with Mixtures Of Dynamical Experts

Nathan Quiblier, Roy Friedman, Matthew Ricci

Main category: cs.LG

TL;DR: MODE (Mixture Of Dynamical Experts) is a graphical modeling framework that decomposes complex biological dynamics into sparse, interpretable components using neural gating, enabling unsupervised discovery of behavioral regimes and accurate long-term forecasting across noisy transitions.

Details

Motivation: Traditional flow-based modeling techniques struggle with complex biological systems that have overlapping behavioral regimes and noisy transitions between them, such as cellular subpopulations shifting between cycling, equilibrium, and developmental branching dynamics.

Method: Proposed MODE framework with neural gating mechanism that decomposes dynamics into sparse components, allowing agents to jump between different governing laws. Evaluated on synthetic and real datasets including single-cell RNA sequencing data.

Result: Successfully benchmarked on unsupervised classification with synthetic data, achieved accurate forecasting on cycling and branching processes, and distinguished proliferation from differentiation dynamics in human single-cell data while predicting cell fate commitment timing.

Conclusion: MODE effectively addresses the challenge of modeling complex biological systems with noisy regime transitions, providing both interpretable regime discovery and accurate forecasting capabilities that outperform traditional methods.

Abstract: Dynamical systems in the life sciences are often composed of complex mixtures of overlapping behavioral regimes. Cellular subpopulations may shift from cycling to equilibrium dynamics or branch towards different developmental fates. The transitions between these regimes can appear noisy and irregular, posing a serious challenge to traditional, flow-based modeling techniques which assume locally smooth dynamics. To address this challenge, we propose MODE (Mixture Of Dynamical Experts), a graphical modeling framework whose neural gating mechanism decomposes complex dynamics into sparse, interpretable components, enabling both the unsupervised discovery of behavioral regimes and accurate long-term forecasting across regime transitions. Crucially, because agents in our framework can jump to different governing laws, MODE is especially tailored to the aforementioned noisy transitions. We evaluate our method on a battery of synthetic and real datasets from computational biology. First, we systematically benchmark MODE on an unsupervised classification task using synthetic dynamical snapshot data, including in noisy, few-sample settings. Next, we show how MODE succeeds on challenging forecasting tasks which simulate key cycling and branching processes in cell biology. Finally, we deploy our method on human, single-cell RNA sequencing data and show that it can not only distinguish proliferation from differentiation dynamics but also predict when cells will commit to their ultimate fate, a key outstanding challenge in computational biology.

[513] ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection

Bo Peng, Yadan Luo, Yonggang Zhang, Yixuan Li, Zhen Fang

Main category: cs.LG

TL;DR: A novel theoretical framework using Bregman divergence for OOD detection, introducing ConjNorm method that searches for optimal norm coefficients and achieves state-of-the-art performance.

Details

Motivation: Existing OOD detection methods based on logits, distances, or data distribution assumptions may fail to accurately reflect true data density or impose impractical constraints, requiring a unified perspective on density-based score design.

Method: Proposed ConjNorm method grounded in Bregman divergence framework, reframing density function design as search for optimal norm coefficient p, with Monte Carlo-based importance sampling for partition function estimation.

Result: Established new state-of-the-art in OOD detection, outperforming current best method by up to 13.25% (FPR95) on CIFAR-100 and 28.19% (FPR95) on ImageNet-1K across various setups.

Conclusion: The Bregman divergence framework provides unified perspective for density-based OOD detection, and ConjNorm method effectively addresses computational challenges while achieving superior performance.

Abstract: Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicated to deriving score functions based on logits, distances, or rigorous data distribution assumptions to identify low-scoring OOD samples. Nevertheless, these estimate scores may fail to accurately reflect the true data density or impose impractical constraints. To provide a unified perspective on density-based score design, we propose a novel theoretical framework grounded in Bregman divergence, which extends distribution considerations to encompass an exponential family of distributions. Leveraging the conjugation constraint revealed in our theorem, we introduce a \textsc{ConjNorm} method, reframing density function design as a search for the optimal norm coefficient $p$ against the given dataset. In light of the computational challenges of normalization, we devise an unbiased and analytically tractable estimator of the partition function using the Monte Carlo-based importance sampling technique. Extensive experiments across OOD detection benchmarks empirically demonstrate that our proposed \textsc{ConjNorm} has established a new state-of-the-art in a variety of OOD detection setups, outperforming the current best method by up to 13.25$%$ and 28.19$%$ (FPR95) on CIFAR-100 and ImageNet-1K, respectively.

[514] E-ICL: Enhancing Fine-Grained Emotion Recognition through the Lens of Prototype Theory

Zhaochun Ren, Zhou Yang, Chenglong Ye, Yufeng Wang, Haizhou Sun, Chao Chen, Xiaofei Zhu, Yunbing Wu, Xiangwen Liao

Main category: cs.LG

TL;DR: ICL performs poorly on fine-grained emotion detection due to reliance on semantically similar but emotionally inaccurate prototypes and interference from irrelevant categories. The proposed E-ICL method uses emotionally accurate prototypes and exclusionary prediction to improve performance without additional training.

Details

Motivation: ICL achieves remarkable performance in various domains but significantly deteriorates for emotion detection tasks, especially fine-grained emotion recognition. The underlying reasons for this poor performance remain unclear.

Method: Proposed Emotion Context Learning (E-ICL) method that uses emotionally accurate prototypes with dynamic labels and employs an exclusionary emotion prediction strategy to avoid interference from irrelevant categories. The process is accomplished with a plug-and-play emotion auxiliary model without additional training.

Result: Experiments on fine-grained emotion datasets (EDOS, Empathetic-Dialogues, EmpatheticIntent, GoEmotions) show E-ICL achieves superior emotion prediction performance. Even with emotion auxiliary models smaller than 10% of LLMs, E-ICL boosts LLM performance by over 4% on multiple datasets.

Conclusion: E-ICL effectively addresses ICL’s deficiencies in fine-grained emotion recognition by leveraging prototype theory insights, using emotionally accurate prototypes, and implementing exclusionary prediction strategies, achieving significant performance improvements without requiring additional training.

Abstract: In-context learning (ICL) achieves remarkable performance in various domains such as knowledge acquisition, commonsense reasoning, and semantic understanding. However, its performance significantly deteriorates for emotion detection tasks, especially fine-grained emotion recognition. The underlying reasons for this remain unclear. In this paper, we identify the reasons behind ICL’s poor performance from the perspective of prototype theory and propose a method to address this issue. Specifically, we conduct extensive pilot experiments and find that ICL conforms to the prototype theory on fine-grained emotion recognition. Based on this theory, we uncover the following deficiencies in ICL: (1) It relies on prototypes (example-label pairs) that are semantically similar but emotionally inaccurate to predict emotions. (2) It is prone to interference from irrelevant categories, affecting the accuracy and robustness of the predictions. To address these issues, we propose an Emotion Context Learning method (E-ICL) on fine-grained emotion recognition. E-ICL relies on more emotionally accurate prototypes to predict categories by referring to emotionally similar examples with dynamic labels. Simultaneously, E-ICL employs an exclusionary emotion prediction strategy to avoid interference from irrelevant categories, thereby increasing its accuracy and robustness. Note that the entire process is accomplished with the assistance of a plug-and-play emotion auxiliary model, without additional training. Experiments on the fine-grained emotion datasets EDOS, Empathetic-Dialogues, EmpatheticIntent, and GoEmotions show that E-ICL achieves superior emotion prediction performance. Furthermore, even when the emotion auxiliary model used is lower than 10% of the LLMs, E-ICL can still boost the performance of LLMs by over 4% on multiple datasets.

[515] Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Ayush Jain, Norio Kosaka, Xinhu Li, Kyung-Min Kim, Erdem Bıyık, Joseph J. Lim

Main category: cs.LG

TL;DR: SAVO is an actor architecture for reinforcement learning that generates multiple action proposals and selects the best one to overcome local optima in Q-functions, improving performance in complex tasks.

Details

Motivation: In complex RL tasks like dexterous manipulation and restricted locomotion, deterministic policy gradient methods struggle because Q-functions have many local optima, causing gradient ascent to get stuck.

Method: SAVO generates multiple action proposals and selects the one with highest Q-value, while also approximating the Q-function by truncating poor local optima to guide gradient ascent more effectively.

Result: SAVO finds optimal actions more frequently and outperforms alternate actor architectures in tasks including restricted locomotion, dexterous manipulation, and large discrete-action space recommender systems.

Conclusion: The proposed SAVO actor architecture effectively addresses the local optima problem in deterministic policy gradients, leading to better performance in complex reinforcement learning tasks.

Abstract: In reinforcement learning, off-policy actor-critic methods like DDPG and TD3 use deterministic policy gradients: the Q-function is learned from environment data, while the actor maximizes it via gradient ascent. We observe that in complex tasks such as dexterous manipulation and restricted locomotion with mobility constraints, the Q-function exhibits many local optima, making gradient ascent prone to getting stuck. To address this, we introduce SAVO, an actor architecture that (i) generates multiple action proposals and selects the one with the highest Q-value, and (ii) approximates the Q-function repeatedly by truncating poor local optima to guide gradient ascent more effectively. We evaluate tasks such as restricted locomotion, dexterous manipulation, and large discrete-action space recommender systems and show that our actor finds optimal actions more frequently and outperforms alternate actor architectures.

[516] Solving Linear-Gaussian Bayesian Inverse Problems with Decoupled Diffusion Sequential Monte Carlo

Filip Ekström Kelvinius, Zheng Zhao, Fredrik Lindsten

Main category: cs.LG

TL;DR: The paper introduces a sequential Monte Carlo method for solving linear-Gaussian inverse problems using decoupled diffusion, which allows larger updates to samples and can be extended to discrete data.

Details

Motivation: To contribute to research using pre-trained generative diffusion models as priors for Bayesian inverse problems by developing a more efficient method.

Method: Decoupled Diffusion Sequential Monte Carlo (DDSMC) method that builds on decoupled diffusion, where the generative process enables larger sample updates.

Result: The method is asymptotically exact and demonstrates effectiveness on synthetic, protein, and image data, with extension capability to discrete data.

Conclusion: DDSMC provides an effective approach for solving linear-Gaussian inverse problems using diffusion models as priors, with demonstrated performance across various data types.

Abstract: A recent line of research has exploited pre-trained generative diffusion models as priors for solving Bayesian inverse problems. We contribute to this research direction by designing a sequential Monte Carlo method for linear-Gaussian inverse problems which builds on “decoupled diffusion”, where the generative process is designed such that larger updates to the sample are possible. The method is asymptotically exact and we demonstrate the effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC) algorithm on both synthetic as well as protein and image data. Further, we demonstrate how the approach can be extended to discrete data.

[517] DeepOHeat-v1: Efficient Operator Learning for Fast and Trustworthy Thermal Simulation and Optimization in 3D-IC Design

Xinling Yu, Ziyue Liu, Hai Li, Yixing Li, Xin Ai, Zhiyu Zeng, Ian Young, Zheng Zhang

Main category: cs.LG

TL;DR: DeepOHeat-v1 is an enhanced physics-informed operator learning framework for 3D-IC thermal analysis that addresses limitations in multi-scale pattern prediction, training efficiency, and result trustworthiness through three key innovations: adaptive multi-scale representation, separable training method, and confidence-based hybrid optimization.

Details

Motivation: Existing operator learning frameworks like DeepOHeat face critical limitations in predicting multi-scale thermal patterns, training efficiency, and trustworthiness of results during 3D-IC design optimization, where thermal analysis is crucial due to increased power density and complex heat dissipation paths.

Method: Three key innovations: 1) Integration of Kolmogorov-Arnold Networks with learnable activation functions for adaptive multi-scale thermal pattern representation; 2) Separable training method that decomposes basis functions along coordinate axes for efficiency; 3) Confidence score evaluation and hybrid optimization workflow combining operator learning with finite difference using GMRES method for incremental refinement.

Result: Achieves 1.25x and 6.29x error reduction in test cases, 62x training speedup and 31x GPU memory reduction, enables thermal analysis at previously infeasible resolutions, and speeds up entire optimization process by 70.6x while maintaining accuracy comparable to high-fidelity finite difference solvers.

Conclusion: DeepOHeat-v1 effectively addresses key limitations in thermal analysis for 3D-IC design, providing efficient and trustworthy thermal optimization through optimal placement of heat-generating components while significantly reducing computational costs.

Abstract: Thermal analysis is crucial in 3D-IC design due to increased power density and complex heat dissipation paths. Although operator learning frameworks such as DeepOHeat~\cite{liu2023deepoheat} have demonstrated promising preliminary results in accelerating thermal simulation, they face critical limitations in prediction capability for multi-scale thermal patterns, training efficiency, and trustworthiness of results during design optimization. This paper presents DeepOHeat-v1, an enhanced physics-informed operator learning framework that addresses these challenges through three key innovations. First, we integrate Kolmogorov-Arnold Networks with learnable activation functions as trunk networks, enabling an adaptive representation of multi-scale thermal patterns. This approach achieves a 1.25x and 6.29x reduction in error in two representative test cases. Second, we introduce a separable training method that decomposes the basis function along the coordinate axes, achieving 62x training speedup and 31x GPU memory reduction in our baseline case, and enabling thermal analysis at resolutions previously infeasible due to GPU memory constraints. Third, we propose a confidence score to evaluate the trustworthiness of the predicted results, and further develop a hybrid optimization workflow that combines operator learning with finite difference (FD) using Generalized Minimal Residual (GMRES) method for incremental solution refinement, enabling efficient and trustworthy thermal optimization. Experimental results demonstrate that DeepOHeat-v1 achieves accuracy comparable to optimization using high-fidelity finite difference solvers, while speeding up the entire optimization process by $70.6\times$ in our test cases, effectively minimizing the peak temperature through optimal placement of heat-generating components. Open source code is available at https://github.com/xlyu0127/DeepOHeat-v1.

[518] Reasoning Large Language Model Errors Arise from Hallucinating Critical Problem Features

Alex Heyman, Joel Zylberberg

Main category: cs.LG

TL;DR: RLLMs hallucinate graph edges not in prompts, causing significant failure rates in constraint-satisfaction problems like graph coloring and stable matching.

Details

Motivation: To understand failure modes of reasoning large language models (RLLMs) in constraint-satisfaction tasks and identify causes of incorrect reasoning.

Method: Tested multiple RLLMs on graph coloring problems with variable complexity, analyzed error rates and chain-of-thought explanations, and validated findings with stable matching experiments.

Result: RLLMs consistently hallucinate graph edges not specified in prompts, accounting for significant fractions of incorrect answers across all tested models and complexity levels.

Conclusion: RLLMs have broader issues with misrepresenting problem specifics, requiring design improvements to mitigate input-conflicting hallucinations.

Abstract: Large language models have recently made great strides in reasoning task performance through chain-of-thought (CoT) strategies trained via reinforcement learning; however, these “reasoning large language models” (RLLMs) remain imperfect reasoners, and understanding the frequencies and causes of their failure modes is important for both users and developers. We test o1-mini, o3-mini, DeepSeek-R1, Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, and Grok 3 Mini Beta on graph coloring as a variable-complexity constraint-satisfaction logic problem, and find evidence from both error rate comparisons and CoT/explanation text analysis that RLLMs are prone to hallucinate graph edges not specified in the prompt. This phenomenon persists across multiple problem complexity levels and semantic frames, and it appears to account for a significant fraction of the incorrect answers from every tested model, and the vast majority of them for some models. We also validate the generalizability of this input-conflicting hallucination phenomenon with smaller-scale experiments on a type of stable matching problem. Our results indicate that RLLMs may possess broader issues with misrepresentation of problem specifics, and we offer suggestions for design choices to mitigate this weakness.

[519] Collaborative Unlabeled Data Optimization

Xinyi Shang, Peng Sun, Fengyuan Liu, Tao Lin

Main category: cs.LG

TL;DR: CoOpt is a data-centric framework that optimizes unlabeled data by encoding knowledge directly into the data itself, enabling reusable and scalable training pipelines with significant performance improvements and training speedups.

Details

Motivation: Existing model-centric approaches have limitations where knowledge extracted from data is locked to model parameters, hindering reusability and scalability. The paper aims to enhance deep learning training efficiency and sustainability by optimizing the data itself rather than just models.

Method: Proposes CoOpt, a highly efficient parallelized framework for collaborative unlabeled data optimization. It distributes unlabeled data and leverages publicly available task-agnostic models to encode knowledge directly into the data, facilitating scalable and reusable training pipelines.

Result: Extensive experiments show CoOpt achieves 13.6% improvement on Tiny-ImageNet and 6.8% improvement on ImageNet-1K, with training speedups of 1.94× and 1.2× respectively across diverse datasets and architectures.

Conclusion: CoOpt successfully demonstrates a data-centric paradigm that maximizes utility of unlabeled data by encoding knowledge directly into data, enabling more efficient, reusable, and sustainable deep learning training pipelines.

Abstract: This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data, tackling a critical question: How can we enhance the efficiency and sustainability of deep learning training by optimizing the data itself? We begin by identifying three key limitations in existing model-centric approaches, all rooted in a shared bottleneck: knowledge extracted from data is locked to model parameters, hindering its reusability and scalability. To this end, we propose CoOpt, a highly efficient, parallelized framework for collaborative unlabeled data optimization, thereby effectively encoding knowledge into the data itself. By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines. Extensive experiments across diverse datasets and architectures demonstrate its efficacy and efficiency, achieving 13.6% and 6.8% improvements on Tiny-ImageNet and ImageNet-1K, respectively, with training speedups of $1.94 \times $ and $1.2 \times$.

[520] Sequential Monte Carlo for Policy Optimization in Continuous POMDPs

Hany Abdulsamad, Sahel Iqbal, Simo Särkkä

Main category: cs.LG

TL;DR: A novel policy optimization framework for continuous POMDPs that balances exploration and exploitation through probabilistic inference in a Feynman-Kac model, using nested SMC for policy gradient estimation.

Details

Motivation: Optimal decision-making under partial observability requires balancing uncertainty reduction (exploration) with immediate objectives (exploitation), which existing methods struggle with in continuous POMDPs.

Method: Policy learning is cast as probabilistic inference in a non-Markovian Feynman-Kac model that captures information value, optimized using nested sequential Monte Carlo (SMC) for history-dependent policy gradients.

Result: The method demonstrates effectiveness across standard continuous POMDP benchmarks where existing approaches fail to act effectively under uncertainty.

Conclusion: The framework provides a principled approach for exploration-exploitation trade-offs in continuous POMDPs without requiring approximations or heuristics, enabling better decision-making under partial observability.

Abstract: Optimal decision-making under partial observability requires agents to balance reducing uncertainty (exploration) against pursuing immediate objectives (exploitation). In this paper, we introduce a novel policy optimization framework for continuous partially observable Markov decision processes (POMDPs) that explicitly addresses this challenge. Our method casts policy learning as probabilistic inference in a non-Markovian Feynman–Kac model that inherently captures the value of information gathering by anticipating future observations, without requiring suboptimal approximations or handcrafted heuristics. To optimize policies under this model, we develop a nested sequential Monte Carlo (SMC) algorithm that efficiently estimates a history-dependent policy gradient under samples from the optimal trajectory distribution induced by the POMDP. We demonstrate the effectiveness of our algorithm across standard continuous POMDP benchmarks, where existing methods struggle to act under uncertainty.

[521] Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein, Volkan Cevher

Main category: cs.LG

TL;DR: LEAF is an efficient adversarial finetuning method for CLIP text encoders that improves robustness against adversarial attacks while maintaining vision performance and enhancing downstream applications like text-to-image generation and multimodal retrieval.

Details

Motivation: Adversarial attacks can significantly shift CLIP embeddings, affecting downstream models that use CLIP. While some work has been done on robust image encoders, text encoder robustness remains unexplored.

Method: Proposed LEAF: an efficient adversarial finetuning method for the text domain that scales to large CLIP models.

Result: Significantly improved zero-shot adversarial accuracy in text domain while maintaining vision performance. Enhanced generation quality in text-to-image diffusion models under adversarial noise. Improved recall in multimodal retrieval tasks under adversarial noise. Better reconstruction of input text from embeddings via direct optimization.

Conclusion: LEAF successfully addresses the gap in text encoder robustness for CLIP models, providing improved adversarial robustness while maintaining performance across various downstream applications.

Abstract: Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. In multimodal retrieval tasks, LEAF improves the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization. We open-source our code ( https://github.com/LIONS-EPFL/LEAF ) and models ( https://huggingface.co/LEAF-CLIP ).

[522] CausalVLBench: Benchmarking Visual Causal Reasoning in Large Vision-Language Models

Aneesh Komanduri, Karuna Bhaila, Xintao Wu

Main category: cs.LG

TL;DR: The paper introduces CausalVLBench, a comprehensive benchmark for evaluating large vision-language models (LVLMs) on visual causal reasoning tasks including causal structure inference, intervention target prediction, and counterfactual prediction.

Details

Motivation: While LLMs have shown strong causal reasoning abilities and LVLMs perform well on recognition and VQA tasks, there's limited work on evaluating LVLMs' visual causal reasoning capabilities, creating a gap in understanding their multimodal causal reasoning abilities.

Method: Created CausalVLBench with three causal reasoning tasks evaluated across three causal representation learning datasets, testing state-of-the-art open-source LVLMs on their multimodal in-context learning capabilities.

Result: The evaluation revealed fundamental strengths and weaknesses of existing LVLMs in visual causal reasoning, providing insights into their current limitations.

Conclusion: The benchmark highlights drawbacks of current vision-language models and aims to motivate new research directions for improving LVLMs’ visual causal reasoning abilities.

Abstract: Large language models (LLMs) have shown remarkable ability in various language tasks, especially with their emergent in-context learning capability. Extending LLMs to incorporate visual inputs, large vision-language models (LVLMs) have shown impressive performance in tasks such as recognition and visual question answering (VQA). Despite increasing interest in the utility of LLMs in causal reasoning tasks such as causal discovery and counterfactual reasoning, there has been relatively little work showcasing the abilities of LVLMs on visual causal reasoning tasks. We take this opportunity to formally introduce a comprehensive causal reasoning benchmark for multi-modal in-context learning from LVLMs. Our CausalVLBench encompasses three representative tasks: causal structure inference, intervention target prediction, and counterfactual prediction. We evaluate the ability of state-of-the-art open-source LVLMs on our causal reasoning tasks across three causal representation learning datasets and demonstrate their fundamental strengths and weaknesses. We hope that our benchmark elucidates the drawbacks of existing vision-language models and motivates new directions and paradigms in improving the visual causal reasoning abilities of LVLMs.

[523] From Contextual Data to Newsvendor Decisions: On the Actual Performance of Data-Driven Algorithms

Omar Besbes, Will Ma, Omar Mouchtaki

Main category: cs.LG

TL;DR: This paper provides exact worst-case regret analysis for Weighted Empirical Risk Minimization (WERM) policies in contextual Newsvendor problems, revealing fundamental insights about how data quantity and quality affect performance.

Details

Motivation: To understand how the relevance/quality and quantity of past data influence performance in contextual decision-making problems, particularly in the Newsvendor problem where past literature focused only on upper bounds via concentration inequalities.

Method: The authors analyze the broad class of WERM policies that weigh past data by contextual similarity, including ERM, k-NN, and kernel-based methods. They use an optimization approach to reduce the infinite-dimensional worst-case distribution problem to a simple line search by exploiting the structure of the Newsvendor loss function.

Result: The paper characterizes exactly the worst-case regret of any WERM policy for any given context configuration, providing the first tight performance guarantees in contextual decision-making problems. This reveals granular insights about learning curves and actual guaranteed performance as a function of contexts.

Conclusion: The optimization approach unveils fundamental insights that were previously obfuscated by general-purpose bounds, enabling exact performance characterization and better understanding of how data quality and quantity impact contextual decision-making algorithms.

Abstract: In this work, we study how the relevance/quality and quantity of past data influence performance by analyzing a contextual Newsvendor problem, in which a decision-maker trades off between underage and overage costs under uncertain demand. We consider a setting in which past demands observed under ``close by’’ contexts come from close by distributions and analyze the performance of data-driven algorithms through a notion of context-dependent worst-case expected regret. We analyze the broad class of Weighted Empirical Risk Minimization (WERM) policies which weigh past data according to their similarity in the contextual space. This class includes classical policies such as ERM, k-Nearest Neighbors and kernel-based policies. Our main methodological contribution is to characterize exactly the worst-case regret of any WERM policy on any given configuration of contexts. To the best of our knowledge, this provides the first understanding of tight performance guarantees in any contextual decision-making problem, with past literature focusing on upper bounds via concentration inequalities. We instead take an optimization approach, and isolate a structure in the Newsvendor loss function that allows to reduce the infinite-dimensional optimization problem over worst-case distributions to a simple line search. This in turn allows us to unveil fundamental insights that were obfuscated by previous general-purpose bounds. We characterize actual guaranteed performance as a function of the contexts, as well as granular insights on the learning curve of algorithms.

[524] Symmetry in Neural Network Parameter Spaces

Bo Zhao, Robin Walters, Rose Yu

Main category: cs.LG

TL;DR: This survey explores parameter space symmetries in deep learning - transformations that leave network functions unchanged despite changing parameters, explaining model redundancy and influencing learning dynamics.

Details

Motivation: To understand the significant redundancy in overparameterized deep learning models by examining symmetries in parameter space that leave network outputs unchanged, providing new insights into optimization, generalization, and model complexity.

Method: The paper conducts a comprehensive survey of existing literature on parameter space symmetry, summarizing research findings and uncovering connections between symmetry and learning theory.

Result: The survey identifies that symmetries in parameter space shape the loss landscape and constrain learning dynamics, offering a complementary perspective to existing deep learning theory.

Conclusion: Parameter space symmetry provides a valuable framework for understanding deep learning models, and the survey identifies gaps and opportunities for future research in this emerging field.

Abstract: Modern deep learning models are highly overparameterized, resulting in large sets of parameter configurations that yield the same outputs. A significant portion of this redundancy is explained by symmetries in the parameter space–transformations that leave the network function unchanged. These symmetries shape the loss landscape and constrain learning dynamics, offering a new lens for understanding optimization, generalization, and model complexity that complements existing theory of deep learning. This survey provides an overview of parameter space symmetry. We summarize existing literature, uncover connections between symmetry and learning theory, and identify gaps and opportunities in this emerging field.

[525] Fair Graph Machine Learning under Adversarial Missingness Processes

Debolina Halder Lina, Arlei Silva

Main category: cs.LG

TL;DR: BFtS is a fair missing data imputation model that addresses adversarial missingness in sensitive attributes by approximating worst-case fairness scenarios through a 3-player adversarial scheme.

Details

Motivation: Existing fair GNNs assume sensitive attributes are fully observed or missing completely at random, but adversarial missingness can disguise fair models through imputation, leading to overestimated fairness.

Method: Proposes BFtS using a 3-player adversarial scheme where two adversaries collaborate against a GNN classifier, and the classifier minimizes maximum bias to approximate worst-case fairness scenarios.

Result: Experiments on synthetic and real datasets show BFtS achieves better fairness-accuracy trade-off than existing alternatives under adversarial missingness processes.

Conclusion: BFtS effectively addresses the challenge of adversarial missingness in sensitive attributes by designing imputations that consider worst-case fairness scenarios, outperforming current methods.

Abstract: Graph Neural Networks (GNNs) have achieved state-of-the-art results in many relevant tasks where decisions might disproportionately impact specific communities. However, existing work on fair GNNs often assumes that either sensitive attributes are fully observed or they are missing completely at random. We show that an adversarial missingness process can inadvertently disguise a fair model through the imputation, leading the model to overestimate the fairness of its predictions. We address this challenge by proposing Better Fair than Sorry (BFtS), a fair missing data imputation model for sensitive attributes. The key principle behind BFtS is that imputations should approximate the worst-case scenario for fairness – i.e. when optimizing fairness is the hardest. We implement this idea using a 3-player adversarial scheme where two adversaries collaborate against a GNN classifier, and the classifier minimizes the maximum bias. Experiments using synthetic and real datasets show that BFtS often achieves a better fairness x accuracy trade-off than existing alternatives under an adversarial missingness process.

[526] Bures-Wasserstein Flow Matching for Graph Generation

Keyue Jiang, Jiahao Cui, Xiaowen Dong, Laura Toni

Main category: cs.LG

TL;DR: BWFlow is a flow-matching framework for graph generation that addresses limitations in existing methods by modeling joint evolution of nodes and edges using Markov random fields and optimal transport to create smooth probability paths.

Details

Motivation: Current graph generation methods like diffusion and flow-based models model node/edge evolution independently with linear interpolations, which breaks graph interconnected patterns and causes irregular probability paths, leading to poor training and sampling convergence.

Method: Represent graphs as connected systems using Markov random fields, leverage optimal transport displacement between MRF objects to design smooth probability paths that ensure co-evolution of graph components, and implement BWFlow flow-matching framework.

Result: Experimental evaluations show BWFlow achieves competitive performance in plain graph and molecule generation, with better training convergence and efficient sampling compared to existing methods.

Conclusion: BWFlow provides a theoretically grounded framework for graph generation that addresses the limitations of independent node/edge modeling by ensuring smooth probability paths through joint evolution modeling, leading to improved training dynamics and sampling convergence.

Abstract: Graph generation has emerged as a critical task in fields ranging from drug discovery to circuit design. Contemporary approaches, notably diffusion and flow-based models, have achieved solid graph generative performance through constructing a probability path that interpolates between reference and data distributions. However, these methods typically model the evolution of individual nodes and edges independently and use linear interpolations to build the path. This disentangled interpolation breaks the interconnected patterns of graphs, making the constructed probability path irregular and non-smooth, which causes poor training dynamics and faulty sampling convergence. To address the limitation, this paper first presents a theoretically grounded framework for probability path construction in graph generative models. Specifically, we model the joint evolution of the nodes and edges by representing graphs as connected systems parameterized by Markov random fields (MRF). We then leverage the optimal transport displacement between MRF objects to design a smooth probability path that ensures the co-evolution of graph components. Based on this, we introduce BWFlow, a flow-matching framework for graph generation that utilizes the derived optimal probability path to benefit the training and sampling algorithm design. Experimental evaluations in plain graph generation and molecule generation validate the effectiveness of BWFlow with competitive performance, better training convergence, and efficient sampling.

[527] FREE: The Foundational Semantic Recognition for Modeling Environmental Ecosystems

Shiyuan Luo, Juntong Ni, Shengyu Chen, Runlong Yu, Yiqun Xie, Licheng Liu, Zhenong Jin, Huaxiu Yao, Xiaowei Jia

Main category: cs.LG

TL;DR: FREE framework converts environmental modeling into semantic recognition by mapping diverse environmental data into text space, enabling universal modeling with varying features across space and time.

Details

Motivation: Environmental ecosystem modeling is challenging due to complex processes and limited measurements. Existing approaches are region/time-specific, lacking a general framework for modeling diverse environmental variables across different scales.

Method: FREE maps available environmental data into text space and converts traditional predictive modeling into semantic recognition problems, allowing training of universal models with varying features.

Result: Evaluation on stream water temperature and crop yield prediction shows FREE outperforms multiple baselines, even in data-sparse scenarios.

Conclusion: FREE provides a generalizable framework for environmental modeling that works across different applications and data availability conditions, advancing ecosystem modeling capabilities.

Abstract: Modeling environmental ecosystems is critical for the sustainability of our planet, but is extremely challenging due to the complex underlying processes driven by interactions amongst a large number of physical variables. As many variables are difficult to measure at large scales, existing works often utilize a combination of observable features and locally available measurements or modeled values as input to build models for a specific study region and time period. This raises a fundamental question in advancing the modeling of environmental ecosystems: how to build a general framework for modeling the complex relationships among diverse environmental variables over space and time? In this paper, we introduce a framework, FREE, that enables the use of varying features and available information to train a universal model. The core idea is to map available environmental data into a text space and then convert the traditional predictive modeling task in environmental science to a semantic recognition problem. Our evaluation on two societally important real-world applications, stream water temperature prediction and crop yield prediction, demonstrates the superiority of FREE over multiple baselines, even in data-sparse scenarios.

[528] EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, Yue Wang

Main category: cs.LG

TL;DR: EFRame is an Exploration-Filter-Replay framework that enhances GRPO (Group Relative Policy Optimization) for LLM reasoning tasks by improving exploration, filtering low-quality samples, and replaying informative trajectories, achieving significant performance gains.

Details

Motivation: GRPO, while efficient, suffers from limited exploration and training instability on complex reasoning tasks, limiting its effectiveness for advancing LLM reasoning capabilities.

Method: EFRame augments GRPO with three components: additional rollouts for deeper exploration, online filtering to remove low-quality samples and stabilize gradients, and experience replay to amplify rare informative trajectories for stable convergence.

Result: EFRame achieves consistent gains across diverse reasoning benchmarks, including a 37.9% relative improvement on Geometry3K over GRPO, and supports fine-grained sample categorization and precise entropy control.

Conclusion: EFRame establishes a principled training cycle that balances exploration, efficiency, and stability, serving as a robust solution for advancing deeper reasoning capabilities in large language models.

Abstract: Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), a lightweight variant of Proximal Policy Optimization (PPO), improves efficiency but suffers from limited exploration and training instability, limiting its effectiveness on complex reasoning tasks. To address these challenges, we introduce EFRame, an Exploration-Filter-Replay framework that augments GRPO across three dimensions: additional rollouts enable deeper and more targeted exploration, online filtering removes low-quality samples to stabilize gradients and accelerate training, and experience replay amplifies rare yet informative trajectories for stable convergence. This unified framework establishes a principled training cycle that balances exploration, efficiency, and stability. Experiments on diverse reasoning benchmarks demonstrate that EFRame achieves consistent gains, including a 37.9% relative improvement on Geometry3K over GRPO. EFRame further supports fine-grained sample categorization and precise entropy control, highlighting it as a robust solution for advancing deeper reasoning in LLMs. Our code is available at https://github.com/597358816/EFRame.

[529] Towards Natural Machine Unlearning

Zhengbao He, Tao Li, Xinwen Cheng, Zhehao Huang, Xiaolin Huang

Main category: cs.LG

TL;DR: The paper proposes a natural machine unlearning method that injects correct information from remaining data into forgetting samples, avoiding the unnatural reinforcement of incorrect information in existing approaches.

Details

Motivation: Current machine unlearning methods modify forgetting data with incorrect labels and fine-tune the model, which is unnatural as it reinforces incorrect information and leads to over-forgetting.

Method: Inject correct information from remaining data into forgetting samples when changing their labels, allowing the model to naturally suppress forgotten information by using the injected correct information.

Result: Significantly outperforms state-of-the-art approaches, substantially reduces over-forgetting, and shows strong robustness to hyperparameters.

Conclusion: This approach represents a promising candidate for practical machine unlearning by making the unlearning process more natural and effective.

Abstract: Machine unlearning (MU) aims to eliminate information that has been learned from specific training data, namely forgetting data, from a pre-trained model. Currently, the mainstream of existing MU methods involves modifying the forgetting data with incorrect labels and subsequently fine-tuning the model. While learning such incorrect information can indeed remove knowledge, the process is quite unnatural as the unlearning process undesirably reinforces the incorrect information and leads to over-forgetting. Towards more \textit{natural} machine unlearning, we inject correct information from the remaining data to the forgetting samples when changing their labels. Through pairing these adjusted samples with their labels, the model will tend to use the injected correct information and naturally suppress the information meant to be forgotten. Albeit straightforward, such a first step towards natural machine unlearning can significantly outperform current state-of-the-art approaches. In particular, our method substantially reduces the over-forgetting and leads to strong robustness to hyperparameters, making it a promising candidate for practical machine unlearning.

[530] Tuning without Peeking: Provable Privacy and Generalization Bounds for LLM Post-Training

Ismail Labiad, Mathurin Videau, Matthieu Kowalski, Marc Schoenauer, Alessandro Leite, Julia Kempe, Olivier Teytaud

Main category: cs.LG

TL;DR: BBoxER is an evolutionary black-box optimization method for LLM post-training that provides privacy and security benefits by avoiding gradient exposure, with theoretical guarantees for differential privacy and robustness to attacks.

Details

Motivation: Gradient-based optimization can leak sensitive data and is vulnerable to attacks like data poisoning, while black-box methods offer privacy protection in restricted or adversarial environments.

Method: BBoxER uses evolutionary black-box optimization that creates an information bottleneck through implicit data compression, relying only on function evaluations without gradient access.

Result: BBoxER improves LLM performance with few iterations, generalizes well on reasoning benchmarks, and demonstrates robustness to membership inference attacks despite computational challenges.

Conclusion: BBoxER serves as a valuable add-on to gradient-based methods, suitable for privacy-sensitive deployments with strong theoretical guarantees for privacy and generalization.

Abstract: Gradient-based optimization is the workhorse of deep learning, offering efficient and scalable training via backpropagation. However, exposing gradients during training can leak sensitive information about the underlying data, raising privacy and security concerns such as susceptibility to data poisoning attacks. In contrast, black box optimization methods, which treat the model as an opaque function, relying solely on function evaluations to guide optimization, offer a promising alternative in scenarios where data access is restricted, adversarial risks are high, or overfitting is a concern. This paper introduces BBoxER, an evolutionary black-box method for LLM post-training that induces an information bottleneck via implicit compression of the training data. Leveraging the tractability of information flow, we provide non-vacuous generalization bounds and strong theoretical guarantees for differential privacy, robustness to data poisoning attacks, and extraction attacks. In experiments with LLMs, we demonstrate empirically that black-box optimization methods-despite the scalability and computational challenges inherent to black-box approaches-are able to learn, showing how a few iterations of BBoxER improve performance, generalize well on a benchmark of reasoning datasets, and are robust to membership inference attacks. This positions BBoxER as an attractive add-on on top of gradient-based optimization, offering suitability for deployment in restricted or privacy-sensitive environments while also providing non-vacuous generalization guarantees.

[531] An Imitative Reinforcement Learning Framework for Pursuit-Lock-Launch Missions

Siyuan Li, Rongchang Zuo, Bofei Liu, Yaoyu He, Peng Liu, Yingnan Zhao

Main category: cs.LG

TL;DR: A novel imitative reinforcement learning framework for UCAV WVR engagement that combines expert imitation with autonomous exploration to achieve high success rates in complex aerial combat tasks.

Details

Motivation: Autonomous WVR engagement faces challenges like weak exploration, low learning efficiency, and unrealistic simulations, requiring a solution that leverages expert knowledge while enabling adaptability.

Method: Proposed imitative reinforcement learning framework that uses expert data for efficient learning and reinforcement learning for autonomous exploration in dynamic environments, implemented in a Harfang3D sandbox environment.

Result: Achieved up to 100% success rate in ‘pursuit-lock-launch’ policy learning, significantly outperforming state-of-the-art reinforcement learning and imitation learning methods with excellent robustness.

Conclusion: The framework successfully combines expert imitation and autonomous exploration to overcome challenges in UCAV WVR engagement, enabling efficient learning of complex aerial combat policies.

Abstract: Unmanned Combat Aerial Vehicle (UCAV) Within-Visual-Range (WVR) engagement, referring to a fight between two or more UCAVs at close quarters, plays a decisive role on the aerial battlefields. With the development of artificial intelligence, WVR engagement progressively advances towards intelligent and autonomous modes. However, autonomous WVR engagement policy learning is hindered by challenges such as weak exploration capabilities, low learning efficiency, and unrealistic simulated environments. To overcome these challenges, we propose a novel imitative reinforcement learning framework, which efficiently leverages expert data while enabling autonomous exploration. The proposed framework not only enhances learning efficiency through expert imitation, but also ensures adaptability to dynamic environments via autonomous exploration with reinforcement learning. Therefore, the proposed framework can learn a successful policy of `pursuit-lock-launch’ for UCAVs. To support data-driven learning, we establish an environment based on the Harfang3D sandbox. The extensive experiment results indicate that the proposed framework excels in this multistage task, and significantly outperforms state-of-the-art reinforcement learning and imitation learning methods. Thanks to the ability of imitating experts and autonomous exploration, our framework can quickly learn the critical knowledge in complex aerial combat tasks, achieving up to a 100% success rate and demonstrating excellent robustness.

[532] Direct Quantized Training of Language Models with Stochastic Rounding

Kaiyan Zhao, Tsuguchika Tabaru, Kenichi Kobayashi, Takumi Honda, Masafumi Yamazaki, Yoshimasa Tsuruoka

Main category: cs.LG

TL;DR: This paper proposes a method to train quantized LLMs directly with low-precision weights without maintaining high-precision copies, using stochastic rounding to reduce memory usage during training while maintaining performance.

Details

Motivation: Current quantized LLMs like BitNet reduce memory during deployment but still require substantial memory for training due to maintaining high-precision weights for straight-through estimation.

Method: Directly update quantized low-precision weights during backpropagation without straight-through estimation, using stochastic rounding technique to minimize information loss from low-bit weights.

Result: Successfully trained LLaMA-structured models with only low-precision weights (even ternary values), achieved performance comparable to BitNet b1.58 with 8-bit weights, and maintained robustness across precision scaling from FP32 to BF16/FP8.

Conclusion: Training quantized LLMs with only low-precision weights is feasible and effective, enabling significant memory reduction during training while supporting flexible deployment with ternary weights for inference.

Abstract: Although recent quantized Large Language Models (LLMs), such as BitNet, have paved the way for significant reduction in memory usage during deployment with binary or ternary weights, training these models still demands substantial memory footprints. This is partly because high-precision (i.e., unquantized) weights required for straight-through estimation must be maintained throughout the whole training process. To address this, we explore directly updating the quantized low-precision weights without relying on straight-through estimation during backpropagation, aiming to save memory usage during training. Specifically, we employ a stochastic rounding technique to minimize the information loss caused by the use of low-bit weights throughout training. Experimental results on our LLaMA-structured models of various sizes indicate that (1) training with only low-precision weights is feasible even when they are constrained to ternary values; (2) extending the bit width to 8 bits achieves performance on par with BitNet b1.58; (3) our models remain robust to precision scaling and memory reduction, showing minimal performance degradation when moving from FP32 to lower-memory environments (BF16/FP8); and (4) our models also support inference using ternary weights, showcasing their flexibility in deployment.

[533] A Digital Twin for Diesel Engines: Operator-infused Physics-Informed Neural Networks with Transfer Learning for Engine Health Monitoring

Kamaljyoti Nath, Varun Kumar, Daniel J. Smith, George Em Karniadakis

Main category: cs.LG

TL;DR: A hybrid framework combining physics-informed neural networks (PINNs) with deep operator networks (DeepONet) for efficient parameter identification in diesel engine models, with transfer learning strategies to reduce computational costs.

Details

Motivation: To improve diesel engine efficiency, reduce emissions, and enable robust health monitoring by addressing limitations of existing neural network methods that lack generalizability and physical interpretability.

Method: Proposes a hybrid framework using PINNs with DeepONet, incorporating offline-trained DeepONets for actuator dynamics prediction and two transfer learning strategies: multi-stage TL and few-shot TL with frozen shared network body.

Result: Significantly lowers online computation costs compared to existing PINN framework, offers better runtime efficiency than full online training, and provides computationally inexpensive physics-based approach for engine dynamics prediction.

Conclusion: The framework combines interpretability of physics-based models with flexibility of deep learning, offering substantial gains in generalization, accuracy, and deployment efficiency for diesel engine diagnostics.

Abstract: Improving diesel engine efficiency, reducing emissions, and enabling robust health monitoring have been critical research topics in engine modelling. While recent advancements in the use of neural networks for system monitoring have shown promising results, such methods often focus on component-level analysis, lack generalizability, and physical interpretability. In this study, we propose a novel hybrid framework that combines physics-informed neural networks (PINNs) with deep operator networks (DeepONet) to enable accurate and computationally efficient parameter identification in mean-value diesel engine models. Our method leverages physics-based system knowledge in combination with data-driven training of neural networks to enhance model applicability. Incorporating offline-trained DeepONets to predict actuator dynamics significantly lowers the online computation cost when compared to the existing PINN framework. To address the re-training burden typical of PINNs under varying input conditions, we propose two transfer learning (TL) strategies: (i) a multi-stage TL scheme offering better runtime efficiency than full online training of the PINN model and (ii) a few-shot TL scheme that freezes a shared multi-head network body and computes physics-based derivatives required for model training outside the training loop. The second strategy offers a computationally inexpensive and physics-based approach for predicting engine dynamics and parameter identification, offering computational efficiency over the existing PINN framework. Compared to existing health monitoring methods, our framework combines the interpretability of physics-based models with the flexibility of deep learning, offering substantial gains in generalization, accuracy, and deployment efficiency for diesel engine diagnostics.

[534] Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance

Yufei He, Ruoyu Li, Alex Chen, Yue Liu, Yulin Chen, Yuan Sui, Cheng Chen, Yi Zhu, Luca Luo, Frank Yang, Bryan Hooi

Main category: cs.LG

TL;DR: ARIA is an LLM agent framework that enables continuous learning of updated domain knowledge during operation by assessing uncertainty, requesting human guidance, and maintaining a timestamped knowledge repository.

Details

Motivation: Current LLM agents struggle in dynamic environments where rules and domain knowledge frequently change, making offline fine-tuning and standard prompting insufficient for real-time adaptation.

Method: ARIA uses structured self-dialogue to assess uncertainty, proactively identifies knowledge gaps, requests targeted human explanations, and systematically updates an internal timestamped knowledge repository while resolving conflicts.

Result: ARIA significantly improves adaptability and accuracy on customer due diligence tasks and dynamic knowledge tasks compared to offline fine-tuning and existing self-improving agents.

Conclusion: ARIA is practically deployed in TikTok Pay serving 150M+ users, confirming its effectiveness for operational use in rapidly evolving environments.

Abstract: Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change, such as regulatory compliance and user risk screening. Current approaches, like offline fine-tuning and standard prompting, are insufficient because they cannot effectively adapt to new knowledge during actual operation. To address this limitation, we propose the Adaptive Reflective Interactive Agent (ARIA), an LLM agent framework designed specifically to continuously learn updated domain knowledge at test time. ARIA assesses its own uncertainty through structured self-dialogue, proactively identifying knowledge gaps and requesting targeted explanations or corrections from human experts. It then systematically updates an internal, timestamped knowledge repository with provided human guidance, detecting and resolving conflicting or outdated knowledge through comparisons and clarification queries. We evaluate ARIA on the realistic customer due diligence name screening task on TikTok Pay, alongside publicly available dynamic knowledge tasks. Results demonstrate significant improvements in adaptability and accuracy compared to baselines using standard offline fine-tuning and existing self-improving agents. ARIA is deployed within TikTok Pay serving over 150 million monthly active users, confirming its practicality and effectiveness for operational use in rapidly evolving environments.

[535] Covariances for Free: Exploiting Mean Distributions for Training-free Federated Learning

Dipam Goswami, Simone Magistri, Kai Wang, Bartłomiej Twardowski, Andrew D. Bagdanov, Joost van de Weijer

Main category: cs.LG

TL;DR: A training-free federated learning method that uses only first-order statistics (class means) to estimate class covariance matrices, enabling better classifier initialization without sharing second-order statistics.

Details

Motivation: To reduce communication overhead in federated learning while maintaining performance, avoiding the need to share second-order statistics or perform extensive training.

Method: Uses an unbiased estimator of class covariance matrices based only on first-order statistics (class means) communicated by clients. Focuses on within-class covariances for better classifier initialization.

Result: Improves performance by 4-26% with same communication cost compared to methods sharing only class means, and achieves competitive or superior performance to methods sharing second-order statistics with dramatically less communication overhead.

Conclusion: The proposed FedCOF method provides a communication-efficient approach for federated learning that outperforms prompt-tuning methods and serves as an effective initialization for further fine-tuning.

Abstract: Using pre-trained models has been found to reduce the effect of data heterogeneity and speed up federated learning algorithms. Recent works have explored training-free methods using first- and second-order statistics to aggregate local client data distributions at the server and achieve high performance without any training. In this work, we propose a training-free method based on an unbiased estimator of class covariance matrices which only uses first-order statistics in the form of class means communicated by clients to the server. We show how these estimated class covariances can be used to initialize the global classifier, thus exploiting the covariances without actually sharing them. We also show that using only within-class covariances results in a better classifier initialization. Our approach improves performance in the range of 4-26% with exactly the same communication cost when compared to methods sharing only class means and achieves performance competitive or superior to methods sharing second-order statistics with dramatically less communication overhead. The proposed method is much more communication-efficient than federated prompt-tuning methods and still outperforms them. Finally, using our method to initialize classifiers and then performing federated fine-tuning or linear probing again yields better performance. Code is available at https://github.com/dipamgoswami/FedCOF.

[536] TriP-LLM: A Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly Detection

Yuan-Cheng Yu, Yen-Chieh Ouyang, Chun-An Lin

Main category: cs.LG

TL;DR: TriP-LLM is a novel unsupervised time-series anomaly detection framework that uses a tri-branch design with frozen pretrained LLM to process patch-wise representations, achieving state-of-the-art performance with lower memory consumption.

Details

Motivation: Traditional statistical methods struggle with the high heterogeneity and complexity of modern time-series data from IoT and smart manufacturing. Inspired by LLMs' success in multimodal tasks, the authors aim to leverage LLMs for time-series anomaly detection.

Method: TriP-LLM uses a triple-branch design (Patching, Selecting, and Global modules) to encode time-series into patch-wise representations processed by a frozen pretrained LLM. A lightweight patch-wise decoder reconstructs inputs to derive anomaly scores.

Result: TriP-LLM consistently outperforms recent SOTA methods across all benchmark datasets using PATE metric. It achieves significantly lower memory consumption compared to CI-based LLM approaches, making it suitable for GPU memory-constrained environments.

Conclusion: The framework demonstrates strong detection capabilities and validates the substantial contribution of LLMs to the architecture. All code and model checkpoints are publicly available.

Abstract: Time-series anomaly detection plays a central role across a wide range of application domains. With the increasing proliferation of the Internet of Things (IoT) and smart manufacturing, time-series data has dramatically increased in both scale and dimensionality. This growth has exposed the limitations of traditional statistical methods in handling the high heterogeneity and complexity of such data. Inspired by the recent success of large language models (LLMs) in multimodal tasks across language and vision domains, we propose a novel unsupervised anomaly detection framework: A Tri-Branch Patch-wise Large Language Model Framework for Time-Series Anomaly Detection (TriP-LLM). TriP-LLM integrates local and global temporal features through a triple-branch design comprising Patching, Selecting, and Global modules, to encode the input time-series into patch-wise representations, which are then processed by a frozen, pretrained LLM. A lightweight patch-wise decoder reconstructs the input, from which anomaly scores are derived. We evaluate TriP-LLM on several public benchmark datasets using PATE, a recently proposed threshold-free evaluation metric, and conduct all comparisons within a unified open-source framework to ensure fairness. Experimental results show that TriP-LLM consistently outperforms recent state-of-the-art (SOTA) methods across all datasets, demonstrating strong detection capabilities. Furthermore, through extensive ablation studies, we verify the substantial contribution of the LLM to the overall architecture. Compared to LLM-based approaches using Channel Independence (CI) patch processing, TriP-LLM achieves significantly lower memory consumption, making it more suitable for GPU memory-constrained environments. All code and model checkpoints of TriP-LLM are publicly available on https://github.com/YYZStart/TriP-LLM.git

[537] Network Dynamics-Based Framework for Understanding Deep Neural Networks

Yuchen Lin, Yong Zhang, Sihan Feng, Hong Zhao

Main category: cs.LG

TL;DR: A theoretical framework using dynamical systems theory to analyze neural network learning dynamics, introducing order-preserving vs non-order-preserving transformations at neuron level, which lead to different learning phases and explain phenomena like grokking.

Details

Motivation: Advancements in AI require deeper understanding of deep learning mechanisms; need theoretical framework to analyze learning dynamics through dynamical systems perspective.

Method: Introduce two fundamental transformation units at neuron level: order-preserving and non-order-preserving transformations; analyze collective behaviors in weight organization; introduce attraction basins in sample and weight spaces; use hyperparameters as control variables.

Result: Different transformation modes lead to distinct collective behaviors, information extraction modes, and learning phases; transitions between phases explain grokking; framework provides metrics for analyzing model performance.

Conclusion: Framework reveals intrinsic advantages of deep learning and provides novel perspective for optimizing network architectures and training strategies.

Abstract: Advancements in artificial intelligence call for a deeper understanding of the fundamental mechanisms underlying deep learning. In this work, we propose a theoretical framework to analyze learning dynamics through the lens of dynamical systems theory. We redefine the notions of linearity and nonlinearity in neural networks by introducing two fundamental transformation units at the neuron level: order-preserving transformations and non-order-preserving transformations. Different transformation modes lead to distinct collective behaviors in weight vector organization, different modes of information extraction, and the emergence of qualitatively different learning phases. Transitions between these phases may occur during training, accounting for key phenomena such as grokking. To further characterize generalization and structural stability, we introduce the concept of attraction basins in both sample and weight spaces. The distribution of neurons with different transformation modes across layers, along with the structural characteristics of the two types of attraction basins, forms a set of core metrics for analyzing the performance of learning models. Hyperparameters such as depth, width, learning rate, and batch size act as control variables for fine-tuning these metrics. Our framework not only sheds light on the intrinsic advantages of deep learning, but also provides a novel perspective for optimizing network architectures and training strategies.

[538] VAGPO: Vision-augmented Asymmetric Group Preference Optimization for Graph Routing Problems

Shiyan Liu, Bohan Tan, Zhiguang Cao, Yan Jin

Main category: cs.LG

TL;DR: VAGPO is a novel vision-augmented asymmetric group preference optimization approach for graph routing problems that combines ResNet-based visual encoding with Transformer-based sequential modeling to achieve efficient training and strong generalization to large-scale instances.

Details

Motivation: Classic routing problems like TSP and CVRP are fundamental for web networks, but current data-driven methods face limitations in training efficiency and generalization to large-scale instances.

Method: Proposes VAGPO approach using ResNet-based visual encoding for spatial structure and Transformer-based sequential modeling for temporal dependencies, with an asymmetric group preference optimization strategy for faster convergence.

Result: Achieves competitive solution quality on generated TSP/CVRP instances and real-world datasets, with strong generalization to instances up to 1000 nodes without re-training.

Conclusion: VAGPO demonstrates effectiveness in both learning efficiency and scalability for graph routing optimization problems.

Abstract: Graph routing problems play a vital role in web-related networks, where finding optimal paths across graphs is essential for efficient data transmission and content delivery. Classic routing formulations such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) represent fundamental graph optimization challenges. Recent data-driven optimization methods have made significant progress, yet they often face limitations in training efficiency and generalization to large-scale instances. In this paper, we propose a novel Vision-augmented Asymmetric Group Preference Optimization (VAGPO) approach. By leveraging ResNet-based visual encoding and Transformer-based sequential modeling, VAGPO captures both spatial structure and temporal dependencies. Furthermore, we introduce an asymmetric group preference optimization strategy that significantly accelerates convergence compared to commonly used policy gradient methods. Experimental results on generated TSP and CVRP instances, as well as real-world datasets, demonstrate that the proposed VAGPO approach achieves highly competitive solution quality. Additionally, VAGPO exhibits strong generalization to larger instances (up to 1000 nodes) without re-training, highlighting its effectiveness in both learning efficiency and scalability.

[539] Filtering out mislabeled training instances using black-box optimization and quantum annealing

Makoto Otsuka, Kento Kodama, Keisuke Morita, Masayuki Ohzeki

Main category: cs.LG

TL;DR: A method combining surrogate model-based black-box optimization with quantum annealing to remove mislabeled instances from contaminated training datasets, improving model generalization.

Details

Motivation: Mislabeled training instances in real-world datasets degrade model generalization, requiring efficient noise-removal strategies.

Method: Combines surrogate model-based BBO with postprocessing and quantum annealing to evaluate filtered training subsets based on validation loss, iteratively refine loss estimates, and efficiently sample diverse training subsets with low validation error.

Result: Experiments on noisy majority bit task show the method prioritizes removal of high-risk mislabeled instances. D-Wave’s physical quantum annealer achieves faster optimization and higher-quality training subsets compared to simulated annealing methods.

Conclusion: The proposed method effectively enhances dataset quality for supervised learning tasks, with potential for extension to unsupervised learning, real-world datasets, and large-scale implementations.

Abstract: This study proposes an approach for removing mislabeled instances from contaminated training datasets by combining surrogate model-based black-box optimization (BBO) with postprocessing and quantum annealing. Mislabeled training instances, a common issue in real-world datasets, often degrade model generalization, necessitating robust and efficient noise-removal strategies. The proposed method evaluates filtered training subsets based on validation loss, iteratively refines loss estimates through surrogate model-based BBO with postprocessing, and leverages quantum annealing to efficiently sample diverse training subsets with low validation error. Experiments on a noisy majority bit task demonstrate the method’s ability to prioritize the removal of high-risk mislabeled instances. Integrating D-Wave’s clique sampler running on a physical quantum annealer achieves faster optimization and higher-quality training subsets compared to OpenJij’s simulated quantum annealing sampler or Neal’s simulated annealing sampler, offering a scalable framework for enhancing dataset quality. This work highlights the effectiveness of the proposed method for supervised learning tasks, with future directions including its application to unsupervised learning, real-world datasets, and large-scale implementations.

[540] Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function Calling

Bingguang Hao, Zengzhuang Xu, Maolin Wang, Yuntao Wen, Yicheng Chen, Cunyin Peng, Long Chen, Dong Wang, Xiangyu Zhao, Jinjie Gu, Chenyi Zhuang, Ji Zhang

Main category: cs.LG

TL;DR: EGPO is a new RL framework that enhances function calling in LLMs by integrating entropy of Chain-of-Thought into policy gradients, enabling diverse reasoning while maintaining stable optimization.

Details

Motivation: Standard methods like SFT fail to instill robust reasoning, and traditional RL struggles with inefficient exploration in training LLMs for function calling.

Method: EGPO uses an entropy-enhanced advantage function that integrates CoT entropy into policy gradients with a clipping mechanism, complemented by binary reward signals.

Result: A 4B-parameter model trained with EGPO sets new state-of-the-art on Berkeley Function Calling Leaderboard, surpassing GPT-4o and Gemini-2.5.

Conclusion: EGPO effectively balances exploration of reasoning paths with stable policy optimization for improved function calling in LLMs.

Abstract: The effective training of Large Language Models (LLMs) for function calling faces a critical challenge: balancing exploration of complex reasoning paths with stable policy optimization. Standard methods like Supervised Fine-Tuning (SFT) fail to instill robust reasoning, and traditional Reinforcement Learning (RL) struggles with inefficient exploration. We propose \textbf{EGPO}, a new RL framework built upon Group Relative Policy Optimization (GRPO), designed to address this challenge directly. The core of EGPO is an entropy-enhanced advantage function that integrates the entropy of the model’s Chain-of-Thought (CoT) into the policy gradient computation. This encourages the generation of diverse reasoning strategies. To maintain optimization direction, the entropy bonus is carefully constrained by a clipping mechanism. Complemented by a strict, binary reward signal, EGPO effectively guides the model towards discovering structured and accurate tool invocation patterns. On the challenging Berkeley Function Calling Leaderboard (BFCL), a 4B-parameter model trained with EGPO sets a new state-of-the-art among models of comparable size, surpassing a range of strong competitors, including GPT-4o and Gemini-2.5.

[541] Orthogonal Representation Learning for Estimating Causal Quantities

Valentyn Melnychuk, Dennis Frauen, Jonas Schweisthal, Stefan Feuerriegel

Main category: cs.LG

TL;DR: This paper bridges representation learning with Neyman-orthogonal learners, showing that combining them (OR-learners) improves estimation under low-dimensional manifold assumptions, while balancing constraints alone cannot compensate for lack of Neyman-orthogonality.

Details

Motivation: Address the tension between practical end-to-end representation learning (lacking asymptotic optimality) and theoretically optimal Neyman-orthogonal learners (not benefiting from representation learning strengths).

Method: Introduce a unifying framework connecting representation learning with Neyman-orthogonal learners (OR-learners), with theoretical and empirical analysis under low-dimensional manifold hypothesis.

Result: OR-learners strictly improve estimation error of standard Neyman-orthogonal learners under low-dimensional manifold hypothesis; balancing constraints require additional inductive bias and cannot generally compensate for lack of Neyman-orthogonality.

Conclusion: Provides guidelines for effectively combining representation learning with classical Neyman-orthogonal learners to achieve both practical performance and theoretical guarantees.

Abstract: End-to-end representation learning has become a powerful tool for estimating causal quantities from high-dimensional observational data, but its efficiency remained unclear. Here, we face a central tension: End-to-end representation learning methods often work well in practice but lack asymptotic optimality in the form of the quasi-oracle efficiency. In contrast, two-stage Neyman-orthogonal learners provide such a theoretical optimality property but do not explicitly benefit from the strengths of representation learning. In this work, we step back and ask two research questions: (1) When do representations strengthen existing Neyman-orthogonal learners? and (2) Can a balancing constraint - commonly proposed technique in the representation learning literature - provide improvements to Neyman-orthogonality? We address these two questions through our theoretical and empirical analysis, where we introduce a unifying framework that connects representation learning with Neyman-orthogonal learners (namely, OR-learners). In particular, we show that, under the low-dimensional manifold hypothesis, the OR-learners can strictly improve the estimation error of the standard Neyman-orthogonal learners. At the same time, we find that the balancing constraint requires an additional inductive bias and cannot generally compensate for the lack of Neyman-orthogonality of the end-to-end approaches. Building on these insights, we offer guidelines for how users can effectively combine representation learning with the classical Neyman-orthogonal learners to achieve both practical performance and theoretical guarantees.

[542] AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance

Lixuan He, Jie Feng, Yong Li

Main category: cs.LG

TL;DR: AMFT is a single-stage fine-tuning algorithm that dynamically balances SFT and RL using meta-gradient adaptive weight control, achieving state-of-the-art performance on reasoning tasks.

Details

Motivation: Traditional two-stage SFT+RL pipeline suffers from catastrophic forgetting and suboptimal trade-offs between imitation and exploration, while existing single-stage methods lack principled mechanisms for balancing SFT and RL.

Method: AMFT treats SFT and RL as complementary reward signals and uses a meta-gradient adaptive weight controller to dynamically learn the optimal balance between them, regularized by policy entropy for stability.

Result: AMFT achieves state-of-the-art performance on mathematical reasoning, abstract visual reasoning, and vision-language navigation benchmarks, with superior generalization on out-of-distribution tasks.

Conclusion: AMFT provides a more principled and effective paradigm for LLM alignment through its meta-learning controller that enables stable, sample-efficient training and autonomous curriculum discovery.

Abstract: Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT’s implicit, path-level reward and RL’s explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT’s stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM alignment. Our codes are open-sourced via https://github.com/hlxtsyj/AMFT.

[543] Causal Additive Models with Unobserved Causal Paths and Backdoor Paths

Thong Pham, Takashi Nicholas Maeda, Shohei Shimizu

Main category: cs.LG

TL;DR: This paper establishes sufficient conditions for identifying causal directions in causal additive models with hidden variables, particularly addressing the challenging case of ‘bow’ structures where observed variables share hidden common parents.

Details

Motivation: Causal relationships often become unidentifiable when unobserved backdoor or causal paths exist between variables, especially in the difficult case of 'bow' structures where adjacent observed variables share hidden common parents.

Method: The authors develop new characterizations of regression sets and use a hybrid approach combining independence among regression residuals with conditional independencies among observed variables. They provide a sound and complete algorithm based on these insights.

Result: The paper establishes identifiability conditions for parent-child relationships in bow structures, representing a significant advancement as no prior work had achieved such identifiability without imposing assumptions on hidden variables. Empirical evaluations show competitive performance with state-of-the-art methods.

Conclusion: The research provides a theoretical foundation and practical algorithm for causal discovery in the presence of hidden variables, successfully addressing the challenging case of bow structures where causal directions were previously unidentifiable.

Abstract: Causal additive models provide a tractable yet expressive framework for causal discovery in the presence of hidden variables. However, when unobserved backdoor or causal paths exist between two variables, their causal relationship is often unidentifiable under existing theories. We establish sufficient conditions under which causal directions can be identified in many such cases. In particular, we derive conditions that enable identification of the parent-child relationship in a bow, an adjacent pair of observed variables sharing a hidden common parent. This represents a notoriously difficult case in causal discovery, and, to our knowledge, no prior work has established such identifiability in any causal model without imposing assumptions on the hidden variables. Our conditions rely on new characterizations of regression sets and a hybrid approach that combines independence among regression residuals with conditional independencies among observed variables. We further provide a sound and complete algorithm that incorporates these insights, and empirical evaluations demonstrate competitive performance with state-of-the-art methods.

[544] Exploring Neural Granger Causality with xLSTMs: Unveiling Temporal Dependencies in Complex Data

Harsh Poonia, Felix Divo, Kristian Kersting, Devendra Singh Dhami

Main category: cs.LG

TL;DR: Proposes GC-xLSTM, a Granger causality method using Extended LSTM with dynamic sparsity enforcement to capture long-range dependencies in time series.

Details

Motivation: Granger causal methods struggle with capturing long-range relations between variables in time series, especially with non-linear dependencies.

Method: Leverages xLSTM architecture with novel dynamic loss penalty for sparsity enforcement, adaptive model improvement, and joint optimization to robustly recover Granger causal relations.

Result: Experimental evaluation on six diverse datasets demonstrates overall efficacy of GC-xLSTM.

Conclusion: GC-xLSTM effectively addresses limitations of traditional Granger causal methods in capturing long-range dependencies through xLSTM architecture and dynamic sparsity enforcement.

Abstract: Causality in time series can be challenging to determine, especially in the presence of non-linear dependencies. Granger causality helps analyze potential relationships between variables, thereby offering a method to determine whether one time series can predict-Granger cause-future values of another. Although successful, Granger causal methods still struggle with capturing long-range relations between variables. To this end, we leverage the recently successful Extended Long Short-Term Memory (xLSTM) architecture and propose Granger causal xLSTMs (GC-xLSTM). It first enforces sparsity between the time series components by using a novel dynamic loss penalty on the initial projection. Specifically, we adaptively improve the model and identify sparsity candidates. Our joint optimization procedure then ensures that the Granger causal relations are recovered robustly. Our experimental evaluation on six diverse datasets demonstrates the overall efficacy of GC-xLSTM.

[545] On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, Jingren Zhou

Main category: cs.LG

TL;DR: CHORD is a framework that unifies SFT and RL through dynamic weighting, treating SFT as an auxiliary objective within RL to prevent disruption of learned patterns and overfitting.

Details

Motivation: Existing approaches integrating SFT and RL risk disrupting established response patterns and causing overfitting to expert data, requiring a more harmonized approach.

Method: CHORD uses a dual-control mechanism with global coefficient for holistic transition from off-policy imitation to on-policy exploration, and token-wise weighting for granular learning from expert data.

Result: Extensive experiments on mathematical reasoning and tool-use tasks show CHORD achieves stable and efficient learning with significant improvements over baselines.

Conclusion: CHORD effectively harmonizes off-policy expert data with on-policy exploration, demonstrating a unified approach to SFT and RL integration.

Abstract: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs). Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data. To address this, we present a novel investigation into the unified view of SFT and RL through an off-policy versus on-policy lens. We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting, which reframes SFT not as a separate stage but as a dynamically weighted auxiliary objective within the on-policy RL process. Based on an analysis of off-policy expert data’s influence at both holistic and granular levels, we incorporate a dual-control mechanism in CHORD. Specifically, the framework first employs a global coefficient to holistically guide the transition from off-policy imitation to on-policy exploration, and then applies a token-wise weighting function that enables granular learning from the expert, which promotes on-policy exploration and mitigates disruption from off-policy data. We conduct extensive experiments on mathematical reasoning problems and practical tool-use tasks, providing empirical evidence that CHORD achieves a stable and efficient learning process. By effectively harmonizing off-policy expert data with on-policy exploration, CHORD demonstrates significant improvements over baselines. We release the implementation at https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord to inspire further research.

[546] Detecting and Filtering Unsafe Training Data via Data Attribution with Denoised Representation

Yijun Pan, Taiwei Shi, Jieyu Zhao, Jiaqi W. Ma

Main category: cs.LG

TL;DR: DRA improves unsafe data detection by denoising representations to focus on critical unsafe tokens, outperforming moderation classifiers.

Details

Motivation: LLMs are sensitive to unsafe training data, but current moderation classifiers are computationally expensive and limited by predefined taxonomies. Existing data attribution methods suffer from noisy representations due to neutral tokens in unsafe texts.

Method: Proposed Denoised Representation Attribution (DRA) that denoises training and target representations by focusing on critical tokens that make texts unsafe, filtering out neutral tokens like stop words.

Result: DRA significantly improves data attribution methods across jailbreak filtering and gender bias detection tasks, outperforming state-of-the-art moderation classifier approaches.

Conclusion: DRA provides an effective alternative to moderation classifiers for unsafe data detection, addressing the noise problem in representation-based attribution methods.

Abstract: Large language models (LLMs) are highly sensitive to even small amounts of unsafe training data, making effective detection and filtering essential for trustworthy model development. Current state-of-the-art (SOTA) detection approaches primarily rely on moderation classifiers, which require significant computation overhead for training and are limited to predefined taxonomies. In this work, we explore data attribution approaches that measure the similarity between individual training samples and a small set of unsafe target examples, based on data representations such as hidden states or gradients. We identify a key limitation in existing methods: unsafe target texts contain both critical tokens that make them unsafe and neutral tokens (e.g., stop words or benign facts) that are necessary to form fluent language, and the latter of which makes the overall representations ``noisy’’ for the purpose of detecting unsafe training data. To address this challenge, we propose Denoised Representation Attribution (DRA), a novel representation-based data attribution approach that denoises training and target representations for unsafe data detection. Across tasks of filtering jailbreaks and detecting gender bias, the proposed approach leads to significant improvement for data attribution methods, outperforming SOTA methods that are mostly based on moderation classifiers.

[547] On Task Vectors and Gradients

Luca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Giuseppe Alessio D’Inverno, Fabrizio Silvestri, Emanuele Rodolà

Main category: cs.LG

TL;DR: Task arithmetic works because task vectors approximate negative gradients of task losses, with first-epoch gradients dominating the finetuning trajectory, making single-epoch finetuning sufficient for effective model merging.

Details

Motivation: Despite empirical success of task arithmetic for model merging, there was no clear theoretical explanation for why and when it works effectively.

Method: Established theoretical connection between task vectors and gradients of task losses, proved equivalence under gradient descent, and bounded second-order error terms for multi-epoch finetuning in feed-forward networks.

Result: Empirical analysis across seven vision benchmarks confirmed that first-epoch gradients dominate finetuning in both norm and direction, and single-epoch finetuning yields comparable performance to fully converged models.

Conclusion: Task arithmetic can be reframed as approximate multitask learning, with early training dynamics playing a critical role in effective model merging.

Abstract: Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.

[548] COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs

Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, Tuo Zhao

Main category: cs.LG

TL;DR: COSMOS is a hybrid optimizer for LLMs that combines SOAP for important gradient subspaces and MUON for less critical ones, achieving memory efficiency without performance loss.

Details

Motivation: Existing optimizers like AdamW fail to capture coordinate interdependencies, while SOAP has high memory overhead and low-dimensional projection methods have approximation errors. COSMOS addresses these limitations.

Method: COSMOS leverages gradient matrix eigensubspaces, applying SOAP to the leading subspace (primary optimization dynamics) and MUON to the remaining subspace, creating a memory-efficient hybrid approach.

Result: COSMOS significantly reduces memory consumption while maintaining robust optimization performance, making it suitable for massive LLMs across various datasets and transformer architectures.

Conclusion: COSMOS provides an effective solution for LLM optimization by balancing memory efficiency and performance through strategic subspace handling, enabling better scalability for large models.

Abstract: Large Language Models (LLMs) have demonstrated remarkable success across various domains, yet their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit. While adaptive optimizers such as AdamW are widely used, they suffer from critical limitations, including an inability to capture interdependencies between coordinates and high memory consumption. Subsequent research, exemplified by SOAP, attempts to better capture coordinate interdependence but incurs greater memory overhead, limiting scalability for massive LLMs. An alternative approach aims to reduce memory consumption through low-dimensional projection, but this leads to substantial approximation errors, resulting in less effective optimization (e.g., in terms of per-token efficiency). In this paper, we propose COSMOS, a novel hybrid optimizer that leverages the varying importance of eigensubspaces in the gradient matrix to achieve memory efficiency without compromising optimization performance. The design of COSMOS is motivated by our empirical insights and practical considerations. Specifically, COSMOS applies SOAP to the leading eigensubspace, which captures the primary optimization dynamics, and MUON to the remaining eigensubspace, which is less critical but computationally expensive to handle with SOAP. This hybrid strategy significantly reduces memory consumption while maintaining robust optimization performance, making it particularly suitable for massive LLMs. Numerical experiments on various datasets and transformer architectures are provided to demonstrate the effectiveness of COSMOS. Our code is available at https://github.com/lliu606/COSMOS.

[549] On the Interpolation Effect of Score Smoothing in Diffusion Models

Zhengdao Chen

Main category: cs.LG

TL;DR: Score-based diffusion models generate creative samples through score smoothing effects in neural networks, enabling interpolation between training data points without full memorization.

Details

Motivation: To understand how diffusion models generate novel samples not present in training data, specifically investigating the hypothesis that creativity arises from interpolation effects caused by smoothing of the empirical score function.

Method: Theoretical analysis of regularized two-layer ReLU neural networks learning smoothed empirical score functions, combined with analytical solutions and numerical experiments in one-dimensional subspace settings.

Result: Demonstrated that smoothed score functions enable generation of interpolated samples along training data subspaces while avoiding full memorization, with experimental evidence showing score smoothing occurs even without explicit regularization.

Conclusion: Score smoothing in neural networks is a key mechanism enabling diffusion models’ creative generation capabilities, allowing interpolation between training data points rather than simple memorization.

Abstract: Score-based diffusion models have achieved remarkable progress in various domains with the ability to generate new data samples that do not exist in the training set. In this work, we study the hypothesis that such creativity arises from an interpolation effect caused by a smoothing of the empirical score function. Focusing on settings where the training set lies uniformly in a one-dimensional subspace, we show theoretically how regularized two-layer ReLU neural networks tend to learn approximately a smoothed version of the empirical score function, and further probe the interplay between score smoothing and the denoising dynamics with analytical solutions and numerical experiments. In particular, we demonstrate how a smoothed score function can lead to the generation of samples that interpolate the training data along their subspace while avoiding full memorization. Moreover, we present experimental evidence that learning score functions with neural networks indeed induces a score smoothing effect, including in simple nonlinear settings and without explicit regularization.

[550] Latent Variable Modeling in Multi-Agent Reinforcement Learning via Expectation-Maximization for UAV-Based Wildlife Protection

Mazyar Taghavi, Rahman Farnoosh

Main category: cs.LG

TL;DR: EM-based latent variable modeling in MARL for UAV coordination in wildlife protection, showing superior performance over PPO and DDPG in detecting poachers.

Details

Motivation: Protecting endangered wildlife from illegal poaching in vast, partially observable environments where real-time response is critical.

Method: Expectation-Maximization (EM) based latent variable modeling approach in Multi-Agent Reinforcement Learning (MARL) for UAV coordination, modeling hidden environmental factors and inter-agent dynamics.

Result: Superior performance in detection accuracy, adaptability, and policy convergence compared to standard algorithms (PPO and DDPG) in simulations with 10 UAVs protecting Iranian leopard habitats.

Conclusion: Combining EM inference with MARL improves decentralized decision-making in complex conservation scenarios, with full implementation publicly available.

Abstract: Protecting endangered wildlife from illegal poaching presents a critical challenge, particularly in vast and partially observable environments where real-time response is essential. This paper introduces a novel Expectation-Maximization (EM) based latent variable modeling approach in the context of Multi-Agent Reinforcement Learning (MARL) for Unmanned Aerial Vehicle (UAV) coordination in wildlife protection. By modeling hidden environmental factors and inter-agent dynamics through latent variables, our method enhances exploration and coordination under uncertainty.We implement and evaluate our EM-MARL framework using a custom simulation involving 10 UAVs tasked with patrolling protected habitats of the endangered Iranian leopard. Extensive experimental results demonstrate superior performance in detection accuracy, adaptability, and policy convergence when compared to standard algorithms such as Proximal Policy Optimization (PPO) and Deep Deterministic Policy Gradient (DDPG). Our findings underscore the potential of combining EM inference with MARL to improve decentralized decisionmaking in complex, high-stakes conservation scenarios. The full implementation, simulation environment, and training scripts are publicly available on GitHub.

[551] Aggregation on Learnable Manifolds for Asynchronous Federated Optimization

Archie Licudi, Anshul Thakur, Soheila Molaei, Danielle Belgrave, David Clifton

Main category: cs.LG

TL;DR: AsyncBezier and OrthoDC: A geometric framework for asynchronous federated learning that uses Bezier curves to avoid loss barriers and orthogonal projections to reduce interference from stale updates.

Details

Motivation: Address two key issues in asynchronous FL with heterogeneous clients: curvature-induced loss barriers from linear parameter interpolation and interference from stale updates misaligned with server optimization state.

Method: Proposes geometric framework casting aggregation as curve learning in Riemannian model space, using Bezier trajectories for aggregation and OrthoDC for orthogonal projection of delayed updates to reduce interference.

Result: Consistently improves accuracy and client fairness over strong asynchronous baselines on three datasets including LEAF Shakespeare and FEMNIST, with gains preserved even when other methods get higher compute budget.

Conclusion: The geometric framework with AsyncBezier and OrthoDC effectively addresses key challenges in asynchronous FL, providing convergence guarantees and practical performance improvements across diverse domains.

Abstract: Asynchronous federated learning (FL) with heterogeneous clients faces two key issues: curvature-induced loss barriers encountered by standard linear parameter interpolation techniques (e.g. FedAvg) and interference from stale updates misaligned with the server’s current optimisation state. To alleviate these issues, we introduce a geometric framework that casts aggregation as curve learning in a Riemannian model space and decouples trajectory selection from update conflict resolution. Within this, we propose AsyncBezier, which replaces linear aggregation with low-degree polynomial (Bezier) trajectories to bypass loss barriers, and OrthoDC, which projects delayed updates via inner product-based orthogonality to reduce interference. We establish framework-level convergence guarantees covering each variant given simple assumptions on their components. On three datasets spanning general-purpose and healthcare domains, including LEAF Shakespeare and FEMNIST, our approach consistently improves accuracy and client fairness over strong asynchronous baselines; finally, we show that these gains are preserved even when other methods are allocated a higher local compute budget.

[552] From Federated Learning to X-Learning: Breaking the Barriers of Decentrality Through Random Walks

Allan Salihovic, Payam Abdisarabshali, Michael Langberg, Seyyedali Hosseinalipour

Main category: cs.LG

TL;DR: X-Learning (XL) is a novel distributed learning architecture that extends decentralization concepts, with connections to graph theory and Markov chains, and presents open research directions.

Details

Motivation: To introduce and explore the design considerations and degrees of freedom in X-Learning, a generalized distributed learning architecture that goes beyond traditional decentralization approaches.

Method: The paper provides a perspective on XL by examining its connections to graph theory and Markov chains, and identifies unexplored design considerations and degrees of freedom in the architecture.

Result: The analysis reveals intuitive but non-trivial connections between XL, graph theory, and Markov chains, providing a foundation for understanding the architecture’s theoretical underpinnings.

Conclusion: XL represents a promising distributed learning architecture that generalizes decentralization concepts, with identified open research directions to stimulate further investigation and development in this area.

Abstract: We provide our perspective on X-Learning (XL), a novel distributed learning architecture that generalizes and extends the concept of decentralization. Our goal is to present a vision for XL, introducing its unexplored design considerations and degrees of freedom. To this end, we shed light on the intuitive yet non-trivial connections between XL, graph theory, and Markov chains. We also present a series of open research directions to stimulate further research.

[553] MergeBench: A Benchmark for Merging Domain-Specialized LLMs

Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, Han Zhao

Main category: cs.LG

TL;DR: MergeBench is a comprehensive evaluation suite for model merging methods, testing 8 techniques across 5 domains using Llama and Gemma models (2B-9B scale) to provide practical guidelines and insights.

Details

Motivation: Existing model merging evaluations are limited in scale and task diversity, leaving questions about applicability to large, domain-specialized LLMs. MergeBench addresses this gap by providing standardized evaluation protocols.

Method: Built on state-of-the-art open-source LLMs (Llama and Gemma families, 2B-9B scale), covering 5 domains: instruction following, mathematics, multilingual understanding, coding, and safety. Standardized finetuning and evaluation protocols to assess 8 merging methods.

Result: Model merging performs better on stronger base models. Techniques like merging coefficient tuning and sparsification improve knowledge retention. However, challenges remain including computational costs, performance gap compared to multi-task models, and unexplored role in standard LLM pipelines.

Conclusion: MergeBench provides a foundation for future research to advance understanding and practical application of model merging, with identified challenges pointing to areas for improvement in computational efficiency and integration with standard training pipelines.

Abstract: Model merging provides a scalable alternative to multi-task training by combining specialized finetuned models through parameter arithmetic, enabling efficient deployment without the need for joint training or access to all task data. While recent methods have shown promise, existing evaluations are limited in both model scale and task diversity, leaving open questions about their applicability to large, domain-specialized LLMs. To tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to assess model merging at scale. MergeBench builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains: instruction following, mathematics, multilingual understanding, coding and safety. We standardize finetuning and evaluation protocols, and assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency. Based on extensive experiments, we provide practical guidelines for algorithm selection and share insights showing that model merging tends to perform better on stronger base models, with techniques such as merging coefficient tuning and sparsification improving knowledge retention. However, several challenges remain, including the computational cost on large models, the gap for in-domain performance compared to multi-task models, and the underexplored role of model merging in standard LLM training pipelines. We hope MergeBench provides a foundation for future research to advance the understanding and practical application of model merging. Our project page is at \href{https://yifei-he.github.io/mergebench/}{https://yifei-he.github.io/mergebench/}.

[554] What’s Inside Your Diffusion Model? A Score-Based Riemannian Metric to Explore the Data Manifold

Simone Azeglio, Arianna Di Bernardo

Main category: cs.LG

TL;DR: The paper introduces a score-based Riemannian metric using diffusion model Stein scores to characterize data manifold geometry, enabling geodesic computation for interpolation and extrapolation that outperforms baselines on perceptual metrics.

Details

Motivation: To understand the geometric properties of data manifolds learned by diffusion models, which remain poorly characterized despite their ability to capture complex image distributions.

Method: Develop a score-based Riemannian metric using Stein score functions from diffusion models to define a metric tensor that stretches distances perpendicular to the manifold while preserving tangential distances, with efficient algorithms for computing geodesics.

Result: Experiments on synthetic data, Rotated MNIST, and Stable Diffusion show score-based geodesics capture meaningful transformations respecting data distribution, consistently outperforming baselines on LPIPS, FID, and KID metrics with smoother, more realistic image transitions.

Conclusion: The method reveals implicit geometric structure learned by diffusion models and provides a principled Riemannian geometry approach to navigate natural image manifolds.

Abstract: Recent advances in diffusion models have demonstrated their remarkable ability to capture complex image distributions, but the geometric properties of the learned data manifold remain poorly understood. We address this gap by introducing a score-based Riemannian metric that leverages the Stein score function from diffusion models to characterize the intrinsic geometry of the data manifold without requiring explicit parameterization. Our approach defines a metric tensor in the ambient space that stretches distances perpendicular to the manifold while preserving them along tangential directions, effectively creating a geometry where geodesics naturally follow the manifold’s contours. We develop efficient algorithms for computing these geodesics and demonstrate their utility for both interpolation between data points and extrapolation beyond the observed data distribution. Through experiments on synthetic data with known geometry, Rotated MNIST, and complex natural images via Stable Diffusion, we show that our score-based geodesics capture meaningful transformations that respect the underlying data distribution. Our method consistently outperforms baseline approaches on perceptual metrics (LPIPS) and distribution-level metrics (FID, KID), producing smoother, more realistic image transitions. These results reveal the implicit geometric structure learned by diffusion models and provide a principled way to navigate the manifold of natural images through the lens of Riemannian geometry.

[555] Robust LLM Training Infrastructure at ByteDance

Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang

Main category: cs.LG

TL;DR: ByteRobust is a GPU infrastructure management system designed to ensure robust and stable training of large language models by enabling high-capacity fault tolerance, prompt fault detection, and efficient recovery mechanisms.

Details

Motivation: The expansion of LLM training to tens of thousands of GPUs has increased failure prevalence (CUDA errors, NaN values, job hangs), posing significant challenges to training stability and requiring minimal interruptions, efficient diagnosis, and effective failure tolerance.

Method: ByteRobust exploits the uniqueness of LLM training processes, prioritizing routine failure detection and recovery. It leverages parallelisms and characteristics of LLM training with a data-driven approach for fault demarcation and localization.

Result: Deployed on a production GPU platform with over 200,000 GPUs, ByteRobust achieved 97% ETTR (Estimated Time to Repair) for a three-month training job on 9,600 GPUs.

Conclusion: ByteRobust comprehensively ensures continuous and efficient training of LLM tasks through its robust infrastructure management capabilities tailored for large-scale GPU environments.

Abstract: The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs.

[556] CausalDynamics: A large-scale benchmark for structural discovery of dynamical causal models

Benjamin Herdeanu, Juan Nathaniel, Carla Roesch, Jatan Buch, Gregor Ramien, Johannes Haux, Pierre Gentine

Main category: cs.LG

TL;DR: CausalDynamics is a large-scale benchmark and data generation framework for causal discovery in dynamical systems, featuring true causal graphs from thousands of differential equations and climate models.

Details

Motivation: Address limitations in current causal discovery methods that are tailored to deterministic, low-dimensional, weakly nonlinear time-series data, and enable research in systems where active interventions are infeasible.

Method: Developed a benchmark with true causal graphs from linearly/nonlinearly coupled ordinary/stochastic differential equations and idealized climate models, featuring plug-and-play workflow for building physical system hierarchies.

Result: Comprehensive evaluation of state-of-the-art causal discovery algorithms on systems with noisy, confounded, and lagged dynamics.

Conclusion: The framework facilitates development of robust causal discovery algorithms applicable across domains while addressing unique challenges, with user-friendly implementation available.

Abstract: Causal discovery for dynamical systems poses a major challenge in fields where active interventions are infeasible. Most methods used to investigate these systems and their associated benchmarks are tailored to deterministic, low-dimensional and weakly nonlinear time-series data. To address these limitations, we present CausalDynamics, a large-scale benchmark and extensible data generation framework to advance the structural discovery of dynamical causal models. Our benchmark consists of true causal graphs derived from thousands of both linearly and nonlinearly coupled ordinary and stochastic differential equations as well as two idealized climate models. We perform a comprehensive evaluation of state-of-the-art causal discovery algorithms for graph reconstruction on systems with noisy, confounded, and lagged dynamics. CausalDynamics consists of a plug-and-play, build-your-own coupling workflow that enables the construction of a hierarchy of physical systems. We anticipate that our framework will facilitate the development of robust causal discovery algorithms that are broadly applicable across domains while addressing their unique challenges. We provide a user-friendly implementation and documentation on https://kausable.github.io/CausalDynamics.

[557] Automated Capability Evaluation of Foundation Models

Arash Afkanpour, Omkar Dige, Fatemeh Tavakoli, Negin Baghbanzadeh, Farnaz Kohankhaki, Elham Dolatabadi

Main category: cs.LG

TL;DR: ACE is an automated framework that uses frontier models to decompose domains into capabilities and generate evaluation tasks, reducing human effort while providing comprehensive model assessment through active learning.

Details

Motivation: Current evaluation frameworks for foundation models rely on static, manually curated benchmarks, which limit their ability to fully capture model capabilities and require significant human effort.

Method: ACE leverages frontier models to decompose domains into semantically meaningful capabilities, generates diverse evaluation tasks, and uses active learning with a capability model in latent semantic space to efficiently approximate performance by evaluating only a subset of capabilities.

Result: In Mathematics, ACE generated 433 capabilities and 11,800 tasks, covering 94% of Wikipedia-defined skills while introducing novel ones. It achieved within 0.01 RMSE of exhaustive evaluation by evaluating less than half of capabilities, providing more balanced coverage than static datasets.

Conclusion: ACE provides a more complete and informative picture of model capabilities, which is essential for safe and well-informed deployment of foundation models, uncovering fine-grained differences that aggregate metrics miss.

Abstract: Current evaluation frameworks for foundation models rely heavily on static, manually curated benchmarks, limiting their ability to capture the full breadth of model capabilities. This paper introduces Active learning for Capability Evaluation (ACE), a novel framework for scalable, automated, and fine-grained evaluation of foundation models. ACE leverages the knowledge embedded in powerful frontier models to decompose a domain into semantically meaningful capabilities and generates diverse evaluation tasks, significantly reducing human effort. In Mathematics, ACE generated 433 capabilities and 11,800 tasks, covering 94% of Wikipedia-defined skills in the domain while introducing novel, coherent ones. To maximize efficiency, ACE fits a capability model in latent semantic space, allowing reliable approximation of a subject model’s performance by evaluating only a subset of capabilities via active learning. It reaches within 0.01 RMSE of exhaustive evaluation by evaluating less than half of capabilities. Compared to static datasets, ACE provides more balanced coverage and uncovers fine-grained differences that aggregate metrics fail to capture. Our results demonstrate that ACE provides a more complete and informative picture of model capabilities, which is essential for safe and well-informed deployment of foundation models.

[558] Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

Yang Chen, Menglin Zou, Jiaqi Zhang, Yitan Zhang, Junyi Yang, Gael Gendron, Libo Zhang, Jiamou Liu, Michael J. Witbrock

Main category: cs.LG

TL;DR: TRRO is a new IRL framework that guarantees monotonic improvement in expert behavior likelihood via Minorization-Maximization, addressing training instability in adversarial IRL methods.

Details

Motivation: Modern adversarial IRL methods suffer from unstable training, while recent non-adversarial approaches lack formal guarantees despite improved stability.

Method: Proposed Trust Region Reward Optimization (TRRO) framework using Minorization-Maximization process, instantiated as Proximal Inverse Reward Optimization (PIRO) algorithm.

Result: PIRO matches or surpasses state-of-the-art baselines in reward recovery and policy imitation with high sample efficiency on MuJoCo, Gym-Robotics benchmarks, and real-world animal behavior modeling.

Conclusion: TRRO provides IRL counterpart to TRPO’s stability guarantees in forward RL, offering both theoretical guarantees and practical performance.

Abstract: Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to unstable training. Recent non-adversarial IRL approaches improve stability by jointly learning reward and policy via energy-based formulations but lack formal guarantees. This work bridges this gap. We first present a unified view showing canonical non-adversarial methods explicitly or implicitly maximize the likelihood of expert behavior, which is equivalent to minimizing the expected return gap. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement in this likelihood via a Minorization-Maximization process. We instantiate TRRO into Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides the IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or surpasses state-of-the-art baselines in reward recovery, policy imitation with high sample efficiency on MuJoCo and Gym-Robotics benchmarks and a real-world animal behavior modeling task.

[559] Partition Generative Modeling: Masked Modeling Without Masks

Justin Deschenaux, Lan Tran, Caglar Gulcehre

Main category: cs.LG

TL;DR: PGM combines AR and MGM strengths by partitioning tokens with sparse attention, enabling parallel generation while avoiding wasted computation on masked tokens.

Details

Motivation: MGMs waste computation on masked tokens early in sampling, while AR models are slower due to sequential processing. PGM aims to get the best of both approaches.

Method: Partition tokens into groups and use sparse attention to block information flow between partitions, allowing parallel generation while processing only relevant tokens.

Result: 5x+ sampling improvements on OpenWebText with better Generative Perplexity; 7.5x higher throughput on ImageNet than MaskGIT with slight FID increase (5.54 vs 5.35); 3.9x faster than MaskGIT with 2x steps and FID 4.56.

Conclusion: PGM successfully combines AR and MGM advantages, achieving significant speedups while maintaining quality, and integrates well with MGM distillation for further improvements.

Abstract: Masked generative models (MGMs) are widely used to capture complex data and enable faster generation than autoregressive models (AR) through parallel decoding. However, MGMs typically operate on fixed-length inputs, which can be inefficient: early in sampling, most tokens are masked and carry no information, leading to wasted computation. In contrast, AR models process only tokens generated previously, making early iterations faster. In this work, we introduce the Partition Generative Model (PGM), a novel approach that combines the strengths of AR and MGMs. Rather than masking, PGM partitions tokens into two groups and employs sparse attention to block information flow between them. Since there is no information flow between partitions, the model can process the previously-generated tokens only during sampling, while retaining the ability to generate tokens in parallel and in any order. On OpenWebText, PGMs offer at least $5\times$ improvements in sampling latency and throughput, while producing samples with superior Generative Perplexity, compared to Masked Diffusion Language Models. On ImageNet, PGMs achieve a $7.5\times$ higher throughput than MaskGIT, with only a slight increase in FID (5.54 vs. 5.35). With twice as many sampling steps, the FID reduces to 4.56 while while being $3.9\times$ faster than MaskGIT. Finally, PGMs integrate seamlessly with MGM distillation, providing further inference speedups.

[560] ClustRecNet: A Novel End-to-End Deep Learning Framework for Clustering Algorithm Recommendation

Mohammadreza Bakhtyari, Bogdan Mazoure, Renato Cordeiro de Amorim, Guillaume Rabusseau, Vladimir Makarenkov

Main category: cs.LG

TL;DR: ClustRecNet is a deep learning framework that recommends the best clustering algorithm for a given dataset, outperforming traditional methods and AutoML approaches.

Details

Motivation: Address the challenge of clustering algorithm selection in unsupervised learning by providing an automated recommendation system.

Method: Built a data repository with 34,000 synthetic datasets processed by 10 clustering algorithms, using ARI for ground truth. Network integrates convolutional, residual, and attention mechanisms to capture dataset patterns.

Result: Outperforms conventional CVIs and state-of-the-art AutoML approaches, achieving 0.497 ARI improvement over Calinski-Harabasz on synthetic data and 15.3% ARI gain over best AutoML on real-world data.

Conclusion: Deep learning-based approach effectively automates clustering algorithm selection, reducing reliance on handcrafted features and traditional validation indices.

Abstract: We introduce ClustRecNet - a novel deep learning (DL)-based recommendation framework for determining the most suitable clustering algorithms for a given dataset, addressing the long-standing challenge of clustering algorithm selection in unsupervised learning. To enable supervised learning in this context, we construct a comprehensive data repository comprising 34,000 synthetic datasets with diverse structural properties. Each of them was processed using 10 popular clustering algorithms. The resulting clusterings were assessed via the Adjusted Rand Index (ARI) to establish ground truth labels, used for training and evaluation of our DL model. The proposed network architecture integrates convolutional, residual, and attention mechanisms to capture both local and global structural patterns from the input data. This design supports end-to-end training to learn compact representations of datasets and enables direct recommendation of the most suitable clustering algorithm, reducing reliance on handcrafted meta-features and traditional Cluster Validity Indices (CVIs). Comprehensive experiments across synthetic and real-world benchmarks demonstrate that our DL model consistently outperforms conventional CVIs (e.g. Silhouette, Calinski-Harabasz, Davies-Bouldin, and Dunn) as well as state-of-the-art AutoML clustering recommendation approaches (e.g. ML2DAC, AutoCluster, and AutoML4Clust). Notably, the proposed model achieves a 0.497 ARI improvement over the Calinski-Harabasz index on synthetic data and a 15.3% ARI gain over the best-performing AutoML approach on real-world data.

[561] PrivATE: Differentially Private Confidence Intervals for Average Treatment Effects

Maresa Schröder, Justin Hartenstein, Stefan Feuerriegel

Main category: cs.LG

TL;DR: PrivATE is a differentially private framework for computing valid confidence intervals for average treatment effects from observational data while protecting sensitive medical information.

Details

Motivation: In medical applications, estimating treatment effects requires reliable uncertainty quantification but often involves sensitive patient data that must be kept private under differential privacy constraints.

Method: Three-step framework: (1) estimate differentially private ATE via output perturbation, (2) estimate differentially private variance using doubly robust methods, (3) construct CIs accounting for both estimation and privatization uncertainty.

Result: The framework is model-agnostic, doubly robust, and produces valid confidence intervals while maintaining differential privacy, as demonstrated on synthetic and real-world medical datasets.

Conclusion: PrivATE provides the first general, doubly robust framework for valid confidence intervals of average treatment effects under differential privacy, enabling reliable treatment effect evaluation while protecting sensitive medical data.

Abstract: The average treatment effect (ATE) is widely used to evaluate the effectiveness of drugs and other medical interventions. In safety-critical applications like medicine, reliable inferences about the ATE typically require valid uncertainty quantification, such as through confidence intervals (CIs). However, estimating treatment effects in these settings often involves sensitive data that must be kept private. In this work, we present PrivATE, a novel machine learning framework for computing CIs for the ATE under differential privacy. Specifically, we focus on deriving valid privacy-preserving CIs for the ATE from observational data. Our PrivATE framework consists of three steps: (i) estimating the differentially private ATE through output perturbation; (ii) estimating the differentially private variance in a doubly robust manner; and (iii) constructing the CIs while accounting for the uncertainty from both the estimation and privatization steps. Our PrivATE framework is model agnostic, doubly robust, and ensures valid CIs. We demonstrate the effectiveness of our framework using synthetic and real-world medical datasets. To the best of our knowledge, we are the first to derive a general, doubly robust framework for valid CIs of the ATE under ($\varepsilon,\delta$)-differential privacy.

[562] Large Language Model Agent for Modular Task Execution in Drug Discovery

Janghoon Ock, Radheesh Sharma Meda, Srivathsan Badrinarayanan, Neha S. Aluru, Achuth Chandrasekhar, Amir Barati Farimani

Main category: cs.LG

TL;DR: A modular LLM-powered framework automates early-stage computational drug discovery, performing tasks from data retrieval to 3D structure generation, with demonstrated success in BCL-2 targeting for lymphocytic leukemia.

Details

Motivation: To streamline and automate the complex, multi-step process of early-stage computational drug discovery by leveraging LLM reasoning combined with domain-specific tools.

Method: Modular framework combining LLMs with domain-specific tools for biomedical data retrieval, question answering, molecular generation, ADMET property prediction, molecular refinement, and 3D protein-ligand structure generation using Boltz-2.

Result: In BCL-2 case study: improved contextual accuracy in mechanistic questions; generated chemically diverse molecules; increased QED > 0.6 molecules from 34 to 55; improved Ghose filter compliance from 32 to 55; generated 3D complexes with binding affinity estimates.

Conclusion: The framework effectively supports molecular screening, prioritization, and structure evaluation, with modular design enabling flexible integration of evolving tools for scalable AI-assisted therapeutic discovery.

Abstract: We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early-stage computational drug discovery pipeline. By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, domain-specific question answering, molecular generation, property prediction, property-aware molecular refinement, and 3D protein-ligand structure generation. In a case study targeting BCL-2 in lymphocytic leukemia, the agent autonomously retrieved relevant biomolecular information, including FASTA sequences, SMILES representations, and literature, and answered mechanistic questions with improved contextual accuracy compared to standard LLMs. It then generated chemically diverse seed molecules and predicted 67 ADMET-related properties, which guided iterative molecular refinement. Across two refinement rounds, the number of molecules with QED > 0.6 increased from 34 to 55. The number of molecules satisfying empirical drug-likeness filters also rose; for example, compliance with the Ghose filter increased from 32 to 55 within a pool of 100 molecules. The framework also employed Boltz-2 to generate 3D protein-ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI-assisted therapeutic discovery.

[563] G$^2$RPO: Granular GRPO for Precise Reward in Flow Models

Yujie Zhou, Pengyang Ling, Jiazi Bu, Yibin Wang, Yuhang Zang, Jiaqi Wang, Li Niu, Guangtao Zhai

Main category: cs.LG

TL;DR: G²RPO is a novel framework that improves preference alignment in reinforcement learning for flow models by introducing granular reward assessments and multi-scale advantage integration.

Details

Motivation: Existing methods for integrating RL into diffusion models suffer from sub-optimal preference alignment due to sparse and narrow reward signals, limiting their effectiveness in exploring high-value samples.

Method: Proposes Granular-GRPO with two key components: 1) Singular Stochastic Sampling for step-wise exploration with high reward-noise correlation, and 2) Multi-Granularity Advantage Integration that aggregates advantages across multiple diffusion scales.

Result: Experiments on various reward models show G²RPO significantly outperforms existing flow-based GRPO baselines in both in-domain and out-of-domain evaluations, demonstrating effectiveness and robustness.

Conclusion: The G²RPO framework achieves precise and comprehensive reward assessments, enabling better preference alignment in reinforcement learning of flow models through granular sampling direction evaluation.

Abstract: The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO (G$^2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our G$^2$RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.

[564] Epistemic Errors of Imperfect Multitask Learners When Distributions Shift

Sabina J. Sloman, Michele Caprio, Samuel Kaski

Main category: cs.LG

TL;DR: The paper introduces a principled framework to characterize and eliminate epistemic errors in uncertainty-aware machine learning, providing decompositional error bounds for imperfect multitask learning under distribution shift.

Details

Motivation: To provide uncertainty-aware learners with a systematic way to identify and reduce errors from reducible (epistemic) uncertainty in complex learning scenarios.

Method: Introduces a principled definition of epistemic error and provides a decompositional epistemic error bound for imperfect multitask learning under distribution shift, where source data comes from multiple tasks and target data may differ systematically.

Result: Developed epistemic error bounds that separately attribute errors to different aspects of learning, with specialized bounds for Bayesian transfer learning and distribution shift within ε-neighborhoods, plus a novel definition of negative transfer.

Conclusion: The framework provides a comprehensive approach to understanding and mitigating epistemic errors in uncertainty-aware learning across various challenging scenarios including multitask learning and distribution shift.

Abstract: Uncertainty-aware machine learners, such as Bayesian neural networks, output a quantification of uncertainty instead of a point prediction. In this work, we provide uncertainty-aware learners with a principled framework to characterize, and identify ways to eliminate, errors that arise from reducible (epistemic) uncertainty. We introduce a principled definition of epistemic error, and provide a decompositional epistemic error bound which operates in the very general setting of imperfect multitask learning under distribution shift. In this setting, the training (source) data may arise from multiple tasks, the test (target) data may differ systematically from the source data tasks, and/or the learner may not arrive at an accurate characterization of the source data. Our bound separately attributes epistemic errors to each of multiple aspects of the learning procedure and environment. As corollaries of the general result, we provide epistemic error bounds specialized to the settings of Bayesian transfer learning and distribution shift within $\epsilon$-neighborhoods. We additionally leverage the terms in our bound to provide a novel definition of negative transfer.

[565] Language Model Guided Reinforcement Learning in Quantitative Trading

Adam Darmanin, Vince Vella

Main category: cs.LG

TL;DR: A hybrid framework combining LLMs for high-level strategy generation with RL agents for tactical trading decisions improves both returns and risk metrics compared to standard RL approaches.

Details

Motivation: RL in algorithmic trading suffers from myopic behavior and opaque policies, while LLMs offer strategic reasoning and multi-modal signal interpretation capabilities that can complement RL.

Method: Proposes a hybrid framework where LLMs generate high-level trading strategies to guide RL agents, evaluated through expert review of economic rationale and performance metrics (Sharpe Ratio, Maximum Drawdown).

Result: LLM guidance improves both return and risk metrics relative to standard RL baselines.

Conclusion: The hybrid LLM-RL framework effectively addresses limitations of pure RL approaches in algorithmic trading by leveraging LLMs’ strategic reasoning capabilities.

Abstract: Algorithmic trading requires short-term tactical decisions consistent with long-term financial objectives. Reinforcement Learning (RL) has been applied to such problems, but adoption is limited by myopic behaviour and opaque policies. Large Language Models (LLMs) offer complementary strategic reasoning and multi-modal signal interpretation when guided by well-structured prompts. This paper proposes a hybrid framework in which LLMs generate high-level trading strategies to guide RL agents. We evaluate (i) the economic rationale of LLM-generated strategies through expert review, and (ii) the performance of LLM-guided agents against unguided RL baselines using Sharpe Ratio (SR) and Maximum Drawdown (MDD). Empirical results indicate that LLM guidance improves both return and risk metrics relative to standard RL.

[566] Post-training quantization of vision encoders needs prefixing registers

Seunghyeon Kim, Jinho Kim, Taesun Yeom, Wonpyo Park, Kyuyeun Kim, Jaeho Lee

Main category: cs.LG

TL;DR: RegCache is a training-free method that mitigates outliers in vision encoders by adding prefix tokens, enabling more effective quantization with minimal accuracy loss.

Details

Motivation: Transformer-based vision encoders like CLIP are crucial for multimodal applications but face challenges in quantization due to outlier activations, especially at 8-bit precision.

Method: Introduces outlier-prone but semantically meaningless prefix tokens to prevent other tokens from having outliers, with innovations including middle-layer prefixing and token deletion.

Result: Consistently improves accuracy of quantized models across both text-supervised and self-supervised vision encoders.

Conclusion: RegCache effectively addresses outlier issues in vision encoder quantization without requiring retraining, making it practical for real-time applications.

Abstract: Transformer-based vision encoders – such as CLIP – are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.

[567] Learning Robust Diffusion Models from Imprecise Supervision

Dong-Dong Wu, Jiacheng Cui, Wei Wang, Zhiqiang Shen, Masashi Sugiyama

Main category: cs.LG

TL;DR: DMIS is a unified framework for training robust diffusion models from imprecise supervision, addressing noisy, ambiguous, or incomplete labels in conditional inputs.

Details

Motivation: Conditional diffusion models rely on large datasets that often contain imprecise information in conditional inputs, causing condition mismatch and degraded generation quality.

Method: Derived from likelihood maximization, DMIS decomposes the objective into generative and classification components: generative component models imprecise-label distributions, while classification component uses a diffusion classifier with optimized timestep sampling.

Result: Extensive experiments on diverse imprecise supervision tasks (image generation, weakly supervised learning, noisy dataset condensation) show DMIS consistently produces high-quality, class-discriminative samples.

Conclusion: DMIS is the first systematic study for training robust diffusion models from imprecise supervision and demonstrates effectiveness across various tasks with noisy labels.

Abstract: Conditional diffusion models have achieved remarkable success in various generative tasks recently, but their training typically relies on large-scale datasets that inevitably contain imprecise information in conditional inputs. Such supervision, often stemming from noisy, ambiguous, or incomplete labels, will cause condition mismatch and degrade generation quality. To address this challenge, we propose DMIS, a unified framework for training robust Diffusion Models from Imprecise Supervision, which is the first systematic study within diffusion models. Our framework is derived from likelihood maximization and decomposes the objective into generative and classification components: the generative component models imprecise-label distributions, while the classification component leverages a diffusion classifier to infer class-posterior probabilities, with its efficiency further improved by an optimized timestep sampling strategy. Extensive experiments on diverse forms of imprecise supervision, covering tasks of image generation, weakly supervised learning, and noisy dataset condensation demonstrate that DMIS consistently produces high-quality and class-discriminative samples.

[568] A Constructive Framework for Nondeterministic Automata via Time-Shared, Depth-Unrolled Feedforward Networks

Sahil Rajesh Dhayalkar

Main category: cs.LG

TL;DR: A formal framework that simulates nondeterministic finite automata (NFAs) using time-shared, depth-unrolled feedforward networks (TS-FFNs), achieving exact recognition of regular languages with parameter count independent of input length.

Details

Motivation: To establish a constructive equivalence between NFAs and neural networks, enabling precise symbolic computation with modern neural architectures while avoiding explicit recurrent structures or post hoc extraction methods.

Method: Symbolically encodes automaton states as binary vectors, transitions as sparse matrix transformations, and nondeterministic branching (including ε-closures) as compositions of shared thresholded updates using TS-FFNs.

Result: Proves every regular language can be recognized exactly by shared-parameter unrolled feedforward networks, with empirical learnability demonstrated through gradient descent training achieving perfect or near-perfect agreement on acceptance, state propagation, and closure dynamics.

Conclusion: The work clarifies the correspondence between automata theory and neural architectures, showing that unrolled feedforward networks can perform precise, interpretable, and trainable symbolic computation.

Abstract: We present a formal and constructive simulation framework for nondeterministic finite automata (NFAs) using time-shared, depth-unrolled feedforward networks (TS-FFNs), i.e., acyclic unrolled computations with shared parameters that are functionally equivalent to unrolled recurrent or state-space models. Unlike prior approaches that rely on explicit recurrent architectures or post hoc extraction methods, our formulation symbolically encodes automaton states as binary vectors, transitions as sparse matrix transformations, and nondeterministic branching-including $\varepsilon$-closures-as compositions of shared thresholded updates. We prove that every regular language can be recognized exactly by such a shared-parameter unrolled feedforward network, with parameter count independent of input length. Our construction yields a constructive equivalence between NFAs and neural networks and demonstrates \emph{empirical learnability}: these networks can be trained via gradient descent on supervised acceptance data to recover the target automaton behavior. This learnability, formalized in Proposition 5.1, is the crux of this work. Extensive experiments validate the theoretical results, achieving perfect or near-perfect agreement on acceptance, state propagation, and closure dynamics. This work clarifies the correspondence between automata theory and modern neural architectures, showing that unrolled feedforward networks can perform precise, interpretable, and trainable symbolic computation.

[569] CAFL-L: Constraint-Aware Federated Learning with Lagrangian Dual Optimization for On-Device Language Models

Dongqi Zheng, Wenjin Fu

Main category: cs.LG

TL;DR: CAFL-L extends FedAvg with Lagrangian dual optimization to handle device resource constraints (energy, communication, memory, thermal) by dynamically adapting training hyperparameters while maintaining stability through gradient accumulation.

Details

Motivation: Standard federated learning approaches like FedAvg don't explicitly address device-level resource constraints, making them impractical for deployment on resource-constrained edge devices.

Method: Uses Lagrangian dual optimization to dynamically adapt training hyperparameters (freezing depth, local steps, batch size, communication compression) while preserving training stability through token-budget preservation via gradient accumulation.

Result: Achieves superior constraint satisfaction compared to standard FedAvg, reducing memory usage by 20% and communication by 95% while maintaining competitive validation performance.

Conclusion: CAFL-L is practical for deployment on resource-constrained edge devices by effectively managing resource constraints while preserving model performance.

Abstract: We introduce Constraint-Aware Federated Learning with Lagrangian Dual Optimization (CAFL-L), a principled extension of FedAvg that explicitly incorporates device-level resource constraints including energy, communication, memory, and thermal budgets. CAFL-L employs Lagrangian dual optimization to dynamically adapt training hyperparameters – freezing depth, local steps, batch size, and communication compression – while preserving training stability through token-budget preservation via gradient accumulation. Experiments on a character-level language model demonstrate that CAFL-L achieves superior constraint satisfaction compared to standard FedAvg (reducing memory usage by 20% and communication by 95%) while maintaining competitive validation performance, making it practical for deployment on resource-constrained edge devices.

[570] High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization

Masih Aminbeidokhti, Heitor Rapela Medeiros, Srikanth Muralidharan, Eric Granger, Marco Pedersoli

Main category: cs.LG

TL;DR: Mixout is a regularization technique that swaps fine-tuned weights with pre-trained ones during training, achieving domain generalization comparable to ensembles while reducing computational costs.

Details

Motivation: Ensembling multiple fine-tuned models improves robustness under distribution shifts but is computationally expensive. Dropout over-regularizes pre-trained models, disrupting critical representations.

Method: Mixout probabilistically swaps fine-tuned weights with pre-trained counterparts during training, using high masking probabilities (0.9 for ViTs, 0.8 for ResNets) to maintain balance between adaptation and prior knowledge retention.

Result: High-rate Mixout achieves domain generalization accuracy comparable to ensemble methods while reducing gradient computation by up to 45% and gradient memory usage by up to 90% across five benchmarks (PACS, VLCS, OfficeHome, TerraIncognita, DomainNet).

Conclusion: Mixout with high masking rates provides an effective and computationally efficient alternative to ensembles for domain generalization, balancing adaptation with preservation of pre-trained knowledge.

Abstract: Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.

[571] Prompt-Aware Scheduling for Low-Latency LLM Serving

Yiheng Tao, Yihe Zhang, Matthew T. Dearing, Xin Wang, Yuping Fan, Zhiling Lan

Main category: cs.LG

TL;DR: PARS is a prompt-aware LLM task scheduler that reduces latency by approximating shortest-job-first scheduling through pairwise ranking, integrated into vLLM serving system.

Details

Motivation: Traditional FCFS scheduling suffers from Head-of-Line blocking where long tasks delay shorter ones, leading to inefficient LLM inference with high latency.

Method: Uses pairwise ranking with margin ranking loss to predict response-length-based task ordering, focusing on impactful scheduling decisions and integrating with vLLM.

Result: Significantly improves performance across multiple LLMs and real-world datasets, including reasoning workloads, with minimal overhead and good cross-model generalization.

Conclusion: PARS effectively addresses HOL blocking in LLM inference scheduling, achieving better efficiency and latency reduction while generalizing well across different models.

Abstract: Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.

[572] Training-free AI for Earth Observation Change Detection using Physics Aware Neuromorphic Networks

Stephen Smith, Cormac Purcell, Zdenka Kuncic

Main category: cs.LG

TL;DR: A Physics Aware Neuromorphic Network (PANN) is proposed for on-board satellite change detection in natural disasters, enabling data prioritization for downlinking without requiring training.

Details

Motivation: To overcome bandwidth limitations in satellite data downlinking by processing data on-board and prioritizing relevant disaster information for first responders.

Method: Uses a neuromorphic network inspired by memristor-based physical neural networks with dynamic weights that update according to physical laws, making it training-free.

Result: PANN achieved comparable or better performance than state-of-the-art AI models across all natural disaster categories.

Conclusion: PANN presents a promising solution for resource-constrained on-board satellite processing of disaster data.

Abstract: Earth observations from low Earth orbit satellites provide vital information for decision makers to better manage time-sensitive events such as natural disasters. For the data to be most effective for first responders, low latency is required between data capture and its arrival to decision makers. A major bottleneck is in the bandwidth-limited downlinking of the data from satellites to ground stations. One approach to overcome this challenge is to process at least some of the data on-board and prioritise pertinent data to be downlinked. In this work we propose a Physics Aware Neuromorphic Network (PANN) to detect changes caused by natural disasters from a sequence of multi-spectral satellite images and produce a change map, enabling relevant data to be prioritised for downlinking. The PANN used in this study is motivated by physical neural networks comprised of nano-electronic circuit elements known as “memristors” (nonlinear resistors with memory). The weights in the network are dynamic and update in response to varying input signals according to memristor equations of state and electrical circuit conservation laws. The PANN thus generates physics-constrained dynamical output features which are used to detect changes in a natural disaster detection task by applying a distance-based metric. Importantly, this makes the whole model training-free, allowing it to be implemented with minimal computing resources. The PANN was benchmarked against a state-of-the-art AI model and achieved comparable or better results in each natural disaster category. It thus presents a promising solution to the challenge of resource-constrained on-board processing.

[573] NIMO: a Nonlinear Interpretable MOdel

Shijian Xu, Marcello Massimo Negri, Volker Roth

Main category: cs.LG

TL;DR: NIMO is a framework that combines neural networks’ expressive power with linear regression’s inherent interpretability, providing flexible and intelligible feature effects while maintaining good predictive performance.

Details

Motivation: Address the dilemma between complex neural networks' high performance but lack of interpretability versus simple linear models' interpretability but lower performance.

Method: Builds on linear regression and uses parameter elimination optimization with adaptive ridge regression to combine NN parameters and linear coefficients effectively, incorporating sparsity.

Result: Empirical results show the model provides faithful and intelligible feature effects while maintaining good predictive performance.

Conclusion: NIMO successfully bridges the gap between interpretability and performance by combining neural networks with inherent interpretability of linear models.

Abstract: Deep learning has achieved remarkable success across many domains, but it has also created a growing demand for interpretability in model predictions. Although many explainable machine learning methods have been proposed, post-hoc explanations lack guaranteed fidelity and are sensitive to hyperparameter choices, highlighting the appeal of inherently interpretable models. For example, linear regression provides clear feature effects through its coefficients. However, such models are often outperformed by more complex neural networks (NNs) that usually lack inherent interpretability. To address this dilemma, we introduce NIMO, a framework that combines inherent interpretability with the expressive power of neural networks. Building on the simple linear regression, NIMO is able to provide flexible and intelligible feature effects. Relevantly, we develop an optimization method based on parameter elimination, that allows for optimizing the NN parameters and linear coefficients effectively and efficiently. By relying on adaptive ridge regression we can easily incorporate sparsity as well. We show empirically that our model can provide faithful and intelligible feature effects while maintaining good predictive performance.

[574] Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang

Main category: cs.LG

TL;DR: Reinforce-Ada is an adaptive sampling framework for RL post-training of LLMs that dynamically allocates sampling effort to prompts with highest uncertainty, improving convergence and performance over uniform sampling methods.

Details

Motivation: Standard RL for LLMs suffers from unstable gradient estimates due to uniform sampling across prompts, which doesn't account for varying learning potential and uncertainty.

Method: Online successive elimination process that interleaves estimation and sampling, automatically stopping sampling for prompts once sufficient signal is collected. Uses fixed-size groups with reward diversity and global advantage baselines.

Result: Empirical results show accelerated convergence and improved final performance compared to GRPO, especially with balanced sampling variant, across multiple model architectures and reasoning benchmarks.

Conclusion: Adaptive, variance-aware data curation is crucial for efficient and reliable reinforcement learning in reasoning-capable LLMs.

Abstract: Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates due to fixed and uniform sampling of responses across prompts. Prior work such as GVM-RAFT addresses this by dynamically allocating inference budget per prompt to minimize stochastic gradient variance under a budget constraint. Inspired by this insight, we propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs that continuously reallocates sampling effort to the prompts with the greatest uncertainty or learning potential. Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, and automatically stops sampling for a prompt once sufficient signal is collected. To stabilize updates, we form fixed-size groups with enforced reward diversity and compute advantage baselines using global statistics aggregated over the adaptive sampling phase. Empirical results across multiple model architectures and reasoning benchmarks show that Reinforce-Ada accelerates convergence and improves final performance compared to GRPO, especially when using the balanced sampling variant. Our work highlights the central role of variance-aware, adaptive data curation in enabling efficient and reliable reinforcement learning for reasoning-capable LLMs. Code is available at https://github.com/RLHFlow/Reinforce-Ada.

[575] Multi-Scale Finetuning for Encoder-based Time Series Foundation Models

Zhongzheng Qiao, Chenghao Liu, Yiming Zhang, Ming Jin, Quang Pham, Qingsong Wen, P. N. Suganthan, Xudong Jiang, Savitha Ramasamy

Main category: cs.LG

TL;DR: MSFT is a multi-scale finetuning framework for time series foundation models that addresses overfitting and suboptimal performance in naive finetuning by explicitly modeling multiple temporal scales.

Details

Motivation: Current finetuning approaches for time series foundation models fail to fully leverage their multi-scale forecasting capabilities, often leading to overfitting and suboptimal performance due to diverse temporal patterns across different sampling scales.

Method: Proposed Multiscale finetuning (MSFT) framework that explicitly integrates multi-scale modeling into the finetuning process for encoder-based time series foundation models, using a causal perspective to analyze the finetuning process.

Result: Experimental results on three backbones (Moirai, Moment and Units) show that MSFT outperforms naive finetuning, parameter efficient finetuning methods, and state-of-the-art deep learning methods.

Conclusion: MSFT provides an effective finetuning framework that better leverages the multi-scale capabilities of time series foundation models, achieving superior performance across different model architectures.

Abstract: Time series foundation models (TSFMs) demonstrate impressive zero-shot performance for time series forecasting. However, an important yet underexplored challenge is how to effectively finetune TSFMs on specific downstream tasks. While naive finetuning can yield performance gains, we argue that it falls short of fully leveraging TSFMs’ capabilities, often resulting in overfitting and suboptimal performance. Given the diverse temporal patterns across sampling scales and the inherent multi-scale forecasting capabilities of TSFMs, we adopt a causal perspective to analyze finetuning process, through which we highlight the critical importance of explicitly modeling multiple scales and reveal the shortcomings of naive approaches. Focusing on encoder-based TSFMs, we propose Multiscale finetuning (MSFT), a simple yet general framework that explicitly integrates multi-scale modeling into the finetuning process. Experimental results on three different backbones (Moirai, Moment and Units) demonstrate that TSFMs finetuned with MSFT not only outperform naive and typical parameter efficient finetuning methods but also surpass state-of-the-art deep learning methods. Codes are available at https://github.com/zqiao11/MSFT.

[576] Untangling Component Imbalance in Hybrid Linear Attention Conversion Methods

Martin Benfeghoul, Teresa Delgado, Adnan Oomerjee, Haitham Bou Ammar, Jun Wang, Zafeirios Fountas

Main category: cs.LG

TL;DR: Existing hybrid linear attention methods inadvertently bypass the linear component, relying mainly on sliding-window softmax. The paper proposes three solutions to ensure balanced component usage while maintaining computational efficiency.

Details

Motivation: Transformers' quadratic complexity limits scalability, and while linear attention reduces this to linear complexity, pre-training such models is expensive. Existing post-training linearisation methods have a critical flaw where they bypass the linear component.

Method: Three solutions: (1) inference-time hybridisation of linear-only conversions with sliding-window softmax, (2) HedgeCATs combining attention-weight transfer with targeted LoRA fine-tuning, and (3) Scheduled Sliding-window Dropout (SSD) that stochastically suppresses softmax during training.

Result: The methods maintain computational efficiency while recovering most base model performance and ensuring genuine linear attention adoption.

Conclusion: The proposed solutions restore the validity of performance attributions in hybrid conversions by preventing component collapse and ensuring balanced usage of linear attention components.

Abstract: Transformers’ quadratic computational complexity limits their scalability despite remarkable performance. While linear attention reduces this to linear complexity, pre-training such models from scratch remains, in most cases, prohibitively expensive. Recent post-training linearisation methods convert pre-trained Transformers to linear models efficiently, often using hybrid approaches that combine linear attention with sliding-window softmax. We identify a critical flaw: existing hybrid methods inadvertently bypass the linear component, relying almost entirely on SWA. Component-level diagnostics reveal this previously undetected behaviour stems from overlooked evaluation practices on common-sense benchmarks. We propose three solutions to ensure balanced component usage: (i) inference-time hybridisation of linear-only conversions with sliding-window softmax; (ii) HedgeCATs, combining attention-weight transfer with targeted LoRA fine-tuning; and (iii) Scheduled Sliding-window Dropout (SSD), which stochastically suppresses the softmax branch during training to prevent component collapse. Our methods maintain computational efficiency while recovering most base model performance and ensuring genuine linear attention adoption, restoring the validity of performance attributions in hybrid conversions.

[577] Spatiotemporal Forecasting as Planning: A Model-Based Reinforcement Learning Approach with Generative World Models

Hao Wu, Yuan Gao, Xingjian Shi, Shuaipeng Li, Fan Xu, Fan Zhang, Zhihong Zhu, Weiyan Wang, Xiao Luo, Kun Wang, Xian Wu, Xiaomeng Huang

Main category: cs.LG

TL;DR: SFP is a new paradigm that treats spatiotemporal forecasting as a planning problem using model-based reinforcement learning, addressing stochasticity and non-differentiable metrics through generative world modeling and beam search planning.

Details

Motivation: To overcome the challenges of inherent stochasticity and non-differentiable metrics in physical spatiotemporal forecasting, which traditional methods struggle with.

Method: Proposes Spatiotemporal Forecasting as Planning (SFP) using a Generative World Model for simulation, beam search-based planning algorithm that uses non-differentiable metrics as rewards, and iterative self-training with high-reward sequences as pseudo-labels.

Result: Significantly reduces prediction error and demonstrates exceptional performance on critical domain metrics, particularly in capturing extreme events.

Conclusion: SFP provides an effective framework for spatiotemporal forecasting that successfully handles stochasticity and non-differentiable metrics through reinforcement learning and planning approaches.

Abstract: To address the dual challenges of inherent stochasticity and non-differentiable metrics in physical spatiotemporal forecasting, we propose Spatiotemporal Forecasting as Planning (SFP), a new paradigm grounded in Model-Based Reinforcement Learning. SFP constructs a novel Generative World Model to simulate diverse, high-fidelity future states, enabling an “imagination-based” environmental simulation. Within this framework, a base forecasting model acts as an agent, guided by a beam search-based planning algorithm that leverages non-differentiable domain metrics as reward signals to explore high-return future sequences. These identified high-reward candidates then serve as pseudo-labels to continuously optimize the agent’s policy through iterative self-training, significantly reducing prediction error and demonstrating exceptional performance on critical domain metrics like capturing extreme events.

[578] Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning

Marco Jiralerspong, Esther Derman, Danilo Vucetic, Nikolay Malkin, Bilun Sun, Tianyu Zhang, Pierre-Luc Bacon, Gauthier Gidel

Main category: cs.LG

TL;DR: The paper introduces TGM, a novel RL algorithm that addresses the issue of overly diverse candidate generation in large search spaces by combining regularized RL operators into a unified framework that targets peakier sampling distributions.

Details

Motivation: Existing RL methods for narrowing large candidate sets (like proteins or molecules) often generate overly diverse, suboptimal candidates in large search spaces, failing to effectively filter for high-quality candidates with desirable properties.

Method: The authors propose a unified operator framework that combines several regularized RL operators to target peakier sampling distributions, and introduce TGM (trajectory general mellowmax) algorithm based on a robust RL perspective that treats regularization as robustness to compositional uncertainty in proxy functions.

Result: TGM identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks, demonstrating improved performance in filtering large candidate sets.

Conclusion: The proposed TGM algorithm effectively addresses the limitations of existing methods by providing a robust RL framework that better balances diversity and quality in candidate generation for scientific discovery applications.

Abstract: A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy’s evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks. Code: https://github.com/marcojira/tgm.

[579] Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Haiquan Qiu, Quanming Yao

Main category: cs.LG

TL;DR: The paper explains why low-precision training with flash attention fails due to similar low-rank representations and biased rounding errors, and proposes a simple fix to stabilize training.

Details

Motivation: To understand and solve the persistent problem of catastrophic loss explosion when training transformer models with flash attention in low-precision settings.

Method: Conducted mechanistic analysis of training failures, identified two key phenomena (similar low-rank representations and biased rounding errors), and introduced a minimal modification to flash attention to mitigate rounding bias.

Result: The analysis revealed that the failure is systematic, caused by a vicious cycle of error accumulation corrupting weight updates. The proposed modification successfully stabilized training.

Conclusion: The paper provides the first mechanistic explanation for low-precision training failures with flash attention and offers a practical solution through a simple modification that addresses biased rounding errors.

Abstract: The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosion. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem. Code is available at https://github.com/ucker/why-low-precision-training-fails.

[580] Towards the Training of Deeper Predictive Coding Neural Networks

Chang Qi, Matteo Forasassi, Thomas Lukasiewicz, Tommaso Salvatori

Main category: cs.LG

TL;DR: Predictive coding networks suffer performance degradation in deep architectures due to imbalanced error propagation. The paper proposes three solutions: precision-weighted optimization, novel weight updates, and auxiliary neurons for residuals, achieving backpropagation-comparable performance on deep ResNets.

Details

Motivation: Predictive coding networks perform well in shallow architectures but degrade significantly beyond 5-7 layers due to exponentially imbalanced errors between layers, ineffective predictions in deeper layers, and energy propagation issues in skip connections.

Method: Three novel methods: 1) Precision-weighted optimization of latent variables to balance error distributions during relaxation, 2) New weight update mechanism to reduce error accumulation in deeper layers, 3) Auxiliary neurons to slow down energy propagation in residual connections.

Result: Empirically achieves performance comparable to backpropagation on deep models like ResNets, enabling predictive coding to work effectively in complex tasks with deep architectures.

Conclusion: The proposed methods successfully address the limitations of predictive coding in deep networks, opening new possibilities for using predictive coding in complex tasks that require deep architectures.

Abstract: Predictive coding networks are neural models that perform inference through an iterative energy minimization process, whose operations are local in space and time. While effective in shallow architectures, they suffer significant performance degradation beyond five to seven layers. In this work, we show that this degradation is caused by exponentially imbalanced errors between layers during weight updates, and by predictions from the previous layers not being effective in guiding updates in deeper layers. Furthermore, when training models with skip connections, the energy propagated by the residuals reaches higher layers faster than that propagated by the main pathway, affecting test accuracy. We address the first issue by introducing a novel precision-weighted optimization of latent variables that balances error distributions during the relaxation phase, the second issue by proposing a novel weight update mechanism that reduces error accumulation in deeper layers, and the third one by using auxiliary neurons that slow down the propagation of the energy in the residual connections. Empirically, our methods achieve performance comparable to backpropagation on deep models such as ResNets, opening new possibilities for predictive coding in complex tasks.

[581] MLLMEraser: Achieving Test-Time Unlearning in Multimodal Large Language Models through Activation Steering

Chenlu Ding, Jiancan Wu, Leheng Sheng, Fan Zhang, Yancheng Yuan, Xiang Wang, Xiangnan He

Main category: cs.LG

TL;DR: MLLMEraser is a training-free framework for test-time unlearning in multimodal LLMs that uses activation steering to erase specific knowledge without parameter updates, outperforming existing methods.

Details

Motivation: Address concerns about memorized private data, outdated knowledge, and harmful content in deployed MLLMs, while avoiding computational costs and knowledge distortion of training-based unlearning approaches.

Method: Uses activation steering with multimodal erasure directions constructed from contrastive analysis of knowledge-recall vs knowledge-erasure image-text pairs, with input-aware adaptive steering to preserve utility.

Result: Outperforms state-of-the-art MLLM unlearning baselines on LLaVA-1.5 and Qwen-2.5-VL, achieving stronger forgetting with lower computational cost and minimal utility degradation.

Conclusion: MLLMEraser provides an effective, efficient training-free solution for dynamic knowledge erasure in multimodal LLMs, enabling safer deployment while preserving retained knowledge utility.

Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities across vision-language tasks, yet their large-scale deployment raises pressing concerns about memorized private data, outdated knowledge, and harmful content. Existing unlearning approaches for MLLMs typically adapt training-based strategies such as gradient ascent or preference optimization, but these methods are computationally expensive, irreversible, and often distort retained knowledge. In this work, we propose MLLMEraser, an input-aware, training-free framework for test-time unlearning. Our approach leverages activation steering to enable dynamic knowledge erasure without parameter updates. Specifically, we construct a multimodal erasure direction by contrasting adversarially perturbed, knowledge-recall image-text pairs with knowledge-erasure counterparts, capturing both textual and visual discrepancies. To prevent unnecessary interference, we further design an input-aware steering mechanism that adaptively determines when and how the erasure direction should be applied, preserving utility on retained knowledge while enforcing forgetting on designated content. Experiments on LLaVA-1.5 and Qwen-2.5-VL demonstrate that MLLMEraser consistently outperforms state-of-the-art MLLM unlearning baselines, achieving stronger forgetting performance with lower computational cost and minimal utility degradation.

[582] Unraveling the Black Box of Neural Networks: A Dynamic Extremum Mapper

Shengjian Chen

Main category: cs.LG

TL;DR: Neural networks are not black boxes and generalize by mapping datasets to model extrema. The number of extrema correlates with parameter count. A new linear equation-based algorithm is proposed as an alternative to backpropagation, addressing gradient vanishing and overfitting.

Details

Motivation: To challenge the black box perception of neural networks and provide a mathematical framework explaining their generalization capabilities, while addressing limitations of backpropagation like gradient vanishing and overfitting.

Method: Proposes a new algorithm that solves systems of linear equations to obtain parameter values, rather than using backpropagation. The approach leverages the relationship between network extrema and parameters.

Result: The method provides a mathematical explanation for neural network generalization and offers an alternative to backpropagation that can handle difficult training situations.

Conclusion: Neural networks’ generalization can be mathematically explained through extrema mapping, and the proposed linear equation-based algorithm provides a viable alternative to backpropagation with better handling of gradient vanishing and overfitting issues.

Abstract: We point out that neural networks are not black boxes, and their generalization stems from the ability to dynamically map a dataset to the extrema of the model function. We further prove that the number of extrema in a neural network is positively correlated with the number of its parameters. We then propose a new algorithm that is significantly different from back-propagation algorithm, which mainly obtains the values of parameters by solving a system of linear equations. Some difficult situations, such as gradient vanishing and overfitting, can be simply explained and dealt with in this framework.

[583] Revealing Interconnections between Diseases: from Statistical Methods to Large Language Models

Alina Ermilova, Dmitrii Kornilov, Sofia Samoilova, Ekaterina Laptenkova, Anastasia Kolesnikova, Ekaterina Podplutova, Senotrusova Sofya, Maksim G. Sharaev

Main category: cs.LG

TL;DR: Systematic evaluation of 7 approaches for uncovering disease relationships using EHR data and ICD-10 codes shows LLMs produce less diverse interconnections compared to domain-specific methods, suggesting limited potential for discovering new disease connections.

Details

Motivation: Manual analysis of clinical data for disease interconnections is labor-intensive and subjective, while existing ML approaches face challenges in method selection, data source reliability, and lack of ground truth for unexplored disease relationships.

Method: Evaluated 7 approaches: statistical co-occurrence analysis, masked language modeling, domain-specific BERT variants (Med-BERT, BioClinicalBERT), general-purpose BERT with document retrieval, and 4 LLMs (Mistral, DeepSeek, Qwen, YandexGPT) using MIMIC-IV EHR data and ICD-10 codes with/without descriptions.

Result: LLM-based approaches produced interconnections with the lowest diversity of ICD code connections compared to text-based and domain-based methods, indicating limited potential for discovering new disease interconnections.

Conclusion: In absence of ground truth databases, the results provide a valuable medical disease ontology that can serve as a foundational resource for future clinical research and AI applications in healthcare.

Abstract: Identifying disease interconnections through manual analysis of large-scale clinical data is labor-intensive, subjective, and prone to expert disagreement. While machine learning (ML) shows promise, three critical challenges remain: (1) selecting optimal methods from the vast ML landscape, (2) determining whether real-world clinical data (e.g., electronic health records, EHRs) or structured disease descriptions yield more reliable insights, (3) the lack of “ground truth,” as some disease interconnections remain unexplored in medicine. Large language models (LLMs) demonstrate broad utility, yet they often lack specialized medical knowledge. To address these gaps, we conduct a systematic evaluation of seven approaches for uncovering disease relationships based on two data sources: (i) sequences of ICD-10 codes from MIMIC-IV EHRs and (ii) the full set of ICD-10 codes, both with and without textual descriptions. Our framework integrates the following: (i) a statistical co-occurrence analysis and a masked language modeling (MLM) approach using real clinical data; (ii) domain-specific BERT variants (Med-BERT and BioClinicalBERT); (iii) a general-purpose BERT and document retrieval; and (iv) four LLMs (Mistral, DeepSeek, Qwen, and YandexGPT). Our graph-based comparison of the obtained interconnection matrices shows that the LLM-based approach produces interconnections with the lowest diversity of ICD code connections to different diseases compared to other methods, including text-based and domain-based approaches. This suggests an important implication: LLMs have limited potential for discovering new interconnections. In the absence of ground truth databases for medical interconnections between ICD codes, our results constitute a valuable medical disease ontology that can serve as a foundational resource for future clinical research and artificial intelligence applications in healthcare.

[584] Transferable Parasitic Estimation via Graph Contrastive Learning and Label Rebalancing in AMS Circuits

Shan Shen, Shenglu Hua, Jiajun Zou, Jiawei Liu, Jianwang Zhai, Chuan Shi, Wenjian Yu

Main category: cs.LG

TL;DR: CircuitGCL is a graph contrastive learning framework for AMS circuits that addresses data scarcity and label imbalance through hyperspherical representation scattering and balanced loss functions, achieving state-of-the-art performance in parasitic estimation tasks.

Details

Motivation: Address challenges in graph representation learning for AMS circuits including scarcity of design data, unbalanced label distribution, and diversity of circuit implementations to enable robust and transferable circuit representations.

Method: Proposes CircuitGCL framework with self-supervised topology-invariant node embeddings via hyperspherical representation scattering, and introduces balanced mean squared error (BMSE) and balanced softmax cross-entropy (BSCE) losses to handle label distribution disparities.

Result: Outperforms all state-of-the-art methods on TSMC 28nm AMS designs: achieves 33.64%-44.20% R² improvement for edge regression (parasitic capacitance estimation) and 0.9×-2.1× F1-score gain for node classification (ground capacitance classification).

Conclusion: CircuitGCL effectively addresses data scarcity and label imbalance in AMS circuit representation learning, demonstrating superior transferability and performance across heterogeneous circuit graphs for parasitic estimation tasks.

Abstract: Graph representation learning on Analog-Mixed Signal (AMS) circuits is crucial for various downstream tasks, e.g., parasitic estimation. However, the scarcity of design data, the unbalanced distribution of labels, and the inherent diversity of circuit implementations pose significant challenges to learning robust and transferable circuit representations. To address these limitations, we propose CircuitGCL, a novel graph contrastive learning framework that integrates representation scattering and label rebalancing to enhance transferability across heterogeneous circuit graphs. CircuitGCL employs a self-supervised strategy to learn topology-invariant node embeddings through hyperspherical representation scattering, eliminating dependency on large-scale data. Simultaneously, balanced mean squared error (BMSE) and balanced softmax cross-entropy (BSCE) losses are introduced to mitigate label distribution disparities between circuits, enabling robust and transferable parasitic estimation. Evaluated on parasitic capacitance estimation (edge-level task) and ground capacitance classification (node-level task) across TSMC 28nm AMS designs, CircuitGCL outperforms all state-of-the-art (SOTA) methods, with the $R^2$ improvement of $33.64% \sim 44.20%$ for edge regression and F1-score gain of $0.9\times \sim 2.1\times$ for node classification. Our code is available at https://github.com/ShenShan123/CircuitGCL.

[585] Networked Information Aggregation via Machine Learning

Michael Kearns, Aaron Roth, Emily Ryu

Main category: cs.LG

TL;DR: Distributed learning in DAGs where agents observe subsets of features and parent predictions, achieving information aggregation when DAG depth is sufficient.

Details

Motivation: To understand when sequential learning in DAGs can aggregate information across agents with partial feature access to compete with centralized models.

Method: Agents learn sequentially in topological order, using their observed features and parent predictions as inputs, with theoretical analysis of linear and general hypothesis classes.

Result: DAG depth is crucial: information aggregation occurs over sufficiently long paths with well-represented features, but fails in shallow DAGs like hub-and-spokes topologies.

Conclusion: Information aggregation in distributed DAG learning depends critically on network depth, with sufficient depth enabling competitive performance despite partial feature access.

Abstract: We study a distributed learning problem in which learning agents are embedded in a directed acyclic graph (DAG). There is a fixed and arbitrary distribution over feature/label pairs, and each agent or vertex in the graph is able to directly observe only a subset of the features – potentially a different subset for every agent. The agents learn sequentially in some order consistent with a topological sort of the DAG, committing to a model mapping observations to predictions of the real-valued label. Each agent observes the predictions of their parents in the DAG, and trains their model using both the features of the instance that they directly observe, and the predictions of their parents as additional features. We ask when this process is sufficient to achieve \emph{information aggregation}, in the sense that some agent in the DAG is able to learn a model whose error is competitive with the best model that could have been learned (in some hypothesis class) with direct access to \emph{all} features, despite the fact that no single agent in the network has such access. We give upper and lower bounds for this problem for both linear and general hypothesis classes. Our results identify the \emph{depth} of the DAG as the key parameter: information aggregation can occur over sufficiently long paths in the DAG, assuming that all of the relevant features are well represented along the path, and there are distributions over which information aggregation cannot occur even in the linear case, and even in arbitrarily large DAGs that do not have sufficient depth (such as a hub-and-spokes topology in which the spoke vertices collectively see all the features). We complement our theoretical results with a comprehensive set of experiments.

[586] Fewer Weights, More Problems: A Practical Attack on LLM Pruning

Kazuki Egashira, Robin Staab, Thibaud Gloaguen, Mark Vero, Martin Vechev

Main category: cs.LG

TL;DR: Model pruning can be maliciously exploited to create backdoored LLMs that appear benign but exhibit malicious behaviors after pruning.

Details

Motivation: To investigate the security implications of model pruning, which has become a prominent approach for reducing LLM memory footprint but whose security risks remain underexplored.

Method: Adversaries compute a proxy metric to estimate parameter pruning likelihood, inject malicious behavior into unlikely-to-be-pruned parameters, and repair the model using likely-to-be-pruned parameters to hide the behavior in unpruned models.

Result: After pruning with vLLM methods (Magnitude, Wanda, SparseGPT), models consistently exhibit strong malicious behaviors with success rates up to 95.7% for jailbreak, 98.7% for benign instruction refusal, and 99.5% for targeted content injection.

Conclusion: The work reveals a critical deployment-time security gap in model compression and underscores the urgent need for stronger security awareness in this area.

Abstract: Model pruning, i.e., removing a subset of model weights, has become a prominent approach to reducing the memory footprint of large language models (LLMs) during inference. Notably, popular inference engines, such as vLLM, enable users to conveniently prune downloaded models before they are deployed. While the utility and efficiency of pruning methods have improved significantly, the security implications of pruning remain underexplored. In this work, for the first time, we show that modern LLM pruning methods can be maliciously exploited. In particular, an adversary can construct a model that appears benign yet, once pruned, exhibits malicious behaviors. Our method is based on the idea that the adversary can compute a proxy metric that estimates how likely each parameter is to be pruned. With this information, the adversary can first inject a malicious behavior into those parameters that are unlikely to be pruned. Then, they can repair the model by using parameters that are likely to be pruned, effectively canceling out the injected behavior in the unpruned model. We demonstrate the severity of our attack through extensive evaluation on five models; after any of the pruning in vLLM are applied (Magnitude, Wanda, and SparseGPT), it consistently exhibits strong malicious behaviors in a diverse set of attack scenarios (success rates of up to $95.7%$ for jailbreak, $98.7%$ for benign instruction refusal, and $99.5%$ for targeted content injection). Our results reveal a critical deployment-time security gap and underscore the urgent need for stronger security awareness in model compression.

[587] ROC-n-reroll: How verifier imperfection affects test-time scaling

Florian E. Dorner, Yatong Chen, André F. Cruz, Fanny Yang

Main category: cs.LG

TL;DR: The paper provides theoretical analysis showing that test-time scaling methods like Best-of-N and Rejection Sampling are characterized by the verifier’s ROC curve geometry, with RS outperforming BoN at fixed compute but both converging to the same accuracy with infinite compute.

Details

Motivation: To address the lack of theoretical understanding of how verifier imperfection affects test-time scaling performance in language models.

Method: Theoretical analysis proving that instance-level accuracy is characterized by the geometry of the verifier’s ROC curve, with experimental validation using Qwen and LLama models on GSM8K and MATH500 datasets.

Result: Rejection Sampling outperforms Best-of-N for fixed compute, both methods converge to the same accuracy in the infinite-compute limit, and high-compute performance cannot be predicted from low-compute observations.

Conclusion: The geometry of the verifier’s ROC curve precisely characterizes test-time scaling performance, revealing important practical implications for method selection and performance prediction.

Abstract: Test-time scaling aims to improve language model performance by leveraging additional compute during inference. Many works have empirically studied techniques such as Best-of-N (BoN) and Rejection Sampling (RS) that make use of a verifier to enable test-time scaling. However, to date there is little theoretical understanding of how verifier imperfection affects performance – a gap we address in this work. Specifically, we prove that the instance-level accuracy of these methods is precisely characterized by the geometry of the verifier’s ROC curve. Our theory has two important takeaways, confirmed by experiments with Qwen and LLama models on GSM8K and MATH500. First, RS outperforms BoN for fixed compute, while both methods converge to the same accuracy in the infinite-compute limit. Second, it is generally impossible to predict the high-compute performance of either method based on observations in the low-compute regime.

[588] Synthetic Series-Symbol Data Generation for Time Series Foundation Models

Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, Xiaoyu Zhang

Main category: cs.LG

TL;DR: SymTime is a foundation model for time series analysis that uses series-symbol data generation to overcome data scarcity issues, achieving competitive performance across five major TSA tasks.

Details

Motivation: To address challenges in time series foundation models including training data scarcity and imbalance, inspired by complex dynamic system theories.

Method: Developed a series-symbol data generation mechanism to create unlimited high-quality time series data paired with symbolic expressions, and built SymTime as a pre-trained foundation model leveraging these correlated series-symbol pairs.

Result: SymTime demonstrates competitive performance across five major time series analysis tasks when fine-tuned, rivaling foundation models pre-trained on real-world datasets.

Conclusion: The approach shows the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance in time series analysis.

Abstract: Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as training data scarcity and imbalance continue to hinder their development. Inspired by complex dynamic system theories, we design a series-symbol data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic expressions. To leverage series-symbol data pairs with strong correlations, we develop SymTime, a pre-trained foundation model for enhancing time series representation using symbolic information. SymTime demonstrates competitive performance across five major TSA tasks when fine-tunes with downstream tasks, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of series-symbol data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance. The code is available at https://github.com/wwhenxuan/SymTime.

[589] Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning

Tolga Dimlioglu, Anna Choromanska

Main category: cs.LG

TL;DR: The paper introduces DPPF, a distributed training algorithm that improves communication efficiency and model generalization by encouraging workers to find flatter minima through a pull-push mechanism.

Details

Motivation: To improve the trade-off between communication efficiency and model performance in distributed DNN training, addressing limitations of local gradient methods and leveraging the flat-minima hypothesis for better generalization.

Method: Proposes Distributed Pull-Push Force (DPPF) algorithm that incorporates a sharpness measure (Inverse Mean Valley) as a lightweight regularizer, creating a push force that counteracts consensus pulling to collaboratively seek wide minima.

Result: DPPF outperforms other communication-efficient approaches, achieves better generalization than local gradient methods and synchronous gradient averaging while maintaining communication efficiency, and successfully locates flatter minima as confirmed by loss landscape visualizations.

Conclusion: DPPF effectively guides workers to span flat valleys with self-stabilizing dynamics, provides theoretical generalization guarantees linked to valley width, and proves convergence in non-convex settings, offering a practical solution for communication-efficient distributed training with improved generalization.

Abstract: We study centralized distributed data parallel training of deep neural networks (DNNs), aiming to improve the trade-off between communication efficiency and model performance of the local gradient methods. To this end, we revisit the flat-minima hypothesis, which suggests that models with better generalization tend to lie in flatter regions of the loss landscape. We introduce a simple, yet effective, sharpness measure, Inverse Mean Valley, and demonstrate its strong correlation with the generalization gap of DNNs. We incorporate an efficient relaxation of this measure into the distributed training objective as a lightweight regularizer that encourages workers to collaboratively seek wide minima. The regularizer exerts a pushing force that counteracts the consensus step pulling the workers together, giving rise to the Distributed Pull-Push Force (DPPF) algorithm. Empirically, we show that DPPF outperforms other communication-efficient approaches and achieves better generalization performance than local gradient methods and synchronous gradient averaging, while maintaining communication efficiency. In addition, our loss landscape visualizations confirm the ability of DPPF to locate flatter minima. On the theoretical side, we show that DPPF guides workers to span flat valleys, with the final valley width governed by the interplay between push and pull strengths, and that its pull-push dynamics is self-stabilizing. We further provide generalization guarantees linked to the valley width and prove convergence in the non-convex setting.

[590] On the optimization dynamics of RLVR: Gradient gap and step size thresholds

Joe Suk, Yaqi Duan

Main category: cs.LG

TL;DR: This paper provides a theoretical foundation for RLVR (Reinforcement Learning with Verifiable Rewards), analyzing why it works by examining the training process at both trajectory and token levels through the concept of Gradient Gap.

Details

Motivation: RLVR has shown empirical success using simple binary feedback to post-train large language models, but there has been a lack of principled understanding of why it works.

Method: The authors analyze RLVR’s training process at full-response (trajectory) and token levels, introducing the Gradient Gap concept to formalize improvement direction from low-reward to high-reward regions. They prove convergence depends on aligning update direction with this Gradient Gap and derive step-size thresholds.

Result: The theory predicts a sharp step-size threshold: below it learning converges, above it performance collapses. It explains why practical heuristics like length normalization improve stability and shows that with fixed learning rate, success rate can stagnate below 100%. These predictions are validated through bandit simulations.

Conclusion: The paper establishes a theoretical foundation for RLVR, explaining its empirical success through the Gradient Gap concept and providing principled insights into convergence conditions and practical implementation considerations.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has shown significant empirical success. However, a principled understanding of why it works has been lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100%$. We validate these predictions through controlled bandit simulations.

[591] Prototype Learning to Create Refined Interpretable Digital Phenotypes from ECGs

Sahil Sethi, David Chen, Michael C. Burkhart, Nipun Bhandari, Bashar Ramadan, Brett Beaulieu-Jones

Main category: cs.LG

TL;DR: Prototype-based neural networks trained for ECG classification capture clinically meaningful patterns that transfer to external populations, showing strong associations with hospital discharge diagnoses beyond their original training objectives.

Details

Motivation: To determine whether prototype-based models, trained solely for classification, capture underlying clinical structure that aligns with broader clinical phenotypes when applied to external populations.

Method: Used a prototype-based deep learning model trained for multi-label ECG classification on PTB-XL dataset, then performed inference on MIMIC-IV clinical database without modification. Assessed associations between individual prototypes and hospital discharge diagnoses (phecodes).

Result: Individual prototypes showed significantly stronger and more specific associations with clinical outcomes compared to class predictions, NLP-extracted concepts, or broader prototype classes. Prototypes achieved strong predictive performance (AUCs 0.89-0.91) for cardiac conditions and showed substantial signal for non-cardiac conditions like sepsis and renal disease.

Conclusion: Prototype-based models can support interpretable digital phenotyping from physiologic time-series data, providing transferable intermediate phenotypes that capture clinically meaningful physiologic signatures beyond their original training objectives.

Abstract: Prototype-based neural networks offer interpretable predictions by comparing inputs to learned, representative signal patterns anchored in training data. While such models have shown promise in the classification of physiological data, it remains unclear whether their prototypes capture an underlying structure that aligns with broader clinical phenotypes. We use a prototype-based deep learning model trained for multi-label ECG classification using the PTB-XL dataset. Then without modification we performed inference on the MIMIC-IV clinical database. We assess whether individual prototypes, trained solely for classification, are associated with hospital discharge diagnoses in the form of phecodes in this external population. Individual prototypes demonstrate significantly stronger and more specific associations with clinical outcomes compared to the classifier’s class predictions, NLP-extracted concepts, or broader prototype classes across all phecode categories. Prototype classes with mixed significance patterns exhibit significantly greater intra-class distances (p $<$ 0.0001), indicating the model learned to differentiate clinically meaningful variations within diagnostic categories. The prototypes achieve strong predictive performance across diverse conditions, with AUCs ranging from 0.89 for atrial fibrillation to 0.91 for heart failure, while also showing substantial signal for non-cardiac conditions such as sepsis and renal disease. These findings suggest that prototype-based models can support interpretable digital phenotyping from physiologic time-series data, providing transferable intermediate phenotypes that capture clinically meaningful physiologic signatures beyond their original training objectives.

[592] Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis

Yuichi Kondo, Hideaki Iiduka

Main category: cs.LG

TL;DR: SGDM with dynamic learning-rate and batch-size schedules analyzed using novel Lyapunov function, showing increasing batch size guarantees convergence while constant batch size doesn’t, and simultaneous increase of both achieves faster decay.

Details

Motivation: To understand convergence behavior of stochastic gradient descent with momentum under practical dynamic scheduling strategies commonly used in deep learning.

Method: Introduced novel Lyapunov function to extend theoretical framework, analyzed three scheduling strategies: constant batch size with decaying learning rate, increasing batch size with decaying learning rate, and increasing batch size with increasing learning rate.

Result: Clear hierarchy in convergence: constant batch size doesn’t guarantee convergence of expected gradient norm, increasing batch size does, and simultaneously increasing both achieves provably faster decay. Empirical results validate theory.

Conclusion: Dynamically scheduled SGDM significantly outperforms fixed-hyperparameter counterpart, with warm-up schedule empirically performing best among all strategies.

Abstract: We analyze the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning-rate and batch-size schedules by introducing a novel and simpler Lyapunov function. We extend the existing theoretical framework to cover three practical scheduling strategies commonly used in deep learning: a constant batch size with a decaying learning rate, an increasing batch size with a decaying learning rate, and an increasing batch size with an increasing learning rate. Our results reveal a clear hierarchy in convergence: a constant batch size does not guarantee convergence of the expected gradient norm, whereas an increasing batch size does, and simultaneously increasing both the batch size and learning rate achieves a provably faster decay. Empirical results validate our theory, showing that dynamically scheduled SGDM significantly outperforms its fixed-hyperparameter counterpart in convergence speed. We also evaluated a warm-up schedule in experiments, which empirically outperformed all other strategies in convergence behavior.

[593] Evaluating the Quality of the Quantified Uncertainty for (Re)Calibration of Data-Driven Regression Models

Jelke Wibbeke, Nico Schönfisch, Sebastian Rohjans, Andreas Rauh

Main category: cs.LG

TL;DR: This paper systematically analyzes regression calibration metrics, revealing significant inconsistencies across different metrics and identifying ENCE and CWC as the most reliable ones.

Details

Motivation: In safety-critical applications, data-driven models need accurate uncertainty estimates (calibration), but existing calibration metrics vary widely in definitions and assumptions, making results difficult to compare across studies.

Method: The authors systematically extracted and categorized regression calibration metrics from literature and conducted controlled experiments using real-world, synthetic, and artificially miscalibrated data to benchmark these metrics independently.

Result: Experiments showed that calibration metrics frequently produce conflicting results, with many metrics disagreeing in their evaluation of the same recalibration result and some indicating contradictory conclusions.

Conclusion: Metric selection is critical in calibration research due to substantial inconsistencies across metrics; ENCE and CWC were identified as the most dependable metrics in the tests.

Abstract: In safety-critical applications data-driven models must not only be accurate but also provide reliable uncertainty estimates. This property, commonly referred to as calibration, is essential for risk-aware decision-making. In regression a wide variety of calibration metrics and recalibration methods have emerged. However, these metrics differ significantly in their definitions, assumptions and scales, making it difficult to interpret and compare results across studies. Moreover, most recalibration methods have been evaluated using only a small subset of metrics, leaving it unclear whether improvements generalize across different notions of calibration. In this work, we systematically extract and categorize regression calibration metrics from the literature and benchmark these metrics independently of specific modelling methods or recalibration approaches. Through controlled experiments with real-world, synthetic and artificially miscalibrated data, we demonstrate that calibration metrics frequently produce conflicting results. Our analysis reveals substantial inconsistencies: many metrics disagree in their evaluation of the same recalibration result, and some even indicate contradictory conclusions. This inconsistency is particularly concerning as it potentially allows cherry-picking of metrics to create misleading impressions of success. We identify the Expected Normalized Calibration Error (ENCE) and the Coverage Width-based Criterion (CWC) as the most dependable metrics in our tests. Our findings highlight the critical role of metric selection in calibration research.

[594] Revisiting associative recall in modern recurrent models

Destiny Okpekpe, Antonio Orvieto

Main category: cs.LG

TL;DR: Modern recurrent models like SSMs face optimization challenges in associative recall tasks, with learning rate sensitivity and scaling differences compared to transformers. 1-layer transformers surprisingly show induction head formation despite poor performance.

Details

Motivation: To investigate why modern recurrent models underperform transformers on reasoning tasks like associative recall, and examine scaling properties and optimization stability of different architectures.

Method: Analyzed associative recall performance across different architectures, studied learning rate sensitivity, compared width vs depth scaling effects, examined 1-layer transformer training dynamics, and conducted architectural ablations on Transformer and Mamba.

Result: Recurrent models are highly sensitive to learning rate choice, transformers and recurrent models scale differently (width vs depth), 1-layer transformers surprisingly form induction heads despite poor AR performance, and architectural components affect optimization stability.

Conclusion: Modern recurrent models require better optimization strategies, transformers and recurrent architectures have fundamentally different scaling properties, and 1-layer transformers exhibit unexpected training dynamics that merit further investigation.

Abstract: Despite the advantageous subquadratic complexity of modern recurrent deep learning models – such as state-space models (SSMs) – recent studies have highlighted their potential shortcomings compared to transformers on reasoning and memorization tasks. In this paper, we dive deeper into one of such benchmarks: associative recall (AR), which has been shown to correlate well with language modeling performance, and inspect in detail the effects of scaling and optimization issues in recently proposed token mixing strategies. We first demonstrate that, unlike standard transformers, the choice of learning rate plays a critical role in the performance of modern recurrent models: an issue that can severely affect reported performance in previous works and suggests further research is needed to stabilize training. Next, we show that recurrent and attention-based models exhibit contrasting benefits when scaling in width as opposed to depth, with attention being notably unable to solve AR when limited to a single layer. We then further inspect 1-layer transformers, revealing that despite their poor performance, their training dynamics surprisingly resemble the formation of induction heads, a phenomenon previously observed only in their 2-layer counterparts. Finally, through architectural ablations, we study how components affects Transformer and Mamba’s performance and optimization stability.

[595] HyPINO: Multi-Physics Neural Operators via HyperPINNs and the Method of Manufactured Solutions

Rafael Bischof, Michal Piovarči, Michael A. Kraus, Siddhartha Mishra, Bernd Bickel

Main category: cs.LG

TL;DR: HyPINO is a multi-physics neural operator that achieves zero-shot generalization across parametric PDEs using a Swin Transformer hypernetwork with mixed supervision from analytical solutions and physics-informed objectives, outperforming existing methods without task-specific fine-tuning.

Details

Motivation: To develop a neural operator that can generalize across diverse PDEs without requiring task-specific fine-tuning, addressing the limitations of current methods that struggle with zero-shot performance on parametric PDEs with varying geometries and boundary conditions.

Method: Combines Swin Transformer-based hypernetwork with mixed supervision using labeled data from Method of Manufactured Solutions and unlabeled samples optimized via physics-informed objectives. Includes iterative refinement procedure that generates delta PINNs to progressively reduce error through ensemble formation.

Result: Achieves strong zero-shot accuracy on seven benchmark problems, outperforming U-Nets, Poseidon, and PINO. Iterative refinement achieves over 100x gain in average L2 loss. Fine-tuned PINNs initialized by HyPINO converge faster and to lower error than random initialization and Reptile-meta-learned PINNs.

Conclusion: HyPINO demonstrates scalable potential as a foundation for extending neural operators to solve complex, nonlinear, and high-dimensional PDE problems, with publicly available code and model weights.

Abstract: We present HyPINO, a multi-physics neural operator designed for zero-shot generalization across a broad class of parametric PDEs without requiring task-specific fine-tuning. Our approach combines a Swin Transformer-based hypernetwork with mixed supervision: (i) labeled data from analytical solutions generated via the Method of Manufactured Solutions (MMS), and (ii) unlabeled samples optimized using physics-informed objectives. The model maps PDE parametrizations to target Physics-Informed Neural Networks (PINNs) and can handle linear elliptic, hyperbolic, and parabolic equations in two dimensions with varying source terms, geometries, and mixed Dirichlet/Neumann boundary conditions, including interior boundaries. HyPINO achieves strong zero-shot accuracy on seven benchmark problems from PINN literature, outperforming U-Nets, Poseidon, and Physics-Informed Neural Operators (PINO). Further, we introduce an iterative refinement procedure that compares the physics of the generated PINN to the requested PDE and uses the discrepancy to generate a “delta” PINN. Summing their contributions and repeating this process forms an ensemble whose combined solution progressively reduces the error on six benchmarks and achieves over 100x gain in average $L_2$ loss in the best case, while retaining forward-only inference. Additionally, we evaluate the fine-tuning behavior of PINNs initialized by HyPINO and show that they converge faster and to lower final error than both randomly initialized and Reptile-meta-learned PINNs on five benchmarks, performing on par on the remaining two. Our results highlight the potential of this scalable approach as a foundation for extending neural operators toward solving increasingly complex, nonlinear, and high-dimensional PDE problems. The code and model weights are publicly available at https://github.com/rbischof/hypino.

[596] DQS: A Low-Budget Query Strategy for Enhancing Unsupervised Data-driven Anomaly Detection Approaches

Lucas Correia, Jan-Christoph Goos, Thomas Bäck, Anna V. Kononova

Main category: cs.LG

TL;DR: This paper introduces an active learning approach for time series anomaly detection that uses a novel dissimilarity-based query strategy (DQS) to refine threshold selection, addressing limitations of purely unsupervised methods.

Details

Motivation: Existing unsupervised time series anomaly detection methods suffer from poor threshold setting or require labeled data for calibration, which is often unavailable in real-world scenarios.

Method: Integrates active learning with unsupervised anomaly detection using a novel dissimilarity-based query strategy (DQS) that evaluates similarity between anomaly scores using dynamic time warping to maximize diversity of queried samples.

Result: DQS performs best in small-budget scenarios, though other strategies are more robust to mislabelling. All query strategies outperform the unsupervised threshold even with mislabelling.

Conclusion: When feasible to query an oracle, employing active learning-based threshold is recommended, with query strategy choice depending on oracle expertise and labeling budget.

Abstract: Truly unsupervised approaches for time series anomaly detection are rare in the literature. Those that exist suffer from a poorly set threshold, which hampers detection performance, while others, despite claiming to be unsupervised, need to be calibrated using a labelled data subset, which is often not available in the real world. This work integrates active learning with an existing unsupervised anomaly detection method by selectively querying the labels of multivariate time series, which are then used to refine the threshold selection process. To achieve this, we introduce a novel query strategy called the dissimilarity-based query strategy (DQS). DQS aims to maximise the diversity of queried samples by evaluating the similarity between anomaly scores using dynamic time warping. We assess the detection performance of DQS in comparison to other query strategies and explore the impact of mislabelling, a topic that is underexplored in the literature. Our findings indicate that DQS performs best in small-budget scenarios, though the others appear to be more robust when faced with mislabelling. Therefore, in the real world, the choice of query strategy depends on the expertise of the oracle and the number of samples they are willing to label. Regardless, all query strategies outperform the unsupervised threshold even in the presence of mislabelling. Thus, whenever it is feasible to query an oracle, employing an active learning-based threshold is recommended.

[597] Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning

Lei Wang, Jieming Bian, Letian Zhang, Jie Xu

Main category: cs.LG

TL;DR: FedLEASE is a federated learning framework that adaptively allocates and selects LoRA experts for fine-tuning LLMs across heterogeneous clients, improving performance while maintaining communication efficiency.

Details

Motivation: Fine-tuning LLMs for domain-specific applications requires substantial distributed data, but federated learning faces computational constraints and single LoRA modules struggle with heterogeneous data across diverse domains.

Method: Proposes FedLEASE framework that adaptively clusters clients based on representation similarity to allocate domain-specific LoRA experts, and uses adaptive top-M Mixture-of-Experts mechanism for client expert selection.

Result: Extensive experiments on diverse benchmark datasets show FedLEASE significantly outperforms existing federated fine-tuning approaches in heterogeneous client settings.

Conclusion: FedLEASE effectively addresses challenges in federated LoRA fine-tuning by adaptively allocating experts and enabling selective utilization, achieving superior performance with communication efficiency.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across various tasks, but fine-tuning them for domain-specific applications often requires substantial domain-specific data that may be distributed across multiple organizations. Federated Learning (FL) offers a privacy-preserving solution, but faces challenges with computational constraints when applied to LLMs. Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient fine-tuning approach, though a single LoRA module often struggles with heterogeneous data across diverse domains. This paper addresses two critical challenges in federated LoRA fine-tuning: 1. determining the optimal number and allocation of LoRA experts across heterogeneous clients, and 2. enabling clients to selectively utilize these experts based on their specific data characteristics. We propose FedLEASE (Federated adaptive LoRA Expert Allocation and SElection), a novel framework that adaptively clusters clients based on representation similarity to allocate and train domain-specific LoRA experts. It also introduces an adaptive top-$M$ Mixture-of-Experts mechanism that allows each client to select the optimal number of utilized experts. Our extensive experiments on diverse benchmark datasets demonstrate that FedLEASE significantly outperforms existing federated fine-tuning approaches in heterogeneous client settings while maintaining communication efficiency.

[598] Spatio-temporal, multi-field deep learning of shock propagation in meso-structured media

M. Giselle Fernández-Godino, Meir H. Shachar, Kevin Korner, Jonathan L. Belof, Mukul Kumar, Jonathan Lind, William J. Schill

Main category: cs.LG

TL;DR: A multi-field spatio-temporal model (MSTM) that unifies seven coupled fields for predicting shock wave dynamics in porous and architected materials, achieving high accuracy with significant speedup.

Details

Motivation: To address the challenge of predicting shock wave behavior in porous materials for planetary defense and inertial fusion energy applications, where capturing phenomena like pore collapse and localized heating is crucial.

Method: Developed a multi-field spatio-temporal model that integrates seven coupled fields (pressure, density, temperature, energy, material distribution, and two velocity components) into a single autoregressive surrogate trained on high-fidelity hydrocode data.

Result: Achieved mean errors of 1.4% for porous and 3.2% for architected configurations, with over three orders of magnitude speedup. Reduced mean-squared error and structural dissimilarity by 94% compared to single-field models.

Conclusion: MSTM transforms previously intractable problems into tractable design studies, providing a practical framework for optimizing meso-structured materials in planetary impact mitigation and inertial fusion energy applications.

Abstract: The ability to predict how shock waves traverse porous and architected materials is a key challenge in planetary defense and in the pursuit of inertial fusion energy. Yet capturing pore collapse, anomalous Hugoniot responses, and localized heating - phenomena that strongly influence asteroid deflection or fusion ignition - has remained a major challenge despite recent advances in single-field and reduced representations. We introduce a multi-field spatio-temporal model (MSTM) that unifies seven coupled fields - pressure, density, temperature, energy, material distribution, and two velocity components - into a single autoregressive surrogate. Trained on high-fidelity hydrocode data, MSTM captures nonlinear shock-driven dynamics across porous and architected configurations, achieving mean errors of 1.4% and 3.2% respectively, all while delivering over three orders of magnitude in speedup. MSTM reduces mean-squared error and structural dissimilarity by 94% relative torelative to single-field spatio-temporal models. This advance transforms problems once considered intractable into tractable design studies, establishing a practical framework for optimizing meso-structured materials in planetary impact mitigation and inertial fusion energy.

[599] MolSpectLLM: A Molecular Foundation Model Bridging Spectroscopy, Molecule Elucidation, and 3D Structure Generation

Shuaike Shen, Jiaqing Xie, Zhuo Yang, Antong Zhang, Shuzhou Sun, Ben Gao, Tianfan Fu, Biqing Qi, Yuqiang Li

Main category: cs.LG

TL;DR: MolSpectLLM is a molecular foundation model that integrates experimental spectroscopy with 3D molecular structure, achieving state-of-the-art performance on spectrum-related tasks and generating accurate 3D structures from SMILES or spectral inputs.

Details

Motivation: Existing molecular foundation models rely only on SMILES representations and overlook experimental spectra and 3D structural information, limiting their effectiveness in real-world scenarios where stereochemistry, spatial conformation, and experimental validation are critical.

Method: Pretrained on Qwen2.5-7B, MolSpectLLM unifies experimental spectroscopy with molecular 3D structure by explicitly modeling molecular spectra.

Result: Achieves state-of-the-art performance with average accuracy of 0.53 across NMR, IR, and MS benchmarks, 15.5% sequence accuracy and 41.7% token accuracy on Spectra-to-SMILES, and generates accurate 3D molecular structures directly from SMILES or spectral inputs.

Conclusion: MolSpectLLM bridges spectral analysis, molecular elucidation, and molecular design, demonstrating strong performance across multiple molecular tasks while incorporating crucial experimental and structural information.

Abstract: Recent advances in molecular foundation models have shown impressive performance in molecular property prediction and de novo molecular design, with promising applications in areas such as drug discovery and reaction prediction. Nevertheless, most existing approaches rely exclusively on SMILES representations and overlook both experimental spectra and 3D structural information-two indispensable sources for capturing molecular behavior in real-world scenarios. This limitation reduces their effectiveness in tasks where stereochemistry, spatial conformation, and experimental validation are critical. To overcome these challenges, we propose MolSpectLLM, a molecular foundation model pretrained on Qwen2.5-7B that unifies experimental spectroscopy with molecular 3D structure. By explicitly modeling molecular spectra, MolSpectLLM achieves state-of-the-art performance on spectrum-related tasks, with an average accuracy of 0.53 across NMR, IR, and MS benchmarks. MolSpectLLM also shows strong performance on the spectra analysis task, obtaining 15.5% sequence accuracy and 41.7% token accuracy on Spectra-to-SMILES, substantially outperforming large general-purpose LLMs. More importantly, MolSpectLLM not only achieves strong performance on molecular elucidation tasks, but also generates accurate 3D molecular structures directly from SMILES or spectral inputs, bridging spectral analysis, molecular elucidation, and molecular design. Code are available at \href{https://github.com/Eurekashen/MolSpectLLM}{https://github.com/Eurekashen/MolSpectLLM}.

[600] Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data

Oussama Kharouiche, Aris Markogiannakis, Xiao Fei, Michail Chatzianastasis, Michalis Vazirgiannis

Main category: cs.LG

TL;DR: Cell2Text is a multimodal framework that translates single-cell RNA sequencing profiles into natural language descriptions, combining gene embeddings with large language models to generate interpretable cellular summaries.

Details

Motivation: Current single-cell foundation models use discrete prediction heads that collapse cellular complexity into predefined labels, failing to capture the richer contextual explanations needed by biologists.

Method: Integrates gene-level embeddings from single-cell foundation models with pretrained large language models to generate structured natural language descriptions of cells.

Result: Outperforms baselines on classification accuracy, shows strong ontological consistency using PageRank-based metrics, and achieves high semantic fidelity in text generation.

Conclusion: Coupling expression data with natural language offers both stronger predictive performance and inherently interpretable outputs, enabling scalable label-efficient characterization of unseen cells.

Abstract: Single-cell RNA sequencing has transformed biology by enabling the measurement of gene expression at cellular resolution, providing information for cell types, states, and disease contexts. Recently, single-cell foundation models have emerged as powerful tools for learning transferable representations directly from expression profiles, improving performance on classification and clustering tasks. However, these models are limited to discrete prediction heads, which collapse cellular complexity into predefined labels that fail to capture the richer, contextual explanations biologists need. We introduce Cell2Text, a multimodal generative framework that translates scRNA-seq profiles into structured natural language descriptions. By integrating gene-level embeddings from single-cell foundation models with pretrained large language models, Cell2Text generates coherent summaries that capture cellular identity, tissue origin, disease associations, and pathway activity, generalizing to unseen cells. Empirically, Cell2Text outperforms baselines on classification accuracy, demonstrates strong ontological consistency using PageRank-based similarity metrics, and achieves high semantic fidelity in text generation. These results demonstrate that coupling expression data with natural language offers both stronger predictive performance and inherently interpretable outputs, pointing to a scalable path for label-efficient characterization of unseen cells.

[601] Machine Learning Detection of Lithium Plating in Lithium-ion Cells: A Gaussian Process Approach

Ayush Patnaik, Jackson Fogelquist, Adam B Zufall, Stephen K Robinson, Xinfan Lin

Main category: cs.LG

TL;DR: A Gaussian Process framework is proposed for robust lithium plating detection by modeling charge-voltage relationships with calibrated uncertainty, enabling noise-aware inference and closed-form derivatives without ad hoc smoothing.

Details

Motivation: Lithium plating during fast charging accelerates battery degradation and poses safety risks. Conventional dQ/dV computation methods amplify sensor noise and introduce bias in peak detection.

Method: The paper uses Gaussian Processes to model Q(V) as a stochastic process, leveraging GP properties to analytically infer dQ/dV with uncertainty quantification. The framework learns hyperparameters from data and supports online variants for battery management systems.

Result: Experimental validation on Li-ion coin cells across various C-rates (0.2C-1C) and temperatures (0-40°C) shows reliable plating peak detection under low-temperature, high-rate charging, with correct negative results in baseline cases.

Conclusion: The method provides accurate and robust lithium plating detection with uncertainty quantification, establishing a practical pathway for real-time monitoring in battery management systems.

Abstract: Lithium plating during fast charging is a critical degradation mechanism that accelerates capacity fade and can trigger catastrophic safety failures. Recent work has identified a distinctive dQ/dV peak above 4.0 V as a reliable signature of plating onset; however, conventional methods for computing dQ/dV rely on finite differencing with filtering, which amplifies sensor noise and introduces bias in peak location. In this paper, we propose a Gaussian Process (GP) framework for lithium plating detection by directly modeling the charge-voltage relationship Q(V) as a stochastic process with calibrated uncertainty. Leveraging the property that derivatives of GPs remain GPs, we infer dQ/dV analytically and probabilistically from the posterior, enabling robust detection without ad hoc smoothing. The framework provides three key benefits: (i) noise-aware inference with hyperparameters learned from data, (ii) closed-form derivatives with credible intervals for uncertainty quantification, and (iii) scalability to online variants suitable for embedded BMS. Experimental validation on Li-ion coin cells across a range of C-rates (0.2C-1C) and temperatures (0-40\deg C) demonstrates that the GP-based method reliably detects plating peaks under low-temperature, high-rate charging, while correctly reporting no peaks in baseline cases. The concurrence of GP-identified differential peaks, reduced charge throughput, and capacity fade measured via reference performance tests confirms the method’s accuracy and robustness, establishing a practical pathway for real-time lithium plating detection.

[602] CarbonX: An Open-Source Tool for Computational Decarbonization Using Time Series Foundation Models

Diptyaroop Maji, Kang Yang, Prashant Shenoy, Ramesh K Sitaraman, Mani Srivastava

Main category: cs.LG

TL;DR: CarbonX is an open-source tool using Time Series Foundation Models for carbon intensity forecasting and imputation, achieving global coverage with minimal data requirements and providing uncertainty estimates.

Details

Motivation: Existing carbon intensity forecasting tools have limitations: they require grid-specific electricity mix data, lack global coverage due to separate models, and don't provide uncertainty estimates, limiting reliability for carbon-aware applications.

Method: Leverages Time Series Foundation Models (TSFMs) using only historical carbon intensity data with a single general model, enabling zero-shot forecasting and imputation across diverse grids without grid-specific data requirements.

Result: Achieves zero-shot forecasting MAPE of 15.82% across 214 grids worldwide. On 13 benchmark grids: average MAPE of 9.59%, tail forecasting MAPE of 16.54%, with 95% coverage prediction intervals. Provides forecasts up to 21 days with minimal accuracy degradation. Fine-tuned version outperforms statistical baselines by 1.2-3.9X on imputation.

Conclusion: CarbonX can be easily used on any grid with limited data while delivering strong performance, making it a practical tool for global-scale decarbonization applications.

Abstract: Computational decarbonization aims to reduce carbon emissions in computing and societal systems such as data centers, transportation, and built environments. This requires accurate, fine-grained carbon intensity forecasts, yet existing tools have several key limitations: (i) they require grid-specific electricity mix data, restricting use where such information is unavailable; (ii) they depend on separate grid-specific models that make it challenging to provide global coverage; and (iii) they provide forecasts without uncertainty estimates, limiting reliability for downstream carbon-aware applications. In this paper, we present CarbonX, an open-source tool that leverages Time Series Foundation Models (TSFMs) for a range of decarbonization tasks. CarbonX utilizes the versatility of TSFMs to provide strong performance across multiple tasks, such as carbon intensity forecasting and imputation, and across diverse grids. Using only historical carbon intensity data and a single general model, our tool achieves a zero-shot forecasting Mean Absolute Percentage Error (MAPE) of 15.82% across 214 grids worldwide. Across 13 benchmark grids, CarbonX performance is comparable with the current state-of-the-art, with an average MAPE of 9.59% and tail forecasting MAPE of 16.54%, while also providing prediction intervals with 95% coverage. CarbonX can provide forecasts for up to 21 days with minimal accuracy degradation. Further, when fully fine-tuned, CarbonX outperforms the statistical baselines by 1.2–3.9X on the imputation task. Overall, these results demonstrate that CarbonX can be used easily on any grid with limited data and still deliver strong performance, making it a practical tool for global-scale decarbonization.

[603] TabImpute: Accurate and Fast Zero-Shot Missing-Data Imputation with a Pre-Trained Transformer

Jacob Feitelberg, Dwaipayan Saha, Kyuseong Choi, Zaid Ahmad, Anish Agarwal, Raaz Dwivedi

Main category: cs.LG

TL;DR: TabImpute is a pre-trained transformer for zero-shot tabular data imputation that requires no fitting or hyperparameter tuning, offering fast and accurate performance across diverse domains.

Details

Motivation: Existing imputation methods have huge performance variance across domains and require time-consuming hyperparameter tuning, with no default solution available for tabular data.

Method: Built on TabPFN foundation model, uses entry-wise featurization for 100x speedup, synthetic training data with realistic missingness patterns, and comprehensive MissBench evaluation framework.

Result: TabImpute shows robust performance compared to 11 established imputation methods across 42 OpenML datasets and 13 missingness patterns in medicine, finance, and engineering domains.

Conclusion: TabImpute provides an effective zero-shot solution for tabular data imputation that eliminates the need for fitting and hyperparameter tuning while maintaining strong performance across diverse real-world scenarios.

Abstract: Missing data is a pervasive problem in tabular settings. Existing solutions range from simple averaging to complex generative adversarial networks. However, due to huge variance in performance across real-world domains and time-consuming hyperparameter tuning, no default imputation method exists. Building on TabPFN, a recent tabular foundation model for supervised learning, we propose TabImpute, a pre-trained transformer that delivers accurate and fast zero-shot imputations requiring no fitting or hyperparameter tuning at inference-time. To train and evaluate TabImpute, we introduce (i) an entry-wise featurization for tabular settings, which enables a $100\times$ speedup over the previous TabPFN imputation method, (ii) a synthetic training data generation pipeline incorporating realistic missingness patterns, which boosts test-time performance, and (iii) MissBench, a comprehensive benchmark for evaluation of imputation methods with $42$ OpenML datasets and $13$ missingness patterns. MissBench spans domains such as medicine, finance, and engineering, showcasing TabImpute’s robust performance compared to $11$ established imputation methods.

[604] Early-Warning of Thunderstorm-Driven Power Outages with a Two-Stage Machine Learning Model

Iryna Stanishevska

Main category: cs.LG

TL;DR: Developed a 24-48h early-warning model for thunderstorm-related power outages using only open data sources, featuring a two-stage logistic gate + LSTM design that outperforms baseline models in detecting outage peaks.

Details

Motivation: Thunderstorm outages are hard to predict due to chaotic convective processes, most storms not causing damage, and noisy/incomplete public data. Need for early warning systems using accessible data.

Method: Two-stage model combining logistic gate and LSTM regressor; uses EAGLE-I outage data and METAR weather data; employs kriging with hourly variograms, causal spatio-temporal features capturing convection precursors (moisture advection, wind shifts, pressure drops).

Result: Two-Stage model detects more outage peaks across all time windows (3/4 vs 2/4 at ±48h, F1 66.7% vs 57.1%) with modest amplitude gains near peaks (2-3% lower cMASE at ±0-12h) and comparable overall errors to baseline.

Conclusion: Despite open-data noise, the feature-driven pipeline provides actionable early warnings for thunderstorm outages, with SHAP analysis confirming the value of engineered features like moisture-advection and wind precursors.

Abstract: Thunderstorm-driven outages are difficult to predict because most storms do not cause damage, convective processes occur rapidly and chaotically, and the available public data are both noisy and incomplete. We develop a 24-48 h early-warning model for summer, thunderstorm-related outages in Michigan using only open sources (EAGLE-I for ground truth; METAR for weather). We use the publicly released EAGLE-I outage dataset (2014-2022), maintained by Oak Ridge National Laboratory for the U.S. Department of Energy. The pipeline preserves convective micro-signals from a sparse station network via parameter-specific kriging with hourly variograms and targeted overdrafting to retain extremes, and builds causal spatio-temporal features (lags/rolling statistics; k-NN/IDW spatial aggregates) capturing precursors of severe convection (moisture advection, wind shifts, and pressure drops). The two-stage model design, combining a logistic gate and an LSTM regressor, limits routine periods and reduces noise exposure. The study uses event-centric metrics (cluster-based hits/misses/false alarms) and peak-conditional MASE (cMASE) in +/-Delta-hour windows around state-level peaks (>= 50,000), with uncertainty quantified by hourly moving-block bootstrap. On the test sample, Two-Stage detects more reference peaks across all windows (e.g., at +/-48 h it records 3/4 vs. 2/4; F1 66.7% vs. 57.1%) with one extra false alarm. Near peaks, it shows modest amplitude gains (2-3% lower cMASE at +/-0-12 h; bootstrap medians +9-13% at +/-6-12 h) but small losses at +/-36-48 h (~3-4%). Overall, errors are comparable to the one-step LSTM baseline. SHAP analysis confirms moisture-advection and wind/gust precursors, underscoring the value of the feature engineering. Despite open-data noise, the feature-driven pipeline yields actionable, event-focused early warnings for thunderstorm outages.

[605] Truncated Kernel Stochastic Gradient Descent with General Losses and Spherical Radial Basis Functions

Jinhui Bai, Andreas Christmann, Lei Shi

Main category: cs.LG

TL;DR: A novel kernel SGD algorithm with improved efficiency and scalability using adaptive regularization and finite-dimensional projection, achieving minimax-optimal convergence rates.

Details

Motivation: To address the computational inefficiency and scalability issues of traditional kernel SGD methods for large-scale supervised learning with general losses.

Method: Uses an innovative regularization strategy that projects stochastic gradients onto finite-dimensional spaces via spherical radial basis function expansion, with adaptive scaling based on bias-variance trade-off. Incorporates coordinate-wise updates from linear SGD to reduce computational complexity.

Result: Achieves minimax-optimal convergence rates for both last iterate and suffix average, optimal strong convergence in RKHS, and significantly reduces computational and storage complexity. Works with various loss functions including least-squares, Huber, and logistic losses.

Conclusion: The proposed kernel SGD algorithm provides an efficient and scalable solution for large-scale supervised learning with theoretical guarantees and practical performance improvements over traditional kernel methods.

Abstract: In this paper, we propose a novel kernel stochastic gradient descent (SGD) algorithm for large-scale supervised learning with general losses. Compared to traditional kernel SGD, our algorithm improves efficiency and scalability through an innovative regularization strategy. By leveraging the infinite series expansion of spherical radial basis functions, this strategy projects the stochastic gradient onto a finite-dimensional hypothesis space, which is adaptively scaled according to the bias-variance trade-off, thereby enhancing generalization performance. Based on a new estimation of the spectral structure of the kernel-induced covariance operator, we develop an analytical framework that unifies optimization and generalization analyses. We prove that both the last iterate and the suffix average converge at minimax-optimal rates, and we further establish optimal strong convergence in the reproducing kernel Hilbert space. Our framework accommodates a broad class of classical loss functions, including least-squares, Huber, and logistic losses. Moreover, the proposed algorithm significantly reduces computational complexity and achieves optimal storage complexity by incorporating coordinate-wise updates from linear SGD, thereby avoiding the costly pairwise operations typical of kernel SGD and enabling efficient processing of streaming data. Finally, extensive numerical experiments demonstrate the efficiency of our approach.

[606] Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization

Kristi Topollai, Anna Choromanska

Main category: cs.LG

TL;DR: Introduces an adaptive memory mechanism that replaces constant momentum with dynamic momentum coefficients adjusted online during optimization, outperforming standard optimizers with fixed momentum.

Details

Motivation: Current momentum-based optimizers use constant momentum coefficients (typically β=0.9) that are suboptimal, as they don't adapt during training despite the changing optimization landscape.

Method: Uses a proximal framework with two-plane approximation: one from current gradient and another from accumulated past gradients memory, enabling dynamic adjustment of momentum coefficients during optimization.

Result: Adaptive memory variants of SGD and AdamW outperform standard versions across various tasks from convex problems to large-scale deep learning, without requiring extra hyperparameter tuning.

Conclusion: The adaptive memory approach is novel, simple to use, and opens new ways for inducing adaptivity in optimization, demonstrating that dynamic momentum coefficients can improve training performance.

Abstract: The vast majority of modern deep learning models are trained with momentum-based first-order optimizers. The momentum term governs the optimizer’s memory by determining how much each past gradient contributes to the current convergence direction. Fundamental momentum methods, such as Nesterov Accelerated Gradient and the Heavy Ball method, as well as more recent optimizers such as AdamW and Lion, all rely on the momentum coefficient that is customarily set to $\beta = 0.9$ and kept constant during model training, a strategy widely used by practitioners, yet suboptimal. In this paper, we introduce an \textit{adaptive memory} mechanism that replaces constant momentum with a dynamic momentum coefficient that is adjusted online during optimization. We derive our method by approximating the objective function using two planes: one derived from the gradient at the current iterate and the other obtained from the accumulated memory of the past gradients. To the best of our knowledge, such a proximal framework was never used for momentum-based optimization. Our proposed approach is novel, extremely simple to use, and does not rely on extra assumptions or hyperparameter tuning. We implement adaptive memory variants of both SGD and AdamW across a wide range of learning tasks, from simple convex problems to large-scale deep learning scenarios, demonstrating that our approach can outperform standard SGD and Adam with hand-tuned momentum coefficients. Finally, our work opens doors for new ways of inducing adaptivity in optimization.

[607] GRADE: Personalized Multi-Task Fusion via Group-relative Reinforcement Learning with Adaptive Dirichlet Exploration

Tingfeng Hong, Pingye Ren, Xinlong Xiao, Chao Wang, Chenyi Lei, Wenwu Ou, Han Li

Main category: cs.LG

TL;DR: GRADE is a reinforcement learning framework for personalized multi-task fusion that uses group-relative policy optimization and Dirichlet exploration to dynamically adapt fusion weights based on individual user intent, achieving significant performance improvements in large-scale recommender systems.

Details

Motivation: Current multi-task fusion methods use static, manually-tuned weights that cannot capture individual user intent, while traditional RL approaches suffer from training instability and sparse rewards in large-scale systems.

Method: GRADE uses a critic-free Group Relative Policy Optimization (GRPO) paradigm with Dirichlet distribution for structured exploration of weight space, and a composite reward function combining sparse user feedback with dense model priors and rule-based constraints.

Result: In large-scale A/B tests with hundreds of millions of daily active users, GRADE achieved +0.595% CTR, +1.193% CVR, +1.788% OPM, and +1.568% total order volume improvements over established baselines.

Conclusion: GRADE has been successfully deployed in Kuaishou’s marketplace search scenario, serving hundreds of millions of users, demonstrating its effectiveness for personalized multi-task fusion in large-scale recommender systems.

Abstract: Balancing multiple objectives is critical for user satisfaction in modern recommender and search systems, yet current Multi-Task Fusion (MTF) methods rely on static, manually-tuned weights that fail to capture individual user intent. While Reinforcement Learning (RL) offers a path to personalization, traditional approaches often falter due to training instability and the sparse rewards inherent in these large-scale systems. To address these limitations, we propose Group-relative Reinforcement learning with Adaptive Dirichlet Exploration (GRADE), a novel and robust framework for personalized multi-task fusion. GRADE leverages a critic-free, Group Relative Policy Optimization (GRPO) paradigm, enabling stable and efficient policy learning by evaluating the relative performance of candidate weight groups. Its core innovations include employing the Dirichlet distribution for principled and structured exploration of the weight space, and a composite reward function that combines sparse user feedback with dense model priors and rule-based constraints to guide the search effectively. Deployed in the in-app marketplace of an application with over hundreds of millions daily active users, GRADE significantly outperforms established baselines, achieving substantial gains in rigorous large-scale A/B tests: +0.595% in CTR, +1.193% in CVR, +1.788% in OPM, and +1.568% in total order volume. Following its strong performance, GRADE has been fully deployed in the marketplace search scenario of Kuaishou, serving hundreds of millions of users.

[608] SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening

Murtaza Rangwala, Farag Azzedin, Richard O. Sinnott, Rajkumar Buyya

Main category: cs.LG

TL;DR: SketchGuard is a Byzantine-robust decentralized federated learning framework that uses Count Sketch compression to reduce communication and computational costs while maintaining security against malicious clients.

Details

Motivation: Existing Byzantine-robust DFL methods require expensive full model comparisons between neighbors, creating prohibitive communication and computational costs that prevent web-scale deployment.

Method: SketchGuard compresses high-dimensional models to low-dimensional sketches using Count Sketch for similarity comparisons, then selectively fetches full models only from accepted neighbors, decoupling Byzantine filtering from model aggregation.

Result: SketchGuard reduces computation time by up to 82% and communication overhead by 50-70% while maintaining identical robustness to state-of-the-art methods, with benefits scaling multiplicatively with model dimensionality and network connectivity.

Conclusion: SketchGuard establishes sketch-based compression as a fundamental enabler of robust decentralized federated learning at web scale by providing rigorous convergence guarantees and significant efficiency improvements.

Abstract: Decentralized Federated Learning (DFL) enables privacy-preserving collaborative training without centralized servers, but remains vulnerable to Byzantine attacks where malicious clients submit corrupted model updates. Existing Byzantine-robust DFL defenses rely on similarity-based neighbor screening that requires every client to exchange and compare complete high-dimensional model vectors with all neighbors in each training round, creating prohibitive communication and computational costs that prevent deployment at web scale. We propose SketchGuard, a general framework that decouples Byzantine filtering from model aggregation through sketch-based neighbor screening. SketchGuard compresses $d$-dimensional models to $k$-dimensional sketches ($k \ll d$) using Count Sketch for similarity comparisons, then selectively fetches full models only from accepted neighbors, reducing per-round communication complexity from $O(d|N_i|)$ to $O(k|N_i| + d|S_i|)$, where $|N_i|$ is the neighbor count and $|S_i| \le |N_i|$ is the accepted neighbor count. We establish rigorous convergence guarantees in both strongly convex and non-convex settings, proving that Count Sketch compression preserves Byzantine resilience with controlled degradation bounds where approximation errors introduce only a $(1+O(\epsilon))$ factor in the effective threshold parameter. Comprehensive experiments across multiple datasets, network topologies, and attack scenarios demonstrate that SketchGuard maintains identical robustness to state-of-the-art methods while reducing computation time by up to 82% and communication overhead by 50-70% depending on filtering effectiveness, with benefits scaling multiplicatively with model dimensionality and network connectivity. These results establish the viability of sketch-based compression as a fundamental enabler of robust DFL at web scale.

[609] Accelerated Evolving Set Processes for Local PageRank Computation

Binbin Huang, Luo Luo, Yanghua Xiao, Deqing Yang, Baojian Zhou

Main category: cs.LG

TL;DR: A novel framework using nested evolving set processes to accelerate Personalized PageRank computation, achieving time complexity independent of graph size under certain conditions.

Details

Motivation: To develop more efficient algorithms for Personalized PageRank computation that can handle large graphs by reducing dependency on graph size.

Method: Uses nested evolving set processes with localized inexact proximal point iterations to solve simplified linear systems, requiring only O~(1/√α) such systems to be solved.

Result: Achieves time complexity of min{O~(R²/ε²), O~(m)} for ε-approximation, and O~(R²/(√αε²)) when 1/ε² ≪ m, independent of graph size. Validated by experiments showing early convergence.

Conclusion: The framework successfully accelerates PPR computation and resolves an open conjecture, providing graph-size-independent complexity under practical conditions.

Abstract: This work proposes a novel framework based on nested evolving set processes to accelerate Personalized PageRank (PPR) computation. At each stage of the process, we employ a localized inexact proximal point iteration to solve a simplified linear system. We show that the time complexity of such localized methods is upper bounded by $\min{\tilde{\mathcal{O}}(R^2/\epsilon^2), \tilde{\mathcal{O}}(m)}$ to obtain an $\epsilon$-approximation of the PPR vector, where $m$ denotes the number of edges in the graph and $R$ is a constant defined via nested evolving set processes. Furthermore, the algorithms induced by our framework require solving only $\tilde{\mathcal{O}}(1/\sqrt{\alpha})$ such linear systems, where $\alpha$ is the damping factor. When $1/\epsilon^2\ll m$, this implies the existence of an algorithm that computes an $\ epsilon $-approximation of the PPR vector with an overall time complexity of $\tilde{\mathcal{O}}\left(R^2 / (\sqrt{\alpha}\epsilon^2)\right)$, independent of the underlying graph size. Our result resolves an open conjecture from existing literature. Experimental results on real-world graphs validate the efficiency of our methods, demonstrating significant convergence in the early stages.

[610] Unsupervised Multi-Source Federated Domain Adaptation under Domain Diversity through Group-Wise Discrepancy Minimization

Larissa Reichart, Cem Ata Baykara, Ali Burak Ünal, Harlin Lee, Mete Akgün

Main category: cs.LG

TL;DR: GALA is a scalable federated unsupervised multi-source domain adaptation framework that uses inter-group discrepancy minimization and centroid-based weighting to handle large numbers of heterogeneous source domains efficiently.

Details

Motivation: Existing distributed UMDA methods don't scale well with many heterogeneous domains, leading to high computational overhead or unstable performance.

Method: Proposes GALA with two key components: (1) inter-group discrepancy minimization for efficient domain alignment without quadratic computation, and (2) temperature-controlled centroid-based weighting to dynamically prioritize source domains based on target alignment.

Result: GALA achieves competitive or state-of-the-art results on standard benchmarks and significantly outperforms prior methods in diverse multi-source settings where others fail to converge.

Conclusion: GALA provides a scalable and robust solution for federated UMDA that works effectively with large numbers of heterogeneous source domains.

Abstract: Unsupervised multi-source domain adaptation (UMDA) aims to learn models that generalize to an unlabeled target domain by leveraging labeled data from multiple, diverse source domains. While distributed UMDA methods address privacy constraints by avoiding raw data sharing, existing approaches typically assume a small number of sources and fail to scale effectively. Increasing the number of heterogeneous domains often makes existing methods impractical, leading to high computational overhead or unstable performance. We propose GALA, a scalable and robust federated UMDA framework that introduces two key components: (1) a novel inter-group discrepancy minimization objective that efficiently approximates full pairwise domain alignment without quadratic computation; and (2) a temperature-controlled, centroid-based weighting strategy that dynamically prioritizes source domains based on alignment with the target. Together, these components enable stable and parallelizable training across large numbers of heterogeneous sources. To evaluate performance in high-diversity scenarios, we introduce Digit-18, a new benchmark comprising 18 digit datasets with varied synthetic and real-world domain shifts. Extensive experiments show that GALA consistently achieves competitive or state-of-the-art results on standard benchmarks and significantly outperforms prior methods in diverse multi-source settings where others fail to converge.

[611] Characterizing the Multiclass Learnability of Forgiving 0-1 Loss Functions

Jacob Trauger, Tyson Trauger, Ambuj Tewari

Main category: cs.LG

TL;DR: Characterization of learnability for forgiving 0-1 loss functions in finite label multiclass setting using a new Generalized Natarajan Dimension.

Details

Motivation: To understand when hypothesis classes are learnable with forgiving 0-1 loss functions in multiclass settings with finite labels.

Method: Created a new combinatorial dimension based on Natarajan Dimension, proved learnability equivalence to finite Generalized Natarajan Dimension, and connected to set-valued feedback learning.

Result: Showed that a hypothesis class is learnable with forgiving 0-1 loss if and only if the Generalized Natarajan Dimension is finite.

Conclusion: Learnability of set learning problems is characterized by the Natarajan Dimension, establishing a fundamental connection between combinatorial dimensions and multiclass learning theory.

Abstract: In this paper we will give a characterization of the learnability of forgiving 0-1 loss functions in the finite label multiclass setting. To do this, we create a new combinatorial dimension that is based off of the Natarajan Dimension and we show that a hypothesis class is learnable in our setting if and only if this Generalized Natarajan Dimension is finite. We also show a connection to learning with set-valued feedback. Through our results we show that the learnability of a set learning problem is characterized by the Natarajan Dimension.

[612] Entropy Regularizing Activation: Boosting Continuous Control, Large Language Models, and Image Classification with Activation as Entropy Constraints

Zilin Kang, Chonghua Liao, Tingqiang Xu, Huazhe Xu

Main category: cs.LG

TL;DR: ERA is a new paradigm that constrains sampling entropy above given thresholds using specially designed output activations, achieving significant performance improvements across LLMs, RL agents, and image classification with minimal computational overhead.

Details

Motivation: To develop a simple yet effective method for controlling entropy in model outputs that can broadly improve performance across different domains without significant computational cost.

Method: Apply specially designed activations to model outputs to constrain sampling entropy above given thresholds, enabling entropy control during inference.

Result: 37.4% boost in AIME 2025 score for Qwen2.5-Math-7B, >30% performance improvement on HumanoidBench over SAC baseline, 0.69% ImageNet top-1 accuracy gain for ResNet-50, all with <7% computational overhead.

Conclusion: Output activation is a powerful tool for entropy control that enables simpler and more robust algorithm design across multiple domains.

Abstract: We propose ERA, a new paradigm that constrains the sampling entropy above given thresholds by applying specially designed activations to the outputs of models. Our approach demonstrates broad effectiveness across different domains:

for large language models(LLMs), boosting the AIME 2025 score for Qwen2.5-Math-7B by 37.4%; 2) for continuous control reinforcement learning agents, improving performance by more than 30% over strong baselines such as SAC on the challenging HumanoidBench; 3) for image classification, enhancing ImageNet top-1 accuracy by 0.69% for ResNet-50. These gains are achieved with a computational overhead of less than 7%. Our work validates output activation as a powerful tool for entropy control, opening a new direction for designing simpler and more robust algorithms.

cs.MA

[613] AgenticAD: A Specialized Multiagent System Framework for Holistic Alzheimer Disease Management

Adib Bazgir, Amir Habibdoust, Xing Song, Yuwen Zhang

Main category: cs.MA

TL;DR: Proposes a multi-agent system framework for holistic Alzheimer’s disease management using specialized AI agents that collaborate across caregiver support, data analysis, and multimodal workflows.

Details

Motivation: Current AI applications for Alzheimer's disease are siloed and address only singular aspects, lacking systemic integration needed for comprehensive disease management.

Method: Develops a framework with eight specialized, interoperable agents using LLMs (GPT-4o, Gemini), multi-agent orchestration, RAG, web scraping, multimodal data processing, and database querying.

Result: Presents a detailed architectural blueprint for an integrated AI ecosystem that enables collaborative, adaptive, and personalized Alzheimer’s care solutions.

Conclusion: The multi-agent framework establishes a foundation for future systems that can synthesize diverse data streams to improve patient outcomes and reduce caregiver burden.

Abstract: Alzheimer’s disease (AD) presents a complex, multifaceted challenge to patients, caregivers, and the healthcare system, necessitating integrated and dynamic support solutions. While artificial intelligence (AI) offers promising avenues for intervention, current applications are often siloed, addressing singular aspects of the disease such as diagnostics or caregiver support without systemic integration. This paper proposes a novel methodological framework for a comprehensive, multi-agent system (MAS) designed for holistic Alzheimer’s disease management. The objective is to detail the architecture of a collaborative ecosystem of specialized AI agents, each engineered to address a distinct challenge in the AD care continuum, from caregiver support and multimodal data analysis to automated research and clinical data interpretation. The proposed framework is composed of eight specialized, interoperable agents. These agents are categorized by function: (1) Caregiver and Patient Support, (2) Data Analysis and Research, and (3) Advanced Multimodal Workflows. The methodology details the technical architecture of each agent, leveraging a suite of advanced technologies including large language models (LLMs) such as GPT-4o and Gemini, multi-agent orchestration frameworks, Retrieval-Augmented Generation (RAG) for evidence-grounded responses, and specialized tools for web scraping, multimodal data processing, and in-memory database querying. This paper presents a detailed architectural blueprint for an integrated AI ecosystem for AD care. By moving beyond single-purpose tools to a collaborative, multi-agent paradigm, this framework establishes a foundation for developing more adaptive, personalized, and proactive solutions. This methodological approach aims to pave the way for future systems capable of synthesizing diverse data streams to improve patient outcomes and reduce caregiver burden.

[614] GRPO-GCC: Enhancing Cooperation in Spatial Public Goods Games via Group Relative Policy Optimization with Global Cooperation Constraint

Zhaoqilin Yang, Chanchan Li, Tianqi Liu, Hongxin Zhao, Youliang Tian

Main category: cs.MA

TL;DR: GRPO-GCC is a new deep reinforcement learning framework for spatial public goods games that combines group relative policy optimization with a global cooperation constraint to achieve sustainable cooperation.

Details

Motivation: Inspired by self-regulating cooperation in collective institutions, the authors aim to prevent collapse into universal defection or unconditional cooperation in structured populations.

Method: Integrates group relative policy optimization with global cooperation constraint, combining group-normalized advantage estimation, reference-anchored KL penalty, and dynamic global incentive term.

Result: Achieves accelerated cooperation onset, stabilized policy adaptation, and long-term sustainability by reshaping incentives toward resilient cooperation.

Conclusion: GRPO-GCC provides a new paradigm for multi-agent reinforcement learning in socio-technical systems using simple global signals to reshape cooperative incentives.

Abstract: Inspired by the principle of self-regulating cooperation in collective institutions, we propose the Group Relative Policy Optimization with Global Cooperation Constraint (GRPO-GCC) framework. This work is the first to introduce GRPO into spatial public goods games, establishing a new deep reinforcement learning baseline for structured populations. GRPO-GCC integrates group relative policy optimization with a global cooperation constraint that strengthens incentives at intermediate cooperation levels while weakening them at extremes. This mechanism aligns local decision making with sustainable collective outcomes and prevents collapse into either universal defection or unconditional cooperation. The framework advances beyond existing approaches by combining group-normalized advantage estimation, a reference-anchored KL penalty, and a global incentive term that dynamically adjusts cooperative payoffs. As a result, it achieves accelerated cooperation onset, stabilized policy adaptation, and long-term sustainability. GRPO-GCC demonstrates how a simple yet global signal can reshape incentives toward resilient cooperation, and provides a new paradigm for multi-agent reinforcement learning in socio-technical systems.

[615] Scalable Multi-Agent Path Finding using Collision-Aware Dynamic Alert Mask and a Hybrid Execution Strategy

Bharath Muppasani, Ritirupa Dey, Biplav Srivastava, Vignesh Narayanan

Main category: cs.MA

TL;DR: A hybrid MAPF framework combining decentralized RL planning with lightweight centralized coordination, using minimal conflict alerts to achieve scalable collision-free solutions in large-scale scenarios.

Details

Motivation: Address limitations of centralized MAPF algorithms (computationally expensive in large-scale) and distributed methods (poor solution quality), seeking a balance between scalability and solution quality.

Method: Hybrid framework with decentralized reinforcement learning for path planning and a lightweight central coordinator that provides minimal conflict alerts (static conflict-cell flags or brief conflict tracks) to guide agents.

Result: Reduces inter-agent information sharing while consistently finding feasible collision-free solutions, even in large-scale scenarios with high agent counts.

Conclusion: The hybrid approach effectively balances computational efficiency and solution quality by combining decentralized planning with targeted centralized coordination.

Abstract: Multi-agent pathfinding (MAPF) remains a critical problem in robotics and autonomous systems, where agents must navigate shared spaces efficiently while avoiding conflicts. Traditional centralized algorithms that have global information, such as Conflict-Based Search (CBS), provide high-quality solutions but become computationally expensive in large-scale scenarios due to the combinatorial explosion of conflicts that need resolution. Conversely, distributed approaches that have local information, particularly learning-based methods, offer better scalability by operating with relaxed information availability, yet often at the cost of solution quality. To address these limitations, we propose a hybrid framework that combines decentralized path planning with a lightweight centralized coordinator. Our framework leverages reinforcement learning (RL) for decentralized planning, enabling agents to adapt their planning based on minimal, targeted alerts–such as static conflict-cell flags or brief conflict tracks–that are dynamically shared information from the central coordinator for effective conflict resolution. We empirically study the effect of the information available to an agent on its planning performance. Our approach reduces the inter-agent information sharing compared to fully centralized and distributed methods, while still consistently finding feasible, collision-free solutions–even in large-scale scenarios having higher agent counts.

[616] Anemoi: A Semi-Centralized Multi-agent System Based on Agent-to-Agent Communication MCP server from Coral Protocol

Xinxing Ren, Caelum Forder, Qianbo Zang, Ahsen Tahir, Roman J. Georgio, Suman Deb, Peter Carroll, Önder Gürcan, Zekun Guo

Main category: cs.MA

TL;DR: Anemoi is a semi-centralized multi-agent system that enables structured inter-agent collaboration through direct communication, reducing dependency on a single planner and improving performance with smaller LLMs.

Details

Motivation: Traditional multi-agent systems rely heavily on a central planner's capability and have limited inter-agent communication, leading to degraded performance with smaller LLMs and inefficient collaboration through prompt concatenation.

Method: Built on Coral Protocol’s Agent-to-Agent communication MCP server, Anemoi enables real-time structured collaboration where all agents can monitor progress, assess results, identify bottlenecks, and propose refinements directly.

Result: On the GAIA benchmark, Anemoi achieved 52.73% accuracy with GPT-4.1-mini as planner, surpassing the strongest open-source baseline OWL (43.63%) by +9.09% under identical LLM settings.

Conclusion: Anemoi’s semi-centralized approach with structured inter-agent communication reduces planner dependency, supports adaptive planning, minimizes redundant context passing, and enables more scalable multi-agent system execution.

Abstract: Recent advances in generalist multi-agent systems (MAS) have largely followed a context-engineering plus centralized paradigm, where a planner agent coordinates multiple worker agents through unidirectional prompt passing. While effective under strong planner models, this design suffers from two critical limitations: (1) strong dependency on the planner’s capability, which leads to degraded performance when a smaller LLM powers the planner; and (2) limited inter-agent communication, where collaboration relies on prompt concatenation rather than genuine refinement through structured discussions. To address these challenges, we propose Anemoi, a semi-centralized MAS built on the Agent-to-Agent (A2A) communication MCP server from Coral Protocol. Unlike traditional designs, Anemoi enables structured and direct inter-agent collaboration, allowing all agents to monitor progress, assess results, identify bottlenecks, and propose refinements in real time. This paradigm reduces reliance on a single planner, supports adaptive plan updates, and minimizes redundant context passing, resulting in more scalable execution. Evaluated on the GAIA benchmark, Anemoi achieved 52.73% accuracy with a small LLM (GPT-4.1-mini) as the planner, surpassing the strongest open-source baseline OWL (43.63%) by +9.09% under identical LLM settings. Our implementation is publicly available at https://github.com/Coral-Protocol/Anemoi.

cs.MM

[617] Iola Walker: A Mobile Footfall Detection System for Music Composition

William B. James

Main category: cs.MM

TL;DR: Iola Walker is a wearable music playback system that adapts music based on the listener’s gait, aiming to create a new preferred music experience medium.

Details

Motivation: To find a method for materially enhancing music using hardware/software and potentially create a new music experience medium that could reform the music industry.

Method: Developed Iola Walker - a wearable device system that allows musicians to compose music that changes according to the listener’s walking patterns.

Result: A functional music playback system infrastructure that adapts music to gait, with artifacts available on GitHub.

Conclusion: This research represents a step toward prosocial reform in the music industry through new playback technologies that could address societal problems in entertainment.

Abstract: This outing is part of a larger music technology research project. The objective is to find a method for materially enhancing music using hardware and software. There is a strong likelihood that there exists a new medium for experiencing music via a wearable device that ordinary listeners prefer over the current state of the art. If such a medium is discovered, it is a step towards altruistic, prosocial reform in the music industry. A new playback system infrastructure has a chance to soothe some of the societal problems tied to the larger entertainment industry ecosystem. Iola walker is a music playback system that allows musicians to compose music that changes in accordance with the listener’s gait. Artifacts are available here: https://github.com/willbjames/iolawalker

eess.AS

[618] Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion

Ahmed Adel Attia, Jing Liu, Carol Espy Wilson

Main category: eess.AS

TL;DR: This paper proposes using articulatory features as both an auxiliary task and pseudo-input in deep learning ASR models, showing consistent improvements especially in low-resource scenarios.

Details

Motivation: Previous work on articulatory features for ASR was limited to shallow acoustic models, but their potential benefits in modern deep learning architectures haven't been fully explored.

Method: Uses speech inversion as an auxiliary prediction task and injects predicted articulatory features as a query stream in cross-attention modules with acoustic embeddings as keys/values.

Result: Experiments on LibriSpeech show consistent improvements over strong transformer-based baselines, particularly under low-resource conditions.

Conclusion: Articulatory features, previously sidelined in ASR research, can provide meaningful benefits when reintroduced with modern deep learning architectures.

Abstract: Prior works have investigated the use of articulatory features as complementary representations for automatic speech recognition (ASR), but their use was largely confined to shallow acoustic models. In this work, we revisit articulatory information in the era of deep learning and propose a framework that leverages articulatory representations both as an auxiliary task and as a pseudo-input to the recognition model. Specifically, we employ speech inversion as an auxiliary prediction task, and the predicted articulatory features are injected into the model as a query stream in a cross-attention module with acoustic embeddings as keys and values. Experiments on LibriSpeech demonstrate that our approach yields consistent improvements over strong transformer-based baselines, particularly under low-resource conditions. These findings suggest that articulatory features, once sidelined in ASR research, can provide meaningful benefits when reintroduced with modern architectures.

[619] Dynamic Stress Detection: A Study of Temporal Progression Modelling of Stress in Speech

Vishakha Lall, Yisi Liu

Main category: eess.AS

TL;DR: The paper proposes modeling stress as a temporally evolving phenomenon in speech, using dynamic labeling from emotional states and cross-attention sequential models to achieve significant accuracy improvements over existing baselines.

Details

Motivation: Prior work treats stress as a static label, but the authors argue stress should be modeled as a dynamic phenomenon influenced by historical emotional state, which is critical for high-pressure settings.

Method: Proposed dynamic labeling strategy that derives fine-grained stress annotations from emotional labels, and introduced cross-attention-based sequential models (Unidirectional LSTM and Transformer Encoder) to capture temporal stress progression.

Result: Achieved notable accuracy gains: +5% on MuSE and +18% on StressID datasets over existing baselines, with good generalization to a custom real-world dataset.

Conclusion: Modeling stress as a dynamic construct in speech provides significant value and improved detection performance compared to static approaches.

Abstract: Detecting psychological stress from speech is critical in high-pressure settings. While prior work has leveraged acoustic features for stress detection, most treat stress as a static label. In this work, we model stress as a temporally evolving phenomenon influenced by historical emotional state. We propose a dynamic labelling strategy that derives fine-grained stress annotations from emotional labels and introduce cross-attention-based sequential models, a Unidirectional LSTM and a Transformer Encoder, to capture temporal stress progression. Our approach achieves notable accuracy gains on MuSE (+5%) and StressID (+18%) over existing baselines, and generalises well to a custom real-world dataset. These results highlight the value of modelling stress as a dynamic construct in speech.

[620] BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

Yaya Sy, Christophe Cerisara, Irina Illina

Main category: eess.AS

TL;DR: A new pruning method for Whisper models in low-resource languages (Bambara) that uses embedding compression and layer merging instead of vocabulary pruning, achieving 48% size reduction and 2.15x speedup while preserving 90% performance with only 32h of data.

Details

Motivation: Existing pruning methods like Distill-Whisper require massive retraining data (21,000 hours), which is unavailable for most low-resource languages. There's a need for efficient pruning techniques that work with limited data for edge device deployment.

Method: Proposes embedding compression via low-rank decomposition and feature distillation, plus layer merging instead of layer removal. Avoids vocabulary pruning due to frequent code-switching in Bambara.

Result: Final model is 48% smaller, 2.15x faster on MacBook Air M1, while preserving 90% of original Whisper performance using only 32h of speech data.

Conclusion: Effective pruning for low-resource languages is possible without massive data by using embedding compression and layer merging strategies.

Abstract: Pruning large pre-trained transformers for low-resource languages is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40% and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90% of the original performance while being 48% smaller and 2.15x faster on a MacBook Air M1.

[621] Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization

Rui Hu, Delai Qiu, Yining Wang, Shengping Liu, Jitao Sang

Main category: eess.AS

TL;DR: VAPO is a novel post-training method that improves ASR for academic lectures by enforcing structured visual reasoning through reinforcement learning, significantly enhancing domain-specific term recognition.

Details

Motivation: ASR systems struggle with domain-specific terminology in academic lectures, and existing methods either use complex pipelines or degenerate into simple OCR systems when using omni-modal LLMs.

Method: Proposed Visually-Anchored Policy Optimization (VAPO) that enforces a structured “Look before Transcription” procedure using format, optimized via RL with four rewards for format compliance, OCR accuracy, ASR quality, and visual anchoring consistency.

Result: VAPO significantly improves recognition of domain-specific terms and establishes an effective end-to-end paradigm for SlideASR, as demonstrated through extensive experiments on the new SlideASR-Bench benchmark.

Conclusion: VAPO provides an effective solution for improving ASR in specialized domains by controlling model reasoning through structured visual anchoring, offering a promising approach for domain-specific speech recognition tasks.

Abstract: Automatic speech recognition (ASR) systems often struggle with domain-specific terminology, especially in specialized settings such as academic lectures. To address this, we define the SlideASR task, which leverages the rich visual information from presentation slides to improve transcription accuracy. Existing pipeline methods for this task tend to be complex and underperform. Although omni-modal large language models (OLLMs) provide a promising end-to-end framework, they frequently fail in practice by degenerating into simple optical character recognition (OCR) systems. To overcome this, we propose Visually-Anchored Policy Optimization (VAPO), a novel post-training method designed to control the model’s reasoning process. Drawing on the Chain-of-Thought reasoning paradigm, VAPO enforces a structured “Look before Transcription” procedure using a format. Specifically, the model first performs OCR on the slide content within the think step, then generates the transcription by referencing this recognized visual information in the answer step. This reasoning process is optimized via reinforcement learning with four distinct rewards targeting format compliance, OCR accuracy, ASR quality, and visual anchoring consistency. To support further research, we construct SlideASR-Bench, a new entity-rich benchmark consisting of a synthetic dataset for training and testing, and a challenging real-world set for evaluation. Extensive experiments demonstrate that VAPO significantly improves recognition of domain-specific terms, establishing an effective end-to-end paradigm for SlideASR.

[622] Impact of HRTF individualisation and head movements in a real/virtual localisation task

Vincent Martin, Lorenzo Picinali

Main category: eess.AS

TL;DR: Individualized HRTFs improve perceived realism but not localization in static scenarios, while the opposite occurs when head movements are allowed.

Details

Motivation: To examine how individualized HRTFs affect localization accuracy and perceived realism of virtual sound sources aligned with real visual objects in AAR applications.

Method: Participants localized virtual and real speech sources using headphones with individualized/non-individualized HRTFs and a spherical loudspeaker array. Tests included static and head movement scenarios with 30 visual sources.

Result: Individual HRTFs improved perceived realism in static conditions but not localization. With head movements, the opposite pattern was observed.

Conclusion: Head movement capability significantly impacts how individualized HRTFs affect localization and realism perception in AAR applications.

Abstract: The objective of Audio Augmented Reality (AAR) applications are to seamlessly integrate virtual sound sources within a real environment. It is critical for these applications that virtual sources are localised precisely at the intended position, and that the acoustic environments are accurately matched. One effective method for spatialising sound on headphones is through Head-Related Transfer Functions (HRTFs). These characterise how the physical features of a listener modify sound waves before they reach the eardrum. This study examines the influence of using individualised HRTFs on the localisation and the perceived realism of virtual sound sources associated with a real visual object. Participants were tasked with localising virtual and real speech sources presented via headphones and through a spherical loudspeaker array, respectively. The assessment focussed on perceived realism and sources location. All sources were associated with one of thirty real visual sources (loudspeakers) arranged in a semi-anechoic room. Various sound source renderings were compared, including single loudspeaker rendering and binaural rendering with individualised or non-individualised HRTFs. Additionally, the impact of head movements was explored: ten participants completed the same task with and without the possibility to move their head. The results showed that using individual HRTFs improved perceived realism but not localisation performance in the static scenario. Surprisingly, the opposite was observed when head movements were possible and encouraged.

[623] Unsupervised lexicon learning from speech is limited by representations rather than clustering

Danel Adendorff, Simon Malan, Herman Kamper

Main category: eess.AS

TL;DR: This paper investigates whether performance limitations in zero-resource word segmentation and clustering come from word segment representations or clustering methods, finding that representation variability is the primary bottleneck.

Details

Motivation: To understand why current zero-resource word segmentation and clustering systems produce imperfect lexicons, and to determine whether the main limitation lies in how word segments are represented or how they are clustered.

Method: Combined various self-supervised speech features (continuous/discrete, frame/word-level) with different clustering methods (K-means, hierarchical, graph-based) on English and Mandarin data, using controlled experiments that isolate either representations or clustering.

Result: The best system uses graph clustering with dynamic time warping on continuous features. Faster alternatives use graph clustering with cosine distance on averaged continuous features or edit distance on discrete unit sequences.

Conclusion: Representation variability across segments of the same word type – rather than clustering methods – is the primary factor limiting performance in zero-resource word segmentation and clustering systems.

Abstract: Zero-resource word segmentation and clustering systems aim to tokenise speech into word-like units without access to text labels. Despite progress, the induced lexicons are still far from perfect. In an idealised setting with gold word boundaries, we ask whether performance is limited by the representation of word segments, or by the clustering methods that group them into word-like types. We combine a range of self-supervised speech features (continuous/discrete, frame/word-level) with different clustering methods (K-means, hierarchical, graph-based) on English and Mandarin data. The best system uses graph clustering with dynamic time warping on continuous features. Faster alternatives use graph clustering with cosine distance on averaged continuous features or edit distance on discrete unit sequences. Through controlled experiments that isolate either the representations or the clustering method, we demonstrate that representation variability across segments of the same word type – rather than clustering – is the primary factor limiting performance.

[624] Effects of automotive microphone frequency response characteristics and noise conditions on speech and ASR quality – an experimental evaluation

Michele Buccoli, Yu Du, Jacob Soendergaard, Simone Shawn Cazzaniga

Main category: eess.AS

TL;DR: This paper experimentally studies how microphone bandwidth and frequency response affect speech quality and ASR performance in automotive applications, using real vehicle noise recordings and standard metrics.

Details

Motivation: There's no consensus on how microphone characteristics affect performance in automotive applications, and achieving preferred bandwidth is challenging due to placement constraints and environmental requirements.

Method: Used noise signals recorded in real vehicles under various driving conditions to study the relationship between microphone characteristics and audio quality/ASR performance, focusing on bandwidth and frequency response variations.

Result: Findings provide knowledge about which microphone frequency response characteristics are more relevant for audio quality, helping inform proper microphone specification choices for automotive applications.

Conclusion: The study establishes relationships between microphone characteristics and performance metrics, offering guidance for selecting appropriate microphone specifications in automotive hands-free communication and ASR systems.

Abstract: Upon choosing microphones for automotive hands-free communication or Automatic Speech Recognition (ASR) applications, OEMs typically specify wideband, super wideband or even fullband requirements following established standard recommendations (e.g., ITU-P.1110, ITU-P.1120). In practice, it is often challenging to achieve the preferred bandwidth for an automotive microphone when considering limitations and constraints on microphone placement inside the cabin, and the automotive grade environmental robustness requirements. On the other hand, there seems to be no consensus or sufficient data on the effect of each microphone characteristic on the actual performance. As an attempt to answer this question, we used noise signals recorded in real vehicles and under various driving conditions to experimentally study the relationship between the microphones’ characteristics and the final audio quality of speech communication and performance of ASR engines. We focus on how variations in microphone bandwidth and amplitude frequency response shapes affect the perceptual speech quality. The speech quality results are compared by using ETSI TS 103 281 metrics (S-MOS, N-MOS, G-MOS) and ancillary metrics such as SNR. The ASR results are evaluated with standard metrics such as Word Error Rate (WER). Findings from this study provide knowledge in the understanding of what microphone frequency response characteristics are more relevant for audio quality and choice of proper microphone specifications, particularly for automotive applications.

[625] Target speaker anonymization in multi-speaker recordings

Natalia Tomashenko, Junichi Yamagishi, Xin Wang, Yun Liu, Emmanuel Vincent

Main category: eess.AS

TL;DR: This paper addresses speaker anonymization in multi-speaker conversational audio, focusing on anonymizing only one target speaker (like customers in call centers) rather than all speakers, and proposes improved evaluation methods for this specific scenario.

Details

Motivation: Existing speaker anonymization research focuses on single-speaker audio, but real-world applications like call centers require anonymizing only specific speakers in multi-speaker conversations. Current methods and evaluation metrics are inadequate for this targeted anonymization scenario.

Method: The study explores effective strategies for targeted speaker anonymization in conversational audio, identifies problems in developing such methods, and proposes corresponding improved evaluation methodologies.

Result: The work highlights the limitations of conventional anonymization methods for multi-speaker scenarios and demonstrates the need for specialized approaches when only a single target speaker needs anonymization.

Conclusion: There is a critical need for developing targeted speaker anonymization techniques and improved evaluation frameworks specifically designed for multi-speaker conversational audio scenarios where privacy protection must be applied selectively to individual speakers.

Abstract: Most of the existing speaker anonymization research has focused on single-speaker audio, leading to the development of techniques and evaluation metrics optimized for such condition. This study addresses the significant challenge of speaker anonymization within multi-speaker conversational audio, specifically when only a single target speaker needs to be anonymized. This scenario is highly relevant in contexts like call centers, where customer privacy necessitates anonymizing only the customer’s voice in interactions with operators. Conventional anonymization methods are often not suitable for this task. Moreover, current evaluation methodology does not allow us to accurately assess privacy protection and utility in this complex multi-speaker scenario. This work aims to bridge these gaps by exploring effective strategies for targeted speaker anonymization in conversational audio, highlighting potential problems in their development and proposing corresponding improved evaluation methodologies.

[626] A Study of the Removability of Speaker-Adversarial Perturbations

Liping Chen, Chenyang Guo, Kong Aik Lee, Zhen-Hua Ling, Wu Guo

Main category: eess.AS

TL;DR: This paper investigates the removability of speaker-adversarial perturbations under different awareness scenarios, finding that full removal is only possible when fully informed about the perturbation generator.

Details

Motivation: Current defense techniques only reduce the effects of speaker-adversarial attacks but don't fully remove perturbations and restore original speech. The paper aims to study whether these perturbations can actually be eliminated.

Method: The study examines three scenarios with varying awareness of the perturbation generator (ignorant, semi-informed, well-informed) and considers both optimization-based and feedforward perturbation generation methods on the LibriSpeech dataset.

Result: 1) In ignorant scenario: perturbations cannot be eliminated but their impact is reduced. 2) In semi-informed scenario: optimization-based perturbations cannot be fully removed, but feedforward ones can be considerably reduced. 3) In well-informed scenario: perturbations are nearly eliminated and original speech can be restored.

Conclusion: Speaker-adversarial perturbations can only be fully removed when the defender has complete knowledge about the perturbation generator, highlighting the importance of awareness in effective defense strategies.

Abstract: Recent advancements in adversarial attacks have demonstrated their effectiveness in misleading speaker recognition models, making wrong predictions about speaker identities. On the other hand, defense techniques against speaker-adversarial attacks focus on reducing the effects of speaker-adversarial perturbations on speaker attribute extraction. These techniques do not seek to fully remove the perturbations and restore the original speech. To this end, this paper studies the removability of speaker-adversarial perturbations. Specifically, the investigation is conducted assuming various degrees of awareness of the perturbation generator across three scenarios: ignorant, semi-informed, and well-informed. Besides, we consider both the optimization-based and feedforward perturbation generation methods. Experiments conducted on the LibriSpeech dataset demonstrated that: 1) in the ignorant scenario, speaker-adversarial perturbations cannot be eliminated, although their impact on speaker attribute extraction is reduced, 2) in the semi-informed scenario, the speaker-adversarial perturbations cannot be fully removed, while those generated by the feedforward model can be considerably reduced, and 3) in the well-informed scenario, speaker-adversarial perturbations are nearly eliminated, allowing for the restoration of the original speech. Audio samples can be found in https://voiceprivacy.github.io/Perturbation-Generation-Removal/.

[627] Spatially-Augmented Sequence-to-Sequence Neural Diarization for Meetings

Li Li, Ming Cheng, Hongyu Zhang, Juan Liu, Ming Li

Main category: eess.AS

TL;DR: SA-S2SND integrates DOA cues into sequence-to-sequence neural diarization using a two-stage training strategy and simulated DOA generation, achieving significant DER improvements on AliMeeting dataset.

Details

Motivation: To enhance speaker diarization performance by incorporating spatial information (DOA cues) with sequence-to-sequence neural networks, addressing limitations of single-channel approaches.

Method: Two-stage training: first with single-channel audio and DOA features, then multi-channel inputs under DOA guidance. Uses simulated DOA generation to reduce dependency on multi-channel corpora.

Result: 7.4% relative DER reduction in offline mode and over 19% improvement with channel attention on AliMeeting dataset. Shows spatial cues complement cross-channel modeling effectively.

Conclusion: Spatial cues are highly complementary to cross-channel modeling, enabling good performance in both online and offline diarization settings.

Abstract: This paper proposes a Spatially-Augmented Sequence-to-Sequence Neural Diarization (SA-S2SND) framework, which integrates direction-of-arrival (DOA) cues estimated by SRP-DNN into the S2SND backbone. A two-stage training strategy is adopted: the model is first trained with single-channel audio and DOA features, and then further optimized with multi-channel inputs under DOA guidance. In addition, a simulated DOA generation scheme is introduced to alleviate dependence on matched multi-channel corpora. On the AliMeeting dataset, SA-S2SND consistently outperform the S2SND baseline, achieving a 7.4% relative DER reduction in the offline mode and over 19% improvement when combined with channel attention. These results demonstrate that spatial cues are highly complementary to cross-channel modeling, yielding good performance in both online and offline settings.

eess.IV

[628] Interlaced dynamic XCT reconstruction with spatio-temporal implicit neural representations

Mathias Boulanger, Ericmoore Jossou

Main category: eess.IV

TL;DR: Spatio-temporal Implicit Neural Representations (INRs) combined with ADMM optimization and INCODE conditioning framework for dynamic X-ray CT reconstruction under interlaced acquisition, outperforming state-of-the-art methods and demonstrating robustness to noise.

Details

Motivation: To develop an efficient reconstruction method for dynamic X-ray computed tomography under interlaced acquisition schemes that can handle undersampling, spatial complexity, and noise while providing practical extensions for real-world applications.

Method: Combines ADMM-based optimization with INCODE conditioning framework, uses spatio-temporal Implicit Neural Representations (INRs), incorporates weighted least squares data fidelity for noise modeling, and enables modular extensions for detector non-idealities and ring artifact correction.

Result: Outperforms Time-Interlaced Model-Based Iterative Reconstruction (TIMBIR) across all tested scenarios. Shows strong robustness to moderate noise levels and significant performance improvement in challenging noise regimes. Successfully demonstrates 4D volumetric reconstruction through joint optimization of batched axial slices.

Conclusion: The proposed INR-based approach provides an effective and modular framework for dynamic XCT reconstruction, offering superior performance over state-of-the-art methods, good noise robustness, and practical extensions for real-world applications including massive parallelization capabilities.

Abstract: In this work, we investigate the use of spatio-temporalImplicit Neural Representations (INRs) for dynamic X-ray computed tomography (XCT) reconstruction under interlaced acquisition schemes. The proposed approach combines ADMM-based optimization with INCODE, a conditioning framework incorporating prior knowledge, to enable efficient convergence. We evaluate our method under diverse acquisition scenarios, varying the severity of global undersampling, spatial complexity (quantified via spatial information), and noise levels. Across all settings, our model achieves strong performance and outperforms Time-Interlaced Model-Based Iterative Reconstruction (TIMBIR), a state-of-the-art model-based iterative method. In particular, we show that the inductive bias of the INR provides good robustness to moderate noise levels, and that introducing explicit noise modeling through a weighted least squares data fidelity term significantly improves performance in more challenging regimes. The final part of this work explores extensions toward a practical reconstruction framework. We demonstrate the modularity of our approach by explicitly modeling detector non-idealities, incorporating ring artifact correction directly within the reconstruction process. Additionally, we present a proof-of-concept 4D volumetric reconstruction by jointly optimizing over batched axial slices, an approach which opens up the possibilities for massive parallelization, a critical feature for processing large-scale datasets.

[629] Progressive Uncertainty-Guided Evidential U-KAN for Trustworthy Medical Image Segmentation

Zhen Yang, Yansong Ma, Lei Chen

Main category: eess.IV

TL;DR: The paper proposes Evidential U-KAN, a trustworthy medical image segmentation method that combines progressive evidence uncertainty guided attention (PEUA) and semantic-preserving evidence learning (SAEL) to improve accuracy and reliability in clinical decision-making.

Details

Motivation: Existing evidence deep learning (EDL) methods for medical image segmentation often neglect using uncertainty maps to refine ambiguous boundary segmentation and suppress evidence of incorrect classes indiscriminately, which impairs uncertainty assessment in ambiguous areas.

Method: Proposes PEUA mechanism that progressively refines attention using uncertainty maps with low-rank learning to denoise attention weights, and SAEL strategy that includes semantic-smooth evidence generator and fidelity-enhancing regularization to retain critical semantics.

Result: Extensive experiments on 4 datasets demonstrate superior accuracy and reliability over competing methods.

Conclusion: Evidential U-KAN provides a novel solution for trustworthy medical image segmentation by effectively leveraging uncertainty guidance and preserving semantic information.

Abstract: Trustworthy medical image segmentation aims at deliver accurate and reliable results for clinical decision-making. Most existing methods adopt the evidence deep learning (EDL) paradigm due to its computational efficiency and theoretical robustness. However, the EDL-based methods often neglect leveraging uncertainty maps rich in attention cues to refine ambiguous boundary segmentation. To address this, we propose a progressive evidence uncertainty guided attention (PEUA) mechanism to guide the model to focus on the feature representation learning of hard regions. Unlike conventional approaches, PEUA progressively refines attention using uncertainty maps while employing low-rank learning to denoise attention weights, enhancing feature learning for challenging regions. Concurrently, standard EDL methods suppress evidence of incorrect class indiscriminately via Kullback-Leibler (KL) regularization, impairing the uncertainty assessment in ambiguous areas and consequently distorts the corresponding attention guidance. We thus introduce a semantic-preserving evidence learning (SAEL) strategy, integrating a semantic-smooth evidence generator and a fidelity-enhancing regularization term to retain critical semantics. Finally, by embedding PEUA and SAEL with the state-of-the-art U-KAN, we proposes Evidential U-KAN, a novel solution for trustworthy medical image segmentation. Extensive experiments on 4 datasets demonstrate superior accuracy and reliability over the competing methods. The code is available at \href{https://anonymous.4open.science/r/Evidence-U-KAN-BBE8}{github}.

[630] FS-RWKV: Leveraging Frequency Spatial-Aware RWKV for 3T-to-7T MRI Translation

Yingtie Lei, Zimeng Li, Chi-Man Pun, Yupeng Liu, Xuhang Chen

Main category: eess.IV

TL;DR: FS-RWKV is a novel framework for synthesizing 7T-quality MRI images from 3T scans using RWKV architecture with frequency-spatial processing modules to overcome computational limitations of existing methods while preserving anatomical details.

Details

Motivation: 7T MRI provides superior image quality but has limited clinical adoption due to high costs and technical demands. There's a need for computational methods to synthesize 7T-quality images from more accessible 3T scans, but existing CNN and Transformer approaches have limitations in spatial coverage and computational efficiency.

Method: Proposed FS-RWKV framework with two key modules: (1) FSO-Shift that performs discrete wavelet decomposition and omnidirectional spatial shifting on low-frequency components to enhance global context while preserving high-frequency details, and (2) SFEB that adaptively reinforces anatomical structure through frequency-aware feature fusion.

Result: FS-RWKV consistently outperforms CNN-, Transformer-, GAN-, and RWKV-based baselines on UNC and BNU datasets across both T1w and T2w modalities, achieving superior anatomical fidelity and perceptual quality.

Conclusion: The FS-RWKV framework provides an efficient and effective solution for 3T-to-7T MRI translation, addressing computational limitations while maintaining high-quality anatomical detail preservation and global tissue contrast recovery.

Abstract: Ultra-high-field 7T MRI offers enhanced spatial resolution and tissue contrast that enables the detection of subtle pathological changes in neurological disorders. However, the limited availability of 7T scanners restricts widespread clinical adoption due to substantial infrastructure costs and technical demands. Computational approaches for synthesizing 7T-quality images from accessible 3T acquisitions present a viable solution to this accessibility challenge. Existing CNN approaches suffer from limited spatial coverage, while Transformer models demand excessive computational overhead. RWKV architectures offer an efficient alternative for global feature modeling in medical image synthesis, combining linear computational complexity with strong long-range dependency capture. Building on this foundation, we propose Frequency Spatial-RWKV (FS-RWKV), an RWKV-based framework for 3T-to-7T MRI translation. To better address the challenges of anatomical detail preservation and global tissue contrast recovery, FS-RWKV incorporates two key modules: (1) Frequency-Spatial Omnidirectional Shift (FSO-Shift), which performs discrete wavelet decomposition followed by omnidirectional spatial shifting on the low-frequency branch to enhance global contextual representation while preserving high-frequency anatomical details; and (2) Structural Fidelity Enhancement Block (SFEB), a module that adaptively reinforces anatomical structure through frequency-aware feature fusion. Comprehensive experiments on UNC and BNU datasets demonstrate that FS-RWKV consistently outperforms existing CNN-, Transformer-, GAN-, and RWKV-based baselines across both T1w and T2w modalities, achieving superior anatomical fidelity and perceptual quality.

[631] SAM2-3dMed: Empowering SAM2 for 3D Medical Image Segmentation

Yeqing Yang, Le Xu, Lixia Tian

Main category: eess.IV

TL;DR: SAM2-3dMed adapts SAM2 for 3D medical imaging by addressing domain gaps through bidirectional slice dependency modeling and boundary enhancement, achieving state-of-the-art segmentation performance.

Details

Motivation: Direct application of SAM2 to 3D medical images faces domain gaps: bidirectional anatomical continuity vs unidirectional video flow, and insufficient boundary precision for medical analysis.

Method: Proposes SAM2-3dMed with two innovations: Slice Relative Position Prediction (SRPP) module for bidirectional inter-slice dependencies, and Boundary Detection (BD) module for enhanced boundary accuracy.

Result: Extensive experiments on MSD datasets (Lung, Spleen, Pancreas) show SAM2-3dMed significantly outperforms state-of-the-art methods in segmentation overlap and boundary precision.

Conclusion: The approach advances 3D medical image segmentation and provides a general paradigm for adapting video-centric foundation models to spatial volumetric data.

Abstract: Accurate segmentation of 3D medical images is critical for clinical applications like disease assessment and treatment planning. While the Segment Anything Model 2 (SAM2) has shown remarkable success in video object segmentation by leveraging temporal cues, its direct application to 3D medical images faces two fundamental domain gaps: 1) the bidirectional anatomical continuity between slices contrasts sharply with the unidirectional temporal flow in videos, and 2) precise boundary delineation, crucial for morphological analysis, is often underexplored in video tasks. To bridge these gaps, we propose SAM2-3dMed, an adaptation of SAM2 for 3D medical imaging. Our framework introduces two key innovations: 1) a Slice Relative Position Prediction (SRPP) module explicitly models bidirectional inter-slice dependencies by guiding SAM2 to predict the relative positions of different slices in a self-supervised manner; 2) a Boundary Detection (BD) module enhances segmentation accuracy along critical organ and tissue boundaries. Extensive experiments on three diverse medical datasets (the Lung, Spleen, and Pancreas in the Medical Segmentation Decathlon (MSD) dataset) demonstrate that SAM2-3dMed significantly outperforms state-of-the-art methods, achieving superior performance in segmentation overlap and boundary precision. Our approach not only advances 3D medical image segmentation performance but also offers a general paradigm for adapting video-centric foundation models to spatial volumetric data.

[632] Rewiring Development in Brain Segmentation: Leveraging Adult Brain Priors for Enhancing Infant MRI Segmentation

Alemu Sisay Nigru, Michele Svanera, Austin Dibble, Connor Dalby, Mattia Savardi, Sergio Benini

Main category: eess.IV

TL;DR: LODi is a framework that transfers knowledge from adult brain MRI segmentation models to improve infant brain MRI segmentation, using transfer learning and domain adaptation with weakly supervised learning on infant scans.

Details

Motivation: Infant brain MRI segmentation is challenging due to evolving anatomy, motion artifacts, and limited labeled data, while adult brain MRI data is abundant.

Method: Pre-train on adult brain MRI data, then adapt to infants using transfer learning, domain adaptation, and weakly supervised learning with FreeSurfer silver-standard labels, incorporating hierarchical feature refinement and multi-level consistency constraints.

Result: Superior performance over traditional supervised learning and domain-specific models on internal and external datasets, enabling fast, accurate, age-adaptive segmentation while reducing scanner/site biases.

Conclusion: Leveraging adult brain priors provides a foundation for age-flexible neuroimaging analysis, enabling more reliable and generalizable brain MRI segmentation across the lifespan.

Abstract: Accurate segmentation of infant brain MRI is critical for studying early neurodevelopment and diagnosing neurological disorders. Yet, it remains a fundamental challenge due to continuously evolving anatomy of the subjects, motion artifacts, and the scarcity of high-quality labeled data. In this work, we present LODi, a novel framework that utilizes prior knowledge from an adult brain MRI segmentation model to enhance the segmentation performance of infant scans. Given the abundance of publicly available adult brain MRI data, we pre-train a segmentation model on a large adult dataset as a starting point. Through transfer learning and domain adaptation strategies, we progressively adapt the model to the 0-2 year-old population, enabling it to account for the anatomical and imaging variability typical of infant scans. The adaptation of the adult model is carried out using weakly supervised learning on infant brain scans, leveraging silver-standard ground truth labels obtained with FreeSurfer. By introducing a novel training strategy that integrates hierarchical feature refinement and multi-level consistency constraints, our method enables fast, accurate, age-adaptive segmentation, while mitigating scanner and site-specific biases. Extensive experiments on both internal and external datasets demonstrate the superiority of our approach over traditional supervised learning and domain-specific models. Our findings highlight the advantage of leveraging adult brain priors as a foundation for age-flexible neuroimaging analysis, paving the way for more reliable and generalizable brain MRI segmentation across the lifespan.

[633] MIP-Based Tumor Segmentation: A Radiologist-Inspired Approach

Romario Zarik, Nahum Kiryati, Michael Green, Liran Domachevsky, Arnaldo Mayer

Main category: eess.IV

TL;DR: Proposes training segmentation models directly on Multi-Angle Maximum Intensity Projections (MIPs) instead of 3D volumes, achieving comparable performance with significant computational efficiency gains and introducing occlusion correction for improved segmentation.

Details

Motivation: Current PET/CT tumor segmentation relies on 3D volumetric data, which is computationally intensive and time-consuming, while radiologists' clinical workflow already uses MIPs for initial assessment but they are underutilized in automated methods.

Method: Trains segmentation models directly on MIPs rather than 3D volumes, introduces occlusion correction to restore annotations hidden by high-intensity structures, and evaluates optimal MIP count for segmentation.

Result: MIP-based approach achieves comparable segmentation performance to 3D (≤1% Dice difference, 26.7% better Hausdorff Distance) with 55.8-75.8% reduction in training time, 71.7-76% reduction in energy per epoch, and two orders of magnitude reduction in TFLOPs. For classification, outperforms 3D with 10x faster training and 93.35% energy reduction.

Conclusion: MIP-based segmentation provides a scalable clinical solution that aligns with radiologists’ workflow, offering near-identical performance to 3D methods with dramatically improved computational efficiency, with 48 MIP views identified as optimal balance.

Abstract: PET/CT imaging is the gold standard for tumor detection, offering high accuracy in identifying local and metastatic lesions. Radiologists often begin assessment with rotational Multi-Angle Maximum Intensity Projections (MIPs) from PET, confirming findings with volumetric slices. This workflow is time-consuming, especially in metastatic cases. Despite their clinical utility, MIPs are underutilized in automated tumor segmentation, where 3D volumetric data remains the norm. We propose an alternative approach that trains segmentation models directly on MIPs, bypassing the need to segment 3D volumes and then project. This better aligns the model with its target domain and yields substantial gains in computational efficiency and training time. We also introduce a novel occlusion correction method that restores MIP annotations occluded by high-intensity structures, improving segmentation. Using the autoPET 2022 Grand Challenge dataset, we evaluate our method against standard 3D pipelines in terms of performance and training/computation efficiency for segmentation and classification, and analyze how MIP count affects segmentation. Our MIP-based approach achieves segmentation performance on par with 3D (<=1% Dice difference, 26.7% better Hausdorff Distance), while reducing training time (convergence time) by 55.8-75.8%, energy per epoch by 71.7-76%, and TFLOPs by two orders of magnitude, highlighting its scalability for clinical use. For classification, using 16 MIPs only as input, we surpass 3D performance while reducing training time by over 10x and energy consumption per epoch by 93.35%. Our analysis of the impact of MIP count on segmentation identified 48 views as optimal, offering the best trade-off between performance and efficiency.

[634] A Biophysically-Conditioned Generative Framework for 3D Brain Tumor MRI Synthesis

Valentin Biller, Lucas Zimmer, Can Erdur, Sandeep Nagar, Daniel Rückert, Niklas Bubeck, Jonas Weidner

Main category: eess.IV

TL;DR: A generative model using latent diffusion conditioned on tumor concentrations and tissue segmentations for brain tumor MRI synthesis and healthy tissue restoration.

Details

Motivation: To support clinical and research applications by enabling high-fidelity brain tumor MRI inpainting and healthy tissue restoration.

Method: Latent diffusion model conditioned on both tissue segmentations and tumor concentrations to generate 3D spatially coherent and anatomically consistent images.

Result: Achieved PSNR of 18.5 for healthy inpainting and 17.4 for tumor inpainting.

Conclusion: The model successfully generates high-fidelity brain tumor MRIs and performs healthy tissue restoration for the BraTS 2025 Inpainting Challenge.

Abstract: Magnetic resonance imaging (MRI) inpainting supports numerous clinical and research applications. We introduce the first generative model that conditions on voxel-level, continuous tumor concentrations to synthesize high-fidelity brain tumor MRIs. For the BraTS 2025 Inpainting Challenge, we adapt this architecture to the complementary task of healthy tissue restoration by setting the tumor concentrations to zero. Our latent diffusion model conditioned on both tissue segmentations and the tumor concentrations generates 3D spatially coherent and anatomically consistent images for both tumor synthesis and healthy tissue inpainting. For healthy inpainting, we achieve a PSNR of 18.5, and for tumor inpainting, we achieve 17.4. Our code is available at: https://github.com/valentin-biller/ldm.git

[635] Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration

Darshan Thaker, Abhishek Goyal, René Vidal

Main category: eess.IV

TL;DR: The paper analyzes approximation errors in diffusion-based image restoration methods and proposes a frequency-domain filtering approach with adaptive curriculum to improve performance on challenging tasks like motion deblurring and dehazing.

Details

Motivation: Existing diffusion-based image restoration methods rely on approximations that can lead to significant errors and compromised sample quality. The authors aim to provide rigorous analysis of these approximation errors and develop improved methods.

Method: Proposes a time-varying low-pass filter in the frequency domain of measurements, progressively incorporating higher frequencies during restoration. Develops an adaptive curriculum for frequency schedule based on the underlying data distribution.

Result: The method significantly improves performance on challenging image restoration tasks including motion deblurring and image dehazing.

Conclusion: The proposed frequency-domain filtering approach with adaptive curriculum effectively addresses approximation errors in diffusion-based restoration methods and achieves superior performance on difficult restoration tasks.

Abstract: Image restoration aims to recover high-quality images from degraded observations. When the degradation process is known, the recovery problem can be formulated as an inverse problem, and in a Bayesian context, the goal is to sample a clean reconstruction given the degraded observation. Recently, modern pretrained diffusion models have been used for image restoration by modifying their sampling procedure to account for the degradation process. However, these methods often rely on certain approximations that can lead to significant errors and compromised sample quality. In this paper, we provide the first rigorous analysis of this approximation error for linear inverse problems under distributional assumptions on the space of natural images, demonstrating cases where previous works can fail dramatically. Motivated by our theoretical insights, we propose a simple modification to existing diffusion-based restoration methods. Our approach introduces a time-varying low-pass filter in the frequency domain of the measurements, progressively incorporating higher frequencies during the restoration process. We develop an adaptive curriculum for this frequency schedule based on the underlying data distribution. Our method significantly improves performance on challenging image restoration tasks including motion deblurring and image dehazing.

[636] Convergent Primal-Dual Plug-and-Play Image Restoration: A General Algorithm and Applications

Yodai Suzuki, Ryosuke Isono, Shunsuke Ono

Main category: eess.IV

TL;DR: A general deep plug-and-play algorithm with theoretical convergence guarantees is proposed, combining PnP with primal-dual splitting to efficiently handle various image restoration problems.

Details

Motivation: Existing PnP methods lack theoretical convergence guarantees under realistic assumptions and have inconsistent behavior, while convergent methods are often limited to specific settings or computationally expensive for non-quadratic data-fidelity terms and constraints.

Method: Integrates PnP paradigm with primal-dual splitting (PDS) methodology to develop a general convergent PnP framework, establishing theoretical convergence conditions under reasonable assumptions and solving monotone inclusion problems.

Result: The proposed approach efficiently handles a broad class of image restoration problems with guaranteed theoretical convergence, validated through numerical experiments on specific image restoration tasks.

Conclusion: The developed PnP-PDS framework provides a general solution for image restoration with theoretical convergence guarantees, addressing limitations of existing PnP methods while maintaining practical effectiveness.

Abstract: We propose a general deep plug-and-play (PnP) algorithm with a theoretical convergence guarantee. PnP strategies have demonstrated outstanding performance in various image restoration tasks by exploiting the powerful priors underlying Gaussian denoisers. However, existing PnP methods often lack theoretical convergence guarantees under realistic assumptions due to their ad-hoc nature, resulting in inconsistent behavior. Moreover, even when convergence guarantees are provided, they are typically designed for specific settings or require a considerable computational cost in handling non-quadratic data-fidelity terms and additional constraints, which are key components in many image restoration scenarios. To tackle these challenges, we integrate the PnP paradigm with primal-dual splitting (PDS), an efficient proximal splitting methodology for solving a wide range of convex optimization problems, and develop a general convergent PnP framework. Specifically, we establish theoretical conditions for the convergence of the proposed PnP algorithm under a reasonable assumption. Furthermore, we show that the problem solved by the proposed PnP algorithm is not a standard convex optimization problem but a more general monotone inclusion problem, where we provide a mathematical representation of the solution set. Our approach efficiently handles a broad class of image restoration problems with guaranteed theoretical convergence. Numerical experiments on specific image restoration tasks validate the practicality and effectiveness of our theoretical results.

[637] EMedNeXt: An Enhanced Brain Tumor Segmentation Framework for Sub-Saharan Africa using MedNeXt V2 with Deep Supervision

Ahmed Jaheen, Abdelrahman Elsayed, Damir Kim, Daniil Tikhonov, Matheus Scatolin, Mohor Banerjee, Qiankun Ji, Mostafa Salem, Hu Wang, Sarim Hashmi, Mohammad Yaqub

Main category: eess.IV

TL;DR: EMedNeXt is an enhanced brain tumor segmentation framework designed for sub-Saharan Africa, addressing challenges of low-quality MRI scanners and scarce radiology expertise through improved architecture and robust ensembling.

Details

Motivation: Current manual MRI segmentation for brain tumors is time-consuming, requires expert radiologists, and is infeasible in under-resourced healthcare systems, especially in low-income regions with poor MRI quality and limited data.

Method: Enhanced MedNeXt V2 framework with deep supervision and optimized post-processing, featuring larger region of interest, improved nnU-Net v2-based architecture, and robust model ensembling system tailored for sub-Saharan Africa.

Result: Achieved average LesionWise DSC of 0.897, LesionWise NSD of 0.541 at 0.5mm tolerance, and 0.84 at 1.0mm tolerance on hidden validation set.

Conclusion: EMedNeXt provides an effective solution for robust brain tumor segmentation in resource-constrained settings like sub-Saharan Africa, addressing image quality degradation and data scarcity challenges.

Abstract: Brain cancer affects millions worldwide, and in nearly every clinical setting, doctors rely on magnetic resonance imaging (MRI) to diagnose and monitor gliomas. However, the current standard for tumor quantification through manual segmentation of multi-parametric MRI is time-consuming, requires expert radiologists, and is often infeasible in under-resourced healthcare systems. This problem is especially pronounced in low-income regions, where MRI scanners are of lower quality and radiology expertise is scarce, leading to incorrect segmentation and quantification. In addition, the number of acquired MRI scans in Africa is typically small. To address these challenges, the BraTS-Lighthouse 2025 Challenge focuses on robust tumor segmentation in sub-Saharan Africa (SSA), where resource constraints and image quality degradation introduce significant shifts. In this study, we present EMedNeXt – an enhanced brain tumor segmentation framework based on MedNeXt V2 with deep supervision and optimized post-processing pipelines tailored for SSA. EMedNeXt introduces three key contributions: a larger region of interest, an improved nnU-Net v2-based architectural skeleton, and a robust model ensembling system. Evaluated on the hidden validation set, our solution achieved an average LesionWise DSC of 0.897 with an average LesionWise NSD of 0.541 and 0.84 at a tolerance of 0.5 mm and 1.0 mm, respectively.

[638] Frequency-Aware Ensemble Learning for BraTS 2025 Pediatric Brain Tumor Segmentation

Yuxiao Yi, Qingyao Zhuang, Zhi-Qin John Xu, Xiaowen Wang, Yan Ren, Tianming Qiu

Main category: eess.IV

TL;DR: An ensemble method combining nnU-Net, Swin UNETR, and HFF-Net won first place in BraTS-PED 2025 challenge for pediatric brain tumor segmentation, achieving high Dice scores across multiple tumor sub-regions.

Details

Motivation: Pediatric brain tumor segmentation is challenging due to rarity and heterogeneity of these malignancies, but remains critical for clinical diagnosis and treatment planning.

Method: Ensemble approach integrating three models: nnU-Net with adjustable initialization scales, Swin UNETR with transfer learning from BraTS 2021, and HFF-Net with frequency domain decomposition for separating low-frequency contours from high-frequency textures.

Result: Achieved Dice scores of 62.7% (CC), 83.2% (ED), 72.9% (ET), 85.7% (NET), 91.8% (TC), and 92.6% (WT) on unseen test dataset, ranking first in BraTS 2025 Pediatric Brain Tumor Segmentation Challenge.

Conclusion: The proposed ensemble framework demonstrates state-of-the-art performance for pediatric brain tumor segmentation, successfully addressing the unique challenges of this domain through complementary model architectures and specialized adaptations.

Abstract: Pediatric brain tumor segmentation presents unique challenges due to the rarity and heterogeneity of these malignancies, yet remains critical for clinical diagnosis and treatment planning. We propose an ensemble approach integrating nnU-Net, Swin UNETR, and HFF-Net for the BraTS-PED 2025 challenge. Our method incorporates three key extensions: adjustable initialization scales for optimal nnU-Net complexity control, transfer learning from BraTS 2021 pre-trained models to enhance Swin UNETR’s generalization on pediatric dataset, and frequency domain decomposition for HFF-Net to separate low-frequency tissue contours from high-frequency texture details. Our final ensemble framework combines nnU-Net ($\gamma=0.7$), fine-tuned Swin UNETR, and HFF-Net, achieving Dice scores of 62.7% (CC), 83.2% (ED), 72.9% (ET), 85.7% (NET), 91.8% (TC), and 92.6% (WT) on the unseen test dataset, respectively. Our proposed method achieves first place (rank 1st) in the BraTS 2025 Pediatric Brain Tumor Segmentation Challenge.

[639] Content-Adaptive Inference for State-of-the-art Learned Video Compression

Ahmet Bilican, M. Akın Yılmaz, A. Murat Tekalp

Main category: eess.IV

TL;DR: Proposes a model-agnostic framework to improve learned video codec performance on videos with complex motion by adaptively downsampling frames during inference to match training motion vector ranges.

Details

Motivation: Learned video codecs perform poorly on videos with complex/large motions compared to simple motion scenes due to inability to generalize to unseen motion vector ranges, causing issues in flow estimation and compression.

Method: Content-adaptive inference framework that controls motion vector scale by adaptively downsampling frames during encoding to match test and training motion vector ranges, improving flow estimation and compression.

Result: Improves BD-rate performance of state-of-the-art DCVC-FM codec by up to 41% on individual videos without model fine-tuning, with motion and scene complexity measures predicting framework effectiveness.

Conclusion: Adaptive frame downsampling during inference effectively addresses motion vector generalization issues in learned video codecs, significantly improving performance on complex motion videos.

Abstract: While the BD-rate performance of recent learned video codec models in both low-delay and random-access modes exceed that of respective modes of traditional codecs on average over common benchmarks, the performance improvements for individual videos with complex/large motions is much smaller compared to scenes with simple motion. This is related to the inability of a learned encoder model to generalize to motion vector ranges that have not been seen in the training set, which causes loss of performance in both coding of flow fields as well as frame prediction and coding. As a remedy, we propose a generic (model-agnostic) framework to control the scale of motion vectors in a scene during inference (encoding) to approximately match the range of motion vectors in the test and training videos by adaptively downsampling frames. This results in down-scaled motion vectors enabling: i) better flow estimation; hence, frame prediction and ii) more efficient flow compression. We show that the proposed framework for content-adaptive inference improves the BD-rate performance of already state-of-the-art low-delay video codec DCVC-FM by up to 41% on individual videos without any model fine tuning. We present ablation studies to show measures of motion and scene complexity can be used to predict the effectiveness of the proposed framework.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Enhancing Biomedical Named Entity Recognition using GLiNER-BioMed with Targeted Dictionary-Based Post-processing for BioASQ 2025 task 6

[2] Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

[3] Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech

[4] Systematic Diagnosis of Brittle Reasoning in Large Language Models

[5] Mnemosyne: An Unsupervised, Human-Inspired Long-Term Memory Architecture for Edge-Based LLMs

[6] Confidence, Not Perplexity: A Better Metric for the Creative Era of LLMs

[7] Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation

[8] Human Texts Are Outliers: Detecting LLM-generated Texts via Out-of-distribution Detection

[9] YpathRAG:A Retrieval-Augmented Generation Framework and Benchmark for Pathology

[10] Identifying & Interactively Refining Ambiguous User Goals for Data Visualization Code Generation

[11] LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

[12] Toward a Safer Web: Multilingual Multi-Agent LLMs for Mitigating Adversarial Misinformation Attacks

[13] DDO: Dual-Decision Optimization for LLM-Based Medical Consultation via Multi-Agent Collaboration

[14] Centering Emotion Hotspots: Multimodal Local-Global Fusion and Cross-Modal Alignment for Emotion Recognition in Conversations

[15] MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation

[16] GraphGhost: Tracing Structures Behind Large Language Models

[17] CLARITY: Clinical Assistant for Routing, Inference, and Triage

[18] The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach

[19] Gender Bias in Large Language Models for Healthcare: Assignment Consistency and Clinical Implications

[20] Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking

[21] Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech

[22] Iterative LLM-Based Generation and Refinement of Distracting Conditions in Math Word Problems

[23] LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests

[24] JAI-1: A Thai-Centric Large Language Model

[25] From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational Agents

[26] Text2Stories: Evaluating the Alignment Between Stakeholder Interviews and Generated User Stories

[27] PARSE: LLM Driven Schema Optimization for Reliable Entity Extraction

[28] Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B

[29] From What to Why: Thought-Space Recommendation with Small Language Models

[30] ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

[31] Next Semantic Scale Prediction via Hierarchical Diffusion Language Models

[32] Upfront Chain-of-Thought: A Cooperative Framework for Chain-of-Thought Compression

[33] Formalizing Style in Personal Narratives

[34] A Novel Framework for Augmenting Rating Scale Tests with LLM-Scored Text Data

[35] dInfer: An Efficient Inference Framework for Diffusion Language Models

[36] Scaling Laws for Code: A More Data-Hungry Regime

[37] Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning

[38] How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

[39] How Reliable is Language Model Micro-Benchmarking?

[40] Coordinates from Context: Using LLMs to Ground Complex Location References

[41] Measuring Moral LLM Responses in Multilingual Capacities

[42] Learning What to Remember: Adaptive Probabilistic Memory Retention for Memory-Efficient Language Models

[43] Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

[44] MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

[45] The Model’s Language Matters: A Comparative Privacy Analysis of LLMs

[46] Search-on-Graph: Iterative Informed Navigation for Large Language Model Reasoning on Knowledge Graphs

[47] Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

[48] Quality Estimation Reranking for Document-Level Translation

[49] FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

[50] Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

[51] A Unified Biomedical Named Entity Recognition Framework with Large Language Models

[52] Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors

[53] Artificial Impressions: Evaluating Large Language Model Behavior Through the Lens of Trait Impressions

[54] SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures

[55] A Human Behavioral Baseline for Collective Governance in Software Projects

[56] Creation of the Chinese Adaptive Policy Communication Corpus

[57] MASA: LLM-Driven Multi-Agent Systems for Autoformalization

[58] DARO: Difficulty-Aware Reweighting Policy Optimization

[59] Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models

[60] LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction

[61] Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise

[62] Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

[63] Large Language Models Do NOT Really Know What They Don’t Know

[64] Alif: Advancing Urdu Large Language Models via Multilingual Synthetic Data Distillation

[65] ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

[66] FrameEOL: Semantic Frame Induction using Causal Language Models

[67] When Retrieval Succeeds and Fails: Rethinking Retrieval-Augmented Generation for LLMs

[68] DITING: A Multi-Agent Evaluation Framework for Benchmarking Web Novel Translation

[69] Augmenting Dialog with Think-Aloud Utterances for Modeling Individual Personality Traits by LLM

[70] Stronger Re-identification Attacks through Reasoning and Aggregation

[71] LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning

[72] DICE: Structured Reasoning in LLMs through SLM-Guided Chain-of-Thought Correction

[73] IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data

[74] CrisiText: A dataset of warning messages for LLM training in emergency communication

[75] DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning

[76] Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

[77] CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation