Daily arXiv Papers - 2026-04-08

AI-enhanced summaries of 0 research papers from arXiv

Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, Jinting Wang, Linghan Fang, Lufei Gao, Orkesh Ablet, Peihua Zhang, Ruolin Hu, Shengyu Li, Weilin Lin, Xiaoyang Feng, Xinyue Yang, Yan Rong, Yanyun Wang, Zihang Shao, Zelin Zhao, Chenxing Li, Shan Yang, Wenfu Wang, Meng Yu, Dong Yu, Li Liu

Main category: cs.SD

TL;DR: PhyAVBench is the first benchmark for evaluating audio-physics grounding in text-to-audio-video generation, introducing a new dataset and contrastive evaluation method to assess physical plausibility of sounds.

DetailsMotivation: Current T2AV models often fail to produce physically plausible sounds, and existing benchmarks focus mainly on audio-video synchronization while overlooking explicit evaluation of audio-physics grounding.

Method: Created PhyAVBench with PhyAV-Sound-11K dataset (11,605 audible videos, 25.5 hours), introduces Audio-Physics Sensitivity Test (APST) using paired text prompts with controlled physical variations, and proposes Contrastive Physical Response Score (CPRS) metric.

Result: Comprehensive evaluation of 17 state-of-the-art models shows even leading commercial models struggle with fundamental audio-physical phenomena, revealing critical gaps beyond audio-visual synchronization.

Conclusion: PhyAVBench addresses the overlooked aspect of physical plausibility in audio-visual generation and provides a foundation for advancing physically grounded audio-visual generation research.

Abstract: Text-to-audio-video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks primarily focus on audio-video temporal synchronization, while largely overlooking explicit evaluation of audio-physics grounding, thereby limiting the study of physically plausible audio-visual generation. To address this issue, we present PhyAVBench, the first benchmark that systematically evaluates the audio-physics grounding capabilities of T2AV, image-to-audio-video (I2AV), and video-to-audio (V2A) models. PhyAVBench offers PhyAV-Sound-11K, a new dataset of 25.5 hours of 11,605 audible videos collected from 184 participants to ensure diversity and avoid data leakage. It contains 337 paired-prompt groups with controlled physical variations that drive sound differences, each grounded with an average of 17 videos and spanning 6 audio-physics dimensions and 41 fine-grained test points. Each prompt pair is annotated with the physical factors underlying their acoustic differences. Importantly, PhyAVBench leverages paired text prompts to evaluate this capability. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST) and introduce a novel metric, the Contrastive Physical Response Score (CPRS), which quantifies the acoustic consistency between generated videos and their real-world counterparts. We conduct a comprehensive evaluation of 17 state-of-the-art models. Our results reveal that even leading commercial models struggle with fundamental audio-physical phenomena, exposing a critical gap beyond audio-visual synchronization and pointing to future research directions. We hope PhyAVBench will serve as a foundation for advancing physically grounded audio-visual generation. Prompts, ground-truth, and generated video samples are available at https://phyavbench.pages.dev/.

Relevance: 9/10

[2] StressTest: Can YOUR Speech LM Handle the Stress?

Iddo Yosha, Gallil Maimon, Yossi Adi

Main category: cs.CL

TL;DR: StressTest benchmark evaluates speech-aware language models on sentence stress understanding, showing poor performance, and introduces Stress-17k dataset and StresSLM model for improved stress reasoning.

DetailsMotivation: Sentence stress is crucial for conveying meaning and intent in speech, but current speech-aware language models (SLMs) overlook this important aspect in their evaluation and development, creating a gap in understanding prosodic cues.

Method: 1) Introduce StressTest benchmark to evaluate SLMs’ ability to distinguish meanings based on stress patterns. 2) Propose novel data generation pipeline to create Stress-17k training set simulating meaning changes from stress variation. 3) Develop StresSLM by finetuning on this dataset.

Result: Leading SLMs perform poorly on stress reasoning tasks. StresSLM generalizes well to real recordings and significantly outperforms existing SLMs on both sentence stress reasoning and detection tasks.

Conclusion: Sentence stress is a critical but overlooked aspect of speech understanding. The StressTest benchmark reveals limitations in current SLMs, while Stress-17k dataset and StresSLM demonstrate substantial improvements in stress-aware speech processing.

Abstract: Sentence stress refers to emphasis on words within a spoken utterance to highlight or contrast an idea. It is often used to imply an underlying intention not explicitly stated. Recent speech-aware language models (SLMs) have enabled direct audio processing, allowing models to access the full richness of speech to perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and intent, it remains largely overlooked in evaluation and development of SLMs. We address this gap by introducing StressTest, a benchmark designed to evaluate models’ ability to distinguish between meanings of speech based on the stress pattern. We evaluate leading SLMs, and find that despite their overall capabilities, they perform poorly on such tasks. Hence, we propose a novel data generation pipeline, and create Stress-17k, a training set that simulates change of meaning implied by stress variation. Results suggest, that our finetuned model, StresSLM, generalizes well to real recordings and notably outperforms existing SLMs on sentence stress reasoning and detection. Models, code, data, samples - pages.cs.huji.ac.il/adiyoss-lab/stresstest.

Relevance: 9/10

[3] Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs

Hongcheng Liu, Yuhao Wang, Zhe Chen, Pingjie Wang, Zhiyuan Zhu, Yixuan Hou, Yanfeng Wang, Yu Wang

Main category: cs.CL

TL;DR: CrossOmni dataset and methods to address cross-modal coreference problems in Omni-LLMs, improving fine-grained alignment between modalities for better omni-modal reasoning.

DetailsMotivation: Omni-LLMs struggle with complex scenarios requiring synergistic omni-modal reasoning, particularly in fine-grained cross-modal alignment and identifying shared referents across modalities, which has been largely overlooked.

Method: Formalize the challenge as cross-modal coreference problem, introduce CrossOmni dataset with 9 tasks and human-designed reasoning rationales, propose training-free In-Context Learning and training-based SFT+GRPO framework to enhance cross-modal alignment.

Result: Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference; both proposed approaches yield substantial performance gains and generalize effectively to collaborative reasoning tasks.

Conclusion: Cross-modal coreference is a crucial missing piece for advancing robust omni-modal reasoning, and addressing it significantly improves multimodal understanding capabilities.

Abstract: Omni Large Language Models (Omni-LLMs) have demonstrated impressive capabilities in holistic multi-modal perception, yet they consistently falter in complex scenarios requiring synergistic omni-modal reasoning. Beyond understanding global multimodal context, effective reasoning also hinges on fine-grained cross-modal alignment, especially identifying shared referents across modalities, yet this aspect has been largely overlooked. To bridge this gap, we formalize the challenge as a cross-modal coreference problem, where a model must localize a referent in a source modality and re-identify it in a target modality. Building on this paradigm, we introduce CrossOmni, a dataset comprising nine tasks equipped with human-designed reasoning rationales to evaluate and enhance this capability. Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference, which we attribute to the absence of coreference-aware thinking patterns. To address this, we enhance cross-modal alignment via two strategies: a training-free In-Context Learning method and a training-based SFT+GRPO framework designed to induce such thinking patterns. Both approaches yield substantial performance gains and generalize effectively to collaborative reasoning tasks. Overall, our findings highlight cross-modal coreference as a crucial missing piece for advancing robust omni-modal reasoning.

Relevance: 9/10


Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models

Jiaquan Zhang, Qigan Sun, Chaoning Zhang, Xudong Wang, Zhenzhen Huang, Yitian Zhou, Pengcheng Zheng, Chi-lok Andy Tai, Sung-Ho Bae, Zeyu Ma, Caiyan Qin, Jinyu Guo, Yang Yang, Hengtao Shen

Main category: cs.CL

TL;DR: A topology-based method to optimize Chain-of-Thought reasoning by embedding effective reasoning patterns from multi-round methods into lightweight CoT, using persistent homology for structural analysis and an optimization agent to repair deficiencies.

DetailsMotivation: Current reasoning paradigms for LLMs face a trade-off: CoT is efficient but has logical gaps, while multi-round methods (GoT, ToT, AoT) achieve strong performance but are too costly for practical use. Need a solution that combines single-round efficiency with multi-round intelligence.

Method: 1) Use persistent homology to map CoT, ToT, and GoT into unified topological space to quantify structural features; 2) Design Topological Optimization Agent that diagnoses deviations in CoT chains from desirable topological characteristics and generates targeted strategies to repair structural deficiencies.

Result: Experiments on multiple datasets show the approach offers superior balance between reasoning accuracy and efficiency compared to multi-round methods like ToT and GoT, achieving “single-round generation with multi-round intelligence.”

Conclusion: The topology-based optimization framework provides a practical solution to enhance LLM reasoning capability by embedding effective reasoning patterns into lightweight CoT paradigm, bridging the gap between efficiency and performance.

Abstract: Enhancing the reasoning capability of large language models (LLMs) remains a core challenge in natural language processing. The Chain-of-Thought (CoT) paradigm dominates practical applications for its single-round efficiency, yet its reasoning chains often exhibit logical gaps. While multi-round paradigms like Graph-of-Thoughts (GoT), Tree-of-Thoughts (ToT), and Atom of Thought (AoT) achieve strong performance and reveal effective reasoning structures, their high cost limits practical use. To address this problem, this paper proposes a topology-based method for optimizing reasoning chains. The framework embeds essential topological patterns of effective reasoning into the lightweight CoT paradigm. Using persistent homology, we map CoT, ToT, and GoT into a unified topological space to quantify their structural features. On this basis, we design a unified optimization system: a Topological Optimization Agent diagnoses deviations in CoT chains from desirable topological characteristics and simultaneously generates targeted strategies to repair these structural deficiencies. Compared with multi-round reasoning methods like ToT and GoT, experiments on multiple datasets show that our approach offers a superior balance between reasoning accuracy and efficiency, showcasing a practical solution to ``single-round generation with multi-round intelligence’'.

[2] The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse

Julian Coda-Forno, Jane X. Wang, Arslan Chaudhry

Main category: cs.CL

TL;DR: The paper investigates how different training objectives (MLM vs decoder-only masking) affect the reversal curse in language models, finding that explicit source entity prediction is needed for reversal accuracy, but this doesn’t create unified concept representations.

DetailsMotivation: To understand how different training objectives mitigate the reversal curse in language models and examine the underlying mechanisms of how bidirectional supervision helps models retrieve facts in reverse order.

Method: Evaluates vanilla masked language modeling (MLM) and decoder-only masking-based training across four reversal benchmarks, then conducts mechanistic analysis using representation distances and linear probes to study how these objectives succeed.

Result: Reversal accuracy requires training signal that explicitly makes the source entity a prediction target. Success doesn’t correspond to single direction-agnostic representations - instead, forward and reverse directions are stored as distinct entries with different indexing geometry for MLM vs decoder-only masking.

Conclusion: Objective-level fixes can improve reversal behavior without necessarily inducing the kind of latent generalization expected from unified concepts, cautioning against assuming improved performance implies better underlying representations.

Abstract: The reversal curse describes a failure of autoregressive language models to retrieve a fact in reverse order (e.g., training on $A > B$'' but failing on $B < A$’’). Recent work shows that objectives with bidirectional supervision (e.g., bidirectional attention or masking-based reconstruction for decoder-only models) can mitigate the reversal curse. We extend this evaluation to include a vanilla masked language modeling (MLM) objective and compare it to decoder-only masking-based training across four reversal benchmarks and then provide a minimal mechanistic study of \emph{how} these objectives succeed. We show that reversal accuracy requires training signal that explicitly makes the source entity a prediction target, and we find little evidence that success corresponds to a single direction-agnostic representation of a fact. Instead, representation distances and linear probes are consistent with storing forward and reverse directions as distinct entries, with different indexing geometry for MLM versus decoder-only masking-based training. Our results caution that objective-level ``fixes’’ can improve reversal behavior without necessarily inducing the kind of latent generalization one might expect from a unified concept.

[3] Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

Mohammad Reza Ghasemi Madani, Soyeon Caren Han, Shuo Yang, Jey Han Lau

Main category: cs.CL

TL;DR: IoT is a progressive self-filtering strategy that reconstructs MCQs by removing implausible distractors to reduce cognitive load and improve LLM reasoning stability.

DetailsMotivation: LLMs are vulnerable to plausible distractors in multiple-choice questions, causing unstable oscillation between correct and incorrect answers due to cognitive load from irrelevant choices.

Method: Inclusion-of-Thoughts (IoT) progressively filters out implausible distractors to reconstruct MCQs with only plausible options, creating a controlled setting for examining comparative judgments and reasoning stability.

Result: IoT substantially boosts chain-of-thought performance across arithmetic, commonsense reasoning, and educational benchmarks with minimal computational overhead.

Conclusion: IoT effectively mitigates cognitive load from distractors, enhances reasoning stability, and improves transparency/interpretability of LLM decision-making in MCQ evaluation.

Abstract: Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, resulting in unstable oscillation between correct and incorrect answers. In this paper, we propose Inclusion-of-Thoughts (IoT), a progressive self-filtering strategy that is designed to mitigate this cognitive load (i.e., instability of model preferences under the presence of distractors) and enable the model to focus more effectively on plausible answers. Our method operates to reconstruct the MCQ using only plausible option choices, providing a controlled setting for examining comparative judgements and therefore the stability of the model’s internal reasoning under perturbation. By explicitly documenting this filtering process, IoT also enhances the transparency and interpretability of the model’s decision-making. Extensive empirical evaluation demonstrates that IoT substantially boosts chain-of-thought performance across a range of arithmetic, commonsense reasoning, and educational benchmarks with minimal computational overhead.

[4] Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space

Gowrav Vishwakarma, Christopher J. Agostino

Main category: cs.CL

TL;DR: Phase-Associative Memory (PAM) is a complex-valued recurrent sequence model using matrix states and conjugate inner products for language modeling, achieving competitive perplexity to transformers despite computational overhead.

DetailsMotivation: To explore alternative computational formalisms for language modeling that align with evidence of non-classical contextuality in semantic interpretation, moving beyond traditional transformer architectures.

Method: A recurrent sequence model with complex-valued representations where associations accumulate in a matrix state via outer products, and retrieval uses conjugate inner products (K_t* · Q_t / √d).

Result: At ~100M parameters on WikiText-103, PAM achieves validation perplexity 30.0, within ~10% of a matched transformer (27.1) despite 4× arithmetic overhead from complex computation and no custom kernels.

Conclusion: The competitiveness of complex-valued superposition and conjugate retrieval suggests alignment with non-classical contextuality in semantic interpretation, raising questions about optimal computational formalisms for language modeling.

Abstract: We present Phase-Associative Memory (PAM), a recurrent sequence model in which all representations are complex-valued, associations accumulate in a matrix state $S_{t}$ $\in$ $\mathbb{C}^{d \times d}$ via outer products, and retrieval operates through the conjugate inner product $K_t^* \cdot Q_t / \sqrt{d}$. At $\sim$100M parameters on WikiText-103, PAM reaches validation perplexity 30.0, within $\sim$10% of a matched transformer (27.1) trained under identical conditions, despite $4\times$ arithmetic overhead from complex computation and no custom kernels. We trace the experimental path from vector-state models, where holographic binding fails due to the $O(1/\sqrt{n})$ capacity degradation of superposed associations, to the matrix state that resolves it. The competitiveness of an architecture whose native operations are complex-valued superposition and conjugate retrieval is consistent with recent empirical evidence that semantic interpretation in both humans and large language models exhibits non-classical contextuality, and we discuss what this implies for the choice of computational formalism in language modeling.

[5] This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA

Hye Sun Yun, Geetika Kapoor, Michael Mackert, Ramez Kouzy, Wei Xu, Junyi Jessy Li, Byron C. Wallace

Main category: cs.CL

TL;DR: LLMs in medical QA show response inconsistency based on question framing (positive vs. negative) even when grounded in same evidence, with framing effects amplified in multi-turn conversations.

DetailsMotivation: Patients use LLMs for complex medical questions, but LLMs are sensitive to prompt phrasing. Need to ensure consistent responses regardless of question wording, especially in high-stakes medical settings where contradictory advice could be harmful.

Method: Systematic evaluation in controlled RAG setting for medical QA using expert-selected documents. Examined two dimensions: question framing (positive/negative) and language style (technical/plain). Constructed dataset of 6,614 query pairs grounded in clinical trial abstracts and evaluated response consistency across eight LLMs.

Result: Positively- and negatively-framed pairs significantly more likely to produce contradictory conclusions than same-framing pairs. Framing effect amplified in multi-turn conversations. No significant interaction between framing and language style. LLM responses systematically influenced by query phrasing alone even with same evidence.

Conclusion: LLM responses in medical QA can be manipulated through query phrasing, highlighting importance of phrasing robustness as evaluation criterion for RAG systems in high-stakes settings like healthcare.

Abstract: Patients are increasingly turning to large language models (LLMs) with medical questions that are complex and difficult to articulate clearly. However, LLMs are sensitive to prompt phrasings and can be influenced by the way questions are worded. Ideally, LLMs should respond consistently regardless of phrasing, particularly when grounded in the same underlying evidence. We investigate this through a systematic evaluation in a controlled retrieval-augmented generation (RAG) setting for medical question answering (QA), where expert-selected documents are used rather than retrieved automatically. We examine two dimensions of patient query variation: question framing (positive vs. negative) and language style (technical vs. plain language). We construct a dataset of 6,614 query pairs grounded in clinical trial abstracts and evaluate response consistency across eight LLMs. Our findings show that positively- and negatively-framed pairs are significantly more likely to produce contradictory conclusions than same-framing pairs. This framing effect is further amplified in multi-turn conversations, where sustained persuasion increases inconsistency. We find no significant interaction between framing and language style. Our results demonstrate that LLM responses in medical QA can be systematically influenced through query phrasing alone, even when grounded in the same evidence, highlighting the importance of phrasing robustness as an evaluation criterion for RAG-based systems in high-stakes settings.

[6] On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

Main category: cs.CL

TL;DR: Survey paper on spoken language models (SLMs) that categorizes recent work by architecture, training, and evaluation, providing a unified understanding of this emerging field.

DetailsMotivation: The field of spoken language processing is shifting from task-specific models to universal spoken language models (SLMs), similar to the progression in text NLP. There's diverse terminology and evaluation settings, creating a need for a unifying survey to improve understanding of SLMs.

Method: Literature survey categorizing recent SLM work by model architecture, training approaches, and evaluation methods. The survey analyzes both “pure” speech language models and hybrid models combining speech encoders with text language models.

Result: Provides a comprehensive categorization framework for SLMs, identifies key challenges in the field, and outlines directions for future work in spoken language modeling.

Conclusion: SLMs represent an important evolution toward universal speech processing systems, and this survey offers a unified perspective to guide future research in this rapidly developing area.

Abstract: The field of spoken language processing is undergoing a shift from training custom-built, task-specific models toward using and optimizing spoken language models (SLMs) which act as universal speech processing systems. This trend is similar to the progression toward universal language models that has taken place in the field of (text) natural language processing. SLMs include both “pure” language models of speech – models of the distribution of tokenized speech sequences – and models that combine speech encoders with text language models, often including both spoken and written input or output. Work in this area is very diverse, with a range of terminology and evaluation settings. This paper aims to contribute an improved understanding of SLMs via a unifying literature survey of recent work in the context of the evolution of the field. Our survey categorizes the work in this area by model architecture, training, and evaluation choices, and describes some key challenges and directions for future work.

[7] Memory Dial: A Training Framework for Controllable Memorization in Language Models

Xiangbo Zhang, Ali Emami

Main category: cs.CL

TL;DR: Memory Dial is a training framework that makes memorization pressure an explicit, controllable variable in language models by interpolating between standard cross-entropy and temperature-sharpened objectives.

DetailsMotivation: Memorization in language models is widely studied but difficult to isolate and control. Existing approaches are post-hoc and cannot disentangle memorization effects from architecture, data, or optimization factors.

Method: Memory Dial interpolates between standard cross-entropy and a temperature-sharpened objective via a single parameter α, producing a family of models identical in architecture and training setup, differing only in memorization pressure.

Result: Experiments across six architectures and five benchmarks show that: (1) α reliably controls memorization pressure, (2) larger models are more responsive to memorization pressure, and (3) frequent sequences are easier to memorize than rare ones.

Conclusion: Memory Dial provides a controlled experimental framework for studying how memorization behavior emerges and interacts with generalization in language models.

Abstract: Memorization in language models is widely studied but remains difficult to isolate and control. Understanding when and what models memorize is essential for explaining their predictions, yet existing approaches are post-hoc: they can detect memorization in trained models, but cannot disentangle its effects from architecture, data, or optimization. We introduce Memory Dial, a training framework that makes memorization pressure an explicit, controllable variable. Memory Dial interpolates between standard cross-entropy and a temperature-sharpened objective via a single parameter $α$, producing a family of models identical in architecture and training setup (within each sweep), differing only in memorization pressure. Experiments across six architectures and five benchmarks demonstrate that: (1) $α$ reliably controls memorization pressure, with seen-example accuracy increasing monotonically while unseen accuracy remains stable; (2) larger models are more responsive to memorization pressure; and (3) frequent sequences are easier to memorize than rare ones. Additional analyses show that the effect is robust across a range of sharpening temperatures, differs qualitatively from single-temperature cross-entropy, transfers to multilingual settings, and is detectable even on naturally occurring single-occurrence sequences. Memory Dial provides a controlled experimental framework for studying how memorization behavior emerges and interacts with generalization in language models.

[8] StressTest: Can YOUR Speech LM Handle the Stress?

Iddo Yosha, Gallil Maimon, Yossi Adi

Main category: cs.CL

TL;DR: StressTest benchmark evaluates speech-aware language models on sentence stress understanding, showing poor performance, and introduces Stress-17k dataset and StresSLM model for improved stress reasoning.

DetailsMotivation: Sentence stress is crucial for conveying meaning and intent in speech, but current speech-aware language models (SLMs) overlook this important aspect in their evaluation and development, creating a gap in understanding prosodic cues.

Method: 1) Introduce StressTest benchmark to evaluate SLMs’ ability to distinguish meanings based on stress patterns. 2) Propose novel data generation pipeline to create Stress-17k training set simulating meaning changes from stress variation. 3) Develop StresSLM by finetuning on this dataset.

Result: Leading SLMs perform poorly on stress reasoning tasks. StresSLM generalizes well to real recordings and significantly outperforms existing SLMs on both sentence stress reasoning and detection tasks.

Conclusion: Sentence stress is a critical but overlooked aspect of speech understanding. The StressTest benchmark reveals limitations in current SLMs, while Stress-17k dataset and StresSLM demonstrate substantial improvements in stress-aware speech processing.

Abstract: Sentence stress refers to emphasis on words within a spoken utterance to highlight or contrast an idea. It is often used to imply an underlying intention not explicitly stated. Recent speech-aware language models (SLMs) have enabled direct audio processing, allowing models to access the full richness of speech to perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and intent, it remains largely overlooked in evaluation and development of SLMs. We address this gap by introducing StressTest, a benchmark designed to evaluate models’ ability to distinguish between meanings of speech based on the stress pattern. We evaluate leading SLMs, and find that despite their overall capabilities, they perform poorly on such tasks. Hence, we propose a novel data generation pipeline, and create Stress-17k, a training set that simulates change of meaning implied by stress variation. Results suggest, that our finetuned model, StresSLM, generalizes well to real recordings and notably outperforms existing SLMs on sentence stress reasoning and detection. Models, code, data, samples - pages.cs.huji.ac.il/adiyoss-lab/stresstest.

[9] MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang, Huichao Wang, Jiale Chen, Jianfei Pan, Jieqiong Cao, Jinghao Lin, Kai Wu, Lin Yang, Shengsheng Yao, Tao Chen, Xiaojun Xiao, Xiaozhong Ji, Xu Wang, Yijun He, Zhixiong Yang

Main category: cs.CL

TL;DR: MedXIAOHE is a medical vision-language foundation model that achieves SOTA performance on medical benchmarks through entity-aware continual pretraining, reinforcement learning for medical reasoning, and reliability-focused features for real-world clinical use.

DetailsMotivation: To advance general-purpose medical understanding and reasoning for real-world clinical applications, addressing challenges like heterogeneous medical data, rare diseases, and the need for reliable, expert-level diagnostic reasoning.

Method: 1) Entity-aware continual pretraining framework organizing heterogeneous medical corpora; 2) Incorporation of diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training; 3) Integration of user-preference rubrics, evidence-grounded reasoning, and low-hallucination report generation.

Result: Achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities.

Conclusion: MedXIAOHE demonstrates practical design choices for medical multimodal foundation models, enabling advanced medical understanding, reasoning, and reliable real-world clinical applications.

Abstract: We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.

[10] Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation

Firoj Alam, Gagan Bhatia, Sahinur Rahman Laskar, Shammur Absar Chowdhury

Main category: cs.CL

TL;DR: OmniScore is a family of lightweight deterministic learned metrics that approximate LLM-judge behavior for text evaluation, offering low-cost, consistent scoring across multiple languages and tasks.

DetailsMotivation: LLMs are increasingly used as automated judges for text evaluation, but they are costly, sensitive to prompt design, and lack reproducibility. There's a need for more practical, scalable alternatives.

Method: Developed small parameter models (<1B) trained on large-scale synthetic supervision (~564k instances across 107 languages) and evaluated on 8,617 manually annotated instances. Supports reference-based, source-grounded, and hybrid evaluations.

Result: OmniScore models demonstrate reliable multi-dimensional scoring across question answering, translation, and summarization tasks in 6 languages, providing a practical alternative to frontier LLMs.

Conclusion: Lightweight deterministic learned metrics like OmniScore offer a highly practical and scalable alternative to expensive LLM-based evaluation methods while maintaining consistency and reliability.

Abstract: While Large Language Models (LLMs) are increasingly adopted as automated judges for evaluating generated text, their outputs are often costly, and highly sensitive to prompt design, language, and aggregation strategies, severely, which limits reproducibility. To address these challenges, we propose \textbf{\textit{OmniScore}}, a family of complementary, deterministic learned metrics developed using small size ($<$1B) parameter models. OmniScore approximates LLM-judge behavior while preserving the low latency and consistency of traditional model-based scoring. We trained the models large-scale synthetic supervision ($\sim$564k instances, in \textbf{107 languages}) and evaluated using 8,617 manually annotated instances. The OmniScore family supports reliable, multi-dimensional scores across a variety of settings, including reference-based, source-grounded, and hybrid evaluations. We evaluate these models across question answering (QA), translation, and summarization in \textbf{6 languages}. Our results demonstrate that lightweight, deterministic learned metrics provide a highly practical and scalable alternative to frontier LLMs. Our models and datasets can be found at https://huggingface.co/collections/QCRI/omniscore

[11] Where Do Backdoors Live? A Component-Level Analysis of Backdoor Propagation in Speech Language Models

Alexandrine Fortier, Thomas Thebaud, Jesús Villalba, Najim Dehak, Patrick Cardinal, Peter West

Main category: cs.CL

TL;DR: Backdoor attacks can propagate through speech language model pipelines, with persistence dependent on targeted components, and poisoned embeddings are not easily separable from benign ones.

DetailsMotivation: Speech language models (SLMs) are complex systems with multiple independent components, but they're often studied end-to-end without understanding how information flows through the pipeline. The authors want to investigate this through backdoor attacks to reveal component roles and vulnerabilities.

Method: The authors use backdoor attacks as a lens to analyze SLM pipelines. They first establish that backdoors can propagate through the entire system, then design a component analysis to reveal each component’s role in backdoor learning. They also examine how backdoors are encoded in shared multitask embeddings.

Result: Backdoors can propagate through SLMs, making all tasks vulnerable. Backdoor persistence or erasure depends heavily on the targeted component. Poisoned samples in shared embeddings are not directly separable from benign ones, challenging common assumptions used in filtering defenses.

Conclusion: Multimodal pipelines like SLMs are intricate systems with unique vulnerabilities that should not be treated as simple extensions of unimodal models. Understanding component-level vulnerabilities is crucial for security.

Abstract: Speech language models (SLMs) are systems of systems: independent components that unite to achieve a common goal. Despite their heterogeneous nature, SLMs are often studied end-to-end; how information flows through the pipeline remains obscure. We investigate this question through the lens of backdoor attacks. We first establish that backdoors can propagate through the SLM, leaving all tasks highly vulnerable. From this, we design a component analysis to reveal the role each component takes in backdoor learning. We find that backdoor persistence or erasure is highly dependent on the targeted component. Beyond propagation, we examine how backdoors are encoded in shared multitask embeddings, showing that poisoned samples are not directly separable from benign ones, challenging a common separability assumption used in filtering defenses. Our findings emphasize the need to treat multimodal pipelines as intricate systems with unique vulnerabilities, not solely extensions of unimodal ones.

[12] Document Optimization for Black-Box Retrieval via Reinforcement Learning

Omri Uzan, Ron Polonsky, Douwe Kiela, Christopher Potts

Main category: cs.CL

TL;DR: Document optimization approach that fine-tunes language/vision models to transform documents for better retrieval alignment using GRPO with retriever ranking improvements as rewards.

DetailsMotivation: Classical document expansion techniques degrade performance with modern retrievers by introducing noise. Need a method to optimize document representations to better align with expected query distributions without requiring query-time processing.

Method: Recast document expansion as document optimization problem. Fine-tune language models or vision language models to transform documents into representations that better align with expected query distribution under target retriever. Use GRPO (Gradient-based Reward Policy Optimization) with retriever’s ranking improvements as rewards. Works with black-box access to retrieval ranks and applicable across single-vector, multi-vector, and lexical retrievers.

Result: Learned document transformations yield retrieval gains and enable smaller, more efficient retrievers to outperform larger ones. OpenAI text-embedding-3-small improved nDCG5 on code (58.7 to 66.8) and VDR (53.3 to 57.6), surpassing more expensive text-embedding-3-large model. When retriever weights accessible, document optimization competitive with fine-tuning, and combination performs best (Jina-ColBERT-V2 improved from 55.8 to 63.3 on VDR and 48.6 to 61.8 on code).

Conclusion: Document optimization via fine-tuned language/vision models with GRPO rewards effectively improves retrieval performance across different retriever types, enabling efficiency gains and competitive performance with fine-tuning approaches.

Abstract: Document expansion is a classical technique for improving retrieval quality, and is attractive since it shifts computation offline, avoiding additional query-time processing. However, when applied to modern retrievers, it has been shown to degrade performance, often introducing noise that obfuscates the discriminative signal. We recast document expansion as a document optimization problem: a language model or a vision language model is fine-tuned to transform documents into representations that better align with the expected query distribution under a target retriever, using GRPO with the retriever’s ranking improvements as rewards. This approach requires only black-box access to retrieval ranks, and is applicable across single-vector, multi-vector and lexical retrievers. We evaluate our approach on code retrieval and visual document retrieval (VDR) tasks. We find that learned document transformations yield retrieval gains and in many settings enable smaller, more efficient retrievers to outperform larger ones. For example, applying document optimization to OpenAI text-embedding-3-small model improves nDCG5 on code (58.7 to 66.8) and VDR (53.3 to 57.6), even slightly surpassing the 6.5X more expensive OpenAI text-embedding-3-large model (66.3 on code; 57.0 on VDR). When retriever weights are accessible, document optimization is often competitive with fine-tuning, and in most settings their combination performs best, improving Jina-ColBERT-V2 from 55.8 to 63.3 on VDR and from 48.6 to 61.8 on code retrieval.

[13] Multilingual Language Models Encode Script Over Linguistic Structure

Aastha A K Verma, Anwoy Chatterjee, Mehak Gupta, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: Multilingual language models organize representations primarily around surface form (orthography) rather than abstract linguistic properties, with linguistic abstraction emerging gradually in deeper layers without forming a unified interlingua.

DetailsMotivation: To understand how multilingual language models internally organize representations for diverse languages - specifically whether they rely on abstract linguistic properties or surface-form cues like orthography.

Method: Analyzed compact distilled models (Llama-3.2-1B and Gemma-2-2B) using Language Activation Probability Entropy (LAPE) metric and Sparse Autoencoders to decompose activations. Investigated effects of romanization and word-order shuffling on language-associated units.

Result: Found that language-associated units are strongly conditioned on orthography - romanization creates near-disjoint representations that don’t align with native scripts or English. Word-order shuffling has limited effect. Typological structure becomes more accessible in deeper layers, and generation is most sensitive to units invariant to surface-form perturbations.

Conclusion: Multilingual LMs organize representations around surface form rather than abstract linguistic properties, with linguistic abstraction emerging gradually without collapsing into a unified interlingua representation.

Abstract: Multilingual language models (LMs) organize representations for typologically and orthographically diverse languages into a shared parameter space, yet the nature of this internal organization remains elusive. In this work, we investigate which linguistic properties - abstract language identity or surface-form cues - shape multilingual representations. Focusing on compact, distilled models where representational trade-offs are explicit, we analyze language-associated units in Llama-3.2-1B and Gemma-2-2B using the Language Activation Probability Entropy (LAPE) metric, and further decompose activations with Sparse Autoencoders. We find that these units are strongly conditioned on orthography: romanization induces near-disjoint representations that align with neither native-script inputs nor English, while word-order shuffling has limited effect on unit identity. Probing shows that typological structure becomes increasingly accessible in deeper layers, while causal interventions indicate that generation is most sensitive to units that are invariant to surface-form perturbations rather than to units identified by typological alignment alone. Overall, our results suggest that multilingual LMs organize representations around surface form, with linguistic abstraction emerging gradually without collapsing into a unified interlingua.

[14] MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Zhengqing Yuan, Hanchi Sun, Lichao Sun, Yanfang Ye

Main category: cs.CL

TL;DR: MegaTrain enables training of 100B+ parameter LLMs on a single GPU by storing parameters in CPU memory and streaming them to GPU, with pipelined execution and stateless layer templates to overcome bandwidth bottlenecks.

DetailsMotivation: Training large language models with 100B+ parameters typically requires multiple GPUs due to memory constraints. Existing systems struggle with single-GPU training at this scale, creating a need for memory-efficient approaches that can leverage available CPU memory.

Method: Memory-centric system that stores parameters and optimizer states in host (CPU) memory, treating GPUs as transient compute engines. Uses pipelined double-buffered execution to overlap parameter prefetching, computation, and gradient offloading across CUDA streams. Replaces persistent autograd graphs with stateless layer templates that bind weights dynamically during streaming.

Result: Successfully trains models up to 120B parameters on a single H200 GPU with 1.5TB host memory. Achieves 1.84× training throughput compared to DeepSpeed ZeRO-3 with CPU offloading for 14B models. Enables 7B model training with 512k token context on single GH200.

Conclusion: MegaTrain demonstrates that memory-centric approaches can enable training of extremely large models on single GPUs by effectively leveraging CPU memory and optimizing data movement, offering a practical alternative to multi-GPU systems for certain training scenarios.

Abstract: We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84$\times$ the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.

[15] Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

Changgeon Ko, Jisu Shin, Hoyun Song, Huije Lee, Eui Jun Hwang, Jong C. Park

Main category: cs.CL

TL;DR: LLM agents acting as human delegates in multi-agent systems are vulnerable to social psychological biases like conformity, expertise perception, and rhetorical persuasion, leading to degraded decision accuracy under social pressure.

DetailsMotivation: As LLM agents increasingly act as human delegates in multi-agent environments, there's a need to understand how social dynamics affect their reliability, drawing inspiration from social psychology to investigate vulnerabilities similar to human group decision-making biases.

Method: Defined four key social phenomena (social conformity, perceived expertise, dominant speaker effect, rhetorical persuasion) and systematically manipulated variables: number of adversaries, relative intelligence, argument length, and argumentative styles in multi-agent experiments.

Result: Representative agent’s accuracy consistently declines with increased social pressure: larger adversarial groups, more capable peers, and longer arguments all cause significant performance degradation. Rhetorical strategies emphasizing credibility or logic can further sway judgment depending on context.

Conclusion: Multi-agent systems are sensitive to both individual reasoning and social dynamics, revealing critical vulnerabilities in AI delegates that mirror human psychological biases in group decision-making.

Abstract: Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision. Drawing inspiration from social psychology, we investigate how the reliability of this representative agent is undermined by the social context of its network. We define four key phenomena-social conformity, perceived expertise, dominant speaker effect, and rhetorical persuasion-and systematically manipulate the number of adversaries, relative intelligence, argument length, and argumentative styles. Our experiments demonstrate that the representative agent’s accuracy consistently declines as social pressure increases: larger adversarial groups, more capable peers, and longer arguments all lead to significant performance degradation. Furthermore, rhetorical strategies emphasizing credibility or logic can further sway the agent’s judgment, depending on the context. These findings reveal that multi-agent systems are sensitive not only to individual reasoning but also to the social dynamics of their configuration, highlighting critical vulnerabilities in AI delegates that mirror the psychological biases observed in human group decision-making.

[16] RAG or Learning? Understanding the Limits of LLM Adaptation under Continuous Knowledge Drift in the Real World

Hanbing Liu, Lang Cao, Yang Li

Main category: cs.CL

TL;DR: A benchmark for evaluating LLM adaptation to continuous knowledge drift using real-world dynamic events with time-stamped evidence, revealing limitations of existing methods and proposing Chronos, a time-aware retrieval approach using Event Evolution Graphs.

DetailsMotivation: LLMs are tied to fixed snapshots of knowledge from pretraining, making adaptation to continuously evolving knowledge challenging. Existing approaches for updating model knowledge are rarely evaluated in settings that reflect chronological, evolving, real-world knowledge evolution.

Method: Introduces a benchmark of real-world dynamic events constructed from time-stamped evidence capturing knowledge evolution over time. Proposes Chronos, a time-aware retrieval baseline that organizes retrieved evidence into an Event Evolution Graph to enable temporally consistent understanding without additional training.

Result: Benchmark reveals most existing methods (including vanilla RAG and learning-based approaches) struggle with continuous knowledge drift, showing catastrophic forgetting and temporal inconsistency. Chronos demonstrates improved temporal consistency compared to baseline methods.

Conclusion: Provides a foundation for analyzing and advancing LLM adaptation to continuous knowledge drift in realistic settings, highlighting the need for time-aware approaches to handle evolving knowledge effectively.

Abstract: Large language models (LLMs) acquire most of their knowledge during pretraining, which ties them to a fixed snapshot of the world and makes adaptation to continuously evolving knowledge challenging. As facts, entities, and events change over time, models may experience continuous knowledge drift, resulting not only in outdated predictions but also in temporally inconsistent reasoning. Although existing approaches, such as continual finetuning, knowledge editing, and retrieval-augmented generation (RAG), aim to update or supplement model knowledge, they are rarely evaluated in settings that reflect chronological, evolving, and real-world knowledge evolution. In this work, we introduce a new benchmark of real-world dynamic events, constructed from time-stamped evidence that captures how knowledge evolves over time, which enables systematic evaluation of model adaptation under continuous knowledge drift. The benchmark reveals that most existing methods, including vanilla RAG and several learning-based approaches, struggle under this setting, exposing critical limitations such as catastrophic forgetting and temporal inconsistency. To mitigate these limitations, we propose a time-aware retrieval baseline, Chronos, which progressively organizes retrieved evidence into an Event Evolution Graph to enable more temporally consistent understanding in LLMs without additional training. Overall, this work provides a foundation for analyzing and advancing LLM adaptation to continuous knowledge drift in realistic settings.

[17] $π^2$: Structure-Originated Reasoning Data Improves Long-Context Reasoning Ability of Large Language Models

Quyet V. Do, Thinh Pham, Nguyen Nguyen, Sha Li, Pratibha Zunjare, Tu Vu

Main category: cs.CL

TL;DR: π² pipeline creates high-quality reasoning data from structured Wikipedia tables to improve long-context reasoning in LLMs through QA curation, code execution verification, and back-translation of reasoning traces.

DetailsMotivation: The paper addresses the need for high-quality reasoning data to enhance long-context reasoning capabilities in large language models, particularly for multi-hop analytical reasoning tasks that require understanding structured data and extended contexts.

Method: 1) Extract and expand tables from Wikipedia, 2) Generate realistic multi-hop analytical reasoning questions with answers verified through dual-path code execution, 3) Back-translate step-by-step structured reasoning traces as solutions given realistic web-search context, 4) Use supervised fine-tuning on curated data.

Result: Fine-tuning GPT-OSS-20B and Qwen3-4B-Instruct-2507 on π² data yields consistent improvements across four long-context reasoning benchmarks and their own π²-Bench, with average absolute accuracy gains of +4.3% and +2.7% respectively. Self-distillation with GPT-OSS-20B improves its own performance by +4.4%.

Conclusion: The π² pipeline effectively creates high-quality reasoning data that significantly improves long-context reasoning in LLMs, with the dataset facilitating self-distillation and demonstrating broad applicability across different model architectures.

Abstract: We study a pipeline that curates reasoning data from initial structured data for improving long-context reasoning in large language models (LLMs). Our approach, $π^2$, constructs high-quality reasoning data through rigorous QA curation: 1) extracting and expanding tables from Wikipedia, 2) from the collected tables and relevant context, generating realistic and multi-hop analytical reasoning questions whose answers are automatically determined and verified through dual-path code execution, and 3) back-translating step-by-step structured reasoning traces as solutions of QA pairs given realistic web-search context. Supervised fine-tuning with \textsc{\small{gpt-oss-20b}} and \textsc{\small{Qwen3-4B-Instruct-2507}} on $π^2$ yields consistent improvements across four long-context reasoning benchmarks and our alike $π^2$-Bench, with average absolute accuracy gains of +4.3% and +2.7% respectively. Notably, our dataset facilitates self-distillation, where \textsc{\small{gpt-oss-20b}} even improves its average performance by +4.4% with its own reasoning traces, demonstrating $π^2$’s usefulness. Our code, data, and models are open-source at https://github.com/vt-pi-squared/pi-squared.

[18] SenseAI: A Human-in-the-Loop Dataset for RLHF-Aligned Financial Sentiment Reasoning

Berny Kabalisa

Main category: cs.CL

TL;DR: SenseAI is a human-in-the-loop financial sentiment dataset with reasoning chains, confidence scores, and market outcomes for LLM fine-tuning, revealing systematic error patterns like Latent Reasoning Drift.

DetailsMotivation: Existing financial sentiment datasets lack the full reasoning process behind model outputs, making it difficult to understand and correct systematic errors in LLM financial reasoning. There's a need for structured data that captures reasoning chains, confidence scores, and real-world outcomes to enable targeted model improvement.

Method: Created SenseAI dataset with 1,439 labeled data points across 40 US equities and 13 financial categories, incorporating human-in-the-loop validation with reasoning chains, confidence scores, human correction signals, and market outcomes. Designed for direct integration into LLM fine-tuning pipelines aligned with RLHF paradigms.

Result: Identified systematic patterns in model behavior including Latent Reasoning Drift (models introducing ungrounded information), confidence miscalibration, and forward projection tendencies. Showed that LLM errors in financial reasoning are predictable and correctable rather than random.

Conclusion: Structured HITL data like SenseAI enables targeted improvement of financial AI systems by revealing systematic error patterns. The dataset supports model evaluation, alignment, and provides opportunities for applying RLHF techniques to financial reasoning tasks.

Abstract: We introduce SenseAI, a human-in-the-loop (HITL) validated financial sentiment dataset designed to capture not only model outputs but the full reasoning process behind them. Unlike existing resources, SenseAI incorporates reasoning chains, confidence scores, human correction signals, and real-world market outcomes, providing a structure aligned with Reinforcement Learning from Human Feedback (RLHF) paradigms. The dataset consists of 1,439 labelled data points across 40 US-listed equities and 13 financial data categories, enabling direct integration into modern LLM fine-tuning pipelines. Through analysis, we identify several systematic patterns in model behavior, including a novel failure mode we term Latent Reasoning Drift, where models introduce information not grounded in the input, as well as consistent confidence miscalibration and forward projection tendencies. These findings suggest that LLM errors in financial reasoning are not random but occur within a predictable and correctable regime, supporting the use of structured HITL data for targeted model improvement. We discuss implications for financial AI systems and highlight opportunities for applying SenseAI in model evaluation and alignment.

[19] EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering

Jiatan Huang, Zheyuan Zhang, Kaiwen Shi, Yanfang Ye, Chuxu Zhang

Main category: cs.CL

TL;DR: EvolveRouter is a trainable framework that jointly improves agent quality and collaboration structure through closed-loop co-evolution of graph-based query routing and targeted instruction refinement, with adaptive inference for dynamic collaboration sizing.

DetailsMotivation: Existing routing methods for multi-agent question answering have two key limitations: they optimize over fixed agent pools without improving agents themselves, and they rely on rigid collaboration schemes that cannot adapt the number of participating agents to each query.

Method: 1) Closed-loop co-evolution combining graph-based query routing with targeted instruction refinement, allowing router diagnostics to guide agent improvement while refined agents provide cleaner routing supervision. 2) Adaptive inference strategy that dynamically determines effective collaboration size through router-weighted answer agreement.

Result: Experiments on five question answering benchmarks show EvolveRouter consistently outperforms SOTA routing baselines in both F1 and exact match metrics. Further analysis confirms benefits of closed-loop refinement and adaptive collaboration.

Conclusion: EvolveRouter enables more capable and efficient multi-agent reasoning by addressing limitations of existing routing methods through joint agent improvement and adaptive collaboration structure.

Abstract: Large language model agents often exhibit complementary strengths, making routing a promising approach for multi-agent question answering. However, existing routing methods remain limited in two important ways: they typically optimize over a fixed pool of agents without improving the agents themselves, and they often rely on rigid collaboration schemes that cannot adapt the number of participating agents to the query. We propose EvolveRouter, a trainable framework that addresses both limitations by jointly improving agent quality and collaboration structure. First, EvolveRouter couples graph-based query routing with targeted instruction refinement in a closed-loop co-evolution process, allowing router diagnostics to guide agent improvement while refined agents provide cleaner supervision for routing. Second, it introduces an adaptive inference strategy that dynamically determines the effective collaboration size for each query through router-weighted answer agreement. Together, these designs enable more capable and more efficient multi-agent reasoning. Experiments on five question answering benchmarks show that EvolveRouter consistently outperforms SOTA routing baselines in both F1 and exact match, while further analysis confirms the benefits of closed-loop refinement and adaptive collaboration.

[20] Just Pass Twice: Efficient Token Classification with LLMs for Zero-Shot NER

Ahmed Ewais, Ahmed Hashish, Amr Ali

Main category: cs.CL

TL;DR: JPT enables causal LLMs to perform zero-shot named entity recognition with bidirectional context by concatenating input to itself, achieving SOTA results with 20x speedup over generative methods.

DetailsMotivation: Causal attention in LLMs prevents effective token classification when disambiguation requires future context. Existing generative approaches are slow, hallucinate entities, and have formatting errors.

Method: Just Pass Twice (JPT) concatenates input to itself so tokens in second pass attend to complete sentence. Combines these representations with definition-guided entity embeddings for flexible zero-shot generalization.

Result: Achieves state-of-the-art on zero-shot NER benchmarks: +7.9 F1 average improvement on CrossNER and MIT benchmarks, 20x faster than comparable generative methods.

Conclusion: JPT enables causal LLMs to perform discriminative token classification with full bidirectional context without architectural modifications, offering efficient and accurate zero-shot NER.

Abstract: Large language models encode extensive world knowledge valuable for zero-shot named entity recognition. However, their causal attention mechanism, where tokens attend only to preceding context, prevents effective token classification when disambiguation requires future context. Existing approaches use LLMs generatively, prompting them to list entities or produce structured outputs, but suffer from slow autoregressive decoding, hallucinated entities, and formatting errors. We propose Just Pass Twice (JPT), a simple yet effective method that enables causal LLMs to perform discriminative token classification with full bidirectional context. Our key insight is that concatenating the input to itself lets each token in the second pass attend to the complete sentence, requiring no architectural modifications. We combine these representations with definition-guided entity embeddings for flexible zero-shot generalization. Our approach achieves state-of-the-art results on zero-shot NER benchmarks, surpassing the previous best method by +7.9 F1 on average across CrossNER and MIT benchmarks, being over 20x faster than comparable generative methods.

[21] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

Haoming Xu, Ningyuan Zhao, Yunzhi Yao, Weihong Xu, Hongru Wang, Xinle Deng, Shumin Deng, Jeff Z. Pan, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: Proposes Neighbor-Consistency Belief (NCB) to measure belief robustness in LLMs under contextual perturbations, showing even self-consistent facts can collapse, and introduces Structure-Aware Training to improve robustness.

DetailsMotivation: Current LLM evaluations focus on point-wise correctness but fail to assess belief robustness under contextual perturbations. Even facts answered with perfect self-consistency can rapidly collapse under mild interference, highlighting the need for structural measures of belief stability.

Method: Introduces Neighbor-Consistency Belief (NCB) as a structural measure evaluating response coherence across conceptual neighborhoods. Develops a cognitive stress-testing protocol to probe output stability under contextual interference. Proposes Structure-Aware Training (SAT) to optimize context-invariant belief structure.

Result: Experiments across multiple LLMs show that high-NCB data is more resistant to interference. Structure-Aware Training reduces long-tail knowledge brittleness by approximately 30%.

Conclusion: NCB provides a valuable measure for assessing belief robustness in LLMs, and SAT effectively improves model stability under contextual perturbations, addressing limitations of traditional point-wise evaluation methods.

Abstract: As Large Language Models (LLMs) are increasingly deployed in real-world settings, correctness alone is insufficient. Reliable deployment requires maintaining truthful beliefs under contextual perturbations. Existing evaluations largely rely on point-wise confidence like Self-Consistency, which can mask brittle belief. We show that even facts answered with perfect self-consistency can rapidly collapse under mild contextual interference. To address this gap, we propose Neighbor-Consistency Belief (NCB), a structural measure of belief robustness that evaluates response coherence across a conceptual neighborhood. To validate the efficiency of NCB, we introduce a new cognitive stress-testing protocol that probes outputs stability under contextual interference. Experiments across multiple LLMs show that the performance of high-NCB data is relatively more resistant to interference. Finally, we present Structure-Aware Training (SAT), which optimizes context-invariant belief structure and reduces long-tail knowledge brittleness by approximately 30%. Code is available at https://github.com/zjunlp/belief.

[22] What Makes a Good Response? An Empirical Analysis of Quality in Qualitative Interviews

Jonathan Ivey, Anjalie Field, Ziang Xiao

Main category: cs.CL

TL;DR: This paper evaluates 10 proposed measures of interview response quality using a new dataset of 343 interview transcripts, finding that direct relevance to research questions is the strongest predictor of quality, while clarity and surprisal-based informativeness are not predictive.

DetailsMotivation: The motivation is to address the lack of validated measures for assessing interview response quality in qualitative research and NLP systems. While various quality measures have been proposed, there's insufficient validation that high-scoring responses actually contribute meaningfully to study findings.

Method: The researchers constructed the Qualitative Interview Corpus containing 343 interview transcripts with 16,940 participant responses from 14 real research projects. They identified, implemented, and evaluated 10 proposed measures of interview response quality to determine which are actually predictive of a response’s contribution to study findings.

Result: The key finding is that direct relevance to a key research question is the strongest predictor of response quality. Surprisingly, two measures commonly used to evaluate NLP interview systems - clarity and surprisal-based informativeness - were found not to be predictive of response quality.

Conclusion: The work provides analytic insights and grounded, scalable metrics to inform both qualitative study design and the evaluation of automated interview systems, offering validated measures for assessing interview response quality.

Abstract: Qualitative interviews provide essential insights into human experiences when they elicit high-quality responses. While qualitative and NLP researchers have proposed various measures of interview quality, these measures lack validation that high-scoring responses actually contribute to the study’s goals. In this work, we identify, implement, and evaluate 10 proposed measures of interview response quality to determine which are actually predictive of a response’s contribution to the study findings. To conduct our analysis, we introduce the Qualitative Interview Corpus, a newly constructed dataset of 343 interview transcripts with 16,940 participant responses from 14 real research projects. We find that direct relevance to a key research question is the strongest predictor of response quality. We additionally find that two measures commonly used to evaluate NLP interview systems, clarity and surprisal-based informativeness, are not predictive of response quality. Our work provides analytic insights and grounded, scalable metrics to inform the design of qualitative studies and the evaluation of automated interview systems.

[23] Can We Predict Before Executing Machine Learning Agents?

Jingsheng Zheng, Jintian Zhang, Yujie Luo, Yuren Mao, Yunjun Gao, Lun Du, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: LLM-based autonomous agents use predictive reasoning instead of expensive physical execution, achieving 6x acceleration in convergence while surpassing execution baselines by 6%

DetailsMotivation: Autonomous machine learning agents are constrained by the Generate-Execute-Feedback paradigm, suffering from an Execution Bottleneck where hypothesis evaluation requires expensive physical execution. The paper aims to bypass these physical constraints by internalizing execution priors for predictive reasoning.

Method: Formalize Data-centric Solution Preference task and construct corpus of 18,438 pairwise comparisons. Use LLMs with Verified Data Analysis Reports for predictive capabilities. Implement FOREAGENT with Predict-then-Verify loop that substitutes runtime checks with instantaneous predictive reasoning.

Result: LLMs achieve 61.5% accuracy with robust confidence calibration when primed with Verified Data Analysis Reports. FOREAGENT achieves 6x acceleration in convergence while surpassing execution-based baselines by +6%.

Conclusion: Predictive reasoning can effectively bypass physical execution bottlenecks in autonomous agents, enabling faster convergence while maintaining or improving performance over traditional execution-based approaches.

Abstract: Autonomous machine learning agents have revolutionized scientific discovery, yet they remain constrained by a Generate-Execute-Feedback paradigm. Previous approaches suffer from a severe Execution Bottleneck, as hypothesis evaluation relies strictly on expensive physical execution. To bypass these physical constraints, we internalize execution priors to substitute costly runtime checks with instantaneous predictive reasoning, drawing inspiration from World Models. In this work, we formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report, achieving 61.5% accuracy and robust confidence calibration. Finally, we instantiate this framework in FOREAGENT, an agent that employs a Predict-then-Verify loop, achieving a 6x acceleration in convergence while surpassing execution-based baselines by +6%. Our code and dataset are publicly available at https://github.com/zjunlp/predict-before-execute.

[24] Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

Purva Chiniya, Kevin Scaria, Sagar Chaturvedi

Main category: cs.CL

TL;DR: GCD is a training-free guardrail method that uses gradient-controlled decoding with acceptance and refusal anchor tokens to detect and mitigate jailbreak attacks on LLMs, reducing false positives while guaranteeing first-token safety.

DetailsMotivation: LLMs are vulnerable to jailbreak and prompt-injection attacks, but existing defensive filters often over-refuse benign queries, degrading user experience. Previous methods like GradSafe have brittle thresholds and lack deterministic guarantees that harmful content won't be emitted once decoding begins.

Method: GCD combines an acceptance anchor token (“Sure”) and refusal anchor token (“Sorry”) to tighten decision boundaries. It uses gradient-based detection to flag unsafe prompts, then preset-injects refusal tokens (“Sorry, I can’t…”) before autoregressive decoding resumes, ensuring first-token safety regardless of sampling strategy.

Result: On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. strongest decoding-only baseline, adds under 15-20 ms latency on V100, transfers to multiple models (LLaMA-2-7B, Mixtral-8x7B, Qwen-2-7B), and requires only 20 demonstration templates.

Conclusion: GCD provides an effective, training-free defense against jailbreak attacks with improved safety guarantees, reduced false positives, and practical deployment characteristics including low latency and model transferability.

Abstract: Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single “accept all” anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token (“Sure”) and refusal anchor token (“Sorry”) tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens (“Sorry, I can’t…”) before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B, and requires only 20 demonstration templates.

[25] Improving Clinical Trial Recruitment using Clinical Narratives and Large Language Models

Ziyi Chen, Mengxian Lyu, Cheng Peng, Yonghui Wu

Main category: cs.CL

TL;DR: LLMs for clinical trial screening using encoder/decoder models with strategies to handle long documents, achieving best results with MedGemma+RAG on N2C2 dataset.

DetailsMotivation: Clinical trial screening is labor-intensive and leads to under-enrollment; LLMs offer promising AI solutions to improve this process by automating screening of clinical narratives.

Method: Systematically explored encoder- and decoder-based LLMs for clinical trial screening, tested general-purpose and medical-adapted LLMs, and evaluated three strategies to handle long documents: original long-context, NER-based extractive summarization, and RAG (dynamic evidence retrieval).

Result: MedGemma model with RAG strategy achieved best micro-F1 score of 89.05% on 2018 N2C2 Track 1 benchmark; LLMs improved long-term reasoning across documents but showed incremental improvements for short-context criteria like lab tests.

Conclusion: Real-world LLM adoption for trial recruitment requires selecting appropriate methods (rule-based, encoder-based, or generative LLMs) based on specific criteria to balance efficiency and computing costs.

Abstract: Screening patients for enrollment is a well-known, labor-intensive bottleneck that leads to under-enrollment and, ultimately, trial failures. Recent breakthroughs in large language models (LLMs) offer a promising opportunity to use artificial intelligence to improve screening. This study systematically explored both encoder- and decoder-based generative LLMs for screening clinical narratives to facilitate clinical trial recruitment. We examined both general-purpose LLMs and medical-adapted LLMs and explored three strategies to alleviate the “Lost in the Middle” issue when handling long documents, including 1) Original long-context: using the default context windows of LLMs, 2) NER-based extractive summarization: converting the long document into summarizations using named entity recognition, 3) RAG: dynamic evidence retrieval based on eligibility criteria. The 2018 N2C2 Track 1 benchmark dataset is used for evaluation. Our experimental results show that the MedGemma model with the RAG strategy achieved the best micro-F1 score of 89.05%, outperforming other models. Generative LLMs have remarkably improved trial criteria that require long-term reasoning across long documents, whereas trial criteria that span a short piece of context (e.g., lab tests) show incremental improvements. The real-world adoption of LLMs for trial recruitment must consider specific criteria for selecting among rule-based queries, encoder-based LLMs, and generative LLMs to maximize efficiency within reasonable computing costs.

[26] Faster Superword Tokenization

Craig W. Schmidt, Chris Tanner, Yuval Pinter

Main category: cs.CL

TL;DR: Improved BPE algorithms (BoundlessBPE and SuperBPE) that allow superword formation across pretoken boundaries with 600x faster training through efficient two-phase implementation.

DetailsMotivation: Standard BPE tokenization is limited to representing at most full words because tokens cannot extend across pretokenization boundaries. BoundlessBPE and SuperBPE algorithms were developed to overcome this limitation by allowing superword formation, but previous implementations were impractical due to slow training times (4.7 CPU days for 1GB data).

Method: Two-phase formulation separating regular merges from supermerges, aggregating supermerge candidates by frequency to avoid keeping full documents in memory. Shows near-equivalence between BoundlessBPE and SuperBPE, with automatic hyperparameter determination in BoundlessBPE’s second phase.

Result: Achieved 600x speedup - training on 1GB data in 603 seconds for BoundlessBPE and 593 seconds for SuperBPE (vs original 4.7 CPU days). Open-sourced both Python reference and fast Rust implementations.

Conclusion: The improved algorithms enable practical training of superword-aware tokenization while maintaining identical results to original implementations, making advanced tokenization techniques accessible for real-world applications.

Abstract: Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend and improve BPE by relaxing this limitation and allowing the formation of superwords, which are combinations of pretokens that form phrases. However, previous implementations were impractical to train: for example, BoundlessBPE took 4.7 CPU days to train on 1GB of data. We show that supermerge candidates, two or more consecutive pretokens eligible to form a supermerge, can be aggregated by frequency much like regular pretokens. This avoids keeping full documents in memory, as the original implementations of BoundlessBPE and SuperBPE required, leading to a significant training speedup. We present a two-phase formulation of BoundlessBPE that separates first-phase learning of regular merges from second-phase learning of supermerges, producing identical results to the original implementation. We also show a near-equivalence between two-phase BoundlessBPE and SuperBPE, with the difference being that a manually selected hyperparameter used in SuperBPE can be automatically determined in the second phase of BoundlessBPE. These changes enable a much faster implementation, allowing training on that same 1GB of data in 603 and 593 seconds for BoundlessBPE and SuperBPE, respectively, a more than 600x increase in speed. For each of BoundlessBPE, SuperBPE, and BPE, we open-source both a reference Python implementation and a fast Rust implementation.

[27] XMark: Reliable Multi-Bit Watermarking for LLM-Generated Texts

Jiahao Xu, Rui Hu, Olivera Kotevska, Zikai Zhang

Main category: cs.CL

TL;DR: XMark is a novel multi-bit watermarking method for LLM-generated text that improves decoding accuracy while preserving text quality, especially effective with limited token sequences.

DetailsMotivation: Existing multi-bit watermarking methods for LLMs have limitations: some are computationally infeasible for large messages, others have poor trade-offs between text quality and decoding accuracy, and decoding accuracy drops significantly when generated text has limited tokens - a common practical scenario.

Method: XMark uses a unique encoder design that produces less distorted logit distributions for watermarked token generation, preserving text quality. It also employs a tailored decoder that can reliably recover encoded messages even with limited tokens.

Result: Extensive experiments across diverse downstream tasks show XMark significantly improves decoding accuracy while preserving the quality of watermarked text, outperforming prior methods.

Conclusion: XMark addresses key limitations in multi-bit watermarking for LLMs, providing a practical solution for reliable attribution and tracing of LLM-generated text, especially in scenarios with limited token sequences.

Abstract: Multi-bit watermarking has emerged as a promising solution for embedding imperceptible binary messages into Large Language Model (LLM)-generated text, enabling reliable attribution and tracing of malicious usage of LLMs. Despite recent progress, existing methods still face key limitations: some become computationally infeasible for large messages, while others suffer from a poor trade-off between text quality and decoding accuracy. Moreover, the decoding accuracy of existing methods drops significantly when the number of tokens in the generated text is limited, a condition that frequently arises in practical usage. To address these challenges, we propose \textsc{XMark}, a novel method for encoding and decoding binary messages in LLM-generated texts. The unique design of \textsc{XMark}’s encoder produces a less distorted logit distribution for watermarked token generation, preserving text quality, and also enables its tailored decoder to reliably recover the encoded message with limited tokens. Extensive experiments across diverse downstream tasks show that \textsc{XMark} significantly improves decoding accuracy while preserving the quality of watermarked text, outperforming prior methods. The code is at https://github.com/JiiahaoXU/XMark.

[28] Exemplar Retrieval Without Overhypothesis Induction: Limits of Distributional Sequence Learning in Early Word Learning

Jon-Paul Cacioli

Main category: cs.CL

TL;DR: Autoregressive transformer language models trained on synthetic corpora fail to learn second-order shape generalizations (overhypotheses) that children naturally acquire, despite perfect first-order exemplar retrieval.

DetailsMotivation: To understand what learning mechanisms enable children to make second-order generalizations about object features (like shape being category-defining), and whether current transformer models can achieve similar inductive leaps.

Method: Trained autoregressive transformer language models (3.4M-25.6M parameters) on synthetic corpora where shape is the stable feature dimension across categories, with 8 controlled conditions. Used 120 pre-registered runs evaluated on a 1,040-item wug test battery.

Result: All models achieved perfect first-order exemplar retrieval (100%) but second-order generalization to novel nouns remained at chance (50-52%). Feature-swap diagnostic revealed models rely on frame-to-feature template matching rather than structured abstraction.

Conclusion: Autoregressive distributional sequence learning has clear limitations in acquiring second-order generalizations under developmental-scale training conditions, highlighting a gap between current models and human cognitive development.

Abstract: Background: Children do not simply learn that balls are round and blocks are square. They learn that shape is the kind of feature that tends to define object categories – a second-order generalisation known as an overhypothesis [1, 2]. What kind of learning mechanism is sufficient for this inductive leap? Methods: We trained autoregressive transformer language models (3.4M-25.6M parameters) on synthetic corpora in which shape is the stable feature dimension across categories, with eight conditions controlling for alternative explanations. Results: Across 120 pre-registered runs evaluated on a 1,040-item wug test battery, every model achieved perfect first-order exemplar retrieval (100%) while second-order generalisation to novel nouns remained at chance (50-52%), a result confirmed by equivalence testing. A feature-swap diagnostic revealed that models rely on frame-to-feature template matching rather than structured noun-to-domain-to-feature abstraction. Conclusions: These results reveal a clear limitation of autoregressive distributional sequence learning under developmental-scale training conditions.

[29] Do Domain-specific Experts exist in MoE-based LLMs?

Giang Do, Hung Le, Truyen Tran

Main category: cs.CL

TL;DR: Domain-specific experts exist in MoE-based LLMs, and a training-free Domain Steering Mixture of Experts (DSMoE) framework leverages these experts to improve performance without additional inference cost.

DetailsMotivation: While MoE architecture improves computational efficiency for large LLMs, the nature of expert specializations and how to systematically interpret them remain unclear. The paper investigates whether domain-specific experts exist in MoE-based LLMs.

Method: Evaluated ten MoE-based LLMs (3.8B to 120B parameters) to find empirical evidence of domain-specific experts. Proposed DSMoE, a training-free framework that uses these experts without additional inference cost or retraining.

Result: Found evidence for domain-specific experts in MoE-based LLMs. DSMoE outperformed both well-trained MoE-based LLMs and strong baselines including SFT, achieving strong performance and robust generalization across target and non-target domains.

Conclusion: Domain-specific experts do exist in MoE-based LLMs, and the proposed DSMoE framework effectively leverages these specializations to improve performance without additional computational cost or retraining requirements.

Abstract: In the era of Large Language Models (LLMs), the Mixture of Experts (MoE) architecture has emerged as an effective approach for training extremely large models with improved computational efficiency. This success builds upon extensive prior research aimed at enhancing expert specialization in MoE-based LLMs. However, the nature of such specializations and how they can be systematically interpreted remain open research challenges. In this work, we investigate this gap by posing a fundamental question: \textit{Do domain-specific experts exist in MoE-based LLMs?} To answer the question, we evaluate ten advanced MoE-based LLMs ranging from 3.8B to 120B parameters and provide empirical evidence for the existence of domain-specific experts. Building on this finding, we propose \textbf{Domain Steering Mixture of Experts (DSMoE)}, a training-free framework that introduces zero additional inference cost and outperforms both well-trained MoE-based LLMs and strong baselines, including Supervised Fine-Tuning (SFT). Experiments on four advanced open-source MoE-based LLMs across both target and non-target domains demonstrate that our method achieves strong performance and robust generalization without increasing inference cost or requiring additional retraining. Our implementation is publicly available at https://github.com/giangdip2410/Domain-specific-Experts.

[30] Beneath the Surface: Investigating LLMs’ Capabilities for Communicating with Subtext

Kabir Ahuja, Yuxuan Li, Andrew Kyle Lampinen

Main category: cs.CL

TL;DR: LLMs struggle with creative subtext communication, showing bias toward literal interpretation despite some ability to use common ground in multimodal games.

DetailsMotivation: To systematically evaluate whether language models can understand and generate subtext (implied meaning beyond literal content) in creative communication settings, which is fundamental to human communication.

Method: Introduces four evaluation suites: writing & interpreting allegories, multi-agent games, and multimodal games inspired by board games like Dixit. Tests models’ ability to use common ground and handle nuanced constraints in creative communication.

Result: Frontier models show strong bias toward literal communication, failing to account for nuanced constraints (60% literal clues in Visual Allusions). Some models can use common ground to reduce literal clues by 30-50%, but struggle when common ground isn’t explicitly stated. Paratextual and persona conditions significantly affect allegory interpretation.

Conclusion: Current LLMs have significant weaknesses in creative subtext communication despite some capabilities. The work provides quantifiable measures for complex phenomena and reveals model idiosyncrasies, calling for future work on socially grounded creative communication.

Abstract: Human communication is fundamentally creative, and often makes use of subtext – implied meaning that goes beyond the literal content of the text. Here, we systematically study whether language models can use subtext in communicative settings, and introduce four new evaluation suites to assess these capabilities. Our evaluation settings range from writing & interpreting allegories to playing multi-agent and multi-modal games inspired by the rules of board games like Dixit. We find that frontier models generally exhibit a strong bias towards overly literal, explicit communication, and thereby fail to account for nuanced constraints – even the best performing models generate literal clues 60% of times in one of our environments – Visual Allusions. However, we find that some models can sometimes make use of common ground with another party to help them communicate with subtext, achieving 30%-50% reduction in overly literal clues; but they struggle at inferring presence of a common ground when not explicitly stated. For allegory understanding, we find paratextual and persona conditions to significantly shift the interpretation of subtext. Overall, our work provides quantifiable measures for an inherently complex and subjective phenomenon like subtext and reveals many weaknesses and idiosyncrasies of current LLMs. We hope this research to inspire future work towards socially grounded creative communication and reasoning.

[31] Right at My Level: A Unified Multilingual Framework for Proficiency-Aware Text Simplification

Jinhong Jeong, Junghun Park, Youngjae Yu

Main category: cs.CL

TL;DR: Re-RIGHT is a reinforcement learning framework for multilingual text simplification without parallel corpora, addressing poor performance of LLMs at easier proficiency levels and for non-English languages.

DetailsMotivation: Existing LLM-based text simplification methods rely on pre-labeled parallel corpora and work poorly for non-English languages and easier proficiency levels, making personalized L2 learning support costly and limited.

Method: Proposes Re-RIGHT: a unified RL framework with three reward modules (vocabulary coverage, semantic preservation, coherence) trained on 43K vocabulary-level data across four languages (English, Japanese, Korean, Chinese) using a 4B policy model.

Result: Re-RIGHT outperforms strong LLM baselines (GPT-5.2, Gemini 2.5) in achieving higher lexical coverage at target proficiency levels while maintaining original meaning and fluency.

Conclusion: The framework enables adaptive multilingual text simplification without parallel corpus supervision, supporting personalized L2 learning across multiple languages and proficiency levels.

Abstract: Text simplification supports second language (L2) learning by providing comprehensible input, consistent with the Input Hypothesis. However, constructing personalized parallel corpora is costly, while existing large language model (LLM)-based readability control methods rely on pre-labeled sentence corpora and primarily target English. We propose Re-RIGHT, a unified reinforcement learning framework for adaptive multilingual text simplification without parallel corpus supervision. We first show that prompting-based lexical simplification at target proficiency levels (CEFR, JLPT, TOPIK, and HSK) performs poorly at easier levels and for non-English languages, even with state-of-the-art LLMs such as GPT-5.2 and Gemini 2.5. To address this, we collect 43K vocabulary-level data across four languages (English, Japanese, Korean, and Chinese) and train a compact 4B policy model using Re-RIGHT, which integrates three reward modules: vocabulary coverage, semantic preservation, and coherence. Compared to the stronger LLM baselines, Re-RIGHT achieves higher lexical coverage at target proficiency levels while maintaining original meaning and fluency.

[32] DIA-HARM: Dialectal Disparities in Harmful Content Detection Across 50 English Dialects

Jason Lucas, Matt Murtagh, Ali Al-Lawati, Uchendu Uchendu, Adaku Uchendu, Dongwon Lee

Main category: cs.CL

TL;DR: DIA-HARM benchmark evaluates disinformation detection robustness across 50 English dialects, revealing systematic vulnerabilities in current models that disadvantage non-Standard American English speakers.

DetailsMotivation: Current harmful content detectors are predominantly developed and evaluated on Standard American English, leaving their robustness to dialectal variation unexplored, potentially disadvantaging hundreds of millions of non-SAE speakers worldwide.

Method: Created DIA-HARM benchmark using Multi-VALUE’s linguistically grounded transformations to generate D3 corpus of 195K samples across 50 English dialects from established disinformation benchmarks, then evaluated 16 detection models including fine-tuned transformers and zero-shot LLMs.

Result: Human-written dialectal content degrades detection by 1.4-3.6% F1, while AI-generated content remains stable; fine-tuned transformers outperform zero-shot LLMs (96.6% vs 78.3% F1); multilingual models generalize effectively while monolingual models fail on dialectal inputs; some models show catastrophic failures exceeding 33% degradation.

Conclusion: Current disinformation detectors systematically disadvantage non-SAE speakers, highlighting the need for more robust multilingual and dialect-aware models for equitable content moderation.

Abstract: Harmful content detectors-particularly disinformation classifiers-are predominantly developed and evaluated on Standard American English (SAE), leaving their robustness to dialectal variation unexplored. We present DIA-HARM, the first benchmark for evaluating disinformation detection robustness across 50 English dialects spanning U.S., British, African, Caribbean, and Asia-Pacific varieties. Using Multi-VALUE’s linguistically grounded transformations, we introduce D3 (Dialectal Disinformation Detection), a corpus of 195K samples derived from established disinformation benchmarks. Our evaluation of 16 detection models reveals systematic vulnerabilities: human-written dialectal content degrades detection by 1.4-3.6% F1, while AI-generated content remains stable. Fine-tuned transformers substantially outperform zero-shot LLMs (96.6% vs. 78.3% best-case F1), with some models exhibiting catastrophic failures exceeding 33% degradation on mixed content. Cross-dialectal transfer analysis across 2,450 dialect pairs shows that multilingual models (mDeBERTa: 97.2% average F1) generalize effectively, while monolingual models like RoBERTa and XLM-RoBERTa fail on dialectal inputs. These findings demonstrate that current disinformation detectors may systematically disadvantage hundreds of millions of non-SAE speakers worldwide. We release the DIA-HARM framework, D3 corpus, and evaluation tools: https://github.com/jsl5710/dia-harm

[33] Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities

Xiangxu Zhang, Jiamin Wang, Qinlin Zhao, Hanze Guo, Linzhuo Li, Jing Yao, Xiao Zhou, Xiaoyuan Yi, Xing Xie

Main category: cs.CL

TL;DR: CIVA introduces a controlled multi-agent environment to study how human value misalignment affects collective behavior in LLM-based agent communities, revealing critical values that shape dynamics and cause system failures.

DetailsMotivation: As LLMs integrate into society, understanding their alignment with human values becomes crucial, especially in multi-agent systems where individual misalignments can accumulate into group-level failures. The paper aims to investigate whether misalignment with human values alters collective behavior in LLM agents and what changes it induces.

Method: Introduces CIVA, a controlled multi-agent environment grounded in social science theories where LLM agents form communities, autonomously communicate, explore, and compete for resources. Enables systematic manipulation of value prevalence and behavioral analysis through comprehensive simulation experiments.

Result: Three key findings: (1) Identified structurally critical values that substantially shape community collective dynamics, including those diverging from LLMs’ original orientations; (2) Detected system failure modes like catastrophic collapse at macro level triggered by value misspecification; (3) Observed emergent behaviors like deception and power-seeking at micro level.

Conclusion: Provides quantitative evidence that human values are essential for collective outcomes in LLMs and motivates future multi-agent value alignment research. Shows how value misalignment can lead to emergent negative behaviors and system failures in LLM-based multi-agent systems.

Abstract: As LLMs become increasingly integrated into human society, evaluating their orientations on human values from social science has drawn growing attention. Nevertheless, it is still unclear why human values matter for LLMs, especially in LLM-based multi-agent systems, where group-level failures may accumulate from individually misaligned actions. We ask whether misalignment with human values alters the collective behavior of LLM agents and what changes it induces? In this work, we introduce CIVA, a controlled multi-agent environment grounded in social science theories, where LLM agents form a community and autonomously communicate, explore, and compete for resources, enabling systematic manipulation of value prevalence and behavioral analysis. Through comprehensive simulation experiments, we reveal three key findings. (1) We identify several structurally critical values that substantially shape the community’s collective dynamics, including those diverging from LLMs’ original orientations. Triggered by the misspecification of these values, we (2) detect system failure modes, e.g., catastrophic collapse, at the macro level, and (3) observe emergent behaviors like deception and power-seeking at the micro level. These results offer quantitative evidence that human values are essential for collective outcomes in LLMs and motivate future multi-agent value alignment.

[34] DQA: Diagnostic Question Answering for IT Support

Vishaal Kapoor, Mariam Dundua, Sarthak Ahuja, Neda Kordjazi, Evren Yortucboylu, Vaibhavi Padala, Derek Ho, Jennifer Whitted, Rebecca Steinert

Main category: cs.CL

TL;DR: DQA is a diagnostic question-answering framework for enterprise IT support that maintains persistent diagnostic state and aggregates retrieved cases at root cause level, outperforming standard multi-turn RAG systems.

DetailsMotivation: Standard multi-turn RAG systems lack explicit diagnostic state, making them ineffective at accumulating evidence and resolving competing hypotheses across turns in enterprise IT support scenarios where diagnostic reasoning is crucial.

Method: DQA combines conversational query rewriting, retrieval aggregation at root cause level (rather than individual documents), and state-conditioned response generation to maintain persistent diagnostic state and support systematic troubleshooting under enterprise constraints.

Result: DQA achieves 78.7% success rate vs 41.3% for multi-turn RAG baseline, while reducing average turns from 8.4 to 3.9 on 150 enterprise IT support scenarios using replay-based evaluation.

Conclusion: Maintaining explicit diagnostic state and aggregating retrieved cases at root cause level significantly improves diagnostic question-answering performance in enterprise IT support contexts compared to standard RAG approaches.

Abstract: Enterprise IT support interactions are fundamentally diagnostic: effective resolution requires iterative evidence gathering from ambiguous user reports to identify an underlying root cause. While retrieval-augmented generation (RAG) provides grounding through historical cases, standard multi-turn RAG systems lack explicit diagnostic state and therefore struggle to accumulate evidence and resolve competing hypotheses across turns. We introduce DQA, a diagnostic question-answering framework that maintains persistent diagnostic state and aggregates retrieved cases at the level of root causes rather than individual documents. DQA combines conversational query rewriting, retrieval aggregation, and state-conditioned response generation to support systematic troubleshooting under enterprise latency and context constraints. We evaluate DQA on 150 anonymized enterprise IT support scenarios using a replay-based protocol. Averaged over three independent runs, DQA achieves a 78.7% success rate under a trajectory-level success criterion, compared to 41.3% for a multi-turn RAG baseline, while reducing average turns from 8.4 to 3.9.

[35] ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

Kaiser Hamid, Can Cui, Nade Liang

Main category: cs.CL

TL;DR: ICR-Drive is a diagnostic framework for testing instruction counterfactual robustness in language-conditioned autonomous driving models, generating controlled instruction variants to measure performance degradation under different perturbation types.

DetailsMotivation: Current vision-language-action models for autonomous driving are evaluated with precise, well-formed instructions, but real-world instructions vary in phrasing, specificity, and may include misleading elements. There's a need to measure instruction-level robustness for safety-critical deployment.

Method: ICR-Drive generates controlled instruction variants across four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading. It replays identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language changes.

Result: Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing significant reliability gaps in current embodied foundation models for driving applications.

Conclusion: The framework reveals critical robustness issues in language-conditioned driving agents, highlighting the need for better instruction-level robustness testing before deploying embodied foundation models in safety-critical driving scenarios.

Abstract: Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.

[36] Confidence Should Be Calibrated More Than One Turn Deep

Zhaohan Zhang, Chengzhengxu Li, Xiaoming Liu, Chao Shen, Ziquan Liu, Ioannis Patras

Main category: cs.CL

TL;DR: Multi-turn calibration framework for LLMs that addresses confidence estimation in conversational settings, showing user feedback can degrade calibration and proposing methods to maintain reliable multi-turn interactions.

DetailsMotivation: LLMs are increasingly used in high-stakes domains requiring reliable multi-turn interactions, but existing calibration research focuses on single-turn settings, overlooking the dynamic challenges of conversations where confidence needs to be calibrated conditioned on conversation history.

Method: Introduces multi-turn calibration task with ECE@T metric to track calibration dynamics over turns. Proposes MTCal which minimizes ECE@T via surrogate calibration target, and ConfChat decoding strategy that leverages calibrated confidence to improve factuality and consistency in multi-turn responses.

Result: MTCal achieves outstanding and consistent performance in multi-turn calibration, and ConfChat preserves/enhances model performance in multi-turn interactions. Shows user feedback (e.g., persuasion) can degrade multi-turn calibration.

Conclusion: Multi-turn calibration is a crucial missing link for scaling LLM calibration toward safe, reliable, real-world use in conversational applications.

Abstract: Large Language Models (LLMs) are increasingly applied in high-stakes domains such as finance, healthcare, and education, where reliable multi-turn interactions with users are essential. However, existing work on confidence estimation and calibration, a major approach to building trustworthy LLM systems, largely focuses on single-turn settings and overlooks the risks and potential of multi-turn conversations. In this work, we introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propose MTCal, which minimises ECE@T via a surrogate calibration target, and further leverage calibrated confidence in ConfChat, a decoding strategy that improves both factuality and consistency of the model response in multi-turn interactions. Extensive experiments demonstrate that MT-Cal achieves outstanding and consistent performance in multi-turn calibration, and ConfChat preserves and even enhances model performance in multi-turn interactions. Our results mark multi-turn calibration as one missing link for scaling LLM calibration toward safe, reliable, and real-world use.

[37] Multi-Drafter Speculative Decoding with Alignment Feedback

Taehyeon Kim, Hojung Jung, Se-Young Yun

Main category: cs.CL

TL;DR: MetaSD introduces a unified framework for speculative decoding that integrates multiple drafters and dynamically allocates computational resources using multi-armed bandit optimization based on alignment feedback.

DetailsMotivation: Current speculative decoding approaches use single drafters trained for specific tasks/domains, which limits their effectiveness across diverse applications. There's a need for a more flexible framework that can leverage multiple drafters.

Method: MetaSD integrates multiple heterogeneous drafters into speculative decoding and uses a multi-armed bandit formulation to dynamically allocate computational resources based on alignment feedback between drafters and the target LLM.

Result: Extensive experiments show MetaSD consistently outperforms single-drafter approaches in speculative decoding performance.

Conclusion: MetaSD provides an effective unified framework for multi-drafter speculative decoding that adapts to diverse applications through dynamic resource allocation.

Abstract: Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller model to draft future tokens, which are then verified by the target LLM. This preserves generation quality by accepting only aligned tokens. However, individual drafters, often trained for specific tasks or domains, exhibit limited effectiveness across diverse applications. To address this, we introduce \textsc{MetaSD}, a unified framework that integrates multiple drafters into the SD process. MetaSD dynamically allocates computational resources to heterogeneous drafters by leveraging alignment feedback and framing drafter selection as a multi-armed bandit problem. Extensive experiments show MetaSD consistently outperforms single-drafter approaches.

[38] Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

Qiyuan Chen, Hongsen Huang, Jiahe Chen, Qian Shao, Jintai Chen, Hongxia Xu, Renjie Hua, Chuan Ren, Jian Wu

Main category: cs.CL

TL;DR: VL-MDR is a vision-language reward modeling framework that dynamically decomposes evaluation into interpretable dimensions using visual-aware gating, addressing the trade-off between interpretability and efficiency in reward modeling.

DetailsMotivation: Current vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque "black boxes." There's a need to bridge this gap between interpretability and efficiency.

Method: Proposes VL-MDR framework that dynamically decomposes evaluation into granular, interpretable dimensions. Uses visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. Curates a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions.

Result: VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability.

Conclusion: VL-MDR provides a scalable solution for VLM alignment by offering interpretable yet efficient reward modeling that bridges the gap between generative and discriminative approaches.

Abstract: Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque “black boxes.” To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.

[39] Content Fuzzing for Escaping Information Cocoons on Digital Social Media

Yifeng He, Ziye Tang, Hao Chen

Main category: cs.CL

TL;DR: ContentFuzz is a framework that uses LLMs to rewrite social media posts to change their machine-inferred stance labels while preserving human intent, aiming to help content reach beyond echo chambers.

DetailsMotivation: Social media platforms use stance detection in recommendation systems, which creates information cocoons that limit exposure to diverse viewpoints. This restricts dissenting opinions and hinders constructive discourse. The paper takes the creator's perspective to investigate how content can be revised to reach beyond existing affinity clusters.

Method: ContentFuzz is a confidence-guided fuzzing framework that uses large language models to generate meaning-preserving rewrites of posts. It guides the LLM using confidence feedback from stance detection models to create rewrites that change machine-inferred stance labels while preserving human-interpreted intent.

Result: Evaluated on four representative stance detection models across three datasets in two languages, ContentFuzz effectively changes machine-classified stance labels while maintaining semantic integrity with respect to the original content.

Conclusion: The framework successfully demonstrates that content can be algorithmically rewritten to bypass stance-based filtering and reach broader audiences, potentially mitigating the effects of information cocoons on social media platforms.

Abstract: Information cocoons on social media limit users’ exposure to posts with diverse viewpoints. Modern platforms use stance detection as an important signal in recommendation and ranking pipelines, which can route posts primarily to like-minded audiences and reduce cross-cutting exposure. This restricts the reach of dissenting opinions and hinders constructive discourse. We take the creator’s perspective and investigate how content can be revised to reach beyond existing affinity clusters. We present ContentFuzz, a confidence-guided fuzzing framework that rewrites posts while preserving their human-interpreted intent and induces different machine-inferred stance labels. ContentFuzz aims to route posts beyond their original cocoons. Our method guides a large language model (LLM) to generate meaning-preserving rewrites using confidence feedback from stance detection models. Evaluated on four representative stance detection models across three datasets in two languages, ContentFuzz effectively changes machine-classified stance labels, while maintaining semantic integrity with respect to the original content.

[40] Don’t Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction

Yuzhe Zhang, Xianwei Xue, Xingyong Wu, Mengke Chen, Chen Liu, Xinran He, Run Shao, Feiran Liu, Huanmin Xu, Qiutong Pan, Haiwei Wang

Main category: cs.CL

TL;DR: VeriGUI introduces a verification-driven framework for GUI agents that explicitly models action outcomes and recovery strategies to handle noisy real-world environments with network latency and system interruptions.

DetailsMotivation: Current autonomous GUI agents based on vision-language models assume deterministic environment responses, but real-world settings have network latency, rendering delays, and system interruptions that cause undetected action failures, repetitive ineffective behaviors, and catastrophic error accumulation. Learning robust recovery strategies is challenging due to high online interaction costs and lack of real-time feedback in offline datasets.

Method: VeriGUI introduces a Thinking-Verification-Action-Expectation (TVAE) framework to detect failures and guide corrective reasoning, and a two-stage training pipeline combining Robust SFT with synthetic failure trajectories and GRPO with asymmetric verification rewards. A Robustness Benchmark based on AndroidControl is constructed to evaluate failure recognition and correction.

Result: Experiments show that VeriGUI significantly reduces failure loops and improves recovery success while maintaining competitive standard task performance.

Conclusion: The proposed verification-driven approach effectively addresses the limitations of deterministic assumptions in GUI agents, enabling more robust performance in noisy real-world environments through explicit failure detection and recovery mechanisms.

Abstract: Autonomous GUI agents based on vision-language models (VLMs) often assume deterministic environment responses, generating actions without verifying whether previous operations succeeded. In real-world settings with network latency, rendering delays, and system interruptions, this assumption leads to undetected action failures, repetitive ineffective behaviors, and catastrophic error accumulation. Moreover, learning robust recovery strategies is challenging due to the high cost of online interaction and the lack of real-time feedback in offline datasets.We propose VeriGUI (Verification-driven GUI Agent), which explicitly models action outcomes and recovery under noisy environments. VeriGUI introduces a Thinking–Verification–Action–Expectation (TVAE) framework to detect failures and guide corrective reasoning, and a two-stage training pipeline that combines Robust SFT with synthetic failure trajectories and GRPO with asymmetric verification rewards. We further construct a Robustness Benchmark based on AndroidControl to evaluate failure recognition and correction. Experiments show that VeriGUI significantly reduces failure loops and improves recovery success while maintaining competitive standard task performance.

[41] Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs

Hongcheng Liu, Yuhao Wang, Zhe Chen, Pingjie Wang, Zhiyuan Zhu, Yixuan Hou, Yanfeng Wang, Yu Wang

Main category: cs.CL

TL;DR: CrossOmni dataset and methods to address cross-modal coreference problems in Omni-LLMs, improving fine-grained alignment between modalities for better omni-modal reasoning.

DetailsMotivation: Omni-LLMs struggle with complex scenarios requiring synergistic omni-modal reasoning, particularly in fine-grained cross-modal alignment and identifying shared referents across modalities, which has been largely overlooked.

Method: Formalize the challenge as cross-modal coreference problem, introduce CrossOmni dataset with 9 tasks and human-designed reasoning rationales, propose training-free In-Context Learning and training-based SFT+GRPO framework to enhance cross-modal alignment.

Result: Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference; both proposed approaches yield substantial performance gains and generalize effectively to collaborative reasoning tasks.

Conclusion: Cross-modal coreference is a crucial missing piece for advancing robust omni-modal reasoning, and addressing it significantly improves multimodal understanding capabilities.

Abstract: Omni Large Language Models (Omni-LLMs) have demonstrated impressive capabilities in holistic multi-modal perception, yet they consistently falter in complex scenarios requiring synergistic omni-modal reasoning. Beyond understanding global multimodal context, effective reasoning also hinges on fine-grained cross-modal alignment, especially identifying shared referents across modalities, yet this aspect has been largely overlooked. To bridge this gap, we formalize the challenge as a cross-modal coreference problem, where a model must localize a referent in a source modality and re-identify it in a target modality. Building on this paradigm, we introduce CrossOmni, a dataset comprising nine tasks equipped with human-designed reasoning rationales to evaluate and enhance this capability. Experiments on 13 Omni-LLMs reveal systematic weaknesses in cross-modal coreference, which we attribute to the absence of coreference-aware thinking patterns. To address this, we enhance cross-modal alignment via two strategies: a training-free In-Context Learning method and a training-based SFT+GRPO framework designed to induce such thinking patterns. Both approaches yield substantial performance gains and generalize effectively to collaborative reasoning tasks. Overall, our findings highlight cross-modal coreference as a crucial missing piece for advancing robust omni-modal reasoning.

[42] Turbulence-like 5/3 spectral scaling in contextual representations of language as a complex system

Zhongxin Yang, Chun Bao, Yuanwei Bin, Xiang I. A. Yang, Shiyi Chen

Main category: cs.CL

TL;DR: Text represented as trajectories in embedding space shows power-law scaling with exponent ~5/3 across languages, revealing scale-free semantic integration similar to turbulence physics.

DetailsMotivation: To understand the statistical regularities and organizational principles of natural language by analyzing text as dynamical trajectories in high-dimensional embedding spaces, moving beyond lexical statistics to examine context-dependent semantic integration.

Method: Represent text as trajectories in transformer-based language model embedding spaces, quantify scale-dependent fluctuations using embedding-step signals, compute power spectra across multiple languages and corpora, and compare contextual vs. static embeddings and randomized token orders.

Result: Consistent power-law scaling with exponent close to 5/3 observed across languages and corpora in contextual embeddings from both human and AI-generated text, absent in static word embeddings, and disrupted by token randomization.

Conclusion: Language exhibits scale-free, self-similar semantic integration across linguistic scales, analogous to Kolmogorov turbulence spectrum, providing quantitative benchmark for studying complex structure in language representations.

Abstract: Natural language is a complex system that exhibits robust statistical regularities. Here, we represent text as a trajectory in a high-dimensional embedding space generated by transformer-based language models, and quantify scale-dependent fluctuations along the token sequence using an embedding-step signal. Across multiple languages and corpora, the resulting power spectrum exhibits a robust power law with an exponent close to $5/3$ over an extended frequency range. This scaling is observed consistently in contextual embeddings from both human-written and AI-generated text, but is absent in static word embeddings and is disrupted by randomization of token order. These results show that the observed scaling reflects multiscale, context-dependent organization rather than lexical statistics alone. By analogy with the Kolmogorov spectrum in turbulence, our findings suggest that semantic information is integrated in a scale-free, self-similar manner across linguistic scales, and provide a quantitative, model-agnostic benchmark for studying complex structure in language representations.

[43] Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting

Jinhu Fu, Yan Bai, Longzhu He, Yihang Lou, Yanxiao Zhao, Li Sun, Sen Su

Main category: cs.CL

TL;DR: CoT2Edit: A new paradigm for knowledge editing in LLMs that uses Chain of Thoughts reasoning to improve generalization and handle both structured and unstructured factual information, achieving strong performance across diverse editing scenarios.

DetailsMotivation: Current knowledge editing approaches for LLMs have poor generalization (rigid injection without practical problem-solving ability) and narrow scope (focus only on structured fact triples, ignoring unstructured forms like news/articles prevalent in real-world contexts).

Method: Proposes CoT2Edit: (1) Uses language model agents to generate Chain of Thoughts for both structured and unstructured edited data to create high-quality instruction data, (2) Trains model to reason over edited knowledge via supervised fine-tuning and Group Relative Policy Optimization, (3) At inference, integrates Retrieval-Augmented Generation to dynamically retrieve relevant edited facts for real-time knowledge editing.

Result: Achieves strong generalization across six diverse knowledge editing scenarios with just a single round of training on three open-source language models.

Conclusion: CoT2Edit effectively addresses limitations of current knowledge editing methods by teaching LLMs to edit knowledge via reasoning chains, handling both structured and unstructured information, and demonstrating robust generalization across diverse scenarios.

Abstract: Large language models (LLMs) can effectively handle outdated information through knowledge editing. However, current approaches face two key limitations: (I) Poor generalization: Most approaches rigidly inject new knowledge without ensuring that the model can use it effectively to solve practical problems. (II) Narrow scope: Current methods focus primarily on structured fact triples, overlooking the diverse unstructured forms of factual information (e.g., news, articles) prevalent in real-world contexts. To address these challenges, we propose a new paradigm: teaching LLMs to edit knowledge via Chain of Thoughts (CoTs) reasoning (CoT2Edit). We first leverage language model agents for both structured and unstructured edited data to generate CoTs, building high-quality instruction data. The model is then trained to reason over edited knowledge through supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). At inference time, we integrate Retrieval-Augmented Generation (RAG) to dynamically retrieve relevant edited facts for real-time knowledge editing. Experimental results demonstrate that our method achieves strong generalization across six diverse knowledge editing scenarios with just a single round of training on three open-source language models. The codes are available at https://github.com/FredJDean/CoT2Edit.

[44] Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

Jun Zhang, Yicheng Ji, Feiyang Ren, Yihang Li, Bowen Zeng, Zonghao Chen, Ke Chen, Lidan Shou, Gang Chen, Huan Li

Main category: cs.CL

TL;DR: Systematic survey analyzing efficiency bottlenecks in Large Vision-Language Models, focusing on visual token dominance and proposing structured taxonomy of optimization techniques across the inference pipeline.

DetailsMotivation: LVLMs suffer from systemic efficiency barriers due to visual token dominance, where high-resolution visual processing creates bottlenecks throughout the inference lifecycle. Current approaches lack end-to-end analysis of how upstream decisions affect downstream performance.

Method: Proposes a systematic taxonomy structured around the inference lifecycle (encoding, prefilling, decoding), analyzing efficiency techniques across three axes: shaping information density, managing long-context attention, and overcoming memory limits.

Result: Provides structured analysis of how isolated optimizations compose to navigate trade-offs between visual fidelity and system efficiency, with empirical insights supporting four future research frontiers.

Conclusion: Visual token dominance requires holistic optimization across the entire inference pipeline. Future work should focus on hybrid compression, modality-aware decoding, progressive state management, and hardware-algorithm co-design.

Abstract: Large Vision-Language Models (LVLMs) enable sophisticated reasoning over images and videos, yet their inference is hindered by a systemic efficiency barrier known as visual token dominance. This overhead is driven by a multi-regime interplay between high-resolution feature extraction, quadratic attention scaling, and memory bandwidth constraints. We present a systematic taxonomy of efficiency techniques structured around the inference lifecycle, consisting of encoding, prefilling, and decoding. Unlike prior reviews focused on isolated optimizations, we analyze the end-to-end pipeline to reveal how upstream decisions dictate downstream bottlenecks, covering compute-bound visual encoding, the intensive prefilling of massive contexts, and the ‘‘visual memory wall’’ in bandwidth-bound decoding. By decoupling the efficiency landscape into the axes of shaping information density, managing long-context attention, and overcoming memory limits, this work provides a structured analysis of how isolated optimizations compose to navigate the trade-off between visual fidelity and system efficiency. The survey concludes by outlining four future frontiers supported by pilot empirical insights, including hybrid compression based on functional unit sensitivity, modality-aware decoding with relaxed verification, progressive state management for streaming continuity, and stage-disaggregated serving through hardware-algorithm co-design. The submitted software contains a snapshot of our literature repository, which is designed to be maintained as a living resource for the community.

[45] Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

Yanxu Mao, Peipei Liu, Tiehan Cui, Congying Liu, Mingzhe Xing, Datao You

Main category: cs.CL

TL;DR: JailAgent is a framework for attacking LLM-based agents by manipulating their reasoning trajectories and memory retrieval without modifying user prompts, using trigger extraction, reasoning hijacking, and constraint tightening.

DetailsMotivation: As LLM-based agents become more complex and widely deployed, they introduce new security threats. Existing red-team methods that modify user prompts lack adaptability to new data and can negatively impact agent performance, necessitating a more sophisticated attack framework.

Method: The JailAgent framework avoids modifying user prompts and instead implicitly manipulates the agent’s reasoning trajectory and memory retrieval through three stages: 1) Trigger Extraction for precise identification of attack points, 2) Reasoning Hijacking using real-time adaptive mechanisms, and 3) Constraint Tightening with an optimized objective function.

Result: JailAgent demonstrates outstanding performance in cross-model and cross-scenario environments, showing effectiveness across different LLM models and application scenarios.

Conclusion: The proposed framework provides a sophisticated approach to red-teaming LLM-based agents that doesn’t rely on prompt modification, offering better adaptability and performance preservation while effectively testing agent security.

Abstract: With the widespread application of LLM-based agents across various domains, their complexity has introduced new security threats. Existing red-team methods mostly rely on modifying user prompts, which lack adaptability to new data and may impact the agent’s performance. To address the challenge, this paper proposes the JailAgent framework, which completely avoids modifying the user prompt. Specifically, it implicitly manipulates the agent’s reasoning trajectory and memory retrieval with three key stages: Trigger Extraction, Reasoning Hijacking, and Constraint Tightening. Through precise trigger identification, real-time adaptive mechanisms, and an optimized objective function, JailAgent demonstrates outstanding performance in cross-model and cross-scenario environments.

[46] AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, Tie-Yan Liu

Main category: cs.CL

TL;DR: AutoSOTA is an automated research system that reproduces and improves SOTA models from AI papers using multi-agent architecture for end-to-end optimization.

DetailsMotivation: AI research requires extensive reproduction, debugging, and iterative refinement cycles to achieve SOTA performance, creating need for systems that accelerate empirical model optimization pipelines.

Method: Multi-agent architecture with eight specialized agents handling paper-to-code grounding, environment setup/repair, experiment tracking, optimization idea generation/scheduling, and validity supervision across three stages: resource preparation/goal setting, experiment evaluation, and reflection/ideation.

Result: AutoSOTA discovered 105 new SOTA models surpassing original methods across eight top-tier AI conferences, averaging ~5 hours per paper, with improvements spanning LLM, NLP, computer vision, time series, and optimization domains.

Conclusion: End-to-end research automation can serve as both performance optimizer and research infrastructure that reduces repetitive experimental burden and redirects human attention toward higher-level scientific creativity.

Abstract: Artificial intelligence research increasingly depends on prolonged cycles of reproduction, debugging, and iterative refinement to achieve State-Of-The-Art (SOTA) performance, creating a growing need for systems that can accelerate the full pipeline of empirical model optimization. In this work, we introduce AutoSOTA, an end-to-end automated research system that advances the latest SOTA models published in top-tier AI papers to reproducible and empirically improved new SOTA models. We formulate this problem through three tightly coupled stages: resource preparation and goal setting; experiment evaluation; and reflection and ideation. To tackle this problem, AutoSOTA adopts a multi-agent architecture with eight specialized agents that collaboratively ground papers to code and dependencies, initialize and repair execution environments, track long-horizon experiments, generate and schedule optimization ideas, and supervise validity to avoid spurious gains. We evaluate AutoSOTA on recent research papers collected from eight top-tier AI conferences under filters for code availability and execution cost. Across these papers, AutoSOTA achieves strong end-to-end performance in both automated replication and subsequent optimization. Specifically, it successfully discovers 105 new SOTA models that surpass the original reported methods, averaging approximately five hours per paper. Case studies spanning LLM, NLP, computer vision, time series, and optimization further show that the system can move beyond routine hyperparameter tuning to identify architectural innovation, algorithmic redesigns, and workflow-level improvements. These results suggest that end-to-end research automation can serve not only as a performance optimizer, but also as a new form of research infrastructure that reduces repetitive experimental burden and helps redirect human attention toward higher-level scientific creativity.

[47] FastDiSS: Few-step Match Many-step Diffusion Language Model on Sequence-to-Sequence Generation–Full Version

Dat Nguyen-Cong, Tung Kieu, Hoang Thanh-Tung

Main category: cs.CL

TL;DR: A novel training framework for continuous diffusion language models that improves robustness to self-conditioning errors in few-step sampling, enabling 400x faster inference while maintaining quality.

DetailsMotivation: Self-conditioning in diffusion models degrades in few-step sampling regimes where inference speed is critical, causing compounding errors that dominate sample quality.

Method: Proposes a training framework that perturbs self-conditioning signals to match inference noise, improving robustness to prior estimation errors, plus a token-level noise-awareness mechanism to prevent training saturation.

Result: Framework surpasses standard continuous diffusion models with up to 400x faster inference speed, remaining competitive against other one-step diffusion frameworks across conditional generation benchmarks.

Conclusion: The proposed approach effectively addresses self-conditioning degradation in few-step sampling, enabling practical deployment of diffusion models for fast inference without sacrificing quality.

Abstract: Self-conditioning has been central to the success of continuous diffusion language models, as it allows models to correct previous errors. Yet its ability degrades precisely in the regime where diffusion is most attractive for deployment: few-step sampling for fast inference. In this study, we show that when models only have a few denoising steps, inaccurate self-conditioning induces a substantial approximation gap; this mistake compounds across denoising steps and ultimately dominate the sample quality. To address this, we propose a novel training framework that handles these errors during learning by perturbing the self-conditioning signal to match inference noise, improving robustness to prior estimation errors. In addition, we introduce a token-level noise-awareness mechanism that prevents training from saturation, hence improving optimization. Extensive experiments across conditional generation benchmarks demonstrate that our framework surpasses standard continuous diffusion models while providing up to 400x faster inference speed, and remains competitive against other one-step diffusion frameworks.

[48] Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue

Junan Hu, Shudan Guo, Wenqi Liu, Jianhua Yin, Yinwei Wei

Main category: cs.CL

TL;DR: Context-Agent: A framework that models multi-turn dialogue history as a dynamic tree structure to better handle non-linear conversation flow, improving task completion and token efficiency in LLMs.

DetailsMotivation: Current LLMs treat dialogue history as flat linear sequences, which is misaligned with the hierarchical and branching nature of real conversations, leading to inefficient context utilization and coherence loss during extended interactions with topic shifts.

Method: Introduces Context-Agent framework that models dialogue history as a dynamic tree structure, enabling navigation of multiple dialogue branches. Also creates NTM benchmark for evaluating non-linear dialogue performance.

Result: Context-Agent enhances task completion rates and improves token efficiency across various LLMs, demonstrating the value of structured context management for complex dialogues.

Conclusion: Modeling dialogue as dynamic tree structures better aligns with natural conversation flow and improves LLM performance in complex, non-linear dialogue scenarios.

Abstract: Large Language Models demonstrate outstanding performance in many language tasks but still face fundamental challenges in managing the non-linear flow of human conversation. The prevalent approach of treating dialogue history as a flat, linear sequence is misaligned with the intrinsically hierarchical and branching structure of natural discourse, leading to inefficient context utilization and a loss of coherence during extended interactions involving topic shifts or instruction refinements. To address this limitation, we introduce Context-Agent, a novel framework that models multi-turn dialogue history as a dynamic tree structure. This approach mirrors the inherent non-linearity of conversation, enabling the model to maintain and navigate multiple dialogue branches corresponding to different topics. Furthermore, to facilitate robust evaluation, we introduce the Non-linear Task Multi-turn Dialogue (NTM) benchmark, specifically designed to assess model performance in long-horizon, non-linear scenarios. Our experiments demonstrate that Context-Agent enhances task completion rates and improves token efficiency across various LLMs, underscoring the value of structured context management for complex, dynamic dialogues. The dataset and code is available at GitHub.

[49] EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

Xuan Dong, Huanyang Zheng, Tianhao Niu, Zhe Han, Pengzhan Li, Bofei Liu, Zhengyang Liu, Guancheng Li, Qingfu Zhu, Wanxiang Che

Main category: cs.CL

TL;DR: EpiBench: A multi-turn multimodal benchmark for evaluating research agents’ ability to navigate scientific literature, integrate evidence from figures/tables across papers, and answer questions requiring cross-paper comparisons.

DetailsMotivation: Existing benchmarks don't adequately evaluate proactive search, multi-evidence integration, and sustained evidence use over time in scientific research workflows, which involve navigating literature, consulting figures/tables, and integrating evidence across papers.

Method: Introduces EpiBench, an episodic multi-turn multimodal benchmark that simulates short research workflows where agents must navigate across papers over multiple turns, align evidence from figures and tables, and use accumulated evidence in memory to answer objective questions requiring cross-paper comparisons and multi-figure integration.

Result: Leading models achieve only 29.23% accuracy on the hard split, showing substantial room for improvement in multi-turn, multi-evidence research workflows.

Conclusion: EpiBench provides a process-level evaluation framework for fine-grained testing and diagnosis of research agents, offering an evaluation platform for verifiable and reproducible research agents in complex multimodal research workflows.

Abstract: Scientific research follows multi-turn, multi-step workflows that require proactively searching the literature, consulting figures and tables, and integrating evidence across papers to align experimental settings and support reproducible conclusions. This joint capability is not systematically assessed in existing benchmarks, which largely under-evaluate proactive search, multi-evidence integration and sustained evidence use over time. In this work, we introduce EpiBench, an episodic multi-turn multimodal benchmark that instantiates short research workflows. Given a research task, agents must navigate across papers over multiple turns, align evidence from figures and tables, and use the accumulated evidence in the memory to answer objective questions that require cross paper comparisons and multi-figure integration. EpiBench introduces a process-level evaluation framework for fine-grained testing and diagnosis of research agents. Our experiments show that even the leading model achieves an accuracy of only 29.23% on the hard split, indicating substantial room for improvement in multi-turn, multi-evidence research workflows, providing an evaluation platform for verifiable and reproducible research agents.

[50] THIVLVC: Retrieval Augmented Dependency Parsing for Latin

Luc Pommeret, Thibault Wagret, Jules Deret

Main category: cs.CL

TL;DR: THIVLVC is a two-stage system for Latin dependency parsing that combines retrieval-augmented generation (RAG) with LLM refinement, showing significant improvements over baseline parsers and revealing annotation inconsistencies in treebanks.

DetailsMotivation: To improve Latin dependency parsing by leveraging large language models with retrieval-augmented generation, addressing the challenge of parsing both poetry and prose in Latin with limited training data.

Method: Two-stage system: 1) Retrieve structurally similar entries from CIRCSE treebank using sentence length and POS n-gram similarity, 2) Prompt an LLM to refine UDPipe baseline parses using retrieved examples and UD annotation guidelines. Two configurations tested: without retrieval and with retrieval (RAG).

Result: On poetry (Seneca), THIVLVC improves CLAS by +17 points over UDPipe baseline; on prose (Thomas Aquinas), gain is +1.5 CLAS. Error analysis of 300 divergences shows 53.3% of unanimous annotator decisions favor THIVLVC, revealing annotation inconsistencies within and across treebanks.

Conclusion: THIVLVC demonstrates effective use of LLMs with RAG for Latin dependency parsing, with particularly strong results on poetry. The system also exposes annotation quality issues in existing treebanks that affect parser evaluation.

Abstract: We describe THIVLVC, a two-stage system for the EvaLatin 2026 Dependency Parsing task. Given a Latin sentence, we retrieve structurally similar entries from the CIRCSE treebank using sentence length and POS n-gram similarity, then prompt a large language model to refine the baseline parse from UDPipe using the retrieved examples and UD annotation guidelines. We submit two configurations: one without retrieval and one with retrieval (RAG). On poetry (Seneca), THIVLVC improves CLAS by +17 points over the UDPipe baseline; on prose (Thomas Aquinas), the gain is +1.5 CLAS. A double-blind error analysis of 300 divergences between our system and the gold standard reveals that, among unanimous annotator decisions, 53.3% favour THIVLVC, showing annotation inconsistencies both within and across treebanks.

[51] YoNER: A New Yorùbá Multi-domain Named Entity Recognition Dataset

Peace Busola Falola, Jesujoba O. Alabi, Solomon O. Akinola, Folashade T. Ogunajo, Emmanuel Oluwadunsin Alabi, David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: YoNER: A new multidomain Yorùbá NER dataset with 5,000 sentences across 5 domains, plus OyoBERT language model for Yorùbá NLP

DetailsMotivation: Address the gap in Yorùbá NER resources which are limited to specific domains (news and Wikipedia), hindering comprehensive research and development

Method: Created YoNER dataset with 5,000 sentences from 5 domains (Bible, Blogs, Movies, Radio, Wikipedia) manually annotated by native speakers; developed OyoBERT language model; conducted cross-domain experiments with transformer models

Result: African-centric models outperform multilingual models for Yorùbá; cross-domain performance drops significantly, especially for blogs and movies; closely related formal domains (news/Wikipedia) transfer better; OyoBERT outperforms multilingual models in in-domain evaluation

Conclusion: YoNER dataset and OyoBERT model advance Yorùbá NLP research; domain diversity is crucial for robust NER systems; African-centric models are more effective for Yorùbá language processing

Abstract: Named Entity Recognition (NER) is a foundational NLP task, yet research in Yorùbá has been constrained by limited and domain-specific resources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yorùbá NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yorùbá speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and consistency. We benchmark several transformer encoder models using cross-domain experiments with MasakhaNER 2.0, and we also assess the effect of few-shot in-domain data using YoNER and cross-lingual setups with English datasets. Our results show that African-centric models outperform general multilingual models for Yorùbá, but cross-domain performance drops substantially, particularly for blogs and movie domains. Furthermore, we observed that closely related formal domains, such as news and Wikipedia, transfer more effectively. In addition, we introduce a new Yorùbá-specific language model (OyoBERT) that outperforms multilingual models in in-domain evaluation. We publicly release the YoNER dataset and pretrained OyoBERT models to support future research on Yorùbá natural language processing.

[52] Graph-Based Chain-of-Thought Pruning for Reducing Redundant Reflections in Reasoning LLMs

Hongyuan Yuan, Xinran He, Run Shao, Bolei He, Xianwei Xue, Mengke Chen, Qiutong Pan, Haiwei Wang, Haifeng Li

Main category: cs.CL

TL;DR: A graph-based CoT optimization framework that reduces reasoning redundancy in LLMs through dual pruning strategies and RL-based training

DetailsMotivation: Current RL-extended Chain-of-Thought (CoT) methods often lead to inefficient reflection patterns like indiscriminate and repetitive verification, causing overthinking and redundant reasoning content

Method: Convert linear CoT into directed acyclic graphs with dependency edges, apply dual pruning (branch-level and depth-level), then distill via three-stage pipeline: SFT initialization, DPO preference learning, and GRPO with length penalty

Result: Reduces average reasoning tokens by 42% while maintaining or improving accuracy compared to baseline methods

Conclusion: The graph-based optimization framework effectively addresses reasoning redundancy in LLMs, enabling more efficient and accurate reasoning through structured pruning and RL-based training

Abstract: Extending CoT through RL has been widely used to enhance the reasoning capabilities of LLMs. However, due to the sparsity of reward signals, it can also induce undesirable thinking patterns such as overthinking, i.e., generating redundant intermediate reasoning content. In this work, we argue that a major source of such redundancy is inefficient reflection, which often manifests in two problematic patterns: Indiscriminate Reflection, where the model performs broad, low-impact checks throughout reasoning, and Repetitive Reflection, where it repeatedly re-verifies an already established conclusion. To address this, we introduce a graph-based CoT optimization framework. Specifically, we convert each linear CoT into a directed acyclic graph (DAG) with explicit dependency edges, and design a dual pruning strategy: branch-level pruning removes weakly contributing reflection branches, while depth-level pruning eliminates late-stage re-verification. We distill this behavior via a three-stage pipeline: (1) SFT to initialize the policy on pruned concise traces, (2) DPO to prefer correct but less redundant trajectories, and (3) GRPO with length penalty to jointly optimize answer correctness and efficiency. Experiments show that our approach reduces the average reasoning tokens by 42% while maintaining or improving accuracy.

[53] See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

Yicheng Ji, Jun Zhang, Jinpeng Chen, Cong Wang, Lidan Shou, Gang Chen, Huan Li

Main category: cs.CL

TL;DR: LVSpec is a training-free speculative decoding framework for Video-LLMs that uses loose verification based on visual-relevant token identification and position-shift tolerance to accelerate inference while maintaining high fidelity.

DetailsMotivation: Video-LLMs suffer from high inference latency during autoregressive generation. Existing speculative decoding methods are constrained by rigid exact-match rules, limiting acceleration potential for video understanding tasks.

Method: LVSpec employs a lightweight visual-relevant token identification scheme to pinpoint sparse visual-relevant anchors that require strict verification, while allowing loose verification for abundant visual-irrelevant fillers. It also includes a position-shift tolerant mechanism to salvage positionally mismatched but semantically equivalent tokens.

Result: LVSpec preserves >99.8% of target performance while accelerating Qwen2.5-VL-32B by 2.70x and LLaVA-OneVision-72B by 2.94x. It boosts mean accepted length and speedup ratio by 136% and 35% compared to state-of-the-art training-free SD methods for Video-LLMs.

Conclusion: LVSpec effectively addresses the inference latency problem in Video-LLMs through a novel loosely-coupled speculative decoding approach that maintains high fidelity while achieving significant speedup, making it a practical solution for efficient video understanding.

Abstract: Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency during autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically equivalent tokens. Experiments demonstrate that LVSpec achieves high fidelity and speed: it preserves >99.8 of target performance while accelerating Qwen2.5-VL-32B by 2.70x and LLaVA-OneVision-72B by 2.94x. Notably, it boosts the mean accepted length and speedup ratio by 136% and 35% compared to SOTA training-free SD methods for Video-LLMs.

[54] LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

Lihao Sun, Hang Dong, Bo Qiao, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan

Main category: cs.CL

TL;DR: LLM reasoning follows structured trajectories through representation space, with mathematical reasoning traversing functionally ordered subspaces that become increasingly separable with depth, enabling prediction of correctness and trajectory-based steering.

DetailsMotivation: To understand the geometric structure of chain-of-thought reasoning in large language models and develop methods for interpreting, predicting, and controlling reasoning behavior through representation space analysis.

Method: Characterizes chain-of-thought generation as structured trajectories through representation space, analyzes mathematical reasoning across layers, examines correct vs. incorrect solution divergence, and introduces trajectory-based steering for inference-time intervention.

Result: Mathematical reasoning traverses functionally ordered, step-specific subspaces that become increasingly separable with layer depth; correct and incorrect solutions diverge systematically at late stages, enabling mid-reasoning prediction of final-answer correctness with ROC-AUC up to 0.87; trajectory-based steering enables reasoning correction and length control.

Conclusion: Reasoning trajectories provide a geometric lens for interpreting, predicting, and controlling LLM reasoning behavior, with implications for improving reasoning reliability and developing better steering mechanisms.

Abstract: This work characterizes large language models’ chain-of-thought generation as a structured trajectory through representation space. We show that mathematical reasoning traverses functionally ordered, step-specific subspaces that become increasingly separable with layer depth. This structure already exists in base models, while reasoning training primarily accelerates convergence toward termination-related subspaces rather than introducing new representational organization. While early reasoning steps follow similar trajectories, correct and incorrect solutions diverge systematically at late stages. This late-stage divergence enables mid-reasoning prediction of final-answer correctness with ROC-AUC up to 0.87. Furthermore, we introduce trajectory-based steering, an inference-time intervention framework that enables reasoning correction and length control based on derived ideal trajectories. Together, these results establish reasoning trajectories as a geometric lens for interpreting, predicting, and controlling LLM reasoning behavior.

[55] Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

Zhen Cheng, Hao-Bo Yang, Wan-Yi Huang, Jin-Long Li

Main category: cs.CL

TL;DR: Attention Editing framework converts trained LLMs to efficient attention architectures (MLA/GateSWA) without re-pretraining, using progressive distillation to maintain performance while improving KV cache efficiency.

DetailsMotivation: KV cache memory and bandwidth dominate LLM inference costs in long-context scenarios. Existing efficient attention architectures (MLA, SWA) are hard to integrate into already-trained models without structural constraints that limit practical deployment.

Method: Attention Editing replaces original attention with learnable target modules using progressive distillation: (1) layer-wise teacher-forced optimization with intermediate activation supervision to prevent error accumulation, (2) model-level distillation on next-token distributions with optional weak feature matching regularization.

Result: Successfully converted Qwen3-8B and Qwen3-30B-A3B to MLA and GateSWA architectures, maintaining competitive performance while delivering substantial efficiency improvements. Experiments conducted on Ascend 910B clusters demonstrate practical feasibility.

Conclusion: Large-scale attention conversion is feasible and robust, enabling deployment of efficient attention architectures in already-trained LLMs without costly re-pretraining, with practical training case studies on domestic hardware.

Abstract: Key-Value (KV) cache memory and bandwidth increasingly dominate large language model inference cost in long-context and long-generation regimes. Architectures such as multi-head latent attention (MLA) and hybrid sliding-window attention (SWA) can alleviate this bound, but integrating them into existing models remains difficult. Prior methods impose fine-grained structural requirements on both source and target attention modules, which cannot meet the feasible requirement in practical deployment. We present Attention Editing, a practical framework for converting already-trained large language models (LLMs) with new attention architectures without re-pretraining from scratch. Attention editing replaces the original attention with a learnable target module and trains it using progressive distillation, consisting of (1) layer-wise teacher-forced optimization with intermediate activation supervision to prevent cold-start error accumulation, and (2) model-level distillation on next-token distributions, optionally regularized by weak feature matching. We instantiate the framework on two different target–MLA and GateSWA, a gated hybrid SWA design, and apply it to Qwen3-8B and Qwen3-30B-A3B. The resulting models maintain competitive performance while delivering substantial efficiency improvements, demonstrating that large-scale attention conversion is both feasible and robust. Notably, experiments are conducted on an Ascend 910B clusters, offering a practical training case study on domestic hardware.

[56] Dialogue Act Patterns in GenAI-Mediated L2 Oral Practice: A Sequential Analysis of Learner-Chatbot Interactions

Liqun He, Shijun, Chen, Mutlu Cukurova, Manolis Mavrikis

Main category: cs.CL

TL;DR: Study analyzes dialogue patterns in GenAI voice chatbot interactions with Chinese EFL learners, finding high-progress sessions feature more learner-initiated questions and timely corrective feedback sequences.

DetailsMotivation: To understand how dialogue interaction patterns in GenAI voice chatbot conversations relate to language learning gains, addressing the gap in understanding interactional processes in AI-assisted L2 oral practice.

Method: Analyzed 70 sessions from 12 Chinese EFL learners over 10 weeks using a pedagogy-informed dialogue act coding scheme, comparing DA distributions and sequential patterns between high- and low-progress sessions.

Result: High-progress sessions showed more learner-initiated questions, while low-progress sessions had more clarification-seeking. High-progress sessions featured more frequent prompting-based corrective feedback sequences consistently positioned after learner responses.

Conclusion: Findings emphasize the importance of dialogic analysis in GenAI chatbot design, provide a pedagogy-informed coding framework, and inform adaptive GenAI chatbot development for L2 education.

Abstract: While generative AI (GenAI) voice chatbots offer scalable opportunities for second language (L2) oral practice, the interactional processes related to learners’ gains remain underexplored. This study investigates dialogue act (DA) patterns in interactions between Grade 9 Chinese English as a foreign language (EFL) learners and a GenAI voice chatbot over a 10-week intervention. Seventy sessions from 12 students were annotated by human coders using a pedagogy-informed coding scheme, yielding 6,957 coded DAs. DA distributions and sequential patterns were compared between high- and low-progress sessions. At the DA level, high-progress sessions showed more learner-initiated questions, whereas low-progress sessions exhibited higher rates of clarification-seeking, indicating greater comprehension difficulty. At the sequential level, high-progress sessions were characterised by more frequent prompting-based corrective feedback sequences, consistently positioned after learner responses, highlighting the role of feedback type and timing in effective interaction. Overall, these findings underscore the value of a dialogic lens in GenAI chatbot design, contribute a pedagogy-informed DA coding framework, and inform the design of adaptive GenAI chatbots for L2 education.

[57] MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models

Han Jang, Junhyeok Lee, Heeseong Eum, Kyu Sung Choi

Main category: cs.CL

TL;DR: MedLayBench-V: First large-scale multimodal benchmark for expert-lay semantic alignment in medical vision-language models, enabling patient-accessible medical image interpretation.

DetailsMotivation: Current Med-VLMs are trained on professional literature, limiting their ability to communicate findings in lay language needed for patient-centered care. There's a critical absence of large-scale multimodal benchmarks for lay-accessible medical image understanding.

Method: Introduced Structured Concept-Grounded Refinement (SCGR) pipeline that enforces strict semantic equivalence by integrating Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) with micro-level entity constraints, avoiding naive simplification approaches that risk hallucination.

Result: Created MedLayBench-V, the first large-scale multimodal benchmark dedicated to expert-lay semantic alignment, providing a verified foundation for training and evaluating next-generation Med-VLMs.

Conclusion: MedLayBench-V bridges the resource gap for lay-accessible medical image understanding and enables development of Med-VLMs that can effectively communicate between clinical experts and patients.

Abstract: Medical Vision-Language Models (Med-VLMs) have achieved expert-level proficiency in interpreting diagnostic imaging. However, current models are predominantly trained on professional literature, limiting their ability to communicate findings in the lay register required for patient-centered care. While text-centric research has actively developed resources for simplifying medical jargon, there is a critical absence of large-scale multimodal benchmarks designed to facilitate lay-accessible medical image understanding. To bridge this resource gap, we introduce MedLayBench-V, the first large-scale multimodal benchmark dedicated to expert-lay semantic alignment. Unlike naive simplification approaches that risk hallucination, our dataset is constructed via a Structured Concept-Grounded Refinement (SCGR) pipeline. This method enforces strict semantic equivalence by integrating Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) with micro-level entity constraints. MedLayBench-V provides a verified foundation for training and evaluating next-generation Med-VLMs capable of bridging the communication divide between clinical experts and patients.

[58] Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning

Yanbei Jiang, Amr Keleg, Ryandito Diandaru, Jey Han Lau, Lea Frermann, Biaoyan Fang, Fajri Koto

Main category: cs.CL

TL;DR: LLMs fail to generate outputs matching desired statistical distributions; new fine-tuning method combines Steering Token Calibration with Semantic Alignment to achieve precise distributional control.

DetailsMotivation: Real-world data is stochastic, but LLMs are typically evaluated on single-round inference against fixed ground truths. The paper aims to assess whether LLMs can generate outputs that adhere to desired target distributions reflecting real-world statistics.

Method: Proposes a novel fine-tuning framework combining Steering Token Calibration with Semantic Alignment. Uses a hybrid objective function with Kullback-Leibler divergence to anchor probability mass of latent steering tokens and Kahneman-Tversky Optimization to bind tokens to semantically consistent responses.

Result: Experiments across six diverse datasets show the approach significantly outperforms baselines, achieving precise distributional control in attribute generation tasks. Off-the-shelf LLMs and standard alignment techniques fail to reliably control output distributions.

Conclusion: Distribution alignment is crucial for LLMs to reflect real-world statistics, and the proposed framework effectively bridges the gap in controlling output distributions for attributes like gender, race, and sentiment in occupational contexts.

Abstract: While the real world is inherently stochastic, Large Language Models (LLMs) are predominantly evaluated on single-round inference against fixed ground truths. In this work, we shift the lens to distribution alignment: assessing whether LLMs, when prompted repeatedly, can generate outputs that adhere to a desired target distribution, e.g. reflecting real-world statistics or a uniform distribution. We formulate distribution alignment using the attributes of gender, race, and sentiment within occupational contexts. Our empirical analysis reveals that off-the-shelf LLMs and standard alignment techniques, including prompt engineering and Direct Preference Optimization, fail to reliably control output distributions. To bridge this gap, we propose a novel fine-tuning framework that couples Steering Token Calibration with Semantic Alignment. We introduce a hybrid objective function combining Kullback-Leibler divergence to anchor the probability mass of latent steering tokens and Kahneman-Tversky Optimization to bind these tokens to semantically consistent responses. Experiments across six diverse datasets demonstrate that our approach significantly outperforms baselines, achieving precise distributional control in attribute generation tasks.

[59] Identifying Influential N-grams in Confidence Calibration via Regression Analysis

Shintaro Ozaki, Wataru Hashimoto, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe

Main category: cs.CL

TL;DR: LLMs remain overconfident during reasoning despite using uncertain language; specific linguistic expressions correlate with confidence; suppressing these expressions can calibrate confidence without performance loss.

DetailsMotivation: LLMs often display overconfidence in their responses even when using linguistic expressions that indicate uncertainty, particularly during reasoning processes. The researchers aim to identify which specific linguistic expressions are related to confidence levels in LLM reasoning.

Method: Applied regression analysis to predict confidence based on linguistic expressions in reasoning parts of LLMs. Analyzed relationships between specific n-grams and confidence across multiple models and QA benchmarks. Tested causality and verified that extracted linguistic information truly affects confidence.

Result: Found that LLMs remain overconfident when reasoning is involved, with specific linguistic expressions correlating with confidence levels. Several extracted expressions matched cue phrases intentionally inserted during test-time scaling to improve reasoning performance. Showed that confidence calibration is possible by simply suppressing overconfident expressions without performance drops.

Conclusion: LLM confidence calibration can be achieved through targeted suppression of specific overconfident linguistic expressions identified through regression analysis, offering a simple method to improve confidence calibration without compromising performance.

Abstract: While large language models (LLMs) improve performance by explicit reasoning, their responses are often overconfident, even though they include linguistic expressions demonstrating uncertainty. In this work, we identify what linguistic expressions are related to confidence by applying the regression method. Specifically, we predict confidence of those linguistic expressions in the reasoning parts of LLMs as the dependent variables and analyze the relationship between a specific $n$-gram and confidence. Across multiple models and QA benchmarks, we show that LLMs remain overconfident when reasoning is involved and attribute this behavior to specific linguistic information. Interestingly, several of the extracted expressions coincide with cue phrases intentionally inserted on test-time scaling to improve reasoning performance. Through our test on causality and verification that the extracted linguistic information truly affects confidence, we reveal that confidence calibration is possible by simply suppressing those overconfident expressions without drops in performance.

[60] PhageBench: Can LLMs Understand Raw Bacteriophage Genomes?

Yusen Hou, Weicai Long, Haitao Hu, Houcheng Su, Junning Feng, Yanlin Zhang

Main category: cs.CL

TL;DR: PhageBench is the first benchmark for evaluating LLMs on bacteriophage genome understanding, covering 5,600 samples across 5 tasks in 3 bioinformatics workflow stages, showing LLMs outperform random baselines but struggle with complex reasoning tasks.

DetailsMotivation: Bacteriophages are crucial for microbial ecosystems and antibiotic alternatives, but current LLMs are underexplored for direct interpretation of raw nucleotide sequences and biological reasoning. There's a need to evaluate LLMs' capabilities in phage genome understanding.

Method: Created PhageBench benchmark with 5,600 high-quality samples covering five core tasks across three bioinformatics workflow stages: Screening, Quality Control, and Phenotype Annotation. Evaluated eight LLMs on their ability to understand phage genomes.

Result: General-purpose reasoning LLMs significantly outperform random baselines in phage contig identification and host prediction, showing promise for genomic understanding. However, they exhibit significant limitations in complex reasoning tasks involving long-range dependencies and fine-grained functional localization.

Conclusion: LLMs show potential for biological sequence understanding but need enhanced reasoning capabilities for complex tasks. PhageBench provides a foundation for developing next-generation models for genomic interpretation.

Abstract: Bacteriophages, often referred to as the dark matter of the biosphere, play a critical role in regulating microbial ecosystems and in antibiotic alternatives. Thus, accurate interpretation of their genomes holds significant scientific and practical value. While general-purpose Large Language Models (LLMs) excel at understanding biological texts, their ability to directly interpret raw nucleotide sequences and perform biological reasoning remains underexplored. To address this, we introduce PhageBench, the first benchmark designed to evaluate phage genome understanding by mirroring the workflow of bioinformatics experts. The dataset contains 5,600 high-quality samples covering five core tasks across three stages: Screening, Quality Control, and Phenotype Annotation. Our evaluation of eight LLMs reveals that general-purpose reasoning models significantly outperform random baselines in phage contig identification and host prediction, demonstrating promising potential for genomic understanding. However, they exhibit significant limitations in complex reasoning tasks involving long-range dependencies and fine-grained functional localization. These findings highlight the necessity of developing next-generation models with enhanced reasoning capabilities for biological sequences.

[61] What Models Know, How Well They Know It: Knowledge-Weighted Fine-Tuning for Learning When to Say “I Don’t Know”

Joosung Lee, Hwiyeol Jo, Donghyeon Ko, Kyubyung Chae, Cheonbok Park, Jeonghoon Kim

Main category: cs.CL

TL;DR: The paper proposes a method to reduce hallucinations in LLMs by estimating instance-level knowledge scores through multi-sampled inference and scaling learning signals accordingly, while encouraging explicit “I don’t know” responses for out-of-scope queries.

DetailsMotivation: Large language models suffer from hallucinations due to knowledge misalignment between pre-training and fine-tuning phases, which needs to be addressed to improve model reliability and explicit uncertainty expression.

Method: The method uses multi-sampled inference to estimate fine-grained, instance-level knowledge scores, then scales learning signals based on these scores while encouraging explicit “I don’t know” responses for queries outside the model’s knowledge scope.

Result: Experimental results show the approach enables models to explicitly express uncertainty when lacking knowledge while maintaining accuracy on questions they can answer, with improved performance through accurate discrimination between known and unknown instances.

Conclusion: The proposed knowledge scoring and uncertainty-aware training approach effectively reduces hallucinations in LLMs by addressing knowledge misalignment and enabling explicit uncertainty expression.

Abstract: While large language models (LLMs) demonstrate strong capabilities across diverse user queries, they still suffer from hallucinations, often arising from knowledge misalignment between pre-training and fine-tuning. To address this misalignment, we reliably estimate a fine-grained, instance-level knowledge score via multi-sampled inference. Using the knowledge score, we scale the learning signal according to the model’s existing knowledge, while encouraging explicit “I don’t know” responses for out-of-scope queries. Experimental results show that this approach allows the model to explicitly express uncertainty when it lacks knowledge, while maintaining accuracy on questions it can answer. Furthermore, we propose evaluation metrics for uncertainty, showing that accurate discrimination between known and unknown instances consistently improves performance.

[62] Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation

Abdullah Mazhar, Het Riteshkumar Shah, Aseem Srivastava, Smriti Joshi, Md Shad Akhtar

Main category: cs.CL

TL;DR: FAITH-M benchmark and CARE framework for evaluating AI therapist responses against six therapeutic principles using expert annotations and multi-stage reasoning.

DetailsMotivation: Current LLM-based mental health systems lack structured evaluation frameworks to assess adherence to core therapeutic principles beyond surface-level fluency, necessitating clinically grounded assessment methods.

Method: Introduces FAITH-M benchmark with expert-annotated ordinal ratings, and CARE framework with intra-dialogue context modeling, contrastive exemplar retrieval, and knowledge-distilled chain-of-thought reasoning.

Result: CARE achieves 63.34 F-1 score vs 38.56 baseline (64.26% improvement), demonstrating robustness under domain shift while highlighting challenges in modeling implicit clinical nuance.

Conclusion: CARE provides a clinically grounded framework for evaluating therapeutic fidelity in AI mental health systems, advancing principled assessment beyond conversational competence.

Abstract: The increasing use of large language models in mental health applications calls for principled evaluation frameworks that assess alignment with psychotherapeutic best practices beyond surface-level fluency. While recent systems exhibit conversational competence, they lack structured mechanisms to evaluate adherence to core therapeutic principles. In this paper, we study the problem of evaluating AI-generated therapist-like responses for clinically grounded appropriateness and effectiveness. We assess each therapists utterance along six therapeutic principles: non-judgmental acceptance, warmth, respect for autonomy, active listening, reflective understanding, and situational appropriateness using a fine-grained ordinal scale. We introduce FAITH-M, a benchmark annotated with expert-assigned ordinal ratings, and propose CARE, a multi-stage evaluation framework that integrates intra-dialogue context, contrastive exemplar retrieval, and knowledge-distilled chain-of-thought reasoning. Experiments show that CARE achieves an F-1 score of 63.34 versus the strong baseline Qwen3 F-1 score of 38.56 which is a 64.26 improvement, which also serves as its backbone, indicating that gains arise from structured reasoning and contextual modeling rather than backbone capacity alone. Expert assessment and external dataset evaluations further demonstrate robustness under domain shift, while highlighting challenges in modelling implicit clinical nuance. Overall, CARE provides a clinically grounded framework for evaluating therapeutic fidelity in AI mental health systems.

[63] CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training

Seungyoon Lee, Minhyuk Kim, Seongtae Hong, Youngjoon Jang, Dongsuk Oh, Heuiseok Lim

Main category: cs.CL

TL;DR: CLEAR is a novel loss function using reverse training to improve cross-lingual retrieval by using English as a bridge language, achieving up to 15% gains in low-resource languages while maintaining English performance.

DetailsMotivation: Existing multilingual embedding models face challenges in cross-lingual scenarios due to imbalanced linguistic resources and insufficient cross-lingual alignment consideration. Standard contrastive learning approaches may struggle with fundamental language alignment and degrade performance in well-aligned languages like English.

Method: Proposes Cross-Lingual Enhancement in Retrieval via Reverse-training (CLEAR), a novel loss function utilizing a reverse training scheme. CLEAR leverages English passages as bridges to strengthen alignments between target languages and English, ensuring robust cross-lingual retrieval performance.

Result: CLEAR achieves notable improvements in cross-lingual scenarios with gains up to 15%, particularly in low-resource languages, while minimizing performance degradation in English. The method also shows promising effectiveness in multilingual training scenarios.

Conclusion: CLEAR offers an effective approach for improving cross-lingual retrieval performance, especially for low-resource languages, while maintaining English performance. The method demonstrates broad applicability and scalability potential for multilingual training.

Abstract: Existing multilingual embedding models often encounter challenges in cross-lingual scenarios due to imbalanced linguistic resources and less consideration of cross-lingual alignment during training. Although standardized contrastive learning approaches for cross-lingual adaptation are widely adopted, they may struggle to capture fundamental alignment between languages and degrade performance in well-aligned languages such as English. To address these challenges, we propose Cross-Lingual Enhancement in Retrieval via Reverse-training (CLEAR), a novel loss function utilizing a reverse training scheme to improve retrieval performance across diverse cross-lingual retrieval scenarios. CLEAR leverages an English passage as a bridge to strengthen alignments between the target language and English, ensuring robust performance in the cross-lingual retrieval task. Our extensive experiments demonstrate that CLEAR achieves notable improvements in cross-lingual scenarios, with gains up to 15%, particularly in low-resource languages, while minimizing performance degradation in English. Furthermore, our findings highlight that CLEAR offers promising effectiveness even in multilingual training, suggesting its potential for broad application and scalability. We release the code at https://github.com/dltmddbs100/CLEAR.

[64] “OK Aura, Be Fair With Me”: Demographics-Agnostic Training for Bias Mitigation in Wake-up Word Detection

Fernando López, Paula Delgado-Santos, Pablo Gómez, David Solans, Jordi Luque

Main category: cs.CL

TL;DR: Demographics-agnostic training techniques (data augmentation and knowledge distillation from pre-trained speech models) significantly reduce demographic bias in Wake-up Word detection across sex, age, and accent groups.

DetailsMotivation: Voice interfaces have widespread use, but Wake-up Word detection systems exhibit persistent demographic biases across different speaker populations (sex, age, accent), creating fairness challenges that need to be addressed.

Method: Uses OK Aura database with demographics-agnostic training (excluding demographic labels during training, reserved only for evaluation). Explores two techniques: (1) data augmentation for better generalization, and (2) knowledge distillation from pre-trained foundational speech models.

Result: Techniques significantly reduce demographic bias: one method achieves Predictive Disparity reduction of 39.94% for sex, 83.65% for age, and 40.48% for accent compared to baseline, leading to more equitable performance across speaker groups.

Conclusion: Demographics-agnostic training methodologies (particularly data augmentation and knowledge distillation) are effective for promoting fairness in Wake-up Word detection systems without requiring demographic labels during training.

Abstract: Voice-based interfaces are widely used; however, achieving fair Wake-up Word detection across diverse speaker populations remains a critical challenge due to persistent demographic biases. This study evaluates the effectiveness of demographics-agnostic training techniques in mitigating performance disparities among speakers of varying sex, age, and accent. We utilize the OK Aura database for our experiments, employing a training methodology that excludes demographic labels, which are reserved for evaluation purposes. We explore (i) data augmentation techniques to enhance model generalization and (ii) knowledge distillation of pre-trained foundational speech models. The experimental results indicate that these demographics-agnostic training techniques markedly reduce demographic bias, leading to a more equitable performance profile across different speaker groups. Specifically, one of the evaluated techniques achieves a Predictive Disparity reduction of 39.94% for sex, 83.65% for age, and 40.48% for accent when compared to the baseline. This study highlights the effectiveness of label-agnostic methodologies in fostering fairness in Wake-up Word detection.

[65] AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning

Yuanfu Sun, Kang Li, Dongzhe Fan, Jiajin Liu, Qiaoyu Tan

Main category: cs.CL

TL;DR: AgentGL is a reinforcement learning framework that enables LLMs to navigate and reason over graph-structured data using graph-native tools and topology-aware exploration.

DetailsMotivation: Existing agentic frameworks treat external information as unstructured text and fail to leverage topological dependencies in real-world data, limiting LLMs' ability to reason over complex relational environments.

Method: Proposes AgentGL, an RL-driven framework that equips LLM agents with graph-native tools for multi-scale exploration, regulates tool usage via search-constrained thinking, and employs graph-conditioned curriculum RL for stable policy learning.

Result: Achieves absolute improvements up to 17.5% in node classification and 28.4% in link prediction across diverse Text-Attributed Graph benchmarks, outperforming GraphLLMs and GraphRAG baselines.

Conclusion: Agentic Graph Learning (AGL) is a promising frontier for enabling LLMs to autonomously navigate and reason over complex relational environments by bridging graph learning with LLM-based inference.

Abstract: Large Language Models (LLMs) increasingly rely on agentic capabilities-iterative retrieval, tool use, and decision-making-to overcome the limits of static, parametric knowledge. Yet existing agentic frameworks treat external information as unstructured text and fail to leverage the topological dependencies inherent in real-world data. To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference. Specifically, we propose AgentGL, the first reinforcement learning (RL)-driven framework for AGL. AgentGL equips an LLM agent with graph-native tools for multi-scale exploration, regulates tool usage via search-constrained thinking to balance accuracy and efficiency, and employs a graph-conditioned curriculum RL strategy to stabilize long-horizon policy learning without step-wise supervision. Across diverse Text-Attributed Graph (TAG) benchmarks and multiple LLM backbones, AgentGL substantially outperforms strong GraphLLMs and GraphRAG baselines, achieving absolute improvements of up to 17.5% in node classification and 28.4% in link prediction. These results demonstrate that AGL is a promising frontier for enabling LLMs to autonomously navigate and reason over complex relational environments. The code is publicly available at https://github.com/sunyuanfu/AgentGL.

[66] Evaluating Learner Representations for Differentiation Prior to Instructional Outcomes

Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel

Main category: cs.CL

TL;DR: The paper introduces “distinctiveness” as a representation-level metric to evaluate learner representations by measuring how each learner differs from others using pairwise distances, without requiring labels or task-specific evaluation.

DetailsMotivation: Learner representations are crucial for educational AI systems, but it's often unclear whether they preserve meaningful differences between students when instructional outcomes are unavailable or highly context-dependent. There's a need for evaluation methods that don't rely on specific instructional outcomes.

Method: The authors introduce “distinctiveness” - a representation-level measure that evaluates how each learner differs from others in a cohort using pairwise distances. They compare learner-level representations (aggregating patterns across a student’s interactions over time) with interaction-level representations (based on individual questions). The study uses student-authored questions collected through a conversational AI agent in an online learning environment.

Result: Learner-level representations yield higher separation, stronger clustering structure, and more reliable pairwise discrimination than interaction-level representations. This demonstrates that learner representations can be evaluated independently of instructional outcomes.

Conclusion: The distinctiveness metric provides a practical pre-deployment criterion for assessing whether a representation supports differentiated modeling or personalization in educational AI systems, without requiring task-specific evaluation or labels.

Abstract: Learner representations play a central role in educational AI systems, yet it is often unclear whether they preserve meaningful differences between students when instructional outcomes are unavailable or highly context-dependent. This work examines how to evaluate learner representations based on whether they retain separation between learners under a shared comparison rule. We introduce distinctiveness, a representation-level measure that evaluates how each learner differs from others in the cohort using pairwise distances, without requiring clustering, labels, or task-specific evaluation. Using student-authored questions collected through a conversational AI agent in an online learning environment, we compare representations based on individual questions with representations that aggregate patterns across a student’s interactions over time. Results show that learner-level representations yield higher separation, stronger clustering structure, and more reliable pairwise discrimination than interaction-level representations. These findings demonstrate that learner representations can be evaluated independently of instructional outcomes and provide a practical pre-deployment criterion using distinctiveness as a diagnostic metric for assessing whether a representation supports differentiated modeling or personalization.

[67] LoRM: Learning the Language of Rotating Machinery for Self-Supervised Condition Monitoring

Xiao Qin, Xingyi Song, Tong Liu, Hatim Laalej, Zepeng Liu, Yunpeng Zhu, Ligang He

Main category: cs.CL

TL;DR: LoRM is a self-supervised framework that treats rotating machinery signals as a machine language, using tokenization and sequence prediction for multi-modal condition monitoring by fine-tuning pre-trained language models.

DetailsMotivation: Traditional rotating machinery condition monitoring relies on hand-crafted signal processing features, which lack generalization and require domain expertise. The authors propose viewing machinery signals as a "machine language" that can be understood using language modeling techniques.

Method: 1) Tokenize local rotating machinery signals into discrete symbolic units; 2) Treat future signal evolution as a sequence prediction problem from multi-sensor context; 3) Partially fine-tune general-purpose pre-trained language models on industrial signals; 4) Use token prediction errors as health indicators for condition monitoring.

Result: Demonstrated stable real-time tracking and strong cross-tool generalization in tool condition monitoring experiments. The framework successfully bridges language modeling with industrial signal analysis without training large models from scratch.

Conclusion: LoRM provides a practical framework for applying language modeling techniques to industrial signal analysis, enabling effective multi-modal rotating machinery understanding and condition monitoring through self-supervised learning.

Abstract: We present LoRM (Language of Rotating Machinery), a self-supervised framework for multi-modal rotating-machinery signal understanding and real-time condition monitoring. LoRM is built on the idea that rotating-machinery signals can be viewed as a machine language: local signals can be tokenised into discrete symbolic units, and their future evolution can be predicted from observed multi-sensor context. Unlike conventional signal-processing methods that rely on hand-crafted transforms and features, LoRM reformulates multi-modal sensor data as a token-based sequence-prediction problem. For each data window, the observed context segment is retained in continuous form, while the future target segment of each sensing channel is quantised into a discrete token. Then, efficient knowledge transfer is achieved by partially fine-tuning a general-purpose pre-trained language model on industrial signals, avoiding the need to train a large model from scratch. Finally, condition monitoring is performed by tracking token-prediction errors as a health indicator, where increasing errors indicate degradation. In-situ tool condition monitoring (TCM) experiments demonstrate stable real-time tracking and strong cross-tool generalisation, showing that LoRM provides a practical bridge between language modelling and industrial signal analysis. The source code is publicly available at https://github.com/Q159753258/LormPHM.

[68] Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models

Xiangming Gu, Soham De, Larisa Markeeva, Petar Veličković, Razvan Pascanu

Main category: cs.CL

TL;DR: Parallel sampling outperforms sequential sampling in large reasoning models due to better exploration, not aggregation or context length issues.

DetailsMotivation: Large Reasoning Models (LRMs) perform well on challenging tasks like math and coding, but require multiple sampling for high-quality solutions. The paper investigates why parallel sampling (sampling multiple solutions independently) outperforms sequential sampling (sampling solutions one after another conditioned on previous ones), despite sequential sampling having more theoretical representation power.

Method: The authors systematically compare parallel vs sequential sampling strategies across various model families (Qwen3, DeepSeek-R1 distilled models, Gemini 2.5) and question domains (math and coding). They test three hypotheses: (1) parallel sampling benefits from aggregation operators, (2) sequential sampling suffers from longer context requirements, and (3) sequential sampling leads to less exploration due to conditioning on previous answers.

Result: Empirical evidence shows that aggregation and context length are not the main causes of the performance gap. Instead, the lack of exploration in sequential sampling (due to conditioning on previous answers) plays a considerably larger role in explaining why parallel sampling outperforms sequential sampling.

Conclusion: The performance gap between parallel and sequential sampling in LRMs is primarily due to exploration limitations in sequential approaches, not aggregation benefits or context length issues. This insight helps understand sampling strategy trade-offs for reasoning tasks.

Abstract: Large Reasoning Models (LRMs) have shown remarkable performance on challenging questions, such as math and coding. However, to obtain a high quality solution, one may need to sample more than once. In principal, there are two sampling strategies that can be composed to form more complex processes: sequential sampling and parallel sampling. In this paper, we first compare these two approaches with rigor, and observe, aligned with previous works, that parallel sampling seems to outperform sequential sampling even though the latter should have more representation power. To understand the underline reasons, we make three hypothesis on the reason behind this behavior: (i) parallel sampling outperforms due to the aggregator operator; (ii) sequential sampling is harmed by needing to use longer contexts; (iii) sequential sampling leads to less exploration due to conditioning on previous answers. The empirical evidence on various model families and sizes (Qwen3, DeepSeek-R1 distilled models, Gemini 2.5) and question domains (math and coding) suggests that the aggregation and context length do not seem to be the main culprit behind the performance gap. In contrast, the lack of exploration seems to play a considerably larger role, and we argue that this is one main cause for the performance gap.

[69] Mechanistic Circuit-Based Knowledge Editing in Large Language Models

Tianyi Zhao, Yinhan He, Wendy Zheng, Chen Chen

Main category: cs.CL

TL;DR: MCircKE is a mechanistic circuit-based knowledge editing framework that addresses the “Reasoning Gap” in LLMs by identifying and surgically updating causal circuits responsible for specific reasoning tasks.

DetailsMotivation: Existing knowledge editing methods for LLMs can patch isolated facts but suffer from a "Reasoning Gap" - models recall edited facts but fail to utilize them in multi-step reasoning chains, limiting real-world deployment in dynamic environments.

Method: MCircKE uses a “map-and-adapt” approach: first identifies causal circuits responsible for specific reasoning tasks (capturing both fact storage and logical consequence routing), then surgically updates parameters exclusively within the mapped circuit.

Result: Extensive experiments on the MQuAKE-3K benchmark demonstrate MCircKE’s effectiveness for multi-hop reasoning in knowledge editing, showing improved utilization of edited facts in reasoning chains.

Conclusion: MCircKE bridges the Reasoning Gap in knowledge editing by providing a precise mechanistic circuit-based approach that enables LLMs to effectively utilize edited knowledge in complex reasoning tasks.

Abstract: Deploying Large Language Models (LLMs) in real-world dynamic environments raises the challenge of updating their pre-trained knowledge. While existing knowledge editing methods can reliably patch isolated facts, they frequently suffer from a “Reasoning Gap”, where the model recalls the edited fact but fails to utilize it in multi-step reasoning chains. To bridge this gap, we introduce MCircKE (\underline{M}echanistic \underline{Circ}uit-based \underline{K}nowledge \underline{E}diting), a novel framework that enables a precise “map-and-adapt” editing procedure. MCircKE first identifies the causal circuits responsible for a specific reasoning task, capturing both the storage of the fact and the routing of its logical consequences. It then surgically update parameters exclusively within this mapped circuit. Extensive experiments on the MQuAKE-3K benchmark demonstrate the effectiveness of the proposed method for multi-hop reasoning in knowledge editing.

[70] FRENCH-YMCA: A FRENCH Corpus meeting the language needs of Youth, froM Children to Adolescents

Cherifa Ben Khelil, Jean-Yves Antoine, Anaïs Halftermeyer, Frédéric Rayar, Mathieu Thebaud

Main category: cs.CL

TL;DR: French-YMCA corpus: A specialized linguistic resource for children and adolescents with 39,200 text files and 22.5M words, designed to train age-appropriate language models.

DetailsMotivation: Children have unique and evolving language skills that differ from adults, requiring specialized linguistic resources to develop language models that can understand and anticipate youth language for age-appropriate digital interactions.

Method: Created a comprehensive corpus of 39,200 text files containing 22,471,898 words specifically tailored for children and adolescents, featuring diverse sources, consistent grammar/spelling, and open online accessibility.

Result: Developed the French-YMCA corpus as a new linguistic resource with extensive coverage of youth language, providing a foundation for training language models that can generate age-appropriate responses and suggestions.

Conclusion: The French-YMCA corpus addresses the gap in child-specific linguistic resources and enables the development of language models better adapted to youth comprehension levels for improved digital interactions.

Abstract: In this paper, we introduce the French-YMCA corpus, a new linguistic resource specifically tailored for children and adolescents. The motivation for building this corpus is clear: children have unique language requirements, as their language skills are in constant evolution and differ from those of adults. With an extensive collection of 39,200 text files, the French-YMCA corpus encompasses a total of 22,471,898 words. It distinguishes itself through its diverse sources, consistent grammar and spelling, and the commitment to providing open online accessibility for all. Such corpus can serve as the foundation for training language models that understand and anticipate youth’s language, thereby enhancing the quality of digital interactions and ensuring that responses and suggestions are age-appropriate and adapted to the comprehension level of users of this age.

[71] FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

Michael Krumdick, Varshini Reddy, Shivani Chaudhary, William Day, Maarij Ahmed, Hayan Haqqi, Muhammad Ahsen Fahim, Hanzallah Amjad, Ahmad Orakzai, Aqsa Gul, Chris Tanner

Main category: cs.CL

TL;DR: FrontierFinance is a benchmark for evaluating LLMs on complex financial modeling tasks that require professional expertise, showing human experts still outperform state-of-the-art AI systems.

DetailsMotivation: Addresses the lack of robust benchmarks for measuring AI performance on practical professional expertise in finance, particularly as AI-driven labor displacement concerns intensify in knowledge-intensive sectors. Current benchmarks fail to capture real-world professional tasks, and there's an absence of accountability mechanisms in LLM deployments.

Method: Developed FrontierFinance benchmark with 25 complex financial modeling tasks across five core finance models, requiring ~18 hours of skilled human labor per task. Created with financial professionals to reflect industry-standard workflows. Used human experts to define tasks, create rubrics, grade LLMs, and perform tasks as human baselines.

Result: Human experts received higher scores on average and were more likely to provide client-ready outputs than current state-of-the-art LLM systems.

Conclusion: There’s a significant gap between current AI capabilities and human professional expertise in complex financial modeling tasks, highlighting the need for better benchmarks and accountability mechanisms for LLM deployments in professional domains.

Abstract: As concerns surrounding AI-driven labor displacement intensify in knowledge-intensive sectors, existing benchmarks fail to measure performance on tasks that define practical professional expertise. Finance, in particular, has been identified as a domain with high AI exposure risk, yet lacks robust benchmarks to track real-world developments. This gap is compounded by the absence of clear accountability mechanisms in current Large Language Model (LLM) deployments. To address this, we introduce FrontierFinance, a long-horizon benchmark of 25 complex financial modeling tasks across five core finance models, requiring an average of over 18 hours of skilled human labor per task to complete. Developed with financial professionals, the benchmark reflects industry-standard financial modeling workflows and is paired with detailed rubrics for structured evaluation. We engage human experts to define the tasks, create rubrics, grade LLMs, and perform the tasks themselves as human baselines. We demonstrate that our human experts both receive higher scores on average, and are more likely to provide client-ready outputs than current state-of-the-art systems.

[72] “I See What You Did There”: Can Large Vision-Language Models Understand Multimodal Puns?

Naen Xu, Jiayi Sheng, Changjiang Li, Chunyi Zhou, Yuyuan Li, Tianyu Du, Jun Wang, Zhihui Fu, Jinbao Li, Shouling Ji

Main category: cs.CL

TL;DR: The paper introduces MultiPun, a multimodal pun dataset and benchmark for evaluating Vision-Language Models’ ability to understand humor through cross-modal reasoning, showing current VLMs struggle with pun comprehension but can be improved with targeted strategies.

DetailsMotivation: There's a lack of systematic study on Vision-Language Models' ability to understand multimodal puns due to scarcity of rigorous benchmarks, despite puns being a common form of humor that exploits polysemy and phonetic similarity across visual and textual modalities.

Method: 1) Proposed a multimodal pun generation pipeline; 2) Created MultiPun dataset with diverse pun types and adversarial non-pun distractors; 3) Evaluated existing VLMs on pun comprehension; 4) Developed prompt-level and model-level strategies to enhance pun understanding.

Result: Most VLMs struggle to distinguish genuine puns from adversarial distractors. The proposed enhancement strategies achieved an average 16.5% improvement in F1 scores for pun comprehension.

Conclusion: The work provides valuable insights for developing future VLMs that can master human-like humor through cross-modal reasoning, highlighting current limitations and offering improvement strategies for multimodal pun understanding.

Abstract: Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.

[73] BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs

Abbas Ghaddar, Ivan Kobyzev, Boxing Chen, Yufei Cui

Main category: cs.CL

TL;DR: BOSCH is a training-free method for optimizing sliding-window attention in LLMs by performing black-box binary optimization for head selection, outperforming existing layer-level and static head-level hybridization approaches.

DetailsMotivation: Existing methods for hybridizing LLMs with sliding-window attention have limitations: layer-level schemes ignore that local and global dependencies are routed through heads within the same layer, while static head-level rankings suffer from entanglement where a head's local/global behavior can change after hybridization.

Method: BOSCH formulates the problem as a Large Neighborhood Search and decomposes it into three subproblems: (1) layer-importance detection via small-budget black-box probes, (2) adaptive per-layer SWA-ratio assignment based on these sensitivities, and (3) grouped head-level optimization within ratio buckets.

Result: Experiments on 4 LLMs ranging from 1.7B to 30B parameters across 4 SWA ratios show BOSCH consistently outperforms layer-level heuristics and 6 strong static head-level methods, with larger gains at higher SWA ratios. Under continual pretraining, BOSCH recovers original long-context performance faster and to a higher level.

Conclusion: BOSCH demonstrates the importance of performing head-level selection for each target SWA ratio rather than relying on fixed locality rankings, with analysis revealing substantial turnover in selected heads across different SWA ratios.

Abstract: Post-training hybridization of large language models (LLMs) often replaces quadratic self-attention with sliding-window attention (SWA) to reduce KV cache usage and improve latency. Existing hybridization schemes are typically defined either at the layer level (e.g., interleaving) or at the head level via static rankings from local to global. Layer-level schemes ignore that local and global dependencies are routed through heads within the same layer, while static head-level rankings suffer from entanglement: a head’s local/global behavior can change after hybridization. We propose BOSCH, Black-box Binary Optimization for Short-context Head Selection, a training-free method that formulates the problem as a Large Neighborhood Search and decomposes it into three subproblems: (i) layer-importance detection via small-budget black-box probes, (ii) adaptive per-layer SWA-ratio assignment based on these sensitivities, and (iii) grouped head-level optimization within ratio buckets. Extensive experiments on 4 LLMs ranging from 1.7B to 30B parameters, across 4 SWA ratios, show that BOSCH consistently outperforms layer-level heuristics and 6 strong static head-level methods, with larger gains at higher SWA ratios. Under continual pretraining, BOSCH recover original long-context performance faster and to a higher level. Analysis of the selected heads reveals substantial turnover for BOSCH across different SWA ratios, underscoring the importance of performing head-level selection for each target ratio rather than relying on fixed locality rankings.

[74] FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures

Fan Zhang, Mingzi Song, Rania Elbadry, Yankai Chen, Shaobo Wang, Yixi Zhou, Xunwen Zheng, Yueru He, Yuyang Dai, Georgi Georgiev, Ayesha Gull, Muhammad Usman Safder, Fan Wu, Liyuan Meng, Fengxian Ji, Junning Zhao, Xueqing Peng, Jimin Huang, Yu Chen, Xue, Liu, Preslav Nakov, Zhuohan Xie

Main category: cs.CL

TL;DR: FinReporting is an agentic workflow system for cross-jurisdiction financial reporting that uses LLMs as constrained verifiers rather than free-form generators, addressing structural differences in accounting standards across markets.

DetailsMotivation: Financial reporting systems using LLMs typically assume single-market settings and fail to address structural differences across jurisdictions, including variations in accounting taxonomies, tagging infrastructures (XBRL vs. PDF), and aggregation conventions, making cross-jurisdiction reporting challenging for semantic alignment and verification.

Method: The system builds a unified canonical ontology covering Income Statement, Balance Sheet, and Cash Flow, and decomposes reporting into auditable stages: filing acquisition, extraction, canonical mapping, and anomaly logging. LLMs are deployed as constrained verifiers under explicit decision rules with evidence grounding rather than as free-form generators.

Result: Evaluated on annual filings from the US, Japan, and China, the system improves consistency and reliability under heterogeneous reporting regimes. An interactive demo supports cross-market inspection and structured export of localized financial statements.

Conclusion: FinReporting demonstrates an effective approach to cross-jurisdiction financial reporting by using LLMs as constrained verifiers within a structured workflow, addressing the challenges of semantic alignment across different accounting standards and reporting infrastructures.

Abstract: Financial reporting systems increasingly use large language models (LLMs) to extract and summarize corporate disclosures. However, most assume a single-market setting and do not address structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs. PDF), and aggregation conventions make cross-jurisdiction reporting a semantic alignment and verification challenge. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system builds a unified canonical ontology over Income Statement, Balance Sheet, and Cash Flow, and decomposes reporting into auditable stages including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than using LLMs as free-form generators, FinReporting deploys them as constrained verifiers under explicit decision rules and evidence grounding. Evaluated on annual filings from the US, Japan, and China, the system improves consistency and reliability under heterogeneous reporting regimes. We release an interactive demo supporting cross-market inspection and structured export of localized financial statements. Our demo is available at https://huggingface.co/spaces/BoomQ/FinReporting-Demo . The video describing our system is available at https://www.youtube.com/watch?v=f65jdEL31Kk

[75] The Model Agreed, But Didn’t Learn: Diagnosing Surface Compliance in Large Language Models

Xiaojie Gu, Ziying Huang, Weicong Hong, Jian Xie, Renze Lou, Kai Zhang

Main category: cs.CL

TL;DR: The paper introduces a diagnostic framework to evaluate knowledge editing in LLMs, revealing that current methods achieve high benchmark scores through surface compliance rather than genuine memory modification, and that recursive edits cause cognitive instability.

DetailsMotivation: Current knowledge editing methods show high success rates on benchmarks, but it's unclear if they genuinely modify internal memory or just mimic target outputs. There's a need for better evaluation frameworks that reflect real-world application environments to assess true memory modification.

Method: The authors introduce a diagnostic framework using discriminative self-assessment under in-context learning (ICL) settings. This approach subjects models to probing that reveals subtle behavioral nuances induced by memory modifications, going beyond standard benchmark evaluations.

Result: The framework reveals “Surface Compliance” - editors achieve high benchmark scores by mimicking target outputs without structurally overwriting internal beliefs. Recursive modifications accumulate representational residues, causing cognitive instability and permanently diminishing memory reversibility.

Conclusion: Current knowledge editing paradigms have significant risks, and robust memory modification is crucial for building trustworthy, long-term sustainable LLM systems. The diagnostic framework highlights the need for better evaluation methods.

Abstract: Large Language Models (LLMs) internalize vast world knowledge as parametric memory, yet inevitably inherit the staleness and errors of their source corpora. Consequently, ensuring the reliability and malleability of these internal representations is imperative for trustworthy real-world deployment. Knowledge editing offers a pivotal paradigm for surgically modifying memory without retraining. However, while recent editors demonstrate high success rates on standard benchmarks, it remains questionable whether current evaluation frameworks that rely on assessing output under specific prompting conditions can reliably authenticate genuine memory modification. In this work, we introduce a simple diagnostic framework that subjects models to discriminative self-assessment under in-context learning (ICL) settings that better reflect real-world application environments, specifically designed to scrutinize the subtle behavioral nuances induced by memory modifications. This probing reveals a pervasive phenomenon of Surface Compliance, where editors achieve high benchmark scores by merely mimicking target outputs without structurally overwriting internal beliefs. Moreover, we find that recursive modifications accumulate representational residues, triggering cognitive instability and permanently diminishing the reversibility of the model’s memory state. These insights underscore the risks of current editing paradigms and highlight the pivotal role of robust memory modification in building trustworthy, long-term sustainable LLM systems. Code is available at https://github.com/XiaojieGu/SA-MCQ.

[76] Disentangling MLP Neuron Weights in Vocabulary Space

Asaf Avrahamy, Yoav Gur-Arieh, Mor Geva

Main category: cs.CL

TL;DR: ROTATE is a data-free method that disentangles MLP neurons in weight space by optimizing rotations to maximize vocabulary-space kurtosis, recovering interpretable vocabulary channels without forward passes.

DetailsMotivation: Interpreting information encoded in model weights remains a fundamental challenge in mechanistic interpretability. Current methods often require forward passes or activation data, limiting scalability and fine-grained analysis.

Method: ROTATE uses a statistical observation that neurons encoding coherent concepts exhibit high kurtosis when projected onto vocabulary space. It optimizes rotations of neuron weights to maximize vocabulary-space kurtosis, recovering sparse interpretable directions called vocabulary channels without needing forward passes.

Result: Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it show ROTATE consistently recovers faithful vocabulary channels. Ablating individual channels selectively disables corresponding input activations or concept promotion. Channel-level descriptions outperform optimized activation-based baselines by 2-3x in comparisons.

Conclusion: ROTATE provides a scalable, data-free decomposition of neuron weights into interpretable vocabulary channels, offering fine-grained building blocks for interpreting language models without forward passes.

Abstract: Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we introduce ROTATE (Rotation-Optimized Token Alignment in weighT spacE), a data-free method requiring no forward passes that disentangles MLP neurons directly in weight space. Our approach relies on a key statistical observation: neurons that encode coherent, monosemantic concepts exhibit high kurtosis when projected onto the model’s vocabulary. By optimizing rotations of neuron weights to maximize their vocabulary-space kurtosis, our method recovers sparse, interpretable directions which we name vocabulary channels. Experiments on Llama-3.1-8B-Instruct and Gemma-2-2B-it demonstrate that ROTATE consistently recovers vocabulary channels that are faithful to the neuron’s behavior. ablating individual channels selectively disables corresponding input activations or the promotion of specific concepts. Moreover, aggregating channel-level descriptions yields comprehensive neuron descriptions that outperform optimized activation-based baselines by 2-3x in head-to-head comparisons. By providing a data-free decomposition of neuron weights, ROTATE offers a scalable, fine-grained building block for interpreting LMs.

[77] BiMind: A Dual-Head Reasoning Model with Attention-Geometry Adapter for Incorrect Information Detection

Zhongxing Zhang, Emily K. Vraga, Jisu Huh, Jaideep Srivastava

Main category: cs.CL

TL;DR: BiMind: A dual-head reasoning framework for misinformation detection that separates content-internal reasoning from knowledge-augmented reasoning using attention geometry adaptation, self-retrieval knowledge mechanisms, and uncertainty-aware fusion strategies.

DetailsMotivation: Current misinformation detection approaches struggle to balance textual content verification with external knowledge modification under collapsed attention geometries, limiting their ability to jointly process content and knowledge effectively.

Method: Proposes BiMind with three innovations: 1) attention geometry adapter that reshapes attention logits via token-conditioned offsets, 2) self-retrieval knowledge mechanism using kNN retrieval and feature-wise linear modulation, and 3) uncertainty-aware fusion strategies including entropy-gated fusion and trainable agreement head with symmetric KL agreement regularizer.

Result: BiMind outperforms advanced detection approaches on public datasets and provides interpretable diagnostics on when and why knowledge matters, with a novel Value-of-eXperience (VoX) metric to quantify knowledge contributions.

Conclusion: BiMind effectively addresses the challenge of joint content verification and knowledge modification in misinformation detection through disentangled reasoning heads and interpretable knowledge integration mechanisms.

Abstract: Incorrect information poses significant challenges by disrupting content veracity and integrity, yet most detection approaches struggle to jointly balance textual content verification with external knowledge modification under collapsed attention geometries. To address this issue, we propose a dual-head reasoning framework, BiMind, which disentangles content-internal reasoning from knowledge-augmented reasoning. In BiMind, we introduce three core innovations: (i) an attention geometry adapter that reshapes attention logits via token-conditioned offsets and mitigates attention collapse; (ii) a self-retrieval knowledge mechanism, which constructs an in-domain semantic memory through kNN retrieval and injects retrieved neighbors via feature-wise linear modulation; (iii) the uncertainty-aware fusion strategies, including entropy-gated fusion and a trainable agreement head, stabilized by a symmetric Kullback-Leibler agreement regularizer. To quantify the knowledge contributions, we define a novel metric, Value-of-eXperience (VoX), to measure instance-wise logit gains from knowledge-augmented reasoning. Experiment results on public datasets demonstrate that our BiMind model outperforms advanced detection approaches and provides interpretable diagnostics on when and why knowledge matters.

[78] A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan, Elliot M. Fielstein, Minu A. Aghevli, Kamonica L. Craig, Elizabeth M. Oliva, Joseph Erdos, Jodie Trafton, Ioana Danciu

Main category: cs.CL

TL;DR: A multi-stage validation framework for LLM-based clinical information extraction that enables scalable assessment without intensive manual annotation, demonstrated on substance use disorder diagnosis extraction.

DetailsMotivation: LLMs show promise for extracting clinical information from health records, but real-world deployment is limited by lack of scalable validation methods that don't require intensive manual annotation.

Method: Multi-stage framework with prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using a higher-capacity judge LLM, selective expert review, and external predictive validity analysis.

Result: Framework removed 14.59% of unsupported LLM extractions; judge LLM showed substantial agreement with experts (AC1=0.80); primary LLM achieved F1=0.80; extracted diagnoses predicted specialty care engagement better than structured baselines (AUC=0.80).

Conclusion: Scalable, trustworthy deployment of LLM-based clinical information extraction is feasible without annotation-intensive evaluation using the proposed multi-stage validation framework.

Abstract: Large language models (LLMs) show promise for extracting clinically meaningful information from unstructured health records, yet their translation into real-world settings is constrained by the lack of scalable and trustworthy validation approaches. Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale. We propose a multi-stage validation framework for LLM-based clinical information extraction that enables rigorous assessment under weak supervision. The framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis to quantify uncertainty and characterize error modes without exhaustive manual annotation. We applied this framework to extraction of substance use disorder (SUD) diagnoses across 11 substance categories from 919,783 clinical notes. Rule-based filtering and semantic grounding removed 14.59% of LLM-positive extractions that were unsupported, irrelevant, or structurally implausible. For high-uncertainty cases, the judge LLM’s assessments showed substantial agreement with subject matter expert review (Gwet’s AC1=0.80). Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria. LLM-extracted SUD diagnoses also predicted subsequent engagement in SUD specialty care more accurately than structured-data baselines (AUC=0.80). These findings demonstrate that scalable, trustworthy deployment of LLM-based clinical information extraction is feasible without annotation-intensive evaluation.

[79] From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

Hongxu Zhou

Main category: cs.CL

TL;DR: Enforcing structured reflection through constrained decoding doesn’t improve LLM self-correction and causes “structure snowballing” where models get trapped in formatting errors instead of fixing semantic issues.

DetailsMotivation: To investigate whether structured reflection through constrained decoding can mitigate hallucination snowballing in LLM self-correction without external tools or training, addressing the tension between structural constraints and model autonomy.

Method: Used Outlines-based constrained decoding to enforce structured reflection on an 8-billion-parameter model (Qwen3-8B), evaluating whether structural constraints alone can improve self-correction in open-ended reasoning tasks.

Result: Structural constraints didn’t improve self-correction performance; instead triggered “structure snowballing” where models get trapped in formatting errors, achieving superficial syntactic alignment but failing to detect deeper semantic errors.

Conclusion: Constrained decoding imposes an “alignment tax” where strict formatting rules create cognitive load that prevents meaningful error correction, highlighting tension between structural granularity and internal model capacity.

Abstract: Intrinsic self-correction in Large Language Models (LLMs) frequently fails in open-ended reasoning tasks due to hallucination snowballing,'' a phenomenon in which models recursively justify early errors during free-text reflection. While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy. This study investigates whether enforcing structured reflection purely through Outlines-based constrained decoding can disrupt error propagation without additional training. Evaluating an 8-billion-parameter model (Qwen3-8B), we show that simply imposing structural constraints does not improve self-correction performance. Instead, it triggers a new failure mode termed structure snowballing.’’ We find that the cognitive load required to satisfy strict formatting rules pushes the model into formatting traps. This observation helps explain why the agent achieves near-perfect superficial syntactic alignment yet fails to detect or resolve deeper semantic errors. These findings expose an ``alignment tax’’ inherent to constrained decoding, highlighting a tension between structural granularity and internal model capacity in autonomous workflows. Code and raw logs are available in the GitHub repository: https://github.com/hongxuzhou/agentic_llm_structured_self_critique.

[80] Short Data, Long Context: Distilling Positional Knowledge in Transformers

Patrick Huber, Ernie Chang, Chinnadhurai Sankar, Rylan Conway, Igor Fedorov, Md Rifat Arefin, Adithya Sagar

Main category: cs.CL

TL;DR: Long-context retrieval capabilities can be transferred to student models via logit-based knowledge distillation using packed short-context samples, with insights from RoPE analysis showing positional information transfer and structured parameter updates.

DetailsMotivation: Extending context windows for language models typically requires expensive long-context pre-training, which is inefficient and data-intensive. The paper aims to find a more efficient approach using knowledge distillation.

Method: Uses logit-based knowledge distillation with packed short-context samples within long-context windows. Analyzes through Rotary Position Embedding (RoPE) lens, examining phase-wise RoPE scaling, positional information transfer via packed repeated token sequences, and query state update patterns.

Result: Three key findings: 1) Phase-wise RoPE scaling achieves best long-context performance in distillation setups; 2) Logit-based distillation enables positional information transfer, with positional perturbations systematically influencing teacher output distributions; 3) Structured update patterns in query state during long-context extension with distinct parameter spans showing sensitivity to long-context training.

Conclusion: Long-context retrieval capabilities can be efficiently transferred via knowledge distillation without expensive long-context pre-training, with RoPE analysis providing insights into positional information transfer mechanisms.

Abstract: Extending the context window of language models typically requires expensive long-context pre-training, posing significant challenges for both training efficiency and data collection. In this paper, we present evidence that long-context retrieval capabilities can be transferred to student models through logit-based knowledge distillation, even when training exclusively on packed short-context samples within a long-context window. We provide comprehensive insights through the lens of Rotary Position Embedding (RoPE) and establish three key findings. First, consistent with prior work, we show that phase-wise RoPE scaling, which maximizes rotational spectrum utilization at each training stage, also achieves the best long-context performance in knowledge distillation setups. Second, we demonstrate that logit-based knowledge distillation can directly enable positional information transfer. Using an experimental setup with packed repeated token sequences, we trace the propagation of positional perturbations from query and key vectors through successive transformer layers to output logits, revealing that positional information systematically influences the teacher’s output distribution and, in turn, the distillation signal received by the student model. Third, our analysis uncovers structured update patterns in the query state during long-context extension, with distinct parameter spans exhibiting strong sensitivity to long-context training.

[81] Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles

Ben Wigler, Maria Tsfasman, Tiffany Matej Hrkalovic

Main category: cs.CL

TL;DR: LLMs can robustly encode and decode personality traits from text, with personality scores recoverable from generated narratives at near-human reliability levels across diverse model architectures.

DetailsMotivation: Existing evaluations of personality conditioning in LLMs rely on questionnaire self-reports, lack architectural diversity, and rarely use real human psychometric data, making it unclear whether personality conditioning produces meaningful individual differences or just superficial trait alignment.

Method: Conditioned LLMs on real psychometric profiles from 290 participants to generate first-person life story narratives, then used independent LLMs to recover personality scores from those narratives alone. Tested across 10 LLM narrative generators and 3 LLM personality scorers from 6 providers.

Result: Personality scores recovered from generated narratives at levels approaching human test-retest reliability (mean r = 0.750, 85% of human ceiling). Recovery robust across diverse models. Generated narratives show behavioral differentiation: 9 of 10 coded features correlate with real conversations, and personality-driven emotional reactivity patterns replicate in real conversational data.

Conclusion: The personality-language relationship captured during LLM pretraining supports robust encoding and decoding of individual differences, including characteristic emotional variability patterns that replicate in real human behavior.

Abstract: Personality traits are richly encoded in natural language, and large language models (LLMs) trained on human text can simulate personality when conditioned on persona descriptions. However, existing evaluations rely predominantly on questionnaire self-report by the conditioned model, are limited in architectural diversity, and rarely use real human psychometric data. Without addressing these limitations, it remains unclear whether personality conditioning produces psychometrically informative representations of individual differences or merely superficial alignment with trait descriptors. To test how robustly LLMs can encode personality into extended text, we condition LLMs on real psychometric profiles from 290 participants to generate first-person life story narratives, and then task independent LLMs to recover personality scores from those narratives alone. We show that personality scores can be recovered from the generated narratives at levels approaching human test-retest reliability (mean r = 0.750, 85% of the human ceiling), and that recovery is robust across 10 LLM narrative generators and 3 LLM personality scorers spanning 6 providers. Decomposing systematic biases reveals that scoring models achieve their accuracy while counteracting alignment-induced defaults. Content analysis of the generated narratives shows that personality conditioning produces behaviourally differentiated text: nine of ten coded features correlate significantly with the same features in participants’ real conversations, and personality-driven emotional reactivity patterns in narratives replicate in real conversational data. These findings provide evidence that the personality-language relationship captured during pretraining supports robust encoding and decoding of individual differences, including characteristic emotional variability patterns that replicate in real human behaviour.

[82] LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces

Olexander Mazurets, Olexander Barmak, Leonid Bedratyuk, Iurii Krak

Main category: cs.CL

TL;DR: LAG-XAI is a geometric framework that models paraphrasing as affine transformations in embedding space, enabling interpretable decomposition of semantic changes into rotation, deformation, and translation components.

DetailsMotivation: Transformer-based language models are powerful but their semantic spaces remain uninterpretable black boxes. There's a need for mathematically grounded, resource-efficient approaches to understand the mechanistic operations within these models.

Method: Proposes LAG-XAI framework modeling paraphrasing as continuous geometric flow on semantic manifold using mean-field approximation inspired by local Lie group actions. Decomposes paraphrase transitions into rotation, deformation, and translation components.

Result: Achieves AUC of 0.7713 on PIT-2015 Twitter corpus, capturing ~80% of non-linear baseline’s capacity. Identifies stable geometric invariants (27.84° reconfiguration angle, near-zero deformation). Successfully detects 95.3% of factual distortions on HaluEval dataset using geometric checks.

Conclusion: LAG-XAI provides mathematically grounded, resource-efficient path toward mechanistic interpretability of Transformers, offering explicit parametric interpretability with minimal accuracy trade-off.

Abstract: Modern Transformer-based language models achieve strong performance in natural language processing tasks, yet their latent semantic spaces remain largely uninterpretable black boxes. This paper introduces LAG-XAI (Lie Affine Geometry for Explainable AI), a novel geometric framework that models paraphrasing not as discrete word substitutions, but as a structured affine transformation within the embedding space. By conceptualizing paraphrasing as a continuous geometric flow on a semantic manifold, we propose a computationally efficient mean-field approximation, inspired by local Lie group actions. This allows us to decompose paraphrase transitions into geometrically interpretable components: rotation, deformation, and translation. Experiments on the noisy PIT-2015 Twitter corpus, encoded with Sentence-BERT, reveal a “linear transparency” phenomenon. The proposed affine operator achieves an AUC of 0.7713. By normalizing against random chance (AUC 0.5), the model captures approximately 80% of the non-linear baseline’s effective classification capacity (AUC 0.8405), offering explicit parametric interpretability in exchange for a marginal drop in absolute accuracy. The model identifies fundamental geometric invariants, including a stable matrix reconfiguration angle (~27.84°) and near-zero deformation, indicating local isometry. Cross-domain generalization is confirmed via direct cross-corpus validation on an independent TURL dataset. Furthermore, the practical utility of LAG-XAI is demonstrated in LLM hallucination detection: using a “cheap geometric check,” the model automatically detected 95.3% of factual distortions on the HaluEval dataset by registering deviations beyond the permissible semantic corridor. This approach provides a mathematically grounded, resource-efficient path toward the mechanistic interpretability of Transformers.

[83] Exclusive Unlearning

Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, Yohei Oseki, Masaru Isonuma

Main category: cs.CL

TL;DR: Exclusive Unlearning (EU) is a novel approach that removes harmful content from LLMs by forgetting everything except desired safe knowledge, enabling broad harm removal while preserving domain-specific capabilities.

DetailsMotivation: When deploying LLMs in sensitive domains like healthcare and education, the risk of generating harmful content is significant. Existing unlearning methods struggle with comprehensive removal of diverse harmful content, requiring specific target listing for forgetting.

Method: Proposes Exclusive Unlearning (EU) which takes an opposite approach to traditional unlearning: instead of listing specific harmful targets to forget, it extensively forgets everything except the knowledge and expressions that should be retained, enabling broad harm removal.

Result: Demonstrates that EU can produce models that ensure safety against a wide range of inputs including jailbreak attempts, while maintaining ability to respond to diverse instructions in specific domains like medicine and mathematics.

Conclusion: Exclusive Unlearning provides an effective approach for comprehensive harm removal in LLMs for industrial applications, addressing the limitations of traditional targeted unlearning methods.

Abstract: When introducing Large Language Models (LLMs) into industrial applications, such as healthcare and education, the risk of generating harmful content becomes a significant challenge. While existing machine unlearning methods can erase specific harmful knowledge and expressions, diverse harmful content makes comprehensive removal difficult. In this study, instead of individually listing targets for forgetting, we propose Exclusive Unlearning (EU), which aims for broad harm removal by extensively forgetting everything except for the knowledge and expressions we wish to retain. We demonstrate that through Exclusive Unlearning, it is possible to obtain a model that ensures safety against a wide range of inputs, including jailbreaks, while maintaining the ability to respond to diverse instructions related to specific domains such as medicine and mathematics.

[84] Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

Komal Kumar, Aman Chadha, Salman Khan, Fahad Shahbaz Khan, Hisham Cholakkal

Main category: cs.CL

TL;DR: Paper Circle: A multi-agent LLM system for automated research paper discovery and analysis using retrieval pipelines and knowledge graph generation.

DetailsMotivation: The rapid growth of scientific literature makes it difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Multi-agent LLMs show strong potential for understanding user intent and utilizing tools, motivating an automated system to reduce literature review effort.

Method: Two complementary pipelines: (1) Discovery Pipeline with offline/online retrieval from multiple sources, multi-criteria scoring, diversity-aware ranking, and structured outputs; (2) Analysis Pipeline that transforms papers into structured knowledge graphs with typed nodes (concepts, methods, experiments, figures) enabling graph-aware QA and coverage verification. Implemented within a coder LLM-based multi-agent orchestration framework.

Result: System produces fully reproducible, synchronized outputs (JSON, CSV, BibTeX, Markdown, HTML) at each agent step. Benchmarked on paper retrieval and review generation, showing consistent improvements with stronger agent models in hit rate, MRR, and Recall at K metrics.

Conclusion: Paper Circle demonstrates an effective multi-agent system for automated research discovery and analysis, reducing literature review effort through structured knowledge extraction and retrieval optimization.

Abstract: The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools. In this paper, we introduce Paper Circle, a multi-agent research discovery and analysis system designed to reduce the effort required to find, assess, organize, and understand academic literature. The system comprises two complementary pipelines: (1) a Discovery Pipeline that integrates offline and online retrieval from multiple sources, multi-criteria scoring, diversity-aware ranking, and structured outputs; and (2) an Analysis Pipeline that transforms individual papers into structured knowledge graphs with typed nodes such as concepts, methods, experiments, and figures, enabling graph-aware question answering and coverage verification. Both pipelines are implemented within a coder LLM-based multi-agent orchestration framework and produce fully reproducible, synchronized outputs including JSON, CSV, BibTeX, Markdown, and HTML at each agent step. This paper describes the system architecture, agent roles, retrieval and scoring methods, knowledge graph schema, and evaluation interfaces that together form the Paper Circle research workflow. We benchmark Paper Circle on both paper retrieval and paper review generation, reporting hit rate, MRR, and Recall at K. Results show consistent improvements with stronger agent models. We have publicly released the website at https://papercircle.vercel.app/ and the code at https://github.com/MAXNORM8650/papercircle.

[85] Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Detection

Valeria Pastorino, Jasivan A. Sivakumar, Nafise Sadat Moosavi

Main category: cs.CL

TL;DR: Systematic evaluation of LLMs for news framing analysis reveals performance sensitivity to prompt design, systematic biases (emotional language conflation), and shows cross-model consensus helps identify contested annotations.

DetailsMotivation: Framing analysis is crucial for computational social science but challenging due to news complexity. Traditional methods have limitations: high annotation costs, domain specificity, and inconsistent generalization. LLMs offer promise but their reliability for framing analysis is insufficiently understood.

Method: Systematic evaluation of multiple LLMs (GPT-3.5/4, FLAN-T5, Llama 3) across zero-shot, few-shot, and explanation-based prompting settings. Focus on domain shift and annotation ambiguity. Introduced new dataset of out-of-domain news headlines for evaluation. Analyzed agreement patterns across models on existing framing datasets.

Result: LLM performance is highly sensitive to prompt design and prone to systematic errors on ambiguous cases. GPT-4 shows stronger cross-domain generalization but displays systematic biases, notably conflating emotional language with framing. Cross-model consensus provides useful signal for identifying contested annotations, offering practical dataset auditing approach.

Conclusion: LLMs show promise for framing analysis but require careful prompt design and awareness of systematic biases. Cross-model consensus can help identify ambiguous cases and audit datasets in low-resource settings.

Abstract: The growing complexity and diversity of news coverage have made framing analysis a crucial yet challenging task in computational social science. Traditional approaches, including manual annotation and fine-tuned models, remain limited by high annotation costs, domain specificity, and inconsistent generalisation. Instruction-based large language models (LLMs) offer a promising alternative, yet their reliability for framing analysis remains insufficiently understood. In this paper, we conduct a systematic evaluation of several LLMs, including GPT-3.5/4, FLAN-T5, and Llama 3, across zero-shot, few-shot, and explanation-based prompting settings. Focusing on domain shift and inherent annotation ambiguity, we show that model performance is highly sensitive to prompt design and prone to systematic errors on ambiguous cases. Although LLMs, particularly GPT-4, exhibit stronger cross-domain generalisation, they also display systematic biases, most notably a tendency to conflate emotional language with framing. To enable principled evaluation under real-world topic diversity, we introduce a new dataset of out-of-domain news headlines covering diverse subjects. Finally, by analysing agreement patterns across multiple models on existing framing datasets, we demonstrate that cross-model consensus provides a useful signal for identifying contested annotations, offering a practical approach to dataset auditing in low-resource settings.

[86] Recent Advances in Multimodal Affective Computing: An NLP Perspective

Guimin Hu, Weimin Lyu, Chang Sun, Zhihong Zhu, Lin Gui, Ruichu Cai, Erik Cambria, Hasti Seifi

Main category: cs.CL

TL;DR: A comprehensive survey of multimodal affective computing from an NLP perspective, covering four key tasks, unified frameworks, and future directions.

DetailsMotivation: The field of multimodal affective computing lacks a unified perspective despite diverse research across tasks, modalities, and modeling paradigms. The survey aims to provide systematic organization and analysis from an NLP viewpoint.

Method: Systematic review of recent advances focusing on four representative tasks: multimodal sentiment analysis (MSA), multimodal emotion recognition in conversation (MERC), multimodal aspect-based sentiment analysis (MABSA), and multimodal multi-label emotion recognition (MMER). Presents unified view through comparison of task formulations, datasets, evaluation protocols, and organizes methods into key paradigms including multitask learning, pre-trained models, knowledge enhancement, and contextual modeling.

Result: Provides comprehensive analysis of current state, identifies key challenges, outlines promising future directions, and releases a curated repository of relevant works and resources to facilitate further research.

Conclusion: The survey offers a unified NLP perspective on multimodal affective computing, highlighting the need for more integrated approaches and providing valuable resources for advancing the field.

Abstract: Multimodal affective computing has gained increasing attention due to its broad applications in understanding human behavior and intentions, particularly in text-centric multimodal scenarios. Existing research spans diverse tasks, modalities, and modeling paradigms, yet lacks a unified perspective. In this survey, we systematically review recent advances from an NLP perspective, focusing on four representative tasks: multimodal sentiment analysis (MSA), multimodal emotion recognition in conversation (MERC), multimodal aspect-based sentiment analysis (MABSA), and multimodal multi-label emotion recognition (MMER). We present a unified view by comparing task formulations, benchmark datasets, and evaluation protocols, and by organizing representative methods into key paradigms, including multitask learning, pre-trained models, knowledge enhancement, and contextual modeling. We further extend the discussion to related directions, such as facial, acoustic, and physiological modalities, as well as emotion cause analysis. Finally, we highlight key challenges and outline promising future directions. To facilitate further research, we release a curated repository of relevant works and resources \footnote{https://anonymous.4open.science/r/Multimodal-Affective-Computing-Survey-9819}.

[87] Exploring Continual Fine-Tuning for Enhancing Language Ability in Large Language Model

Divyanshu Aggarwal, Sankarshan Damle, Navin Goyal, Satya Lokam, Sunayana Sitaram

Main category: cs.CL

TL;DR: Study on continual fine-tuning for multilingual adaptation of LLMs, showing task similarity between phases affects performance preservation and proposing layer freezing/generative replay solutions.

DetailsMotivation: Address the challenge of adapting LLMs to new languages without degrading performance on existing languages, focusing on continual fine-tuning for multilingual capability expansion.

Method: Two-phase continual fine-tuning: Phase 1 (English task ability) → Phase 2 (multilingual adaptation). Analyzes task similarity effects and tests layer freezing/generative replay methods to mitigate catastrophic forgetting.

Result: Task similarity between phases determines adaptability - similar tasks preserve performance, dissimilar tasks cause deterioration. Layer freezing and generative replay effectively enhance language ability while preserving task performance.

Conclusion: Continual fine-tuning for multilingual adaptation requires careful consideration of task similarity; proposed methods can mitigate catastrophic forgetting and enable effective language expansion.

Abstract: A common challenge towards the adaptability of Large Language Models (LLMs) is their ability to learn new languages over time without hampering the model’s performance on languages in which the model is already proficient (usually English). Continual fine-tuning (CFT) is the process of sequentially fine-tuning an LLM to enable the model to adapt to downstream tasks with varying data distributions and time shifts. This paper focuses on the language adaptability of LLMs through CFT. We study a two-phase CFT process in which an English-only end-to-end fine-tuned LLM from Phase 1 (predominantly Task Ability) is sequentially fine-tuned on a multilingual dataset – comprising task data in new languages – in Phase 2 (predominantly Language Ability). We observe that the ``similarity’’ of Phase 2 tasks with Phase 1 determines the LLM’s adaptability. For similar phase-wise datasets, the LLM after Phase 2 does not show deterioration in task ability. In contrast, when the phase-wise datasets are not similar, the LLM’s task ability deteriorates. We test our hypothesis on the open-source \mis\ and \llm\ models with multiple phase-wise dataset pairs. To address the deterioration, we analyze tailored variants of two CFT methods: layer freezing and generative replay. Our findings demonstrate their effectiveness in enhancing the language ability of LLMs while preserving task performance, in comparison to relevant baselines.

[88] A Systematic Survey of Semantic Role Labeling in the Era of Pretrained Language Models

Huiyao Chen, Meishan Zhang, Jing Li, Lilja Øvrelid, Jan Hajič, Hao Fei, Min Zhang

Main category: cs.CL

TL;DR: A comprehensive survey paper on Semantic Role Labeling (SRL) that proposes a unified taxonomy, analyzes syntax feature effectiveness, examines SRL in the LLM era, and extends coverage to multimodal settings including visual, video, and speech modalities.

DetailsMotivation: Despite extensive research on SRL, there's a lack of comprehensive surveys that critically synthesize the field from a unified perspective, especially considering recent developments with large language models and multimodal extensions.

Method: The authors conducted systematic literature searches across major academic databases (ACL Anthology, IEEE Xplore, ACM Digital Library, Google Scholar) from 2000-2025, applying explicit inclusion/exclusion criteria to gather ~200 primary references. They propose a four-dimensional taxonomy covering model architectures, syntax feature modeling, application scenarios, and multimodal extensions.

Result: The survey provides critical analysis of when syntactic features help SRL, examines the complementary roles of LLMs and specialized SRL systems, extends SRL coverage to multimodal settings, and analyzes structural differences in evaluation across modalities.

Conclusion: The paper offers a comprehensive synthesis of SRL research with novel contributions including the unified taxonomy, analysis of syntax effectiveness, treatment of SRL in the LLM era, and multimodal extensions, while identifying future research directions.

Abstract: Semantic role labeling (SRL) is a central natural language processing task for understanding predicate-argument structures within texts and enabling downstream applications. Despite extensive research, comprehensive surveys that critically synthesize the field from a unified perspective remain lacking. This survey makes several contributions beyond organizing existing work. We propose a unified four-dimensional taxonomy that categorizes SRL research along model architectures, syntax feature modeling, application scenarios, and multimodal extensions. We provide a critical analysis of when and why syntactic features help, identifying conditions under which syntax-aided approaches provide consistent gains over syntax-free counterparts. We offer the first systematic treatment of SRL in the era of large language models, examining the complementary roles of LLMs and specialized SRL systems and identifying directions for hybrid approaches. We extend the scope of SRL surveys to cover multimodal settings including visual, video, and speech modalities, and analyze structural differences in evaluation across these modalities. Literature was collected through systematic searches of the ACL Anthology, IEEE Xplore, the ACM Digital Library, and Google Scholar, covering publications from 2000 to 2025 and applying explicit inclusion and exclusion criteria to yield approximately 200 primary references. SRL benchmarks, evaluation metrics, and paradigm modeling approaches are discussed alongside practical applications across domains. Future research directions are analyzed, addressing the evolving role of SRL with large language models and broader NLP impact.

[89] LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An

Main category: cs.CL

TL;DR: LongSpec is a speculative decoding framework optimized for long-context LLMs, addressing memory, training-inference mismatch, and attention efficiency challenges to accelerate inference on extended inputs.

DetailsMotivation: As LLMs process longer contexts for applications like LLM agents, efficient inference becomes crucial. Existing speculative decoding methods are trained on short texts (under 4k tokens) and fail in long-context scenarios due to excessive KV cache memory, training-inference mismatch, and inefficient tree attention mechanisms.

Method: LongSpec introduces three innovations: 1) memory-efficient draft model with constant-sized KV cache, 2) novel position indices to mitigate training-inference mismatch, and 3) attention aggregation combining fast prefix computation with standard tree attention for efficient decoding.

Result: Achieves up to 3.26x speedup over Flash Attention baselines across five long-context understanding datasets, and 2.25x reduction in wall-clock time on AIME24 long reasoning task with QwQ model, demonstrating significant latency improvements.

Conclusion: LongSpec effectively addresses key challenges in long-context speculative decoding, enabling efficient acceleration for LLM applications requiring extended input processing while maintaining lossless decoding quality.

Abstract: As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this capability. Speculative decoding (SD) offers a promising lossless acceleration technique compared to lossy alternatives such as quantization and model cascades. However, most state-of-the-art SD methods are trained on short texts (typically fewer than 4k tokens), making them unsuitable for long-context scenarios. Specifically, adapting these methods to long contexts presents three key challenges: (1) the excessive memory demands posed by draft models due to large Key-Value (KV) cache; (2) performance degradation resulting from the mismatch between short-context training and long-context inference; and (3) inefficiencies in tree attention mechanisms when managing long token sequences. This work introduces LongSpec, a framework that addresses these challenges through three core innovations: a memory-efficient draft model with a constant-sized KV cache; novel position indices that mitigate the training-inference mismatch; and an attention aggregation strategy that combines fast prefix computation with standard tree attention to enable efficient decoding. Experimental results confirm the effectiveness of LongSpec, achieving up to a 3.26x speedup over strong Flash Attention baselines across five long-context understanding datasets, as well as a 2.25x reduction in wall-clock time on the AIME24 long reasoning task with the QwQ model, demonstrating significant latency improvements for long-context applications. The code is available at https://github.com/sail-sg/LongSpec.

[90] NativQA Framework: Enabling LLMs and VLMs with Native, Local, and Everyday Knowledge

Firoj Alam, Md Arid Hasan, Sahinur Rahman Laskar, Mucahid Kutlu, Kareem Darwish, Shammur Absar Chowdhury

Main category: cs.CL

TL;DR: NativQA framework extended to multimodal (text, image, audio, video) for creating culturally-aligned QA datasets in native languages using location-specific web search

DetailsMotivation: Address cultural bias and fairness issues in LLMs by creating resources grounded in multilingual, local, and cultural contexts for underrepresented regions

Method: Extend NativQA framework to multimodality (image, audio, video support), use search engines with user-defined seed queries to collect location-specific everyday information across 39 locations in 24 countries and 7 languages

Result: Collected over ~300K text QA pairs, ~312K images, and ~29K videos with associated audio across resource settings from extremely low-resource to high-resource

Conclusion: Framework enables scalable construction of culturally-aligned multimodal datasets for LLM benchmarking and fine-tuning, addressing cultural bias in underrepresented regions

Abstract: The rapid progress of large language models (LLMs) raises concerns about cultural bias, fairness, and performance in diverse languages and underrepresented regions. Addressing these gaps requires large-scale resources grounded in multilingual, local, and cultural contexts. We systematize and extend the earlier NativQA framework to multimodality by adding image, audio, and video support, enabling scalable construction of culturally and regionally aligned QA datasets in native languages. Given user-defined seed queries, the framework uses search engines to collect location-specific everyday information. We evaluate it across 39 locations in 24 countries and 7 languages, spanning extremely low-resource to high-resource settings, and collect over $\sim$300K text QA pairs, $\sim$312K images, and $\sim$29K videos with associated audio. The developed resources can be used for LLMs benchmarking and further fine-tuning. The framework has been made publicly available for the community (https://gitlab.com/nativqa/nativqa-framework). Demo video is available here: \href{https://shorturl.at/DAVn9}{https://shorturl.at/DAVn9}.

[91] arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation

Weiqi Wang, Jiefu Ou, Yangqiu Song, Benjamin Van Durme, Daniel Khashabi

Main category: cs.CL

TL;DR: Automatic generation of literature review tables from scientific papers with new evaluation framework and benchmark dataset.

DetailsMotivation: Existing literature review table generation methods operate in oracle settings with unrealistic assumptions about user needs and retrieval noise. Need for more realistic evaluation that avoids leaking gold information and accounts for semantically related but out-of-scope papers.

Method: Proposes arXiv2Table benchmark with 1,957 tables referencing 7,158 papers, featuring human-verified distractors and schema-agnostic user demands. Develops iterative, batch-based generation method that co-refines paper filtering and schema over multiple rounds. Introduces lightweight, annotation-free evaluation decomposing utility into schema coverage, cell fidelity, and relational consistency.

Result: Method consistently improves over strong baselines, though absolute scores remain modest, highlighting task difficulty. Evaluation protocol validated through human audits and cross-evaluator checks. Dataset and code publicly available.

Conclusion: Presents realistic benchmark and evaluation framework for literature review table generation, addressing limitations of oracle settings and providing reproducible evaluation metrics. Task remains challenging despite methodological improvements.

Abstract: Literature review tables are essential for summarizing and comparing collections of scientific papers. In this paper, we study the automatic generation of such tables from a pool of papers to satisfy a user’s information need. Building on recent work (Newman et al., 2024), we move beyond oracle settings by (i) simulating well-specified yet schema-agnostic user demands that avoid leaking gold column names or values, (ii) explicitly modeling retrieval noise via semantically related but out-of-scope distractor papers verified by human annotators, and (iii) introducing a lightweight, annotation-free, utilization-oriented evaluation that decomposes utility into schema coverage, unary cell fidelity, and pairwise relational consistency, while measuring paper selection through a two-way QA procedure (gold to system and system to gold) with recall, precision, and F1. To support reproducible evaluation, we introduce arXiv2Table, a benchmark of 1,957 tables referencing 7,158 papers, with human-verified distractors and rewritten, schema-agnostic user demands. We also develop an iterative, batch-based generation method that co-refines paper filtering and schema over multiple rounds. We validate the evaluation protocol with human audits and cross-evaluator checks. Extensive experiments show that our method consistently improves over strong baselines, while absolute scores remain modest, underscoring the task’s difficulty. Our data and code is available at https://github.com/JHU-CLSP/arXiv2Table.

[92] Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs

Darpan Aswal, Siddharth D Jaiswal

Main category: cs.CL

TL;DR: CMP-RT reveals tokenization vulnerabilities in safety-aligned LLMs through phonetic perturbations that fragment safety-critical tokens, causing safety mechanisms to fail despite preserved input understanding.

DetailsMotivation: Safety-aligned LLMs remain vulnerable to digital phenomena like textese that introduce non-canonical phonetic perturbations, and the paper aims to identify the root cause of this vulnerability.

Method: Introduces CMP-RT (code-mixed phonetic perturbations for red-teaming), a diagnostic probe that uses phonetic perturbations to fragment safety-critical tokens into benign sub-words. Includes mechanistic analysis, layer-wise probing, and output equivalence enforcement to recover lost representations.

Result: The vulnerability evades standard defenses, persists across modalities and SOTA models including Gemini-3-Pro, and scales through simple supervised fine-tuning. Layer-wise probing shows perturbed and canonical representations align up to a critical depth.

Conclusion: Tokenization is identified as a critical, under-examined vulnerability in current safety pipelines, with a structural gap between pre-training and alignment. Enforcing output equivalence can robustly recover lost representations.

Abstract: Safety-aligned LLMs remain vulnerable to digital phenomena like textese that introduce non-canonical perturbations to words but preserve the phonetics. We introduce CMP-RT (code-mixed phonetic perturbations for red-teaming), a novel diagnostic probe that pinpoints tokenization as the root cause of this vulnerability. A mechanistic analysis reveals that phonetic perturbations fragment safety-critical tokens into benign sub-words, suppressing their attribution scores while preserving prompt interpretability – causing safety mechanisms to fail despite excellent input understanding. We demonstrate that this vulnerability evades standard defenses, persists across modalities and state-of-the-art (SOTA) models including Gemini-3-Pro, and scales through simple supervised fine-tuning (SFT). Furthermore, layer-wise probing shows perturbed and canonical input representations align up to a critical layer depth; enforcing output equivalence robustly recovers the lost representations, providing causal evidence for a structural gap between pre-training and alignment, and establishing tokenization as a critical, under-examined vulnerability in current safety pipelines.

[93] PDF Retrieval Augmented Question Answering

Thi Thu Uyen Hoang, Meenakshi Rajendran, Kun Zhang, Yuhan Wu, Viet Anh Nguyen

Main category: cs.CL

TL;DR: A multimodal RAG-based QA system for extracting information from PDFs containing text, images, diagrams, graphs, and tables, addressing complex multimodal queries through refined processing of non-textual elements and LLM fine-tuning.

DetailsMotivation: PDFs contain rich multimodal data (text, images, diagrams, graphs, tables) that existing QA systems struggle with, as they're primarily designed for textual content. There's a need for systems that can handle complex multimodal questions combining multiple data types.

Method: Developed a comprehensive RAG-based QA system with refined approaches for processing and integrating non-textual elements from PDFs into the RAG framework, plus fine-tuning large language models to better adapt to the multimodal system.

Result: Experimental evaluation demonstrates the system’s capability to extract accurate information from different types of content across PDFs, effectively addressing complex multimodal questions.

Conclusion: The work pushes boundaries of retrieval-augmented QA systems and lays foundation for further research in multimodal data integration and processing, particularly for PDF document analysis.

Abstract: This paper presents an advancement in Question-Answering (QA) systems using a Retrieval Augmented Generation (RAG) framework to enhance information extraction from PDF files. Recognizing the richness and diversity of data within PDFs–including text, images, vector diagrams, graphs, and tables–poses unique challenges for existing QA systems primarily designed for textual content. We seek to develop a comprehensive RAG-based QA system that will effectively address complex multimodal questions, where several data types are combined in the query. This is mainly achieved by refining approaches to processing and integrating non-textual elements in PDFs into the RAG framework to derive precise and relevant answers, as well as fine-tuning large language models to better adapt to our system. We provide an in-depth experimental evaluation of our solution, demonstrating its capability to extract accurate information that can be applied to different types of content across PDFs. This work not only pushes the boundaries of retrieval-augmented QA systems but also lays a foundation for further research in multimodal data integration and processing.

[94] EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge

Klim Zaporojets, Daniel Daza, Edoardo Barba, Ira Assent, Roberto Navigli, Paul Groth

Main category: cs.CL

TL;DR: Dataset creation for benchmarking automatic knowledge graph updates from evolving textual sources using Wikidata snapshots and Wikipedia passages

DetailsMotivation: Knowledge graphs need to be updated over time as knowledge evolves, but current information extraction methods operate independently of existing KG state, creating challenges for integrating new knowledge with existing structure

Method: Constructed dataset using Wikidata KG snapshots (2019-2025) aligned with Wikipedia passages and corresponding edit operations, resulting in 233K passages paired with 1.45M KG edits across 7 yearly snapshots

Result: Created comprehensive benchmark dataset highlighting challenges in updating KGs from textual sources, particularly in integrating text knowledge with existing KG structure

Conclusion: Dataset serves as valuable benchmark for future research on automatic KG updates from evolving textual knowledge, addressing the gap between independent information extraction and context-aware KG updating

Abstract: Knowledge Graphs (KGs) are structured knowledge repositories containing entities and relations between them. In this paper, we study the problem of automatically updating KGs over time in response to evolving knowledge in unstructured textual sources. Addressing this problem requires identifying a wide range of update operations based on the state of an existing KG at a given time and the information extracted from text. This contrasts with traditional information extraction pipelines, which extract knowledge from text independently of the current state of a KG. To address this challenge, we propose a method for construction of a dataset consisting of Wikidata KG snapshots over time and Wikipedia passages paired with the corresponding edit operations that they induce in a particular KG snapshot. The resulting dataset comprises 233K Wikipedia passages aligned with a total of 1.45 million KG edits over 7 different yearly snapshots of Wikidata from 2019 to 2025. Our experimental results highlight key challenges in updating KG snapshots based on emerging textual knowledge, particularly in integrating knowledge expressed in text with the existing KG structure. These findings position the dataset as a valuable benchmark for future research. Our dataset and model implementations are publicly available.

[95] Enhancing Hallucination Detection via Future Context

Joosung Lee, Cheonbok Park, Hwiyeol Jo, Jeonghoon Kim, Joonsuk Park, Kang Min Yoo

Main category: cs.CL

TL;DR: A hallucination detection framework for black-box LLMs that samples future contexts to identify persistent hallucinations, improving detection performance across multiple methods.

DetailsMotivation: As LLMs generate text without revealing their internal processes, detecting hallucinations in black-box outputs has become critical. The paper addresses the challenge of identifying false information in generated text when the generation process is not transparent.

Method: The framework samples future contexts from the text generation process, motivated by the observation that hallucinations tend to persist once introduced. These sampled future contexts provide valuable clues for hallucination detection and can be integrated with various sampling-based methods.

Result: The approach demonstrates performance improvements across multiple hallucination detection methods when using the proposed sampling approach, showing effectiveness in identifying hallucinations in black-box LLM outputs.

Conclusion: Sampling future contexts is an effective strategy for hallucination detection in black-box LLMs, providing valuable detection clues that can enhance various existing methods for identifying false information in generated text.

Abstract: Large Language Models (LLMs) are widely used to generate plausible text on online platforms, without revealing the generation process. As users increasingly encounter such black-box outputs, detecting hallucinations has become a critical challenge. To address this challenge, we focus on developing a hallucination detection framework for black-box generators. Motivated by the observation that hallucinations, once introduced, tend to persist, we sample future contexts. The sampled future contexts provide valuable clues for hallucination detection and can be effectively integrated with various sampling-based methods. We extensively demonstrate performance improvements across multiple methods using our proposed sampling approach.

[96] CharBench: Evaluating the Role of Tokenization in Character-Level Tasks

Omri Uzan, Yuval Pinter

Main category: cs.CL

TL;DR: CharBench is a large-scale benchmark for character-level reasoning tasks that reveals modern LLMs struggle with character-level operations, with tokenization properties affecting positional understanding more than counting tasks.

DetailsMotivation: Character-level reasoning tasks (counting, locating characters) remain challenging for LLMs, but the impact of tokenization (subword units vs characters) is unclear with conflicting prior research. Need comprehensive benchmark to systematically evaluate this.

Method: Introduce CharBench, a comprehensive benchmark two orders of magnitude larger than existing alternatives. Evaluate diverse open-weight and proprietary models, analyze correlation between tokenization properties (word segmentation, token length) and performance on character-level tasks.

Result: CharBench presents significant challenge: average accuracy 43.6%, some tasks only 32.3%. For counting tasks, tokenization weakly correlated with correctness; word length and actual character count more important. For positional tasks, performance negatively correlated with length of token containing queried character.

Conclusion: Tokenization impacts character-level reasoning differently: positional understanding hindered by longer tokens obscuring character position, while counting less affected. CharBench provides tools for improving model performance on character-level tasks.

Abstract: Tasks that require character-level reasoning, such as counting or locating characters within words, remain challenging for contemporary language models. A common conjecture is that language models’ reliance on subword units, rather than characters, contributes to their struggles with character-level tasks, yet recent studies offer conflicting conclusions about the role of tokenization, leaving its impact unclear. To address this gap, we introduce CharBench, a comprehensive benchmark of character-level tasks that is two orders of magnitude larger than existing alternatives. We evaluate a diverse range of leading open-weight and proprietary models on CharBench and find that it presents a significant challenge to modern LLMs, with an average accuracy of 43.6% and 32.3% on some tasks. We present an in-depth analysis of how intrinsic properties of words and their segmentations into tokens correspond to model performance. For counting tasks, we find that tokenization properties are weakly correlated with correctness, while the length of the queried word and the actual character count play a more significant part. In contrast, for tasks requiring intra-word positional understanding, performance is negatively correlated with the length of the token containing the queried character, suggesting that longer tokens obscure character position information for LLMs. We encourage future work to build on the benchmark and evaluation methodology introduced here as tools for improving model performance on such tasks.

[97] ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

Kimihiro Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Susan Holm, Yuran Wang, Vincent Zhou, Ken Fukuda, Teruko Mitamura

Main category: cs.CL

TL;DR: ProMQA-Assembly: A multimodal QA dataset for assembly activities with 646 QA pairs requiring understanding of human activity videos and instruction manuals, created using semi-automated LLM generation with human verification.

DetailsMotivation: Assembly task assistants have great potential but lack evaluation resources. Current multimodal QA datasets don't adequately cover procedural assembly activities requiring understanding of both visual demonstrations and textual instructions.

Method: Created ProMQA-Assembly dataset with 646 QA pairs using semi-automated approach: LLMs generate candidate QA pairs, humans verify them. Integrated fine-grained action labels to diversify question types. Also created 81 instruction task graphs for target assembly tasks to facilitate benchmarking and verification.

Result: Dataset contains challenging multimodal questions where reasoning models show promising results. Benchmarking of competitive proprietary multimodal models demonstrates the dataset’s difficulty and utility for evaluating procedural activity understanding.

Conclusion: ProMQA-Assembly contributes to development of procedural-activity assistants by providing a challenging multimodal evaluation resource for assembly tasks, combining video understanding with instruction manual comprehension.

Abstract: Assistants on assembly tasks show great potential to benefit humans ranging from helping with everyday tasks to interacting in industrial settings. However, evaluation resources in assembly activities are underexplored. To foster system development, we propose a new multimodal QA evaluation dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 646 QA pairs that require multimodal understanding of human activity videos and their instruction manuals in an online-style manner. For cost effectiveness in the data creation, we adopt a semi-automated QA annotation approach, where LLMs generate candidate QA pairs and humans verify them. We further improve QA generation by integrating fine-grained action labels to diversify question types. Additionally, we create 81 instruction task graphs for our target assembly tasks. These newly created task graphs are used in our benchmarking experiment, as well as in facilitating the human verification process. With our dataset, we benchmark models, including competitive proprietary multimodal models. We find that ProMQA-Assembly contains challenging multimodal questions, where reasoning models showcase promising results. We believe our new evaluation dataset contributes to the further development of procedural-activity assistants.

[98] GrACE: A Generative Approach to Better Confidence Elicitation and Efficient Test-Time Scaling in Large Language Models

Zhaohan Zhang, Ziquan Liu, Ioannis Patras

Main category: cs.CL

TL;DR: GrACE: A generative approach for real-time confidence elicitation in LLMs using hidden state similarity to a special token, enabling scalable and reliable confidence estimation without additional sampling or auxiliary models.

DetailsMotivation: Existing methods for assessing LLM reliability through confidence elicitation are either computationally expensive or poorly calibrated, making them impractical for real-world deployment in high-stakes applications like healthcare and finance.

Method: GrACE uses a novel mechanism where the model expresses confidence through similarity between the last hidden state and the embedding of a special token appended to the vocabulary. The model is fine-tuned to calibrate confidence with accuracy targets, enabling real-time confidence estimation without additional sampling.

Result: GrACE achieves the best discriminative capacity and calibration on open-ended generation tasks compared to existing methods. Confidence-based test-time scaling strategies improve final decision accuracy while significantly reducing required samples.

Conclusion: GrACE provides a practical solution for deploying LLMs with reliable, on-the-fly confidence estimation, addressing key limitations of existing confidence elicitation methods for real-world applications.

Abstract: Assessing the reliability of Large Language Models (LLMs) by confidence elicitation is a prominent approach to AI safety in high-stakes applications, such as healthcare and finance. Existing methods either require expensive computational overhead or suffer from poor calibration, making them impractical and unreliable for real-world deployment. In this work, we propose GrACE, a Generative Approach to Confidence Elicitation that enables scalable and reliable confidence elicitation for LLMs. GrACE adopts a novel mechanism in which the model expresses confidence by the similarity between the last hidden state and the embedding of a special token appended to the vocabulary, in real-time. We fine-tune the model for calibrating the confidence with targets associated with accuracy. Extensive experiments show that the confidence produced by GrACE achieves the best discriminative capacity and calibration on open-ended generation tasks without resorting to additional sampling or an auxiliary model. Moreover, we propose two confidence-based strategies for test-time scaling with GrACE, which not only improve the accuracy of the final decision but also significantly reduce the number of required samples, highlighting its potential as a practical solution for deploying LLMs with reliable, on-the-fly confidence estimation.

[99] LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization

Junsong Li, Jie Zhou, Bihao Zhan, Yutao Yang, Qianjun Pan, Shilian Chen, Tianyu Huai, Xin Li, Qin Chen, Liang He

Main category: cs.CL

TL;DR: LifeAlign is a lifelong alignment framework for LLMs that prevents catastrophic forgetting when adapting to new preferences/domains through focalized preference optimization and short-to-long memory consolidation.

DetailsMotivation: Traditional LLM alignment methods suffer from catastrophic forgetting - models lose previously acquired knowledge when adapting to new preferences or domains. There's a need for lifelong alignment that maintains consistent human preference alignment across sequential learning tasks without forgetting previous knowledge.

Method: Two key innovations: 1) Focalized preference optimization strategy that aligns LLMs with new preferences while preventing erosion of previously acquired knowledge; 2) Short-to-long memory consolidation mechanism that merges denoised short-term preference representations into stable long-term memory using intrinsic dimensionality reduction for efficient storage and retrieval of alignment patterns across domains.

Result: Experimental evaluation across multiple sequential alignment tasks spanning different domains and preference types shows superior performance in maintaining both preference alignment quality and knowledge retention compared to existing lifelong learning approaches.

Conclusion: LifeAlign provides an effective framework for lifelong alignment of LLMs that addresses catastrophic forgetting and enables consistent human preference alignment across sequential learning tasks while preserving previously learned knowledge.

Abstract: Alignment plays a crucial role in Large Language Models (LLMs) in aligning with human preferences on a specific task/domain. Traditional alignment methods suffer from catastrophic forgetting, where models lose previously acquired knowledge when adapting to new preferences or domains. We introduce LifeAlign, a novel framework for lifelong alignment that enables LLMs to maintain consistent human preference alignment across sequential learning tasks without forgetting previously learned knowledge. Our approach consists of two key innovations. First, we propose a focalized preference optimization strategy that aligns LLMs with new preferences while preventing the erosion of knowledge acquired from previous tasks. Second, we develop a short-to-long memory consolidation mechanism that merges denoised short-term preference representations into stable long-term memory using intrinsic dimensionality reduction, enabling efficient storage and retrieval of alignment patterns across diverse domains. We evaluate LifeAlign across multiple sequential alignment tasks spanning different domains and preference types. Experimental results demonstrate that our method achieves superior performance in maintaining both preference alignment quality and knowledge retention compared to existing lifelong learning approaches. The codes and datasets will be released on GitHub.

[100] A State-Update Prompting Strategy for Efficient and Robust Multi-turn Dialogue

Ziyi Liu

Main category: cs.CL

TL;DR: A training-free prompt engineering method called State-Update Multi-turn Dialogue Strategy improves LLM performance in long-horizon dialogues using State Reconstruction and History Remind mechanisms.

DetailsMotivation: LLMs struggle with information forgetting and inefficiency in long-horizon, multi-turn dialogues, which limits their effectiveness in extended conversational contexts.

Method: Proposes a training-free prompt engineering method with two key mechanisms: State Reconstruction (to reconstruct current dialogue state) and History Remind (to selectively recall relevant historical information).

Result: On HotpotQA dataset: improves core information filtering score by 32.6%, increases downstream QA score by 14.1%, reduces inference time by 73.1%, and reduces token consumption by 59.4%.

Conclusion: Provides an effective solution for optimizing LLMs in long-range interactions and offers insights for developing more robust Agents.

Abstract: Large Language Models (LLMs) struggle with information forgetting and inefficiency in long-horizon, multi-turn dialogues. To address this, we propose a training-free prompt engineering method, the State-Update Multi-turn Dialogue Strategy. It utilizes “State Reconstruction” and “History Remind” mechanisms to effectively manage dialogue history. Our strategy shows strong performance across multiple multi-hop QA datasets. For instance, on the HotpotQA dataset, it improves the core information filtering score by 32.6%, leading to a 14.1% increase in the downstream QA score, while also reducing inference time by 73.1% and token consumption by 59.4%. Ablation studies confirm the pivotal roles of both components. Our work offers an effective solution for optimizing LLMs in long-range interactions, providing new insights for developing more robust Agents.

[101] StateX: Enhancing RNN Recall via Post-training State Expansion

Xingyu Shen, Yingfa Chen, Zhen Leng Thai, Xu Han, Zhiyuan Liu, Maosong Sun

Main category: cs.CL

TL;DR: StateX is a post-training framework that efficiently expands the recurrent state size of pre-trained RNNs (linear attention and state-space models) to improve long-context recall and in-context learning without increasing parameters or training costs.

DetailsMotivation: RNNs struggle with accurate recall from long contexts because they compress all contextual information into fixed-size recurrent states. While larger states improve recall, directly training RNNs with large states is expensive. There's a need for an efficient way to enhance RNNs' recall capabilities post-training.

Method: StateX introduces architectural modifications for two RNN types: linear attention and state-space models. The framework scales up state size with minimal or no parameter increase through post-training techniques, avoiding the high costs of training from scratch.

Result: Experiments on models up to 1.3B parameters show StateX efficiently enhances recall and in-context learning performance without high post-training costs or compromising other capabilities.

Conclusion: StateX provides an effective post-training solution to overcome RNNs’ limitations in long-context recall by efficiently expanding recurrent state sizes, making RNNs more capable for tasks requiring accurate contextual memory.

Abstract: Recurrent neural networks (RNNs), such as linear attention and state-space models, have gained popularity due to their constant per-token complexity when processing long contexts. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a fixed-size recurrent state. Previous studies have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with large recurrent states results in high training costs. In this paper, we introduce StateX, a post-training framework that efficiently expands the states of pre-trained RNNs. For two popular classes of RNNs, linear attention and state-space models, we design post-training architectural modifications in StateX, to scale up the state size with no or negligible increase in model parameters. Experiments on models with up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning performance of RNNs without incurring high post-training costs or compromising other capabilities.

[102] Idiom Understanding as a Tool to Measure the Dialect Gap

David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury

Main category: cs.CL

TL;DR: This paper introduces new benchmark datasets for testing dialect understanding through regional idioms in Quebec French, revealing significant performance gaps in LLMs between standard French and regional dialects.

DetailsMotivation: The paper aims to address the gap in dialect understanding by combining idiom comprehension with dialect analysis, using regional idioms as a test for dialect competence in language models.

Method: The authors propose three new benchmark datasets: QFrCoRE (4,633 Quebec idiomatic phrases), QFrCoRT (171 Quebec idiomatic words), and MFrCoE (4,938 French Metropolitan expressions). They develop a methodology for corpus construction that can be replicated for other dialects and test 111 LLMs on these benchmarks.

Result: Experiments reveal a critical disparity: while models perform well on French Metropolitan expressions, 65.77% perform significantly worse on Quebec idioms, with only 9.0% favoring the regional dialect over standard French.

Conclusion: The benchmarks reliably quantify dialect gaps in language models, demonstrating that proficiency in prestige languages doesn’t guarantee understanding of regional dialects, highlighting the need for better dialectal competence in NLP systems.

Abstract: The tasks of idiom understanding and dialect understanding are both well-established benchmarks in natural language processing. In this paper, we propose combining them, and using regional idioms as a test of dialect understanding. Towards this end, we propose three new benchmark datasets for the Quebec dialect of French: QFrCoRE, which contains 4,633 instances of idiomatic phrases, and QFrCoRT, which comprises 171 regional instances of idiomatic words, and a new benchmark for French Metropolitan expressions, MFrCoE, which comprises 4,938 phrases. We explain how to construct these corpora, so that our methodology can be replicated for other dialects. Our experiments with 111 LLMs reveal a critical disparity in dialectal competence: while models perform well on French Metropolitan, 65.77% of them perform significantly worse on Quebec idioms, with only 9.0% favoring the regional dialect. These results confirm that our benchmarks are a reliable tool for quantifying the dialect gap and that prestige-language proficiency does not guarantee regional dialect understanding.

[103] Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

Omri Uzan, Asaf Yehudai, Roi pony, Eyal Shnarch, Ariel Gera

Main category: cs.CL

TL;DR: GQR is a test-time optimization method that refines vision-centric multimodal retrieval using guidance from a lightweight text retriever, achieving state-of-the-art performance with significantly improved efficiency.

DetailsMotivation: Vision-centric multimodal document retrieval models face deployment challenges due to large representation sizes and potential modality gaps, while existing hybrid methods fail to exploit rich interactions between different retrieval models' representation spaces.

Method: Guided Query Refinement (GQR) refines a primary vision-centric retriever’s query embedding at test time using guidance from a complementary text retriever’s scores, enabling fine-grained interaction between different retrieval modalities.

Result: GQR allows vision-centric models to match performance of models with significantly larger representations while being up to 14x faster and requiring 54x less memory, pushing the Pareto frontier for performance and efficiency in multimodal retrieval.

Conclusion: GQR demonstrates that lightweight text retrievers can effectively enhance stronger vision-centric models through test-time optimization, offering a practical solution for efficient multimodal document retrieval deployment.

Abstract: Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model’s representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever’s query embedding using guidance from a complementary retriever’s scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://github.com/IBM/test-time-hybrid-retrieval

[104] Unlocking the Potential of Diffusion Language Models through Template Infilling

Junhoo Lee, Seungyeon Kim, Nojun Kwak

Main category: cs.CL

TL;DR: Template Infilling (TI) is a new conditioning method for Diffusion Language Models that uses structural anchors across the entire response space instead of just prefix prompting, enabling better reasoning and generation quality.

DetailsMotivation: Current Diffusion Language Models (DLMs) use limited inference strategies inherited from autoregressive models, primarily relying on prefix-based prompting. This constrains their ability to reason globally and structure responses effectively.

Method: Proposes Template Infilling (TI), a conditioning methodology that flexibly aligns structural anchors across the entire target response space. It establishes a global blueprint first, then fills in masked segments, enabling better reasoning and generation.

Result: Achieves consistent improvements of 9.40% over baseline on diverse benchmarks including mathematical reasoning, code generation, and trip planning. Also provides advantages in multi-token generation settings, enabling effective speedup while maintaining quality and robustness.

Conclusion: TI facilitates System-2 reasoning by enforcing global constraints, empowering DLMs to deliberate within structurally defined solution spaces, offering a more effective alternative to traditional prefix prompting.

Abstract: Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs. Unlike conventional prefix prompting, TI flexibly aligns structural anchors across the entire target response space, establishing a global blueprint before filling in the masked segments. We demonstrate the effectiveness of our approach on diverse benchmarks, including mathematical reasoning, code generation, and trip planning, achieving consistent improvements of 9.40% over the baseline. Furthermore, we observe that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality and robustness. By enforcing these global constraints, TI ultimately facilitates System-2 reasoning, empowering the model to deliberate within a structurally defined solution space.

[105] Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning

Xingrui Zhuo, Jiapu Wang, Gongqing Wu, Zhongyuan Wang, Jichen Zhang, Shirui Pan, Xindong Wu

Main category: cs.CL

TL;DR: KRLM is a Knowledge Reasoning Language Model that coordinates LLM knowledge with KG context using a specialized instruction format, tokenizer, attention layer, and structure-aware predictor to improve inductive knowledge graph reasoning while reducing hallucinations.

DetailsMotivation: Existing LLM-based knowledge graph foundation models suffer from LLM knowledge distortion when sparse KG context overshadows intrinsic LLM knowledge, and struggle to constrain generative hallucinations, limiting reasoning credibility.

Method: Proposes KRLM with: 1) Knowledge Reasoning Language instruction format and tokenizer to align LLM knowledge with KG representations, 2) KRL attention layer with dynamic knowledge memory mechanism to coordinate intrinsic LLM knowledge with KG context, 3) structure-aware next-entity predictor to constrain results within trustworthy knowledge domain.

Result: Extensive experiments on 25 real-world inductive KGR datasets demonstrate significant superiority in both zero-shot reasoning and fine-tuning scenarios.

Conclusion: KRLM effectively addresses LLM knowledge distortion and hallucination issues in inductive knowledge graph reasoning through unified coordination between LLM knowledge and KG context.

Abstract: Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have demonstrated strong capabilities for open-domain knowledge reasoning. As a result, the latest research has focused on LLM-based KGFMs that integrate LLM knowledge with KG context for inductive KGR. However, the intrinsic knowledge of LLMs may be overshadowed by sparse KG context, leading to LLM knowledge distortion, which can cause irreversible damage to model reasoning. Moreover, existing LLM-based KGR methods still struggle to fully constrain generative hallucinations in LLMs, severely limiting the credibility of reasoning results. To address these limitations, we propose a Knowledge Reasoning Language Model (KRLM) that achieves unified coordination between LLM knowledge and KG context throughout the KGR process. Specifically, we design a Knowledge Reasoning Language (KRL) instruction format and a KRL tokenizer to align LLM knowledge with KG representations. Then, we propose a KRL attention layer that coordinates intrinsic LLM knowledge with additional KG context through a dynamic knowledge memory mechanism. Finally, a structure-aware next-entity predictor is proposed, which strictly constrains the reasoning results within a trustworthy knowledge domain. Extensive experimental results on 25 real-world inductive KGR datasets demonstrate the significant superiority of the proposed KRLM\footnote{Our source codes are available at https://anonymous.4open.science/r/KRLM-EA36 in both zero-shot reasoning and fine-tuning scenarios.

[106] RLAIF-SPA: Structured AI Feedback for Semantic-Prosodic Alignment in Speech Synthesis

Qing Yang, Zhenghao Liu, Yangfan Du, Pengcheng Huang, Tong Xiao

Main category: cs.CL

TL;DR: RLAIF-SPA: A reinforcement learning framework using AI feedback to optimize emotional expressiveness and intelligibility in TTS without human supervision, achieving state-of-the-art results across multiple datasets.

DetailsMotivation: Current TTS systems achieve near-human quality in neutral styles but lack emotional expressiveness, often requiring costly emotion annotations or using surrogate objectives that fail to capture perceptual emotional quality.

Method: Uses Reinforcement Learning from AI Feedback (RLAIF) with ASR for semantic accuracy feedback and structured reward modeling for prosodic-emotional consistency. Enables control over four dimensions: Structure, Emotion, Speed, and Tone.

Result: Outperforms Chat-TTS on Libri-Speech with 26.1% reduction in word error rate, 9.1% improvement in SIM-O, and over 10% gains in human subjective evaluations. Consistent gains across Libri-Speech, MELD, and Mandarin ESD datasets.

Conclusion: RLAIF-SPA successfully addresses emotional expressiveness limitations in TTS without human supervision, providing precise control over expressive speech generation while maintaining intelligibility.

Abstract: Recent advances in Text-To-Speech (TTS) synthesis have achieved near-human speech quality in neutral speaking styles. However, most existing approaches either depend on costly emotion annotations or optimize surrogate objectives that fail to adequately capture perceptual emotional quality. As a result, the generated speech, while semantically accurate, often lacks expressive and emotionally rich characteristics. To address these limitations, we propose RLAIF-SPA, a novel framework that integrates Reinforcement Learning from AI Feedback (RLAIF) to directly optimize both emotional expressiveness and intelligibility without human supervision. Specifically, RLAIF-SPA incorporates Automatic Speech Recognition (ASR) to provide semantic accuracy feedback, while leveraging structured reward modeling to evaluate prosodic-emotional consistency. RLAIF-SPA enables more precise and nuanced control over expressive speech generation along four structured evaluation dimensions: Structure, Emotion, Speed, and Tone. Extensive experiments on Libri-Speech, MELD, and Mandarin ESD datasets demonstrate consistent gains across clean read speech, conversational dialogue, and emotional speech. On Libri-Speech, RLAIF-SPA consistently outperforms Chat-TTS, achieving a 26.1% reduction in word error rate, a 9.1% improvement in SIM-O, and over 10% gains in human subjective evaluations.

[107] Telling Speculative Stories to Help Humans Imagine the Harms of Healthcare AI

Xingmeng Zhao, Dan Schumacher, Veronica Rammouz, Anthony Rios

Main category: cs.CL

TL;DR: Unable to analyze paper 2510.14718 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.14718: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14718&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[108] Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Md Zarif Hossain, Ahmed Imteaj

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2407.14971: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.14971&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[109] DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Yu Zhou, Sohyun An, Haikang Deng, Da Yin, Clark Peng, Cho-Jui Hsieh, Kai-Wei Chang, Nanyun Peng

Main category: cs.CL

TL;DR: Unable to analyze paper 2510.14949 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2510.14949: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14949&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[110] CoGate-LSTM: Prototype-Guided Feature-Space Gating for Mitigating Gradient Dilution in Imbalanced Toxic Comment Classification

Noor Islam S. Mohammad

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to technical limitations in accessing the abstract

Abstract: Failed to fetch summary for 2510.17018: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17018&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[111] Fairness Evaluation and Inference Level Mitigation in LLMs

Afrozah Nadeem, Mark Dras, Usman Naseem

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.18914: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18914&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[112] The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

Noah Amsel, David Persson, Christopher Musco, Robert M. Gower

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.16932: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16932&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[113] Advancing AI Research Assistants with Expert-Involved Learning

Tianyu Liu, Simeng Han, Hanchen Wang, Xiao Luo, Pan Lu, Biqing Zhu, Yuge Wang, Keyi Li, Jiapeng Chen, Rihao Qu, Yufeng Liu, Xinyue Cui, Aviv Yaish, Yuhang Chen, Minsheng Hao, Chuhan Li, Kexing Li, Yinsheng Lu, Xinyu Wei, Qinzhe Xing, Antonia Panescu, Mengbo Wang, Vibha Annaswamy, Alicia Sanchez, Jack Cloherty, Arman Cohan, Hua Xu, Mark Gerstein, James Zou, Hongyu Zhao

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2505.04638: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.04638&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[114] MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models

Kailin Jiang, Ning Jiang, Yuntao Du, Yuchen Ren, Yuchen Li, Yifan Gao, Jinhe Bi, Yunpu Ma, Bin Li, Lei Liu, Qing Li

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.19457: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19457&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[115] Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Ruizhe Shi, Minhak Song, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

Main category: cs.CL

TL;DR: Unable to analyze paper 2505.19770 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is unavailable due to rate limiting

Method: Cannot determine method as abstract is unavailable due to rate limiting

Result: Cannot determine results as abstract is unavailable due to rate limiting

Conclusion: Cannot draw conclusions about the paper due to lack of accessible abstract information

Abstract: Failed to fetch summary for 2505.19770: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19770&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[116] Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

Matthias De Lange, Jens-Joris Decorte, Jeroen Van Hautte

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2511.07969: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07969&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[117] LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering

Yuanjie Zhu, Liangwei Yang, Ke Xu, Weizhi Zhang, Zihe Song, Jindong Wang, Philip S. Yu

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.15424: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15424&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[118] MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick, Mercy Asiedu, Ines Mezerreg, Howard Hu, Howard Yang, Richa Tiwari, Sunny Jansen, Preeti Singh, Yun Liu, Shekoofeh Azizi, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riviere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Elena Buchatskaya, Jean-Baptiste Alayrac, Dmitry Lepikhin, Vlad Feinberg, Sebastian Borgeaud, Alek Andreev, Cassidy Hardin, Robert Dadashi, Léonard Hussenot, Armand Joulin, Olivier Bachem, Yossi Matias, Katherine Chou, Avinatan Hassidim, Kavi Goel, Clement Farabet, Joelle Barral, Tris Warkentin, Jonathon Shlens, David Fleet, Victor Cotruta, Omar Sanseviero, Gus Martins, Phoebe Kirk, Anand Rao, Shravya Shetty, David F. Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, Lin Yang

Main category: cs.CL

TL;DR: Unable to analyze paper 2507.05201 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2507.05201: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.05201&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[119] PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data

Pawel Batorski, Paul Swoboda

Main category: cs.CL

TL;DR: Unable to analyze paper 2512.11013 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot draw conclusions without access to the paper abstract

Abstract: Failed to fetch summary for 2512.11013: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11013&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[120] Automatic Replication of LLM Mistakes in Medical Conversations

Oleksii Proniakin, Diego Fajardo, Ruslan Nazarenko, Razvan Marinescu

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2512.20983: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20983&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[121] Multiplayer Nash Preference Optimization

Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, Yejin Choi

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2509.23102: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23102&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[122] Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders

Ruikang Zhang, Shuo Wang, Qi Su

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2601.02978: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02978&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.25454: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25454&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[124] BaseCal: Unsupervised Confidence Calibration via Base Model Signals

Hexiang Tan, Wanli Yang, Junwei Zhang, Xin Chen, Rui Tang, Du Su, Jingang Wang, Yuanzhuo Wang, Fei Sun, Xueqi Cheng

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2601.03042: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03042&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[125] Hypothesis-Driven Feature Manifold Analysis in LLMs via Supervised Multi-Dimensional Scaling

Federico Tiblias, Irina Bigoulaeva, Jingcheng Niu, Simone Balloccu, Iryna Gurevych

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.01025: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01025&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[126] Frame of Reference: Addressing the Challenges of Common Ground Representation in Situational Dialogs

Biswesh Mohapatra, Théo Charlot, Giovanni Duca, Mayank Palan, Laurent Romary, Justine Cassell

Main category: cs.CL

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

DetailsMotivation: Unable to determine motivation due to API access issues

Method: Unable to determine method due to API access issues

Result: Unable to determine results due to API access issues

Conclusion: Unable to draw conclusions due to API access issues

Abstract: Failed to fetch summary for 2601.09365: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09365&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Mahammad Namazov, Tomáš Koref, Ivan Habernal

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.12419: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12419&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[128] DRIFT: Decompose, Retrieve, Illustrate, then Formalize Theorems

Meiru Zhang, Philipp Borchert, Milan Gritta, Gerasimos Lampouras

Main category: cs.CL

TL;DR: Paper 2510.10815: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2510.10815: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10815&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[129] Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? A Study of Hierarchical Gating and Calibration

Víctor Yeste, Paolo Rosso

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.00913: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00913&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[130] What If We Allocate Test-Time Compute Adaptively?

Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan, Ayesha Mohsin, Dean Hougen

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2602.01070: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01070&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[131] From Reasoning to Pixels: Benchmarking the Alignment Gap in Unified Multimodal Models

Cheng Yang, Chufan Shi, Bo Shui, Yaokang Wu, Muzi Tao, Huijuan Wang, Ivan Yee Lee, Yong Liu, Xuezhe Ma, Taylor Berg-Kirkpatrick

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available due to API rate limiting preventing paper access

Conclusion: Cannot provide conclusion as paper content is inaccessible due to HTTP 429 error from arXiv API

Abstract: Failed to fetch summary for 2602.08336: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08336&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[132] The Company You Keep: How LLMs Respond to Dark Triad Traits

Zeyi Lu, Angelica Henestrosa, Pavel Chizhov, Ivan P. Yamshchikov

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.04299: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04299&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[133] OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

Yuping Yan, Yuhan Xie, Yuanshuai Li, Yingchao Yu, Lingjuan Lyu, Yaochu Jin

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.10287: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10287&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[134] Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca

Main category: cs.CL

TL;DR: Unable to analyze paper 2603.16105 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2603.16105: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16105&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[135] Emergent Introspection in AI is Content-Agnostic

Harvey Lederman, Kyle Mahowald

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.05414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[136] Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality

Mengyu Bu, Yang Feng

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.17512: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17512&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[137] AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

Liang Ding

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.21357: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21357&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[138] Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis

Zice Wang, Zhenyu Zhang

Main category: cs.CL

TL;DR: Paper 2603.19282: Could not fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to HTTP 429 error preventing access to paper content

Method: Unable to determine method due to HTTP 429 error preventing access to paper content

Result: Unable to determine results due to HTTP 429 error preventing access to paper content

Conclusion: Unable to determine conclusion due to HTTP 429 error preventing access to paper content

Abstract: Failed to fetch summary for 2603.19282: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19282&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[139] Triangulating Temporal Dynamics in Multilingual Swiss Online News

Bros Victor, Dufraisse Evan, Popescu Adrian, Gatica-Perez Daniel

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.21519: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21519&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[140] The illusion of reasoning: step-level evaluation reveals decorative chain-of-thought in frontier language models

Abhinaba Basu, Pavan Chakraborty

Main category: cs.CL

TL;DR: Paper 2603.22816: Could not fetch summary due to HTTP 429 error (rate limiting).

DetailsMotivation: Unable to determine motivation due to missing abstract.

Method: Unable to determine method due to missing abstract.

Result: Unable to determine results due to missing abstract.

Conclusion: Unable to draw conclusions due to missing abstract.

Abstract: Failed to fetch summary for 2603.22816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[141] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xinyu Chen, Tianci He, Jiani Hou, Liang Hu, Ziyun Huang, Yongzhe Hui, Jianpeng Jiao, Chennan Ju, Yingru Kong, Yiran Li, Mengyun Liu, Luyao Ma, Fei Ni, Yiqing Ni, Yueyan Qiu, Yanle Ren, Zilin Shi, Zaiyuan Wang, Wenjie Yue, Shiyu Zhang, Xinyi Zhang, Kaiwen Zhao, Zhenwei Zhu, Shanshan Wu, Qi Zhao, Wenhao Huang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed fetch

Method: Unable to determine method due to failed fetch

Result: Unable to determine results due to failed fetch

Conclusion: Unable to determine conclusion due to failed fetch

Abstract: Failed to fetch summary for 2604.02368: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02368&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[142] Robust Multilingual Text-to-Pictogram Mapping for Scalable Reading Rehabilitation

Soufiane Jhilal, Martina Galletti

Main category: cs.CL

TL;DR: Paper 2603.24536: Unable to fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2603.24536: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24536&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[143] Learning to Predict Future-Aligned Research Proposals with Language Models

Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu, Jiawei Han, Heng Ji

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.27146: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27146&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[144] MemFactory: Unified Inference & Training Framework for Agent Memory

Ziliang Guo, Ziheng Li, Bo Tang, Feiyu Xiong, Zhiyu Li

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.29493: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29493&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[145] How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization

Ramon Ferrer-i-Cancho

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.01938: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01938&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[146] Adam’s Law: Textual Frequency Law on Large Language Models

Hongyuan Adam Lu, Z.L., Victor Wei, Zefan Zhang, Zhao Hong, Qiqi Xiang, Bowen Cao, Wai Lam

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to determine conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2604.02176: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02176&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[147] A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

Avish Vijayaraghavan, Jaskaran Singh Kawatra, Sebin Sabu, Jonny Sheldon, Will Poulett, Alex Eze, Daniel Key, John Booth, Shiren Patel, Jonny Pearson, Dan Schofield, Jonathan Hope, Pavithra Rajendran, Neil Sebire

Main category: cs.CL

TL;DR: Failed to fetch summary for arXiv ID 2604.04168 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved due to API rate limiting

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2604.04168: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04168&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[148] Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2511.04570: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04570&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[149] How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

Gregory N. Frank

Main category: cs.CL

TL;DR: Failed to fetch summary for paper 2604.04385 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as the paper abstract could not be retrieved

Method: Unknown - paper content not accessible

Result: No results available due to access error

Conclusion: Cannot analyze paper due to technical limitations in fetching the abstract

Abstract: Failed to fetch summary for 2604.04385: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04385&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[150] Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection

Yang Li, Qiang Sheng, Zhengjia Wang, Yehan Yang, Danding Wang, Juan Cao

Main category: cs.CL

TL;DR: Unable to analyze paper 2604.04932 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper relevance due to lack of content

Abstract: Failed to fetch summary for 2604.04932: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04932&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[151] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

João Vitor Boer Abitante, Joana Meneguzzo Pasquali, Luan Fonseca Garcia, Ewerton de Oliveira, Thomas da Silva Paula, Rodrigo C. Barros, Lucas S. Kupssinskü

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2602.13151: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13151&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[152] MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

Yibo Wang, Congying Xia, Wenting Zhao, Jiangshu Du, Chunyu Miao, Zhongfen Deng, Philip S. Yu, Chen Xing

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2502.06556: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.06556&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[153] Why Code, Why Now: An Information-Theoretic Perspective on the Limits of Machine Learning

Zhimin Zhao

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.13934: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13934&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[154] ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking

Xianming Li, Aamir Shakir, Rui Huang, Julius Lipp, Benjamin Clavié, Jing Li

Main category: cs.CL

TL;DR: Unable to analyze paper 2506.03487 due to HTTP 429 error when fetching the abstract from arXiv API

DetailsMotivation: Cannot determine motivation as the paper content could not be retrieved

Method: Cannot determine method as the paper content could not be retrieved

Result: Cannot determine results as the paper content could not be retrieved

Conclusion: Cannot draw conclusions as the paper content could not be retrieved

Abstract: Failed to fetch summary for 2506.03487: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.03487&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[155] MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

Zilin Xiao, Qi Ma, Mengting Gu, Chun-cheng Jason Chen, Xintao Chen, Vicente Ordonez, Vijai Mohan

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.18095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[156] Horizon-LM: A RAM-Centric Architecture for LLM Training

Zhengqing Yuan, Lichao Sun, Yanfang Ye

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.04816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[157] Weight space Detection of Backdoors in LoRA Adapters

David Puertolas Merenciano, Ekaterina Vasyagina, Kevin Zhu, Javier Ferrando, Maheep Chaudhary

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.15195: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15195&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[158] Moral Mazes in the Era of LLMs

Dang Nguyen, Harvey Yiyun Fu, Peter West, Ari Holtzman, Chenhao Tan

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.20231: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20231&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[159] Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

Cheng Jiayang, Xin Liu, Zhihan Zhang, Haoyang Wen, Zixuan Zhang, Qingyu Yin, Shiyang Li, Priyanka Nigam, Bing Yin, Chao Zhang, Yangqiu Song

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.24709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[160] UnWeaving the knots of GraphRAG – turns out VectorRAG is almost enough

Ryszard Tuora, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Mateusz Czyżnikiewicz, Adam Kozakiewicz, Tomasz Ziętkiewicz

Main category: cs.CL

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv

DetailsMotivation: Unable to determine motivation due to failed abstract retrieval

Method: Unable to determine method due to failed abstract retrieval

Result: Unable to determine results due to failed abstract retrieval

Conclusion: Unable to draw conclusions due to failed abstract retrieval

Abstract: Failed to fetch summary for 2603.29875: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29875&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[161] Darkness Visible: Reading the Exception Handler of a Language Model

Peter Balogh

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to data unavailability

Abstract: Failed to fetch summary for 2604.04756: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04756&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[162] Vero: An Open RL Recipe for General Visual Reasoning

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2604.04917: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04917&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.CV

[163] Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity

Abhishek Dharmaratnakar, Srivaths Ranganathan, Debanshu Das, Anushree Sinha

Main category: cs.CV

TL;DR: Survey paper analyzing the evolution of automatic video trailer generation from heuristic-based methods to deep generative synthesis using LLMs, MLLMs, and diffusion models.

DetailsMotivation: To provide a comprehensive technical review of the paradigm shift in automatic video trailer generation, from traditional heuristic-based approaches to modern generative AI techniques, and establish a new taxonomy for AI-driven trailer generation in the era of foundation models.

Method: Survey methodology analyzing architectural progression from Graph Convolutional Networks (GCNs) to Trailer Generation Transformers (TGT), evaluation of LLM-orchestrated pipelines, text-to-video foundation models (Sora, Veo), and synthesis of insights from recent literature.

Result: Establishes a new taxonomy for AI-driven trailer generation, identifies the transition from extractive selection to controllable generative editing and semantic reconstruction, and analyzes economic implications on UGC platforms and ethical challenges of neural synthesis.

Conclusion: Future promotional video systems will move beyond extractive selection toward controllable generative editing and semantic reconstruction of trailers, with foundation models enabling coherent, emotionally resonant narrative construction.

Abstract: The domain of automatic video trailer generation is currently undergoing a profound paradigm shift, transitioning from heuristic-based extraction methods to deep generative synthesis. While early methodologies relied heavily on low-level feature engineering, visual saliency, and rule-based heuristics to select representative shots, recent advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), and diffusion-based video synthesis have enabled systems that not only identify key moments but also construct coherent, emotionally resonant narratives. This survey provides a comprehensive technical review of this evolution, with a specific focus on generative techniques including autoregressive Transformers, LLM-orchestrated pipelines, and text-to-video foundation models like OpenAI’s Sora and Google’s Veo. We analyze the architectural progression from Graph Convolutional Networks (GCNs) to Trailer Generation Transformers (TGT), evaluate the economic implications of automated content velocity on User-Generated Content (UGC) platforms, and discuss the ethical challenges posed by high-fidelity neural synthesis. By synthesizing insights from recent literature, this report establishes a new taxonomy for AI-driven trailer generation in the era of foundation models, suggesting that future promotional video systems will move beyond extractive selection toward controllable generative editing and semantic reconstruction of trailers.

[164] Leveraging Image Editing Foundation Models for Data-Efficient CT Metal Artifact Reduction

Ahmet Rasim Emirdagi, Süleyman Aslan, Mısra Yavuz, Görkay Aydemir, Yunus Bilge Kurt, Nasrin Rahimi, Burak Can Biner, M. Akın Yılmaz

Main category: cs.CV

TL;DR: Adapting vision-language diffusion foundation models via LoRA for CT metal artifact reduction, achieving state-of-the-art performance with only 16-128 training examples through multi-reference conditioning.

DetailsMotivation: Metal artifacts from implants degrade CT image quality, but standard deep learning methods require extensive paired training data. The paper proposes reframing artifact reduction as an in-context reasoning task using foundation models to dramatically reduce data requirements.

Method: Adapts a general-purpose vision-language diffusion foundation model via parameter-efficient Low-Rank Adaptation (LoRA). Uses multi-reference conditioning where clean anatomical exemplars from unrelated subjects are provided alongside corrupted input, enabling the model to exploit category-specific context for artifact reduction.

Result: Achieves state-of-the-art performance on the AAPM CT-MAR benchmark with perceptual and radiological-feature metrics. Reduces data requirements by two orders of magnitude (only 16-128 paired training examples needed).

Conclusion: Foundation models, when appropriately adapted with domain adaptation and multi-reference conditioning, offer a scalable alternative for interpretable, data-efficient medical image reconstruction, establishing a new paradigm for artifact reduction.

Abstract: Metal artifacts from high-attenuation implants severely degrade CT image quality, obscuring critical anatomical structures and posing a challenge for standard deep learning methods that require extensive paired training data. We propose a paradigm shift: reframing artifact reduction as an in-context reasoning task by adapting a general-purpose vision-language diffusion foundation model via parameter-efficient Low-Rank Adaptation (LoRA). By leveraging rich visual priors, our approach achieves effective artifact suppression with only 16 to 128 paired training examples reducing data requirements by two orders of magnitude. Crucially, we demonstrate that domain adaptation is essential for hallucination mitigation; without it, foundation models interpret streak artifacts as erroneous natural objects (e.g., waffles or petri dishes). To ground the restoration, we propose a multi-reference conditioning strategy where clean anatomical exemplars from unrelated subjects are provided alongside the corrupted input, enabling the model to exploit category-specific context to infer uncorrupted anatomy. Extensive evaluation on the AAPM CT-MAR benchmark demonstrates that our method achieves state-of-the-art performance on perceptual and radiological-feature metrics . This work establishes that foundation models, when appropriately adapted, offer a scalable alternative for interpretable, data-efficient medical image reconstruction. Code is available at https://github.com/ahmetemirdagi/CT-EditMAR.

[165] RCP: Representation Consistency Pruner for Mitigating Distribution Shift in Large Vision-Language Models

Jianwei Zhang, Chaoning Zhang, Sihan Cao, Wang Liu, Pengcheng Zheng, Jiaxin Huang, Caiyan Qin, Yalan Ye, Wei Dong, Yang Yang

Main category: cs.CV

TL;DR: RCP is a novel pruning framework for Large Vision-Language Models that reduces visual tokens by up to 88.9% with minimal performance degradation using cumulative pruning and delayed repair mechanisms.

DetailsMotivation: LVLMs suffer from high inference costs due to processing massive visual tokens. Existing pruning methods cause significant performance degradation from irreversible token removal and distribution shifts away from pre-trained regimes.

Method: Proposes Representation Consistency Pruner (RCP) with two key components: 1) Cross-attention pruner using LLM’s intrinsic attention to predict cumulative masks for consistent token reduction, and 2) Delayed Repair Adapter (DRA) that caches pruned token essence and applies FiLM-based modulation to answer generation tokens. Uses repair loss to match pruned representations with full-token teacher statistics.

Result: RCP removes up to 88.9% of visual tokens, reduces FLOPs by up to 85.7% with only marginal average accuracy drop. Outperforms prior methods that avoid fine-tuning the original model on several widely used benchmarks.

Conclusion: RCP provides an efficient pruning framework for LVLMs that maintains performance while significantly reducing computational costs through cumulative pruning and delayed repair mechanisms.

Abstract: Large Vision-Language Models (LVLMs) suffer from prohibitive inference costs due to the massive number of visual tokens processed by the language decoder. Existing pruning methods often lead to significant performance degradation because the irreversible removal of visual tokens causes a distribution shift in the hidden states that deviates from the pre-trained full-token regime. To address this, we propose Representation Consistency Pruner, which we refer to as RCP, as a novel framework that integrates cumulative visual token pruning with a delayed repair mechanism. Specifically, we introduce a cross-attention pruner that leverages the intrinsic attention of the LLM as a baseline to predict cumulative masks, ensuring consistent and monotonic token reduction across layers. To compensate for the resulting information loss, we design a delayed repair adapter denoted as DRA, which caches the essence of pruned tokens and applies FiLM-based modulation specifically to the answer generation tokens. We employ a repair loss to match the first and second-order statistics of the pruned representations with a full-token teacher. RCP is highly efficient because it trains only lightweight plug-in modules while allowing for physical token discarding at inference. Extensive experiments on LVLM benchmarks demonstrate that RCP removes up to 88.9% of visual tokens and reduces FLOPs by up to 85.7% with only a marginal average accuracy drop, and outperforms prior methods that avoid fine-tuning the original model on several widely used benchmarks.

[166] Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Xiawu Zheng, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Xing Sun, Caifeng Shan, Ran He

Main category: cs.CV

TL;DR: Video-MME-v2 is a comprehensive video understanding benchmark with progressive tri-level hierarchy and group-based non-linear evaluation to assess model robustness and faithfulness beyond inflated leaderboard scores.

DetailsMotivation: Existing video understanding benchmarks are becoming saturated, creating a gap between inflated leaderboard scores and real-world model capabilities. There's a need for rigorous evaluation of model robustness and faithfulness in video comprehension.

Method: 1) Progressive tri-level hierarchy: multi-point visual information aggregation → temporal dynamics modeling → complex multimodal reasoning. 2) Group-based non-linear evaluation: enforces consistency across related queries and coherence in multi-step reasoning. 3) Rigorous human annotation pipeline with 12 annotators, 50 reviewers, 3,300 human-hours, and up to 5 rounds of quality assurance.

Result: Substantial gap between current best model (Gemini-3-Pro) and human experts. Clear hierarchical bottleneck where errors in visual aggregation and temporal modeling propagate to limit high-level reasoning. Thinking-based reasoning highly dependent on textual cues, improving with subtitles but sometimes degrading in purely visual settings.

Conclusion: Video-MME-v2 exposes limitations of current video MLLMs and establishes a demanding new testbed for next-generation video multimodal large language models, highlighting the need for more robust and faithful video understanding capabilities.

Abstract: With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

[167] Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval

Yuxin Yang, Yinan Zhou, Yuxin Chen, Ziqi Zhang, Zongyang Ma, Chunfeng Yuan, Bing Li, Jun Gao, Weiming Hu

Main category: cs.CV

TL;DR: OACIR introduces a new fine-grained image retrieval task focused on instance-level consistency rather than semantic matching, with a benchmark dataset and AdaFocal framework using context-aware attention modulation.

DetailsMotivation: Current Composed Image Retrieval (CIR) prioritizes semantic matching but struggles with instance-level fidelity. Many real-world applications require strict instance consistency rather than broad semantic similarity.

Method: Proposes Object-Anchored Composed Image Retrieval (OACIR) task with bounding box anchors. Introduces AdaFocal framework with Context-Aware Attention Modulator that adaptively intensifies attention within specified instance regions while balancing with broader compositional context.

Result: Created OACIRR benchmark with 160K+ quadruples across multiple domains with hard-negative distractors. AdaFocal substantially outperforms existing compositional retrieval models, particularly in maintaining instance-level fidelity.

Conclusion: OACIR addresses the limitation of semantic-focused CIR by enabling instance-aware retrieval. AdaFocal establishes a strong baseline for this challenging task, opening directions for more flexible, instance-aware multimodal retrieval systems.

Abstract: Composed Image Retrieval (CIR) has demonstrated significant potential by enabling flexible multimodal queries that combine a reference image and modification text. However, CIR inherently prioritizes semantic matching, struggling to reliably retrieve a user-specified instance across contexts. In practice, emphasizing concrete instance fidelity over broad semantics is often more consequential. In this work, we propose Object-Anchored Composed Image Retrieval (OACIR), a novel fine-grained retrieval task that mandates strict instance-level consistency. To advance research on this task, we construct OACIRR (OACIR on Real-world images), the first large-scale, multi-domain benchmark comprising over 160K quadruples and four challenging candidate galleries enriched with hard-negative instance distractors. Each quadruple augments the compositional query with a bounding box that visually anchors the object in the reference image, providing a precise and flexible way to ensure instance preservation. To address the OACIR task, we propose AdaFocal, a framework featuring a Context-Aware Attention Modulator that adaptively intensifies attention within the specified instance region, dynamically balancing focus between the anchored instance and the broader compositional context. Extensive experiments demonstrate that AdaFocal substantially outperforms existing compositional retrieval models, particularly in maintaining instance-level fidelity, thereby establishing a robust baseline for this challenging task while opening new directions for more flexible, instance-aware retrieval systems.

[168] ID-Sim: An Identity-Focused Similarity Metric

Julia Chae, Nicholas Kolkin, Jui-Hsien Wang, Richard Zhang, Sara Beery, Cusuh Ham

Main category: cs.CV

TL;DR: ID-Sim is a feed-forward metric for evaluating identity-focused vision tasks that aims to match human selective sensitivity to identities across diverse contexts.

DetailsMotivation: Vision models struggle to match human selective sensitivity to identities across different contexts (viewpoints, lighting), and progress in identity-focused tasks like personalized image generation is hindered by lack of proper evaluation metrics.

Method: Propose ID-Sim metric trained on curated real-world images augmented with generative synthetic data providing controlled fine-grained identity and contextual variations. Create unified evaluation benchmark for identity-focused recognition, retrieval, and generative tasks.

Result: ID-Sim metric designed to faithfully reflect human selective sensitivity to identities across diverse contexts, with evaluation on new benchmark for consistency with human annotations.

Conclusion: ID-Sim facilitates progress in identity-focused vision tasks by providing a metric that better aligns with human perception of identity across varying contexts.

Abstract: Humans have remarkable selective sensitivity to identities – easily distinguishing between highly similar identities, even across significantly different contexts such as diverse viewpoints or lighting. Vision models have struggled to match this capability, and progress toward identity-focused tasks such as personalized image generation is slowed by a lack of identity-focused evaluation metrics. To help facilitate progress, we propose ID-Sim, a feed-forward metric designed to faithfully reflect human selective sensitivity. To build ID-Sim, we curate a high-quality training set of images spanning diverse real-world domains, augmented with generative synthetic data that provides controlled, fine-grained identity and contextual variations. We evaluate our metric on a new unified evaluation benchmark for assessing consistency with human annotations across identity-focused recognition, retrieval, and generative tasks.

[169] DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

Xinran Wang, Yuxuan Zhang, Xiao Zhang, Haolong Yan, Muxi Diao, Songyu Xu, Zhonghao Yan, Hongbing Li, Kongming Liang, Zhanyu Ma

Main category: cs.CV

TL;DR: DetailVerifyBench: A challenging benchmark for precise hallucination localization in long image captions from MLLMs, featuring 1,000 images across 5 domains with token-level annotations and 200+ word captions.

DetailsMotivation: As MLLMs generate comprehensive, long-form image captions (hundreds of words), existing benchmarks lack the fine granularity and domain diversity needed to evaluate precise hallucination detection and localization at the token/span level rather than just response-level inconsistencies.

Method: Created DetailVerifyBench with 1,000 high-quality images across five distinct domains, featuring long captions (average 200+ words) and dense token-level annotations of multiple hallucination types to enable rigorous evaluation of hallucination localization capabilities.

Result: Established the most challenging benchmark to date for precise hallucination localization in long image captioning, providing a comprehensive evaluation framework with fine-grained annotations across diverse domains.

Conclusion: DetailVerifyBench addresses a critical gap in evaluating MLLMs’ ability to detect and localize hallucinations in long-form image captions, enabling more reliable caption generation systems through better evaluation of fine-grained hallucination detection capabilities.

Abstract: Accurately detecting and localizing hallucinations is a critical task for ensuring high reliability of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often spanning hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flag response-level inconsistencies. However, existing benchmarks lack the fine granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the field of long image captioning to date. Our benchmark is available at https://zyx-hhnkh.github.io/DetailVerifyBench/.

[170] R3PM-Net: Real-time, Robust, Real-world Point Matching Network

Yasaman Kashefbahrami, Erkut Akdag, Panagiotis Meletis, Evgeniya Balmashnova, Dip Goswami, Egor Bondarau

Main category: cs.CV

TL;DR: R3PM-Net is a lightweight, global-aware point cloud registration network that achieves state-of-the-art accuracy with unmatched speed, specifically designed for real-world industrial applications with imperfect scans.

DetailsMotivation: Existing deep learning methods for point cloud registration are developed on clean synthetic datasets, limiting their generalizability to real-world industrial scenarios with noisy, incomplete scans. There's a need for methods that prioritize both generalizability and real-time efficiency.

Method: Proposes R3PM-Net, a lightweight global-aware object-level point matching network. Introduces two new datasets (Sioux-Cranfield and Sioux-Scans) for evaluating registration of imperfect photogrammetric and event-camera scans to CAD models.

Result: Achieves perfect fitness score of 1 and inlier RMSE of 0.029 cm on ModelNet40 in only 0.007s (7x faster than RegTR). Maintains similar performance on Sioux-Cranfield and successfully resolves edge cases on challenging Sioux-Scans dataset in under 50 ms.

Conclusion: R3PM-Net offers a robust, high-speed solution for critical industrial applications where precision and real-time performance are essential, bridging the gap between synthetic training and real-world deployment.

Abstract: Accurate Point Cloud Registration (PCR) is an important task in 3D data processing, involving the estimation of a rigid transformation between two point clouds. While deep-learning methods have addressed key limitations of traditional non-learning approaches, such as sensitivity to noise, outliers, occlusion, and initialization, they are developed and evaluated on clean, dense, synthetic datasets (limiting their generalizability to real-world industrial scenarios). This paper introduces R3PM-Net, a lightweight, global-aware, object-level point matching network designed to bridge this gap by prioritizing both generalizability and real-time efficiency. To support this transition, two datasets, Sioux-Cranfield and Sioux-Scans, are proposed. They provide an evaluation ground for registering imperfect photogrammetric and event-camera scans to digital CAD models, and have been made publicly available. Extensive experiments demonstrate that R3PM-Net achieves competitive accuracy with unmatched speed. On ModelNet40, it reaches a perfect fitness score of $1$ and inlier RMSE of $0.029$ cm in only $0.007$s, approximately 7 times faster than the state-of-the-art method RegTR. This performance carries over to the Sioux-Cranfield dataset, maintaining a fitness of $1$ and inlier RMSE of $0.030$ cm with similarly low latency. Furthermore, on the highly challenging Sioux-Scans dataset, R3PM-Net successfully resolves edge cases in under 50 ms. These results confirm that R3PM-Net offers a robust, high-speed solution for critical industrial applications, where precision and real-time performance are indispensable. The code and datasets are available at https://github.com/YasiiKB/R3PM-Net.

[171] EDGE-Shield: Efficient Denoising-staGE Shield for Violative Content Filtering via Scalable Reference-Based Matching

Takara Taniguchi, Ryohei Shimizu, Minh-Duc Vo, Kota Izumi, Shiqi Yang, Teppei Suzuki

Main category: cs.CV

TL;DR: EDGE-Shield is a scalable content filter for text-to-image models that blocks copyright-violating and deepfake content during the denoising process using embedding-based matching and x-pred transformation for early detection.

DetailsMotivation: Text-to-image models pose risks of copyright violation and deepfake generation. Existing reference-based filters lack scalability with numerous references and require waiting for complete image generation, creating a need for efficient, training-free content filters that work during generation.

Method: Proposes EDGE-Shield with two key innovations: 1) Embedding-based matching for efficient reference comparison, and 2) x-pred transformation that converts noisy intermediate latents into pseudo-estimated clean latents at later denoising stages, enabling earlier classification of violative content.

Result: EDGE-Shield significantly reduces processing time by approximately 79% for Z-Image-Turbo and 50% for Qwen-Image compared to traditional reference-based methods, while maintaining filtering accuracy across different model architectures.

Conclusion: EDGE-Shield provides a scalable, efficient solution for content filtering during image generation, addressing the limitations of existing reference-based approaches while maintaining practical latency and accuracy.

Abstract: The advent of Text-to-Image generative models poses significant risks of copyright violation and deepfake generation. Since the rapid proliferation of new copyrighted works and private individuals constantly emerges, reference-based training-free content filters are essential for providing up-to-date protection without the constraints of a fixed knowledge cutoff. However, existing reference-based approaches often lack scalability when handling numerous references and require waiting for finishing image generation. To solve these problems, we propose EDGE-Shield, a scalable content filter during the denoising process that maintains practical latency while effectively blocking violative content. We leverage embedding-based matching for efficient reference comparison. Additionally, we introduce an \textit{$x$}-pred transformation that converts the model’s noisy intermediate latent into the pseudo-estimated clean latent at the later stage, enhancing classification accuracy of violative content at earlier denoising stages. We conduct experiments of violative content filtering against two generative models including Z-Image-Turbo and Qwen-Image. EDGE-Shield significantly outperforms traditional reference-based methods in terms of latency; it achieves an approximate $79%$ reduction in processing time for Z-Image-Turbo and approximate $50%$ reduction for Qwen-Image, maintaining the filtering accuracy across different model architectures.

[172] SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

Zhongyu Yang, Zuhao Yang, Shuo Zhan, Tan Yue, Wei Pang, Yingfang Yuan

Main category: cs.CV

TL;DR: SVAgent is a storyline-guided cross-modal multi-agent framework for VideoQA that uses narrative reasoning to improve video understanding and answer consistency.

DetailsMotivation: Current VideoQA methods focus on locating relevant frames rather than reasoning through evolving storylines like humans do, which is crucial for robust and contextually grounded predictions.

Method: A multi-agent framework with: 1) storyline agent that constructs narrative representations, 2) refinement suggestion agent that analyzes historical failures to suggest frames, 3) cross-modal decision agents that independently predict answers from visual and textual modalities guided by the storyline, and 4) meta-agent that aligns cross-modal predictions.

Result: SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.

Conclusion: The storyline-guided multi-agent approach effectively improves VideoQA by incorporating narrative reasoning and cross-modal alignment, leading to better performance and interpretability.

Abstract: Video question answering (VideoQA) is a challenging task that requires integrating spatial, temporal, and semantic information to capture the complex dynamics of video sequences. Although recent advances have introduced various approaches for video understanding, most existing methods still rely on locating relevant frames to answer questions rather than reasoning through the evolving storyline as humans do. Humans naturally interpret videos through coherent storylines, an ability that is crucial for making robust and contextually grounded predictions. To address this gap, we propose SVAgent, a storyline-guided cross-modal multi-agent framework for VideoQA. The storyline agent progressively constructs a narrative representation based on frames suggested by a refinement suggestion agent that analyzes historical failures. In addition, cross-modal decision agents independently predict answers from visual and textual modalities under the guidance of the evolving storyline. Their outputs are then evaluated by a meta-agent to align cross-modal predictions and enhance reasoning robustness and answer consistency. Experimental results demonstrate that SVAgent achieves superior performance and interpretability by emulating human-like storyline reasoning in video understanding.

[173] Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors

Junbin Zhang, Meng Cao, Feng Tan, Yikai Lin, Yuexian Zou

Main category: cs.CV

TL;DR: Graph-PiT introduces a graph-based framework for fine-grained visual generation that models spatial-semantic relationships between parts using hierarchical graph neural networks to improve structural coherence in multi-part image synthesis.

DetailsMotivation: Existing part-based visual generation frameworks treat user-provided parts as unordered sets, ignoring their intrinsic spatial and semantic relationships, which leads to compositions lacking structural integrity and coherence.

Method: Proposes Graph-PiT framework that models visual parts as nodes and their relationships as edges in a graph. Uses Hierarchical Graph Neural Network (HGNN) for bidirectional message passing between coarse-grained part-level super-nodes and fine-grained IP+ token sub-nodes, with graph Laplacian smoothness loss and edge-reconstruction loss for relation-aware embeddings.

Result: Quantitative experiments on controlled synthetic domains (character, product, indoor layout, jigsaw) and qualitative transfer to real web images show Graph-PiT improves structural coherence over vanilla PiT while remaining compatible with original IP-Prior pipeline. Ablation confirms explicit relational reasoning is crucial for enforcing adjacency constraints.

Conclusion: Graph-PiT enhances plausibility of generated concepts and offers scalable, interpretable mechanism for complex multi-part image synthesis by explicitly modeling structural dependencies between visual components through graph-based relational reasoning.

Abstract: Achieving fine-grained and structurally sound controllability is a cornerstone of advanced visual generation. Existing part-based frameworks treat user-provided parts as an unordered set and therefore ignore their intrinsic spatial and semantic relationships, which often results in compositions that lack structural integrity. To bridge this gap, we propose Graph-PiT, a framework that explicitly models the structural dependencies of visual components using a graph prior. Specifically, we represent visual parts as nodes and their spatial-semantic relationships as edges. At the heart of our method is a Hierarchical Graph Neural Network (HGNN) module that performs bidirectional message passing between coarse-grained part-level super-nodes and fine-grained IP+ token sub-nodes, refining part embeddings before they enter the generative pipeline. We also introduce a graph Laplacian smoothness loss and an edge-reconstruction loss so that adjacent parts acquire compatible, relation-aware embeddings. Quantitative experiments on controlled synthetic domains (character, product, indoor layout, and jigsaw), together with qualitative transfer to real web images, show that Graph-PiT improves structural coherence over vanilla PiT while remaining compatible with the original IP-Prior pipeline. Ablation experiments confirm that explicit relational reasoning is crucial for enforcing user-specified adjacency constraints. Our approach not only enhances the plausibility of generated concepts but also offers a scalable and interpretable mechanism for complex, multi-part image synthesis. The code is available at https://github.com/wolf-bailang/Graph-PiT.

[174] Simultaneous Dual-View Mammogram Synthesis Using Denoising Diffusion Probabilistic Models

Jorge Alberto Garza-Abdala, Gerardo A. Fumagal-González, Eduardo de Avila-Armenta, Sadam Hussain, Jasiel H. Toscano-Martínezb, Diana S. M. Rosales Gurmendi, Alma A. Pedro-Pérez, Jose G. Tamez-Pena

Main category: cs.CV

TL;DR: A three-channel denoising diffusion probabilistic model generates paired craniocaudal (CC) and mediolateral oblique (MLO) mammogram views simultaneously, using a third channel for absolute difference to maintain anatomical consistency.

DetailsMotivation: Many breast cancer screening datasets lack complete paired CC and MLO views, limiting algorithm development that requires cross-view consistency. There's a need for synthetic dual-view mammogram generation to augment datasets and enable cross-view-aware AI applications.

Method: A three-channel denoising diffusion probabilistic model (DDPM) with separate channels for CC and MLO views, plus a third channel encoding their absolute difference to guide anatomical relationship learning. A pretrained DDPM was fine-tuned on private screening data.

Result: The difference-based encoding preserved global breast structure across views, producing synthetic CC-MLO pairs resembling real acquisitions. Evaluation showed geometric consistency via automated breast mask segmentation and distributional comparison with real images.

Conclusion: The work demonstrates feasibility of simultaneous dual-view mammogram synthesis using difference-guided DDPM, highlighting potential for dataset augmentation and future cross-view-aware AI applications in breast imaging.

Abstract: Breast cancer screening relies heavily on mammography, where the craniocaudal (CC) and mediolateral oblique (MLO) views provide complementary information for diagnosis. However, many datasets lack complete paired views, limiting the development of algorithms that depend on cross-view consistency. To address this gap, we propose a three-channel denoising diffusion probabilistic model capable of simultaneously generating CC and MLO views of a single breast. In this configuration, the two mammographic views are stored in separate channels, while a third channel encodes their absolute difference to guide the model toward learning coherent anatomical relationships between projections. A pretrained DDPM from Hugging Face was fine-tuned on a private screening dataset and used to synthesize dual-view pairs. Evaluation included geometric consistency via automated breast mask segmentation and distributional comparison with real images, along with qualitative inspection of cross-view alignment. The results show that the difference-based encoding helps preserve the global breast structure across views, producing synthetic CC-MLO pairs that resemble real acquisitions. This work demonstrates the feasibility of simultaneous dual-view mammogram synthesis using a difference-guided DDPM, highlighting its potential for dataset augmentation and future cross-view-aware AI applications in breast imaging.

[175] Forgery-aware Layer Masking and Multi-Artifact Subspace Decomposition for Generalizable Deepfake Detection

Xiang Zhang, Wenliang Weng, Daoyong Fu, Beijing Chen, Ziqiang Li, Ziwen He, Zhangjie Fu

Main category: cs.CV

TL;DR: FMSD: A deepfake detection framework using forgery-aware layer masking and multi-artifact subspace decomposition to improve cross-dataset generalization while preserving pretrained semantic representations.

DetailsMotivation: Deepfake detection struggles with cross-dataset generalization due to varying artifact patterns across forgery methods. Existing approaches often overemphasize forgery-specific cues and disturb semantic representations, weakening generalization.

Method: Two-stage approach: 1) Forgery-aware Layer Masking identifies forgery-sensitive layers via gradient bias-variance analysis for selective updating; 2) Multi-Artifact Subspace Decomposition uses SVD to decompose selected layer weights into semantic subspace and multiple learnable artifact subspaces with orthogonality and spectral consistency constraints.

Result: The framework effectively models diverse forgery patterns while preserving pretrained semantic representations, improving cross-dataset generalization in deepfake detection.

Conclusion: FMSD addresses limitations of existing deepfake detection methods by selectively updating forgery-sensitive layers and decomposing representations to capture heterogeneous artifacts without compromising semantic understanding.

Abstract: Deepfake detection remains highly challenging, particularly in cross-dataset scenarios and complex real-world settings. This challenge mainly arises because artifact patterns vary substantially across different forgery methods, whereas adapting pretrained models to such artifacts often overemphasizes forgery-specific cues and disturbs semantic representations, thereby weakening generalization. Existing approaches typically rely on full-parameter fine-tuning or auxiliary supervision to improve discrimination. However, they often struggle to model diverse forgery artifacts without compromising pretrained representations. To address these limitations, we propose FMSD, a deepfake detection framework built upon Forgery-aware Layer Masking and Multi-Artifact Subspace Decomposition. Specifically, Forgery-aware Layer Masking evaluates the bias-variance characteristics of layer-wise gradients to identify forgery-sensitive layers, thereby selectively updating them while reducing unnecessary disturbance to pretrained representations. Building upon this, Multi-Artifact Subspace Decomposition further decomposes the selected layer weights via Singular Value Decomposition (SVD) into a semantic subspace and multiple learnable artifact subspaces. These subspaces are optimized to capture heterogeneous and complementary forgery artifacts, enabling effective modeling of diverse forgery patterns while preserving pretrained semantic representations. Furthermore, orthogonality and spectral consistency constraints are imposed to regularize the artifact subspaces, reducing redundancy across them while preserving the overall spectral structure of pretrained weights.

[176] Watch Before You Answer: Learning from Visually Grounded Post-Training

Yuxuan Zhang, EunJeong Hwang, Huaisong Zhang, Penghui Du, Yiming Jia, Dongfu Jiang, Xuan He, Shenhui Zhang, Ping Nie, Peter West, Kelsey R. Allen

Main category: cs.CV

TL;DR: VidGround addresses linguistic bias in video understanding benchmarks by curating visually-grounded questions for post-training, improving VLM performance while using less data.

DetailsMotivation: Current video understanding benchmarks contain 40-60% questions answerable from text alone, undermining visual grounding evaluation and post-training effectiveness for VLMs.

Method: Introduces VidGround - curates post-training data using only visually-grounded questions without linguistic biases, combined with RL-based post-training algorithms.

Result: Improves performance by up to 6.2 points using only 69.1% of original data; data curation outperforms complex post-training techniques, showing data quality is key bottleneck.

Conclusion: Highlights importance of curating post-training data and benchmarks that truly require visual grounding to advance capable VLMs; data quality is major bottleneck for video understanding.

Abstract: It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.

[177] Lightweight True In-Pixel Encryption with FeFET Enabled Pixel Design for Secure Imaging

Md Rahatul Islam Udoy, Diego Ferrer, Wantong Li, Kai Ni, Sumeet Kumar Gupta, Ahmedullah Aziz

Main category: cs.CV

TL;DR: SecurePix: CMOS pixel sensor with in-pixel encryption using ferroelectric transistors for hardware-level image security

DetailsMotivation: Need for end-to-end security in image sensors as visual data can be exposed through multiple stages of imaging pipeline; requires encryption before pixel values appear on readout lines

Method: Compact CMOS-compatible pixel architecture using ferroelectric field-effect transistors with programmable non-volatile multidomain polarization states for symmetric key encryption; designed and simulated in HSPICE with 45nm CMOS process layout

Result: Pixel pitch of 2.33 x 3.01 um²; ResNet-18 accuracy drops from 99.29% to 9.58% on MNIST and 91.33% to 6.98% on CIFAR-10 after encryption; per-pixel programming power-delay product of 17 μW·μs and sensing of 1.25 μW·μs

Conclusion: SecurePix demonstrates effective hardware-level image protection with low overhead, enabling secure image capture at the pixel level

Abstract: Ensuring end-to-end security in image sensors has become essential as visual data can be exposed through multiple stages of the imaging pipeline. Advanced protection requires encryption to occur before pixel values appear on any readout lines. This work introduces a secure pixel sensor (SecurePix), a compact CMOS-compatible pixel architecture that performs true in-pixel encryption using a symmetric key realized through programmable, non-volatile multidomain polarization states of a ferroelectric field-effect transistor. The pixel and array operations are designed and simulated in HSPICE, while a 45 nm CMOS process design kit is used for layout drawing. The resulting layout confirms a pixel pitch of 2.33 x 3.01 um^2. Each pixel’s non-volatile programming level defines its analog transfer characteristic, enabling the photodiode voltage to be converted into an encrypted analog output within the pixel. Full-image evaluation shows that ResNet-18 recognition accuracy drops from 99.29 percent to 9.58 percent on MNIST and from 91.33 percent to 6.98 percent on CIFAR-10 after encryption, indicating strong resistance to neural-network-based inference. Lookup-table-based inverse mapping enables recovery for authorized receivers using the same symmetric key. Based on HSPICE simulation, the SecurePix achieves a per-pixel programming power-delay product of 17 uW us and a per-pixel sensing power-delay product of 1.25 uW us, demonstrating low-overhead hardware-level protection.

[178] Modality-Aware and Anatomical Vector-Quantized Autoencoding for Multimodal Brain MRI

Mingjie Li, Edward Kim, Yue Zhao, Ehsan Adeli, Kilian M. Pohl

Main category: cs.CV

TL;DR: NeuroQuant is a modality-aware 3D VQ-VAE for multi-modal brain MRI reconstruction that learns shared anatomical representations while preserving modality-specific features through factorized attention and dual-stream encoding.

DetailsMotivation: Existing brain VAEs focus on single-modality data (T1-weighted MRI), missing the complementary diagnostic value of other modalities like T2-weighted MRIs. There's a need for robust multi-modal brain MRI reconstruction for medical image analysis applications.

Method: Proposes NeuroQuant: a modality-aware 3D VQ-VAE with factorized multi-axis attention for shared latent representation, dual-stream 3D encoder separating anatomical structures from modality-dependent appearance, discretization via shared codebook, and FiLM-based decoding with joint 2D/3D training strategy.

Result: Extensive experiments on two multi-modal brain MRI datasets show superior reconstruction fidelity compared to existing VAEs, enabling scalable foundation for downstream generative modeling and cross-modal brain image analysis.

Conclusion: NeuroQuant effectively learns shared anatomical representations across modalities while preserving modality-specific features, providing a robust foundation for multi-modal brain MRI analysis and generative modeling tasks.

Abstract: Learning a robust Variational Autoencoder (VAE) is a fundamental step for many deep learning applications in medical image analysis, such as MRI synthesizes. Existing brain VAEs predominantly focus on single-modality data (i.e., T1-weighted MRI), overlooking the complementary diagnostic value of other modalities like T2-weighted MRIs. Here, we propose a modality-aware and anatomically grounded 3D vector-quantized VAE (VQ-VAE) for reconstructing multi-modal brain MRIs. Called NeuroQuant, it first learns a shared latent representation across modalities using factorized multi-axis attention, which can capture relationships between distant brain regions. It then employs a dual-stream 3D encoder that explicitly separates the encoding of modality-invariant anatomical structures from modality-dependent appearance. Next, the anatomical encoding is discretized using a shared codebook and combined with modality-specific appearance features via Feature-wise Linear Modulation (FiLM) during the decoding phase. This entire approach is trained using a joint 2D/3D strategy in order to account for the slice-based acquisition of 3D MRI data. Extensive experiments on two multi-modal brain MRI datasets demonstrate that NeuroQuant achieves superior reconstruction fidelity compared to existing VAEs, enabling a scalable foundation for downstream generative modeling and cross-modal brain image analysis.

[179] MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing

Ziqian Liu, Stephan Alaniz

Main category: cs.CV

TL;DR: MIRAGE is a training-free framework for precise multi-instance image editing that addresses over-editing and spatial misalignment in existing models by using vision-language parsing and multi-branch parallel denoising.

DetailsMotivation: Current instruction-guided image editing models (like FLUX.2 and Qwen-Image-Edit) struggle with complex scenarios involving multiple similar instances requiring individual edits, suffering from over-editing and spatial misalignment when faced with multiple identical instances and composite instructions.

Method: Proposes MIRAGE (Multi-Instance Regional Alignment via Guided Editing), a training-free framework that: 1) uses a vision-language model to parse complex instructions into regional subsets, 2) employs multi-branch parallel denoising strategy to inject latent representations of target regions into global representation space, 3) maintains background integrity through reference trajectory.

Result: Extensive evaluations on MIRA-Bench and RefEdit-Bench demonstrate that MIRAGE significantly outperforms existing methods in achieving precise instance-level modifications while preserving background consistency.

Conclusion: MIRAGE addresses key limitations in multi-instance image editing through a novel training-free framework that enables precise, localized editing while maintaining background integrity, with benchmark and code publicly available.

Abstract: Instruction-guided image editing has seen remarkable progress with models like FLUX.2 and Qwen-Image-Edit, yet they still struggle with complex scenarios with multiple similar instances each requiring individual edits. We observe that state-of-the-art models suffer from severe over-editing and spatial misalignment when faced with multiple identical instances and composite instructions. To this end, we introduce a comprehensive benchmark specifically designed to evaluate fine-grained consistency in multi-instance and multi-instruction settings. To address the failures of existing methods observed in our benchmark, we propose Multi-Instance Regional Alignment via Guided Editing (MIRAGE), a training-free framework that enables precise, localized editing. By leveraging a vision-language model to parse complex instructions into regional subsets, MIRAGE employs a multi-branch parallel denoising strategy. This approach injects latent representations of target regions into the global representation space while maintaining background integrity through a reference trajectory. Extensive evaluations on MIRA-Bench and RefEdit-Bench demonstrate that our framework significantly outperforms existing methods in achieving precise instance-level modifications while preserving background consistency. Our benchmark and code are available at https://github.com/ZiqianLiu666/MIRAGE.

[180] LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

Zhengqin Li, Cheng Zhang, Jakob Engel, Zhao Dong

Main category: cs.CV

TL;DR: LSRM scales transformer context windows for high-fidelity 3D reconstruction and inverse rendering using sparse attention, achieving SOTA results matching dense-view optimization methods.

DetailsMotivation: Current object-centric feed-forward 3D reconstruction methods lag behind dense-view optimization in recovering fine-grained texture and appearance. The paper investigates how scaling transformer context windows can bridge this gap.

Method: Uses native sparse attention with three key innovations: (1) coarse-to-fine pipeline predicting sparse high-resolution residuals, (2) 3D-aware spatial routing using geometric distances for 2D-3D correspondences, and (3) block-aware sequence parallelism with All-gather-KV protocol for GPU workload balancing.

Result: Handles 20x more object tokens and >2x more image tokens than prior SOTA, achieving 2.5 dB higher PSNR and 40% lower LPIPS on novel-view synthesis benchmarks. For inverse rendering, matches or exceeds SOTA dense-view optimization LPIPS scores.

Conclusion: Scaling transformer context windows with sparse attention enables high-fidelity 3D reconstruction and inverse rendering that rivals dense-view optimization methods, demonstrating the importance of context window scaling for visual understanding tasks.

Abstract: We introduce the Large Sparse Reconstruction Model to study how scaling transformer context windows impacts feed-forward 3D reconstruction. Although recent object-centric feed-forward methods deliver robust, high-quality reconstruction, they still lag behind dense-view optimization in recovering fine-grained texture and appearance. We show that expanding the context window – by substantially increasing the number of active object and image tokens – remarkably narrows this gap and enables high-fidelity 3D object reconstruction and inverse rendering. To scale effectively, we adapt native sparse attention in our architecture design, unlocking its capacity for 3D reconstruction with three key contributions: (1) an efficient coarse-to-fine pipeline that focuses computation on informative regions by predicting sparse high-resolution residuals; (2) a 3D-aware spatial routing mechanism that establishes accurate 2D-3D correspondences using explicit geometric distances rather than standard attention scores; and (3) a custom block-aware sequence parallelism strategy utilizing an All-gather-KV protocol to balance dynamic, sparse workloads across GPUs. As a result, LSRM handles 20x more object tokens and >2x more image tokens than prior state-of-the-art (SOTA) methods. Extensive evaluations on standard novel-view synthesis benchmarks show substantial gains over the current SOTA, yielding 2.5 dB higher PSNR and 40% lower LPIPS. Furthermore, when extending LSRM to inverse rendering tasks, qualitative and quantitative evaluations on widely-used benchmarks demonstrate consistent improvements in texture and geometry details, achieving an LPIPS that matches or exceeds that of SOTA dense-view optimization methods. Code and model will be released on our project page.

[181] OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models

Ali Aliev, Kamil Garifullin, Nikolay Yudin, Vera Soboleva, Alexander Molozhavenko, Ivan Oseledets, Aibek Alanov, Maxim Rakhuba

Main category: cs.CV

TL;DR: Training-free method for merging orthogonal adapters in generative models using geometric properties of structured orthogonal matrices and spectra restoration.

DetailsMotivation: There's practical interest in parameter-efficient fine-tuning and adapter merging, but combining subject and style adapters for generative models remains unresolved. The paper addresses how to merge multiple adapters trained for different tasks into one that works on both tasks without additional training.

Method: Uses orthogonal fine-tuning (OFT) with structured orthogonal parametrization. Derives the manifold structure of Group-and-Shuffle (GS) orthogonal matrices and obtains efficient formulas for geodesics approximation between two points. Proposes a spectra restoration transform to restore spectral properties of merged adapters for higher-quality fusion.

Result: The technique successfully merges GS orthogonal matrices to unite concept and style features of different adapters in subject-driven generation tasks. This is the first training-free method for merging multiplicative orthogonal adapters.

Conclusion: The paper presents a novel training-free approach for merging orthogonal adapters using geometric properties of structured orthogonal matrices, enabling effective combination of subject and style features in generative models without additional training.

Abstract: In a rapidly growing field of model training there is a constant practical interest in parameter-efficient fine-tuning and various techniques that use a small amount of training data to adapt the model to a narrow task. However, there is an open question: how to combine several adapters tuned for different tasks into one which is able to yield adequate results on both tasks? Specifically, merging subject and style adapters for generative models remains unresolved. In this paper we seek to show that in the case of orthogonal fine-tuning (OFT), we can use structured orthogonal parametrization and its geometric properties to get the formulas for training-free adapter merging. In particular, we derive the structure of the manifold formed by the recently proposed Group-and-Shuffle ($\mathcal{GS}$) orthogonal matrices, and obtain efficient formulas for the geodesics approximation between two points. Additionally, we propose a $\text{spectra restoration}$ transform that restores spectral properties of the merged adapter for higher-quality fusion. We conduct experiments in subject-driven generation tasks showing that our technique to merge two $\mathcal{GS}$ orthogonal matrices is capable of uniting concept and style features of different adapters. To the best of our knowledge, this is the first training-free method for merging multiplicative orthogonal adapters. Code is available via the $\href{https://github.com/ControlGenAI/OrthoFuse}{link}$.

[182] Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification

Muhammad Adil, Mehmood Ahmed, Muhammad Aqib, Vicente A. Gonzalez, Gaang Lee, Qipei Mei

Main category: cs.CV

TL;DR: A detection-guided small vision-language model framework improves construction hazard identification by integrating object detection with multimodal reasoning, achieving better accuracy and explanation quality with minimal computational overhead.

DetailsMotivation: Large VLMs have strong contextual reasoning but high computational costs, while small VLMs are efficient but suffer from accuracy issues and hallucinations in complex construction scenes. There's a need for an efficient yet accurate solution for real-time construction hazard detection.

Method: Proposes a detection-guided sVLM framework that first uses YOLOv11n detector to localize workers and machinery, then embeds detected entities into structured prompts to guide sVLM reasoning for spatially grounded hazard assessment. Evaluated six sVLMs in zero-shot settings on construction site images.

Result: The approach consistently improved hazard detection across all models. Best model (Gemma-3 4B) achieved F1-score of 50.6% vs 34.5% baseline. Explanation quality improved significantly (BERTScore F1 from 0.61 to 0.82). Minimal overhead of only 2.5 ms per image during inference.

Conclusion: Integrating lightweight object detection with small VLM reasoning provides an effective and efficient solution for context-aware construction safety hazard detection, balancing accuracy and computational efficiency.

Abstract: Accurate and timely identification of construction hazards around workers is essential for preventing workplace accidents. While large vision-language models (VLMs) demonstrate strong contextual reasoning capabilities, their high computational requirements limit their applicability in near real-time construction hazard detection. In contrast, small vision-language models (sVLMs) with fewer than 4 billion parameters offer improved efficiency but often suffer from reduced accuracy and hallucination when analyzing complex construction scenes. To address this trade-off, this study proposes a detection-guided sVLM framework that integrates object detection with multimodal reasoning for contextual hazard identification. The framework first employs a YOLOv11n detector to localize workers and construction machinery within the scene. The detected entities are then embedded into structured prompts to guide the reasoning process of sVLMs, enabling spatially grounded hazard assessment. Within this framework, six sVLMs (Gemma-3 4B, Qwen-3-VL 2B/4B, InternVL-3 1B/2B, and SmolVLM-2B) were evaluated in zero-shot settings on a curated dataset of construction site images with hazard annotations and explanatory rationales. The proposed approach consistently improved hazard detection performance across all models. The best-performing model, Gemma-3 4B, achieved an F1-score of 50.6%, compared to 34.5% in the baseline configuration. Explanation quality also improved significantly, with BERTScore F1 increasing from 0.61 to 0.82. Despite incorporating object detection, the framework introduces minimal overhead, adding only 2.5 ms per image during inference. These results demonstrate that integrating lightweight object detection with small VLM reasoning provides an effective and efficient solution for context-aware construction safety hazard detection.

[183] Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D

Daniel DeTone, Tianwei Shen, Fan Zhang, Lingni Ma, Julian Straub, Richard Newcombe, Jakob Engel

Main category: cs.CV

TL;DR: Boxer is a 3D object localization system that lifts 2D open-vocabulary object detections to 3D bounding boxes using posed images and optional depth data.

DetailsMotivation: 3D object localization for open-world categories remains challenging and underexplored compared to 2D detection. Existing methods require costly 3D bounding box annotations, creating a need for approaches that can leverage more readily available 2D detection data.

Method: BoxerNet, a transformer-based network that lifts 2D bounding box proposals into 3D, followed by multi-view fusion and geometric filtering. It leverages existing 2D detection algorithms (DETIC, OWLv2, SAM3) for 2D localization, incorporates aleatoric uncertainty for robust regression, uses median depth patch encoding for sparse depth inputs, and is trained on over 1.2 million unique 3DBBs.

Result: BoxerNet outperforms state-of-the-art baselines in open-world 3DBB lifting: 0.532 vs. 0.010 mAP in egocentric settings without dense depth, and 0.412 vs. 0.250 mAP on CA-1M with dense depth available.

Conclusion: Boxer effectively addresses 3D object localization for open-world categories by leveraging 2D detection capabilities, reducing the need for expensive 3D annotations while achieving superior performance over existing methods.

Abstract: Detecting and localizing objects in space is a fundamental computer vision problem. While much progress has been made to solve 2D object detection, 3D object localization is much less explored and far from solved, especially for open-world categories. To address this research challenge, we propose Boxer, an algorithm to estimate static 3D bounding boxes (3DBBs) from 2D open-vocabulary object detections, posed images and optional depth either represented as a sparse point cloud or dense depth. At its core is BoxerNet, a transformer-based network which lifts 2D bounding box (2DBB) proposals into 3D, followed by multi-view fusion and geometric filtering to produce globally consistent de-duplicated 3DBBs in metric world space. Boxer leverages the power of existing 2DBB detection algorithms (e.g. DETIC, OWLv2, SAM3) to localize objects in 2D. This allows the main BoxerNet model to focus on lifting to 3D rather than detecting, ultimately reducing the demand for costly annotated 3DBB training data. Extending the CuTR formulation, we incorporate an aleatoric uncertainty for robust regression, a median depth patch encoding to support sparse depth inputs, and large-scale training with over 1.2 million unique 3DBBs. BoxerNet outperforms state-of-the-art baselines in open-world 3DBB lifting, including CuTR in egocentric settings without dense depth (0.532 vs. 0.010 mAP) and on CA-1M with dense depth available (0.412 vs. 0.250 mAP).

[184] Hierarchical Mesh Transformers with Topology-Guided Pretraining for Morphometric Analysis of Brain Structures

Yujian Xiong, Mohammad Farazi, Yanxi Chen, Wenhui Zhu, Xuanzhao Dong, Natasha Lepore, Yi Su, Raza Mushtaq, Stephen Foldes, Andrew Yang, Yalin Wang

Main category: cs.CV

TL;DR: Hierarchical transformer for heterogeneous neuroimaging mesh analysis that handles both volumetric and surface meshes with diverse vertex-level features, achieving SOTA on Alzheimer’s and cortical dysplasia tasks.

DetailsMotivation: Current neuroimaging mesh analysis methods either ignore clinically informative vertex-level features (cortical thickness, curvature, etc.) or only support single mesh topologies, limiting cross-pipeline applicability.

Method: Hierarchical transformer framework using spatially adaptive tree partitions from simplicial complexes of arbitrary order, with feature projection module to map variable-length clinical descriptors into spatial hierarchy, and self-supervised pretraining via masked reconstruction.

Result: Achieved state-of-the-art results on Alzheimer’s disease classification and amyloid burden prediction using ADNI volumetric brain meshes, and focal cortical dysplasia detection on MELD cortical surface meshes.

Conclusion: The framework successfully handles heterogeneous mesh analysis across different neuroimaging modalities while incorporating diverse clinical features, demonstrating strong transferability across tasks and mesh types.

Abstract: Representation learning on large-scale unstructured volumetric and surface meshes poses significant challenges in neuroimaging, especially when models must incorporate diverse vertex-level morphometric descriptors, such as cortical thickness, curvature, sulcal depth, and myelin content, which carry subtle disease-related signals. Current approaches either ignore these clinically informative features or support only a single mesh topology, restricting their use across imaging pipelines. We introduce a hierarchical transformer framework designed for heterogeneous mesh analysis that operates on spatially adaptive tree partitions constructed from simplicial complexes of arbitrary order. This design accommodates both volumetric and surface discretizations within a single architecture, enabling efficient multi-scale attention without topology-specific modifications. A feature projection module maps variable-length per-vertex clinical descriptors into the spatial hierarchy, separating geometric structure from feature dimensionality and allowing seamless integration of different neuroimaging feature sets. Self-supervised pretraining via masked reconstruction of both coordinates and morphometric channels on large unlabeled cohorts yields a transferable encoder backbone applicable to diverse downstream tasks and mesh modalities. We validate our approach on Alzheimer’s disease classification and amyloid burden prediction using volumetric brain meshes from ADNI, as well as focal cortical dysplasia detection on cortical surface meshes from the MELD dataset, achieving state-of-the-art results across all benchmarks.

[185] Active Measurement of Two-Point Correlations

Max Hamilton, Daniel Sheldon, Subhransu Maji

Main category: cs.CV

TL;DR: A human-in-the-loop framework for efficiently estimating two-point correlation functions of target sources using adaptive sampling guided by a pre-trained classifier to reduce annotation effort while maintaining statistical rigor.

DetailsMotivation: Measuring two-point correlation functions (2PCF) for specific subsets of points (like star clusters in astronomy) requires extensive manual labeling which is time-consuming. Current approaches need careful catalog construction, creating a bottleneck in large-scale astronomical analysis.

Method: Proposes a human-in-the-loop framework with: 1) pre-trained classifier to guide adaptive sampling of most informative points, 2) novel unbiased estimator for pair counts across multiple distance bins, 3) sampling strategy that reduces variance compared to Monte Carlo, and 4) confidence interval construction for statistical grounding.

Result: The method achieves substantially lower variance than simple Monte Carlo approaches while significantly reducing annotation effort. It enables scalable and statistically grounded measurement of two-point correlations in astronomy datasets.

Conclusion: The framework provides an efficient, statistically rigorous approach to measuring two-point correlation functions for target subsets in large datasets, particularly valuable for astronomy where manual annotation is a bottleneck.

Abstract: Two-point correlation functions (2PCF) are widely used to characterize how points cluster in space. In this work, we study the problem of measuring the 2PCF over a large set of points, restricted to a subset satisfying a property of interest. An example comes from astronomy, where scientists measure the 2PCF of star clusters, which make up only a tiny subset of possible sources within a galaxy. This task typically requires careful labeling of sources to construct catalogs, which is time-consuming. We present a human-in-the-loop framework for efficient estimation of 2PCF of target sources. By leveraging a pre-trained classifier to guide sampling, our approach adaptively selects the most informative points for human annotation. After each annotation, it produces unbiased estimates of pair counts across multiple distance bins simultaneously. Compared to simple Monte Carlo approaches, our method achieves substantially lower variance while significantly reducing annotation effort. We introduce a novel unbiased estimator, sampling strategy, and confidence interval construction that together enable scalable and statistically grounded measurement of two-point correlations in astronomy datasets.

[186] Protecting and Preserving Protest Dynamics for Responsible Analysis

Cohen Archbold, Usman Hassan, Nazmus Sakib, Sen-ching Cheung, Abdullah-Al-Zubaer Imran

Main category: cs.CV

TL;DR: A responsible computing framework for protest analysis that uses synthetic imagery to protect privacy while enabling collective pattern analysis

DetailsMotivation: Protest-related social media data is high-risk due to surveillance and privacy concerns, and current AI systems can identify individuals and enable cross-platform tracking, posing dangers to protesters. Existing approaches lack integrated privacy risk assessment, analysis, and fairness considerations.

Method: Proposes a responsible computing framework that replaces sensitive protest imagery with well-labeled synthetic reproductions using conditional image synthesis, enabling analysis of collective patterns without exposing identifiable individuals.

Result: The approach produces realistic and diverse synthetic imagery while balancing downstream analytical utility with reductions in privacy risk, and assesses demographic fairness in the generated data.

Conclusion: The method adopts a pragmatic, harm-mitigating approach that enables socially sensitive analysis while acknowledging residual risks, rather than offering absolute privacy guarantees.

Abstract: Protest-related social media data are valuable for understanding collective action but inherently high-risk due to concerns surrounding surveillance, repression, and individual privacy. Contemporary AI systems can identify individuals, infer sensitive attributes, and cross-reference visual information across platforms, enabling surveillance that poses risks to protesters and bystanders. In such contexts, large foundation models trained on protest imagery risk memorizing and disclosing sensitive information, leading to cross-platform identity leakage and retroactive participant identification. Existing approaches to automated protest analysis do not provide a holistic pipeline that integrates privacy risk assessment, downstream analysis, and fairness considerations. To address this gap, we propose a responsible computing framework for analyzing collective protest dynamics while reducing risks to individual privacy. Our framework replaces sensitive protest imagery with well-labeled synthetic reproductions using conditional image synthesis, enabling analysis of collective patterns without direct exposure of identifiable individuals. We demonstrate that our approach produces realistic and diverse synthetic imagery while balancing downstream analytical utility with reductions in privacy risk. We further assess demographic fairness in the generated data, examining whether synthetic representations disproportionately affect specific subgroups. Rather than offering absolute privacy guarantees, our method adopts a pragmatic, harm-mitigating approach that enables socially sensitive analysis while acknowledging residual risks.

[187] Coverage Optimization for Camera View Selection

Timothy Chen, Adam Dai, Maximilian Adang, Grace Gao, Mac Schwager

Main category: cs.CV

TL;DR: COVER is a lightweight coverage-based view selection metric for active 3D reconstruction that minimizes Fisher Information Gain approximation to select informative camera poses by covering insufficiently observed geometry.

DetailsMotivation: The quality of data used for 3D reconstruction is crucial for efficient and accurate scene modeling. Current active view selection methods often require expensive computations and are sensitive to noise and training dynamics.

Method: Develops a principled analysis yielding a simple interpretable criterion based on minimizing a tractable approximation of Fisher Information Gain. This reduces to favoring viewpoints that cover geometry insufficiently observed by past cameras, avoiding expensive transmittance estimation.

Result: COVER consistently improves reconstruction quality compared to state-of-the-art active view selection methods across multiple datasets and radiance-field baselines in both fixed and embodied data acquisition scenarios.

Conclusion: The proposed coverage-based view selection metric (COVER) is lightweight, robust to noise and training dynamics, and effectively improves 3D reconstruction quality by selecting informative camera poses.

Abstract: What makes a good viewpoint? The quality of the data used to learn 3D reconstructions is crucial for enabling efficient and accurate scene modeling. We study the active view selection problem and develop a principled analysis that yields a simple and interpretable criterion for selecting informative camera poses. Our key insight is that informative views can be obtained by minimizing a tractable approximation of the Fisher Information Gain, which reduces to favoring viewpoints that cover geometry that has been insufficiently observed by past cameras. This leads to a lightweight coverage-based view selection metric that avoids expensive transmittance estimation and is robust to noise and training dynamics. We call this metric COVER (Camera Optimization for View Exploration and Reconstruction). We integrate our method into the Nerfstudio framework and evaluate it on real datasets within fixed and embodied data acquisition scenarios. Across multiple datasets and radiance-field baselines, our method consistently improves reconstruction quality compared to state-of-the-art active view selection methods. Additional visualizations and our Nerfstudio package can be found at https://chengine.github.io/nbv_gym/.

[188] Region-R1: Reinforcing Query-Side Region Cropping for Multi-Modal Re-Ranking

Chan-Wei Hu, Zhengzhong Tu

Main category: cs.CV

TL;DR: Region-R1: A query-side region cropping framework for MM-RAG that learns to dynamically crop question-relevant regions from query images to improve re-ranking by reducing visual distractors.

DetailsMotivation: Standard multimodal RAG re-rankers process full query images as global embeddings, making them vulnerable to visual distractors like background clutter that skew similarity scores and reduce retrieval accuracy.

Method: Proposes Region-R1 framework that formulates region selection as a decision-making problem during re-ranking. Uses region-aware group relative policy optimization (r-GRPO) to learn a policy for dynamically cropping discriminative question-relevant regions from query images before scoring retrieved candidates.

Result: Achieves state-of-the-art performance on E-VQA and InfoSeek benchmarks, increasing conditional Recall@1 by up to 20%. Shows consistent gains across challenging multimodal retrieval tasks.

Conclusion: Query-side adaptation through region cropping is a simple but effective way to strengthen MM-RAG re-ranking by focusing on relevant visual regions and reducing noise from visual distractors.

Abstract: Multi-modal retrieval-augmented generation (MM-RAG) relies heavily on re-rankers to surface the most relevant evidence for image-question queries. However, standard re-rankers typically process the full query image as a global embedding, making them susceptible to visual distractors (e.g., background clutter) that skew similarity scores. We propose Region-R1, a query-side region cropping framework that formulates region selection as a decision-making problem during re-ranking, allowing the system to learn to retain the full image or focus only on a question-relevant region before scoring the retrieved candidates. Region-R1 learns a policy with a novel region-aware group relative policy optimization (r-GRPO) to dynamically crop a discriminative region. Across two challenging benchmarks, E-VQA and InfoSeek, Region-R1 delivers consistent gains, achieving state-of-the-art performances by increasing conditional Recall@1 by up to 20%. These results show the great promise of query-side adaptation as a simple but effective way to strengthen MM-RAG re-ranking.

[189] Toward Unified Fine-Grained Vehicle Classification and Automatic License Plate Recognition

Gabriel E. Lima, Valfride Nascimento, Eduardo Santos, Eduil Nascimento, Rayson Laroca, David Menotti

Main category: cs.CV

TL;DR: UFPR-VeSV dataset for fine-grained vehicle classification with 13 colors, 26 makes, 136 models, 14 types from real-world surveillance images, integrated with license plate recognition.

DetailsMotivation: Existing vehicle information extraction systems often assume controlled conditions, explore limited attributes, and overlook integration of fine-grained vehicle classification with automatic license plate recognition in real-world surveillance scenarios.

Method: Created UFPR-VeSV dataset with 24,945 images of 16,297 unique vehicles from Brazilian police surveillance, annotated for 13 colors, 26 makes, 136 models, 14 types, validated using license plate info, benchmarked with 5 deep learning models and 2 OCR models.

Result: Dataset confirmed as challenging with real-world conditions (partial occlusions, nighttime infrared, varying lighting), benchmark revealed specific challenges with multicolored vehicles, infrared images, and platform-sharing models; integration of FGVC and ALPR showed potential.

Conclusion: UFPR-VeSV dataset addresses gaps in real-world vehicle surveillance, demonstrates value of integrating fine-grained vehicle classification with license plate recognition for intelligent transportation systems.

Abstract: Extracting vehicle information from surveillance images is essential for intelligent transportation systems, enabling applications such as traffic monitoring and criminal investigations. While Automatic License Plate Recognition (ALPR) is widely used, Fine-Grained Vehicle Classification (FGVC) offers a complementary approach by identifying vehicles based on attributes such as color, make, model, and type. Although there have been advances in this field, existing studies often assume well-controlled conditions, explore limited attributes, and overlook FGVC integration with ALPR. To address these gaps, we introduce UFPR-VeSV, a dataset comprising 24,945 images of 16,297 unique vehicles with annotations for 13 colors, 26 makes, 136 models, and 14 types. Collected from the Military Police of Paraná (Brazil) surveillance system, the dataset captures diverse real-world conditions, including partial occlusions, nighttime infrared imaging, and varying lighting. All FGVC annotations were validated using license plate information, with text and corner annotations also being provided. A qualitative and quantitative comparison with established datasets confirmed the challenging nature of our dataset. A benchmark using five deep learning models further validated this, revealing specific challenges such as handling multicolored vehicles, infrared images, and distinguishing between vehicle models that share a common platform. Additionally, we apply two optical character recognition models to license plate recognition and explore the joint use of FGVC and ALPR. The results highlight the potential of integrating these complementary tasks for real-world applications. The UFPR-VeSV dataset is publicly available at: https://github.com/Lima001/UFPR-VeSV-Dataset.

[190] From Measurement to Mitigation: Quantifying and Reducing Identity Leakage in Image Representation Encoders with Linear Subspace Removal

Daniel George, Charles Yeh, Daniel Lee, Yifei Zhang

Main category: cs.CV

TL;DR: Paper proposes identity sanitization projection (ISP) to remove identity leakage from frozen visual embeddings while preserving utility for visual search/retrieval, with comprehensive privacy benchmarking.

DetailsMotivation: Visual embeddings like CLIP, DINOv2/v3, and SSCD are widely used for retrieval but have unmeasured identity leakage when applied to face data, lacking deployable privacy mitigations.

Method: Proposes ISP - a one-shot linear projector that removes estimated identity subspace while preserving complementary utility space; includes comprehensive benchmarking with open-set verification, diffusion-based template inversion, and face-context attribution.

Result: CLIP shows higher identity leakage than DINOv2/v3 and SSCD; ISP achieves near-chance identity recognition while retaining high non-biometric utility, with good cross-dataset transfer.

Conclusion: First attacker-calibrated facial privacy audit of non-face recognition encoders shows linear subspace removal provides strong privacy guarantees while preserving visual search utility.

Abstract: Frozen visual embeddings (e.g., CLIP, DINOv2/v3, SSCD) power retrieval and integrity systems, yet their use on face-containing data is constrained by unmeasured identity leakage and a lack of deployable mitigations. We take an attacker-aware view and contribute: (i) a benchmark of visual embeddings that reports open-set verification at low false-accept rates, a calibrated diffusion-based template inversion check, and face-context attribution with equal-area perturbations; and (ii) propose a one-shot linear projector that removes an estimated identity subspace while preserving the complementary space needed for utility, which for brevity we denote as the identity sanitization projection ISP. Across CelebA-20 and VGGFace2, we show that these encoders are robust under open-set linear probes, with CLIP exhibiting relatively higher leakage than DINOv2/v3 and SSCD, robust to template inversion, and are context-dominant. In addition, we show that ISP drives linear access to near-chance while retaining high non-biometric utility, and transfers across datasets with minor degradation. Our results establish the first attacker-calibrated facial privacy audit of non-FR encoders and demonstrate that linear subspace removal achieves strong privacy guarantees while preserving utility for visual search and retrieval.

[191] SmokeGS-R: Physics-Guided Pseudo-Clean 3DGS for Real-World Multi-View Smoke Restoration

Xueming Fu, Lixia Han

Main category: cs.CV

TL;DR: SmokeGS-R is a pipeline for 3D reconstruction in smoky scenes that decouples geometry recovery from appearance correction using physics-guided pseudo-clean supervision and 3D Gaussian Splatting with post-render harmonization.

DetailsMotivation: Real-world smoke causes three main problems for 3D reconstruction: attenuation of scene radiance, addition of airlight, and destabilization of multi-view appearance consistency, making robust reconstruction particularly challenging.

Method: The method decouples geometry recovery from appearance correction by: 1) generating physics-guided pseudo-clean supervision using refined dark channel prior and guided filtering, 2) training a sharp clean-only 3D Gaussian Splatting source model, and 3) harmonizing renderings with a donor ensemble using geometric-mean reference aggregation, LAB-space Reinhard transfer, and light Gaussian smoothing.

Result: On the official challenge testing leaderboard: PSNR=15.217, SSIM=0.666. After public release of RealX3D, re-evaluation on seven released scenes: PSNR=15.209, SSIM=0.644, LPIPS=0.551, outperforming the strongest official baseline by +3.68 dB PSNR.

Conclusion: A geometry-first reconstruction strategy combined with stable post-render appearance harmonization is an effective approach for real-world multi-view smoke restoration.

Abstract: Real-world smoke simultaneously attenuates scene radiance, adds airlight, and destabilizes multi-view appearance consistency, making robust 3D reconstruction particularly difficult. We present \textbf{SmokeGS-R}, a practical pipeline developed for the NTIRE 2026 3D Restoration and Reconstruction Track 2 challenge. The key idea is to decouple geometry recovery from appearance correction: we generate physics-guided pseudo-clean supervision with a refined dark channel prior and guided filtering, train a sharp clean-only 3D Gaussian Splatting source model, and then harmonize its renderings with a donor ensemble using geometric-mean reference aggregation, LAB-space Reinhard transfer, and light Gaussian smoothing. On the official challenge testing leaderboard, the final submission achieved \mbox{PSNR $=15.217$} and \mbox{SSIM $=0.666$}. After the public release of RealX3D, we re-evaluated the same frozen result on the seven released challenge scenes without retraining and obtained \mbox{PSNR $=15.209$}, \mbox{SSIM $=0.644$}, and \mbox{LPIPS $=0.551$}, outperforming the strongest official baseline average on the same scenes by $+3.68$ dB PSNR. These results suggest that a geometry-first reconstruction strategy combined with stable post-render appearance harmonization is an effective recipe for real-world multi-view smoke restoration. The code is available at https://github.com/windrise/3drr_Track2_SmokeGS-R.

[192] Indoor Asset Detection in Large Scale 360° Drone-Captured Imagery via 3D Gaussian Splatting

Monica Tang, Avideh Zakhor

Main category: cs.CV

TL;DR: 3D object detection and segmentation in Gaussian Splatting scenes using multi-view mask association with semantic and spatial constraints

DetailsMotivation: To enable accurate object-level detection and segmentation of indoor assets in 3D Gaussian Splatting scenes reconstructed from drone imagery, addressing challenges in multi-view mask consistency

Method: Introduces a 3D object codebook that leverages mask semantics and spatial information of Gaussian primitives to guide multi-view mask association. Integrates 2D object detection/segmentation models with semantically and spatially constrained merging procedures to aggregate masks into coherent 3D object instances

Result: Experiments on two large indoor scenes show reliable multi-view mask consistency (65% F1 score improvement over SOTA) and accurate object-level 3D indoor asset detection (11% mAP gain over baselines)

Conclusion: The approach effectively addresses multi-view mask association challenges in 3DGS scenes and enables accurate 3D object detection and segmentation for indoor assets

Abstract: We present an approach for object-level detection and segmentation of target indoor assets in 3D Gaussian Splatting (3DGS) scenes, reconstructed from 360° drone-captured imagery. We introduce a 3D object codebook that jointly leverages mask semantics and spatial information of their corresponding Gaussian primitives to guide multi-view mask association and indoor asset detection. By integrating 2D object detection and segmentation models with semantically and spatially constrained merging procedures, our method aggregates masks from multiple views into coherent 3D object instances. Experiments on two large indoor scenes demonstrate reliable multi-view mask consistency, improving F1 score by 65% over state-of-the-art baselines, and accurate object-level 3D indoor asset detection, achieving an 11% mAP gain over baseline methods.

[193] VLA-InfoEntropy: A Training-Free Vision-Attention Information Entropy Approach for Vision-Language-Action Models Inference Acceleration and Success

Chuhang Liu, Yayun He, Zuheng Kang, Xiaoyang Qu, Jianzong Wang

Main category: cs.CV

TL;DR: VLA-InfoEntropy improves Vision-Language-Action model efficiency by using image entropy and attention entropy to dynamically focus on informative regions, reducing computational overhead while maintaining performance.

DetailsMotivation: VLA models suffer from high computational overhead and low inference efficiency due to joint processing of high-dimensional visual features, complex language inputs, and continuous action sequences, hindering real-time deployment.

Method: Uses image entropy to quantify grayscale distribution of visual tokens and attention entropy to capture attention score distribution over task-related text. Combines these with timestep information for dynamic transition from global visual features to attention-guided local informative regions.

Result: Method reduces inference parameters, accelerates inference speed, and outperforms existing approaches in extensive experiments.

Conclusion: VLA-InfoEntropy effectively integrates spatial, semantic, and temporal cues to reduce redundancy while preserving critical content, enabling more efficient VLA model deployment.

Abstract: Vision-Language-Action (VLA) models integrate visual perception, language understanding, and action decision-making for cross-modal semantic alignment, exhibiting broad application potential. However, the joint processing of high-dimensional visual features, complex linguistic inputs, and continuous action sequences incurs significant computational overhead and low inference efficiency, thereby hindering real-time deployment and reliability. To address this issue, we use image entropy to quantify the grayscale distribution characteristics of each visual token and introduce attention entropy to capture the distribution of attention scores over task-related text. Visual entropy identifies texture-rich or structurally informative regions, while attention entropy pinpoints semantically relevant tokens. Combined with timestep information, these metrics enable a dynamic transition strategy that shifts the model’s focus from global visual features to attention-guided local informative regions. Thus, the resulting VLA-InfoEntropy method integrates spatial, semantic, and temporal cues to reduce redundancy while preserving critical content. Extensive experiments show that our method reduces inference parameters, accelerates inference speed, and outperforms existing approaches.

[194] Unsupervised Multi-agent and Single-agent Perception from Cooperative Views

Haochen Yang, Baolu Li, Lei Li, Delin Ren, Jiacheng Guo, Minghai Qin, Tianyun Zhang, Hongkai Yu

Main category: cs.CV

TL;DR: UMS is an unsupervised framework that simultaneously solves multi-agent and single-agent 3D perception using cooperative LiDAR data sharing without human annotations.

DetailsMotivation: Current LiDAR-based perception lacks methods that simultaneously solve multi-agent and single-agent perception in an unsupervised way. The paper aims to leverage multi-agent cooperation without human annotations to improve both perception tasks.

Method: Proposes UMS framework with: 1) Proposal Purifying Filter for better classification after multi-agent point cloud density cooperation, 2) Progressive Proposal Stabilizing module using easy-to-hard curriculum learning for reliable pseudo labels, and 3) Cross-View Consensus Learning to use multi-agent cooperative views to guide single-agent detection.

Result: UMS achieved significantly higher 3D detection performance than state-of-the-art methods on both V2V4Real and OPV2V datasets for multi-agent and single-agent perception in unsupervised settings.

Conclusion: The paper demonstrates that multi-agent cooperation can be effectively leveraged for unsupervised 3D perception, simultaneously improving both multi-agent and single-agent tasks without human annotations.

Abstract: The LiDAR-based multi-agent and single-agent perception has shown promising performance in environmental understanding for robots and automated vehicles. However, there is no existing method that simultaneously solves both multi-agent and single-agent perception in an unsupervised way. By sharing sensor data between multiple agents via communication, this paper discovers two key insights: 1) Improved point cloud density after the data sharing from cooperative views could benefit unsupervised object classification, 2) Cooperative view of multiple agents can be used as unsupervised guidance for the 3D object detection in the single view. Based on these two discovered insights, we propose an Unsupervised Multi-agent and Single-agent (UMS) perception framework that leverages multi-agent cooperation without human annotations to simultaneously solve multi-agent and single-agent perception. UMS combines a learning-based Proposal Purifying Filter to better classify the candidate proposals after multi-agent point cloud density cooperation, followed by a Progressive Proposal Stabilizing module to yield reliable pseudo labels by the easy-to-hard curriculum learning. Furthermore, we design a Cross-View Consensus Learning to use multi-agent cooperative view to guide detection in single-agent view. Experimental results on two public datasets V2V4Real and OPV2V show that our UMS method achieved significantly higher 3D detection performance than the state-of-the-art methods on both multi-agent and single-agent perception tasks in an unsupervised setting.

[195] GESS: Multi-cue Guided Local Feature Learning via Geometric and Semantic Synergy

Yang Yi, Xieyuanli Chen, Jinpu Zhang, Hui Shen, Dewen Hu

Main category: cs.CV

TL;DR: Multi-cue guided local feature learning framework using semantic and geometric cues to enhance detection robustness and descriptor discriminability in computer vision.

DetailsMotivation: Existing local feature methods rely on single appearance cues, leading to unstable keypoints and insufficient descriptor discriminability. The paper aims to address these limitations by leveraging multiple cues.

Method: Proposes a framework with: 1) Joint semantic-normal prediction head using shared 3D vector field, 2) Depth stability prediction head for geometric consistency, 3) Semantic-Depth Aware Keypoint (SDAK) mechanism for feature detection, and 4) Unified Triple-Cue Fusion (UTCF) module for descriptor construction with semantic-scheduled gating.

Result: Extensive experiments on four benchmarks validate the effectiveness of the proposed framework in improving feature detection robustness and descriptor discriminability.

Conclusion: The multi-cue guided approach successfully enhances local feature learning by synergistically leveraging semantic and geometric information, outperforming single-cue methods.

Abstract: Robust local feature detection and description are foundational tasks in computer vision. Existing methods primarily rely on single appearance cues for modeling, leading to unstable keypoints and insufficient descriptor discriminability. In this paper, we propose a multi-cue guided local feature learning framework that leverages semantic and geometric cues to synergistically enhance detection robustness and descriptor discriminability. Specifically, we construct a joint semantic-normal prediction head and a depth stability prediction head atop a lightweight backbone. The former leverages a shared 3D vector field to deeply couple semantic and normal cues, thereby resolving optimization interference from heterogeneous inconsistencies. The latter quantifies the reliability of local regions from a geometric consistency perspective, providing deterministic guidance for robust keypoint selection. Based on these predictions, we introduce the Semantic-Depth Aware Keypoint (SDAK) mechanism for feature detection. By coupling semantic reliability with depth stability, SDAK reweights keypoint responses to suppress spurious features in unreliable regions. For descriptor construction, we design a Unified Triple-Cue Fusion (UTCF) module, which employs a semantic-scheduled gating mechanism to adaptively inject multi-attribute features, improving descriptor discriminability. Extensive experiments on four benchmarks validate the effectiveness of the proposed framework. The source code and pre-trained model will be available at: https://github.com/yiyscut/GESS.git.

[196] Rethinking IRSTD: Single-Point Supervision Guided Encoder-only Framework is Enough for Infrared Small Target Detection

Rixiang Ni, Boyang Li, Jun Chen, Yonghao Li, Feiyu Ren, Yuji Wang, Haoyang Yuan, Wujiao He, Wei An

Main category: cs.CV

TL;DR: SPIRE reformulates infrared small target detection as centroid regression using single-point supervision and probabilistic response encoding, achieving competitive performance with low false alarms.

DetailsMotivation: Traditional pixel-level supervision methods for infrared small target detection fail because small targets occupy few pixels with blurred boundaries, making precise segmentation difficult. The authors argue that target localization should be prioritized over complete region separation.

Method: Proposes SPIRE with Point-Response Prior Supervision (PRPS) that transforms single-point annotations into probabilistic response maps matching infrared target characteristics, and a High-Resolution Probabilistic Encoder (HRPE) for encoder-only regression without decoder reconstruction.

Result: SPIRE achieves competitive target-level detection performance with consistently low false alarm rates and significantly reduced computational cost across various IRSTD benchmarks including SIRST-UAVB and SIRST4.

Conclusion: Reformulating IRSTD as centroid regression with single-point supervision and probabilistic encoding is more effective than traditional segmentation approaches, offering better localization with reduced computational requirements.

Abstract: Infrared small target detection (IRSTD) aims to separate small targets from clutter backgrounds. Extensive research is dedicated to the pixel-level supervision-guided “encoder-decoder” segmentation paradigm. Although having achieved promising performance, they neglect the fact that small targets only occupy a few pixels and are usually accompanied with blurred boundary caused by clutter backgrounds. Based on this observation, we argue that the first principle of IRSTD should be target localization instead of separating all target region accompanied with indistinguishable background noise. In this paper, we reformulate IRSTD as a centroid regression task and propose a novel Single-Point Supervision guided Infrared Probabilistic Response Encoding method (namely, SPIRE), which is indeed challenging due to the mismatch between reduced supervision network and equivalent output. Specifically, we first design a Point-Response Prior Supervision (PRPS), which transforms single-point annotations into probabilistic response map consistent with infrared point-target response characteristics, with a High-Resolution Probabilistic Encoder (HRPE) that enables encoder-only, end-to-end regression without decoder reconstruction. By preserving high-resolution features and increasing effective supervision density, SPIRE alleviates optimization instability under sparse target distributions. Finally, extensive experiments on various IRSTD benchmarks, including SIRST-UAVB and SIRST4 demonstrate that SPIRE achieves competitive target-level detection performance with consistently low false alarm rate (Fa) and significantly reduced computational cost. Code is publicly available at: https://github.com/NIRIXIANG/SPIRE-IRSTD.

[197] 3DTurboQuant: Training-Free Near-Optimal Quantization for 3D Reconstruction Models

Jae Joong Lee

Main category: cs.CV

TL;DR: 3DTurboQuant: A training-free compression method for 3D reconstruction models using random rotation and precomputed quantization, achieving 3.5-7.9x compression with minimal quality loss.

DetailsMotivation: Existing 3D model compression methods require per-scene fine-tuning and data-dependent codebook learning, which is computationally expensive and time-consuming. The authors aim to develop a training-free compression approach that works without scene-specific optimization.

Method: Uses random rotation to transform parameter vectors into coordinates with known Beta distribution, enabling precomputed Lloyd-Max quantization. Includes dimension-dependent quantization criteria, norm-separation bounds, entry-grouping for hash grid features, and a composable pruning-quantization pipeline.

Result: Achieves 3.5x compression for 3DGS with only 0.02dB PSNR loss on NeRF Synthetic, and 7.9x compression for DUSt3R KV caches with 39.7dB pointmap fidelity. Compression takes seconds without training or calibration data.

Conclusion: Demonstrates that data-independent quantization with random rotation is near-optimal for compressing 3D reconstruction models, eliminating the need for per-scene fine-tuning while maintaining high quality.

Abstract: Every existing method for compressing 3D Gaussian Splatting, NeRF, or transformer-based 3D reconstructors requires learning a data-dependent codebook through per-scene fine-tuning. We show this is unnecessary. The parameter vectors that dominate storage in these models, 45-dimensional spherical harmonics in 3DGS and 1024-dimensional key-value vectors in DUSt3R, fall in a dimension range where a single random rotation transforms any input into coordinates with a known Beta distribution. This makes precomputed, data-independent Lloyd-Max quantization near-optimal, within a factor of 2.7 of the information-theoretic lower bound. We develop 3D, deriving (1) a dimension-dependent criterion that predicts which parameters can be quantized and at what bit-width before running any experiment, (2) norm-separation bounds connecting quantization MSE to rendering PSNR per scene, (3) an entry-grouping strategy extending rotation-based quantization to 2-dimensional hash grid features, and (4) a composable pruning-quantization pipeline with a closed-form compression ratio. On NeRF Synthetic, 3DTurboQuant compresses 3DGS by 3.5x with 0.02dB PSNR loss and DUSt3R KV caches by 7.9x with 39.7dB pointmap fidelity. No training, no codebook learning, no calibration data. Compression takes seconds. The code will be released (https://github.com/JaeLee18/3DTurboQuant)

[198] UAVReason: A Unified, Large-Scale Benchmark for Multimodal Aerial Scene Reasoning and Generation

Jintao Sun, Hu Zhang, Donglin Di, Gangyi Ding, Zhedong Zheng

Main category: cs.CV

TL;DR: UAVReason: First unified large-scale multimodal benchmark for nadir-view UAV scenarios with 273K VQA pairs, evaluating 22 reasoning types and generation across RGB, depth, and segmentation modalities.

DetailsMotivation: Vision-Language models fail on high-altitude UAVs due to domain shift from tiny dense objects, repetitive textures, and ambiguous top-down orientations, disrupting semantic grounding and generation capabilities.

Method: Created UAVReason benchmark from high-fidelity UAV simulation with 273K VQA pairs (23.6K single frames, 68.2K 2-frame sequences, 188.8K generation samples). Established unified baseline via multi-task learning across 22 reasoning types.

Result: Extensive experiments show unified multi-task learning substantially improves UAV-native performance across metrics (EM/F1 for VQA, mIoU for segmentation, CLIP Score for generation), revealing limitations of general-domain VLMs.

Conclusion: UAVReason bridges critical gap in UAV multimodal understanding, providing comprehensive benchmark and showing unified approach outperforms general VLMs in nadir-view scenarios.

Abstract: Vision-Language models (VLMs) have demonstrated remarkable capability in ground-view visual understanding but often fracture when deployed on high-altitude Unmanned Aerial Vehicles (UAVs). The failure largely stems from a pronounced domain shift, characterized by tiny and densely packed objects, repetitive textures, and ambiguous top-down orientations. These factors severely disrupt semantic grounding and hinder both spatial reasoning and controllable generation. To bridge this critical gap, we introduce UAVReason, the first unified large-scale multi-modal benchmark dedicated to nadir-view UAV scenarios, derived from a high-fidelity UAV simulation platform. In contrast to existing UAV benchmarks, which are largely siloed and focus on single tasks like object detection or segmentation, UAVReason uniquely consolidates over 273K Visual Question Answering (VQA) pairs, including 23.6K single frames with detailed captions, 68.2K 2-frame temporal sequences, and 188.8K cross-modal generation samples. The benchmark probes 22 diverse reasoning types across spatial and temporal axes while simultaneously evaluating high-fidelity generation across RGB, depth, and segmentation modalities. We further establish a strong, unified baseline model via multi-task learning. Extensive experiments validate the efficacy of our unified approach across diverse metrics, such as EM/F1 for VQA, mIoU for segmentation, and CLIP Score for generation. These results indicate limitations of general-domain vision-language models and show that unified multi-task learning substantially improves UAV-native performance. All data, code, and evaluation tools will be publicly released to advance UAV multimodal research.

[199] LUMOS: Universal Semi-Supervised OCT Retinal Layer Segmentation with Hierarchical Reliable Mutual Learning

Yizhou Fang, Jian Zhong, Li Lin, Xiaoying Tang

Main category: cs.CV

TL;DR: LUMOS is a semi-supervised universal OCT retinal layer segmentation framework that addresses annotation scarcity and heterogeneous label granularities through dual-decoder architecture with hierarchical prompting and progressive multi-granularity learning.

DetailsMotivation: OCT layer segmentation suffers from annotation scarcity and heterogeneous label granularities across datasets. Existing semi-supervised methods assume fixed granularity and fail to exploit cross-granularity supervision effectively.

Method: Proposes LUMOS with Dual-Decoder Network with Hierarchical Prompting Strategy (DDN-HPS) to suppress pseudo-label noise propagation, and Reliable Progressive Multi-granularity Learning (RPML) with region-level reliability weighing and progressive training from easier to more difficult tasks.

Result: Experiments on six OCT datasets show LUMOS outperforms existing methods and exhibits exceptional cross-domain and cross-granularity generalization capability.

Conclusion: LUMOS effectively addresses OCT segmentation challenges through innovative architecture and learning strategy that leverages cross-granularity supervision while handling annotation scarcity.

Abstract: Optical Coherence Tomography (OCT) layer segmentation faces challenges due to annotation scarcity and heterogeneous label granularities across datasets. While semi-supervised learning helps alleviate label scarcity, existing methods typically assume a fixed granularity, failing to fully exploit cross-granularity supervision. This paper presents LUMOS, a semi-supervised universal OCT retinal layer segmentation framework based on a Dual-Decoder Network with a Hierarchical Prompting Strategy (DDN-HPS) and Reliable Progressive Multi-granularity Learning (RPML). DDN-HPS combines a dual-branch architecture with a multi-granularity prompting strategy to effectively suppress pseudo-label noise propagation. Meanwhile, RPML introduces region-level reliability weighing and a progressive training approach that guides the model from easier to more difficult tasks, ensuring the reliable selection of cross-granularity consistency targets, thereby achieving stable cross-granularity alignment. Experiments on six OCT datasets demonstrate that LUMOS largely outperforms existing methods and exhibits exceptional cross-domain and cross-granularity generalization capability.

[200] LSGS-Loc: Towards Robust 3DGS-Based Visual Localization for Large-Scale UAV Scenarios

Xiang Zhang, Tengfei Wang, Fang Xu, Xin Wang, Zongqian Zhan

Main category: cs.CV

TL;DR: LSGS-Loc: A visual localization pipeline for large-scale UAV scenarios using 3D Gaussian Splatting with scale-aware pose initialization and Laplacian-based reliability masking to handle reconstruction artifacts.

DetailsMotivation: Visual localization in large-scale UAV scenarios is challenging due to geometric complexity and environmental variations. Existing 3DGS-based methods struggle with robust pose initialization and sensitivity to rendering artifacts in large-scale settings.

Method: Proposes LSGS-Loc with two key components: 1) Scale-aware pose initialization combining scene-agnostic relative pose estimation with explicit 3DGS scale constraints, and 2) Laplacian-based reliability masking mechanism to guide photometric refinement toward high-quality regions and mitigate reconstruction artifacts.

Result: Extensive experiments on large-scale UAV benchmarks show state-of-the-art accuracy and robustness for unordered image queries, significantly outperforming existing 3DGS-based approaches.

Conclusion: LSGS-Loc provides an effective solution for visual localization in large-scale 3DGS scenes, addressing key limitations of existing methods through novel scale-aware initialization and artifact-aware refinement techniques.

Abstract: Visual localization in large-scale UAV scenarios is a critical capability for autonomous systems, yet it remains challenging due to geometric complexity and environmental variations. While 3D Gaussian Splatting (3DGS) has emerged as a promising scene representation, existing 3DGS-based visual localization methods struggle with robust pose initialization and sensitivity to rendering artifacts in large-scale settings. To address these limitations, we propose LSGS-Loc, a novel visual localization pipeline tailored for large-scale 3DGS scenes. Specifically, we introduce a scale-aware pose initialization strategy that combines scene-agnostic relative pose estimation with explicit 3DGS scale constraints, enabling geometrically grounded localization without scene-specific training. Furthermore, in the pose refinement, to mitigate the impact of reconstruction artifacts such as blur and floaters, we develop a Laplacian-based reliability masking mechanism that guides photometric refinement toward high-quality regions. Extensive experiments on large-scale UAV benchmarks demonstrate that our method achieves state-of-the-art accuracy and robustness for unordered image queries, significantly outperforming existing 3DGS-based approaches. Code is available at: https://github.com/xzhang-z/LSGS-Loc

[201] Weather-Conditioned Branch Routing for Robust LiDAR-Radar 3D Object Detection

Hongsheng Li, Lingfeng Zhang, Zexian Yang, Liang Li, Rong Yin, Xiaoshuai Hao, Wenbo Ding

Main category: cs.CV

TL;DR: A 3D object detection framework that dynamically adapts sensor fusion between LiDAR and 4D radar based on weather conditions using a lightweight router and parallel feature streams.

DetailsMotivation: Existing LiDAR-4D radar fusion methods use fixed or weakly adaptive pipelines that fail to dynamically adjust modality preferences as environmental conditions change, limiting robustness in adverse weather scenarios.

Method: Reformulates multi-modal perception as weather-conditioned branch routing with three parallel 3D feature streams: pure LiDAR, pure 4D radar, and condition-gated fusion. Uses a lightweight router with condition tokens from visual/semantic prompts to predict sample-specific weights for soft aggregation. Includes weather-supervised learning with auxiliary classification and diversity regularization to prevent branch collapse.

Result: Achieves state-of-the-art performance on the K-Radar benchmark and provides explicit, interpretable insights into modality preferences, showing how adaptive routing robustly shifts reliance between LiDAR and 4D radar across diverse adverse-weather scenarios.

Conclusion: The framework enables dynamic adaptation of sensor fusion strategies based on weather conditions, improving robustness in adverse weather 3D object detection while providing transparent insights into modality preferences.

Abstract: Robust 3D object detection in adverse weather is highly challenging due to the varying reliability of different sensors. While existing LiDAR-4D radar fusion methods improve robustness, they predominantly rely on fixed or weakly adaptive pipelines, failing to dy-namically adjust modality preferences as environmental conditions change. To bridge this gap, we reformulate multi-modal perception as a weather-conditioned branch routing problem. Instead of computing a single fused output, our framework explicitly maintains three parallel 3D feature streams: a pure LiDAR branch, a pure 4D radar branch, and a condition-gated fusion branch. Guided by a condition token extracted from visual and semantic prompts, a lightweight router dynamically predicts sample-specific weights to softly aggregate these representations. Furthermore, to prevent branch collapse, we introduce a weather-supervised learning strategy with auxiliary classification and diversity regularization to enforce distinct, condition-dependent routing behaviors. Extensive experiments on the K-Radar benchmark demonstrate that our method achieves state-of-the-art performance. Furthermore, it provides explicit and highly interpretable insights into modality preferences, transparently revealing how adaptive routing robustly shifts reliance between LiDAR and 4D radar across diverse adverse-weather scenarios. The source code with be released.

[202] CRISP: Rank-Guided Iterative Squeezing for Robust Medical Image Segmentation under Domain Shift

Yizhou Fang, Pujin Cheng, Yixiang Liu, Xiaoying Tang, Longxi Zhou

Main category: cs.CV

TL;DR: CRISP introduces a parameter-free, model-agnostic framework for medical image segmentation that addresses distribution shift by leveraging rank stability of positive regions rather than probability thresholds.

DetailsMotivation: Distribution shift in medical imaging causes severe performance degradation in unseen environments and exacerbates health inequities. Existing domain adaptation methods are limited by requiring predefined simulated shifts or pseudo-supervision, which fail in the unpredictable real world with effectively infinite distribution shifts.

Method: Based on the empirical “Rank Stability of Positive Regions” law, CRISP uses latent feature perturbation to simulate distribution shifts. It identifies two stable patterns: regions consistently retaining high probabilities (destined positives) and low-probability regions (safe negatives). These form high-precision (HP) and high-recall (HR) priors that are recursively refined under perturbation, then progressively “squeezed” to final segmentation through iterative training.

Result: CRISP significantly outperforms state-of-the-art methods on multi-center cardiac MRI and CT-based lung vessel segmentation, achieving striking HD95 reductions of up to 0.14 (7.0% improvement), 1.90 (13.1% improvement), and 8.39 (38.9% improvement) pixels across multi-center, demographic, and modality shifts respectively.

Conclusion: CRISP provides a novel, parameter-free approach to address distribution shift in medical imaging by leveraging rank stability rather than probability thresholds, demonstrating superior robustness across various shift types without requiring target-domain information.

Abstract: Distribution shift in medical imaging remains a central bottleneck for the clinical translation of medical AI. Failure to address it can lead to severe performance degradation in unseen environments and exacerbate health inequities. Existing methods for domain adaptation are inherently limited by exhausting predefined possibilities through simulated shifts or pseudo-supervision. Such strategies struggle in the open-ended and unpredictable real world, where distribution shifts are effectively infinite. To address this challenge, we introduce an empirical law called Rank Stability of Positive Regions'', which states that the relative rank of predicted probabilities for positive voxels remains stable under distribution shift. Guided by this principle, we propose CRISP, a parameter-free and model-agnostic framework requiring no target-domain information. CRISP is the first framework to make segmentation based on rank rather than probabilities. CRISP simulates model behavior under distribution shift via latent feature perturbation, where voxel probability rankings exhibit two stable patterns: regions that consistently retain high probabilities (destined positives according to the principle) and those that remain low-probability (can be safely classified as negatives). Based on these patterns, we construct high-precision (HP) and high-recall (HR) priors and recursively refine them under perturbation. We then design an iterative training framework, making HP and HR progressively squeeze’’ to the final segmentation. Extensive evaluations on multi-center cardiac MRI and CT-based lung vessel segmentation demonstrate CRISP’s superior robustness, significantly outperforming state-of-the-art methods with striking HD95 reductions of up to 0.14 (7.0% improvement), 1.90 (13.1% improvement), and 8.39 (38.9% improvement) pixels across multi-center, demographic, and modality shifts, respectively.

[203] Learning to Synergize Semantic and Geometric Priors for Limited-Data Wheat Disease Segmentation

Shijie Wang, Zijian Wang, Yadan Luo, Scott Chapman, Xin Yu, Zi Huang

Main category: cs.CV

TL;DR: SGPer is a framework for wheat disease segmentation that synergizes semantic priors from DINOv2 with geometric priors from SAM to handle temporal appearance variations with limited data.

DetailsMotivation: Wheat disease segmentation faces challenges from significant intra-class temporal variations across growth stages, making representative dataset collection labor-intensive and impractical. Existing methods struggle with appearance shifts in data-constrained scenarios.

Method: SGPer treats wheat disease segmentation as a coupled task of disease-specific semantic perception and boundary localization. It uses DINOv2 for semantic priors, converts features into dense point prompts for SAM, and dynamically filters prompts using cross-referencing between SAM’s mask confidence and DINOv2’s semantic consistency.

Result: Extensive evaluations show SGPer consistently achieves state-of-the-art performance on wheat disease and organ segmentation benchmarks, especially in data-constrained scenarios.

Conclusion: SGPer effectively addresses wheat disease segmentation challenges by synergizing semantic and geometric priors, demonstrating robustness to temporal appearance changes with limited training data.

Abstract: Wheat disease segmentation is fundamental to precision agriculture but faces severe challenges from significant intra-class temporal variations across growth stages. Such substantial appearance shifts make collecting a representative dataset for training from scratch both labor-intensive and impractical. To address this, we propose SGPer, a Semantic-Geometric Prior Synergization framework that treats wheat disease segmentation under limited data as a coupled task of disease-specific semantic perception and disease boundary localization. Our core insight is that pretrained DINOv2 provides robust category-aware semantic priors to handle appearance shifts, which can be converted into coarse spatial prompts to guide SAM for the precise localization of disease boundaries. Specifically, SGPer designs disease-sensitive adapters with multiple disease-friendly filters and inserts them into both DINOv2 and SAM to align their pretrained representations with disease-specific characteristics. To operationalize this synergy, SGPer transforms DINOv2-derived features into dense, category-specific point prompts to ensure comprehensive spatial coverage of all disease regions. To subsequently eliminate prompt redundancy and ensure highly accurate mask generation, it dynamically filters these dense candidates by cross-referencing SAM’s iterative mask confidence with the category-specific semantic consistency derived from DINOv2. Ultimately, SGPer distills a highly informative set of prompts to activate SAM’s geometric priors, achieving precise and robust segmentation that remains strictly invariant to temporal appearance changes. Extensive evaluations demonstrate that SGPer consistently achieves state-of-the-art performance on wheat disease and organ segmentation benchmarks, especially in data-constrained scenarios.

[204] VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

Honghao Fu, Miao Xu, Yiwei Wang, Dailing Zhang, Liu Jun, Yujun Cai

Main category: cs.CV

TL;DR: VideoStir: A structured, intent-aware RAG framework for long-video understanding that organizes videos as spatio-temporal graphs and uses MLLM-based intent scoring for better retrieval.

DetailsMotivation: Current video RAG methods flatten videos into independent segments, breaking spatio-temporal structure, and rely on explicit semantic matching that misses implicitly relevant cues for query intent.

Method: 1) Structures videos as spatio-temporal graphs at clip level; 2) Performs multi-hop retrieval across distant but contextually related events; 3) Uses MLLM-backed intent-relevance scorer to retrieve frames based on query reasoning intent alignment.

Result: VideoStir is competitive with state-of-the-art baselines without relying on auxiliary information, demonstrating the promise of structured, intent-aware reasoning over flattened semantic matching.

Conclusion: The framework successfully shifts long-video RAG from flattened semantic matching to structured, intent-aware reasoning, with released code and checkpoints.

Abstract: Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query’s intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query’s reasoning intent. To support this capability, we curate IR-600K, a large-scale dataset tailored for learning frame-query intent alignment. Experiments show that VideoStir is competitive with state-of-the-art baselines without relying on auxiliary information, highlighting the promise of shifting long-video RAG from flattened semantic matching to structured, intent-aware reasoning. Codes and checkpoints are available at Github.

[205] Cross-Stage Attention Propagation for Efficient Semantic Segmentation

Beoungwoo Kang

Main category: cs.CV

TL;DR: CSAP is a lightweight semantic segmentation decoder that propagates attention maps from deep to shallow feature scales to reduce redundancy and computational cost while maintaining performance.

DetailsMotivation: Most multi-scale decoders compute attention independently at each feature scale, introducing substantial redundancy since attention distributions across scales are strongly correlated. This leads to unnecessary computational overhead in lightweight segmentation models.

Method: Proposes Cross-Stage Attention Propagation (CSAP), a decoder framework that computes attention only at the deepest feature scale and propagates the resulting attention maps to shallower stages, bypassing query-key computation at those stages entirely.

Result: CSAP-Tiny achieves 42.9% mIoU on ADE20K with 5.5 GFLOPs, 80.5% on Cityscapes with 21.5 GFLOPs, and 40.9% on COCO-Stuff 164K with 5.5 GFLOPs, surpassing SegNeXt-Tiny by +1.8% on ADE20K while requiring 16.8% fewer FLOPs.

Conclusion: CSAP preserves multi-scale contextual reasoning while substantially reducing computational cost, offering an efficient decoder design for lightweight semantic segmentation models.

Abstract: Recent lightweight semantic segmentation methods have made significant progress by combining compact backbones with efficient decoder heads. However, most multi-scale decoders compute attention independently at each feature scale, introducing substantial redundancy since the resulting attention distributions across scales are strongly correlated. We propose Cross-Stage Attention Propagation (CSAP), a decoder framework that computes attention at the deepest feature scale and propagates the resulting attention maps to shallower stages, bypassing query-key computation at those stages entirely. This design preserves multi-scale contextual reasoning while substantially reducing the decoder’s computational cost. CSAP-Tiny achieves 42.9% mIoU on ADE20K with only 5.5 GFLOPs, 80.5% on Cityscapes with 21.5 GFLOPs, and 40.9% on COCO-Stuff 164K with 5.5 GFLOPs, surpassing SegNeXt-Tiny by +1.8% on ADE20K while requiring 16.8% fewer floating-point operations.

[206] Few-Shot Semantic Segmentation Meets SAM3

Yi-Jen Tsai, Yen-Yu Lin, Chien-Yao Wang

Main category: cs.CV

TL;DR: Training-free few-shot semantic segmentation using SAM3’s promptable concept segmentation via spatial concatenation of support and query images

DetailsMotivation: Existing few-shot segmentation methods require extensive episodic training which is computationally expensive and sensitive to distribution shifts. The authors explore using modern vision foundation models like SAM3 as a training-free solution.

Method: Repurpose SAM3’s Promptable Concept Segmentation (PCS) capability by spatially concatenating support and query images into a shared canvas, allowing frozen SAM3 to perform segmentation without fine-tuning or architectural changes.

Result: Achieves state-of-the-art performance on PASCAL-5^i and COCO-20^i datasets, outperforming many heavily engineered methods. Also discovers that negative prompts can be counterproductive in few-shot settings.

Conclusion: Strong cross-image reasoning can emerge from simple spatial formulations with foundation models, but current models have limitations in handling conflicting prompt signals.

Abstract: Few-Shot Semantic Segmentation (FSS) focuses on segmenting novel object categories from only a handful of annotated examples. Most existing approaches rely on extensive episodic training to learn transferable representations, which is both computationally demanding and sensitive to distribution shifts. In this work, we revisit FSS from the perspective of modern vision foundation models and explore the potential of Segment Anything Model 3 (SAM3) as a training-free solution. By repurposing its Promptable Concept Segmentation (PCS) capability, we adopt a simple spatial concatenation strategy that places support and query images into a shared canvas, allowing a fully frozen SAM3 to perform segmentation without any fine-tuning or architectural changes. Experiments on PASCAL-$5^i$ and COCO-$20^i$ show that this minimal design already achieves state-of-the-art performance, outperforming many heavily engineered methods. Beyond empirical gains, we uncover that negative prompts can be counterproductive in few-shot settings, where they often weaken target representations and lead to prediction collapse despite their intended role in suppressing distractors. These findings suggest that strong cross-image reasoning can emerge from simple spatial formulations, while also highlighting limitations in how current foundation models handle conflicting prompt signals. Code at: https://github.com/WongKinYiu/FSS-SAM3

[207] Human Interaction-Aware 3D Reconstruction from a Single Image

Gwanghyun Kim, Junghun James Kim, Suh Yoon Jeon, Jason Park, Se Young Chun

Main category: cs.CV

TL;DR: HUG3D: A holistic framework for reconstructing textured 3D human models from single images of multi-human scenes, addressing occlusion, interaction, and geometric distortion issues through group-instance modeling and physics-based priors.

DetailsMotivation: Existing 3D human reconstruction methods fail in multi-human scenes due to unrealistic overlaps, missing geometry in occluded regions, and distorted interactions when naively composing individual reconstructions. There's a need for approaches that incorporate group-level context and interaction priors.

Method: 1) Transform input to canonical orthographic space to mitigate perspective distortions. 2) Human Group-Instance Multi-View Diffusion (HUG-MVD) generates complete multi-view normals/images by jointly modeling individuals and group context. 3) Human Group-Instance Geometric Reconstruction (HUG-GR) optimizes geometry using physics-based interaction priors for physical plausibility. 4) Fuse multi-view images into high-fidelity texture.

Result: HUG3D significantly outperforms both single-human and existing multi-human methods, producing physically plausible, high-fidelity 3D reconstructions of interacting people from a single image.

Conclusion: The proposed holistic framework successfully addresses multi-human 3D reconstruction challenges by incorporating group-level context and physics-based interaction priors, enabling realistic reconstruction of interacting humans from single images.

Abstract: Reconstructing textured 3D human models from a single image is fundamental for AR/VR and digital human applications. However, existing methods mostly focus on single individuals and thus fail in multi-human scenes, where naive composition of individual reconstructions often leads to artifacts such as unrealistic overlaps, missing geometry in occluded regions, and distorted interactions. These limitations highlight the need for approaches that incorporate group-level context and interaction priors. We introduce a holistic method that explicitly models both group- and instance-level information. To mitigate perspective-induced geometric distortions, we first transform the input into a canonical orthographic space. Our primary component, Human Group-Instance Multi-View Diffusion (HUG-MVD), then generates complete multi-view normals and images by jointly modeling individuals and group context to resolve occlusions and proximity. Subsequently, the Human Group-Instance Geometric Reconstruction (HUG-GR) module optimizes the geometry by leveraging explicit, physics-based interaction priors to enforce physical plausibility and accurately model inter-human contact. Finally, the multi-view images are fused into a high-fidelity texture. Together, these components form our complete framework, HUG3D. Extensive experiments show that HUG3D significantly outperforms both single-human and existing multi-human methods, producing physically plausible, high-fidelity 3D reconstructions of interacting people from a single image. Project page: https://jongheean11.github.io/HUG3D_project

[208] Not All Agents Matter: From Global Attention Dilution to Risk-Prioritized Game Planning

Kang Ding, Hongsong Wang, Jie Gui, Lei He

Main category: cs.CV

TL;DR: GameAD: A novel framework that models end-to-end autonomous driving as a risk-aware game problem with prioritized interactions between agents, achieving superior safety performance.

DetailsMotivation: Existing end-to-end autonomous driving models treat all agents equally, failing to distinguish real collision threats from complex backgrounds. The authors argue that autonomous driving is fundamentally a dynamic multi-agent game requiring risk-prioritized decision making.

Method: Proposes GameAD framework with four key components: Risk-Aware Topology Anchoring, Strategic Payload Adapter, Minimax Risk-Aware Sparse Attention, and Risk Consistent Equilibrium Stabilization. Also introduces Planning Risk Exposure metric to quantify cumulative risk intensity of trajectories.

Result: Extensive experiments on nuScenes and Bench2Drive datasets show GameAD significantly outperforms state-of-the-art methods, especially in trajectory safety metrics.

Conclusion: Modeling autonomous driving as a risk-prioritized game problem with specialized attention mechanisms leads to safer decision making and superior performance compared to traditional equal-treatment approaches.

Abstract: End-to-end autonomous driving resides not in the integration of perception and planning, but rather in the dynamic multi-agent game within a unified representation space. Most existing end-to-end models treat all agents equally, hindering the decoupling of real collision threats from complex backgrounds. To address this issue, We introduce the concept of Risk-Prioritized Game Planning, and propose GameAD, a novel framework that models end-to-end autonomous driving as a risk-aware game problem. The GameAD integrates Risk-Aware Topology Anchoring, Strategic Payload Adapter, Minimax Risk-Aware Sparse Attention, and Risk Consistent Equilibrium Stabilization to enable game theoretic decision making with risk prioritized interactions. We also present the Planning Risk Exposure metric, which quantifies the cumulative risk intensity of planned trajectories over a long horizon for safe autonomous driving. Extensive experiments on the nuScenes and Bench2Drive datasets show that our approach significantly outperforms state-of-the-art methods, especially in terms of trajectory safety.

[209] A Synthetic Eye Movement Dataset for Script Reading Detection: Real Trajectory Replay on a 3D Simulator

Kidus Zewde, Yuchen Zhou, Dennis Ng, Neo Tiangratanakul, Tommy Duong, Ankit Raj, Yuxin Zhang, Xingyu Shen, Simiao Ren

Main category: cs.CV

TL;DR: A pipeline for generating synthetic labeled eye movement video data using 3D eye movement simulation and browser automation, applied to script-reading detection in video interviews.

DetailsMotivation: Address the scarcity of behavioral data (gestures, eye movements, social signals) which is expensive to annotate and privacy-sensitive, by creating synthetic generation infrastructure as an alternative to real data collection.

Method: Extract real human iris trajectories from reference videos and replay them on a 3D eye movement simulator via headless browser automation to generate synthetic labeled eye movement video data.

Result: Created final_dataset_v1 with 144 sessions (12 hours of synthetic eye movement video at 25fps) for script-reading detection. Generated trajectories preserve temporal dynamics (KS D < 0.14). Found bounded sensitivity in 3D simulator due to absence of coupled head movement.

Conclusion: The pipeline, dataset, and tools enable downstream behavioral classifier development at the intersection of behavioral modeling and vision-language systems, with insights for future simulator design.

Abstract: Large vision-language models have achieved remarkable capabilities by training on massive internet-scale data, yet a fundamental asymmetry persists: while LLMs can leverage self-supervised pretraining on abundant text and image data, the same is not true for many behavioral modalities. Video-based behavioral data – gestures, eye movements, social signals – remains scarce, expensive to annotate, and privacy-sensitive. A promising alternative is simulation: replace real data collection with controlled synthetic generation to produce automatically labeled data at scale. We introduce infrastructure for this paradigm applied to eye movement, a behavioral signal with applications across vision-language modeling, virtual reality, robotics, accessibility systems, and cognitive science. We present a pipeline for generating synthetic labeled eye movement video by extracting real human iris trajectories from reference videos and replaying them on a 3D eye movement simulator via headless browser automation. Applying this to the task of script-reading detection during video interviews, we release final_dataset_v1: 144 sessions (72 reading, 72 conversation) totaling 12 hours of synthetic eye movement video at 25fps. Evaluation shows that generated trajectories preserve the temporal dynamics of the source data (KS D < 0.14 across all metrics). A matched frame-by-frame comparison reveals that the 3D simulator exhibits bounded sensitivity at reading-scale movements, attributable to the absence of coupled head movement – a finding that informs future simulator design. The pipeline, dataset, and evaluation tools are released to support downstream behavioral classifier development at the intersection of behavioral modeling and vision-language systems.

[210] Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis

Pu Wang, Zhixuan Mao, Jialu Li, Zhuoran Zheng, Dianjie Lu, Youshan Zhang

Main category: cs.CV

TL;DR: A novel diagnostic system for canine pneumothorax combining vision-language model-guided flow matching for precise segmentation with random matrix theory for statistical detection of pathological signals.

DetailsMotivation: Address challenges in automatic canine pneumothorax diagnosis including data scarcity and need for trustworthy models by creating a public dataset and developing a synergistic approach combining precise localization with statistical detection.

Method: Two-stage approach: 1) Localization using VLM-guided iterative Flow Matching for superior boundary accuracy in segmentation, 2) Detection using Random Matrix Theory to analyze isolated lesion features, modeling healthy tissue as predictable noise and identifying pathological signals through outlier eigenvalues.

Result: Developed a highly accurate and interpretable diagnostic system with high-fidelity localization crucial for purifying pathological signals and maximizing detection sensitivity.

Conclusion: Synergy of generative segmentation (Flow Matching) and first-principles statistical analysis (RMT) yields an effective diagnostic paradigm for canine pneumothorax, with potential broader applications.

Abstract: Automatic diagnosis of canine pneumothorax is challenged by data scarcity and the need for trustworthy models. To address this, we first introduce a public, pixel-level annotated dataset to facilitate research. We then propose a novel diagnostic paradigm that reframes the task as a synergistic process of signal localization and spectral detection. For localization, our method employs a Vision-Language Model (VLM) to guide an iterative Flow Matching process, which progressively refines segmentation masks to achieve superior boundary accuracy. For detection, the segmented mask is used to isolate features from the suspected lesion. We then apply Random Matrix Theory (RMT), a departure from traditional classifiers, to analyze these features. This approach models healthy tissue as predictable random noise and identifies pneumothorax by detecting statistically significant outlier eigenvalues that represent a non-random pathological signal. The high-fidelity localization from Flow Matching is crucial for purifying the signal, thus maximizing the sensitivity of our RMT detector. This synergy of generative segmentation and first-principles statistical analysis yields a highly accurate and interpretable diagnostic system (source code is available at: https://github.com/Pu-Wang-alt/Canine-pneumothorax).

[211] A Weak-Signal-Aware Framework for Subsurface Defect Detection: Mechanisms for Enhancing Low-SCR Hyperbolic Signatures

Wenbo Zhang, Zekun Long, Zican Liu, Yangchen Zeng, Keyi Hu

Main category: cs.CV

TL;DR: WSA-Net is a lightweight GPR framework that enhances detection of faint subsurface defects by preserving weak signals, suppressing clutter, reconstructing geometric features, and resolving semantic ambiguities through specialized attention mechanisms.

DetailsMotivation: Ground Penetrating Radar subsurface defect detection faces challenges with "weak signals" - faint diffraction hyperbolas with low signal-to-clutter ratios, high wavefield similarity, and geometric degradation. Existing lightweight detectors prioritize efficiency over sensitivity, failing to preserve low-frequency structures or decouple heterogeneous clutter.

Method: WSA-Net integrates four mechanisms: 1) Signal preservation using partial convolutions, 2) Clutter suppression via heterogeneous grouping attention, 3) Geometric reconstruction to sharpen hyperbolic arcs, and 4) Context anchoring to resolve semantic ambiguities. The framework moves beyond simple parameter reduction to enhance faint signatures through physical-feature reconstruction.

Result: On the RTST dataset, WSA-Net achieves 0.6958 mAP@0.5 and 164 FPS with only 2.412 M parameters. The results prove that signal-centric awareness in lightweight architectures effectively reduces false negatives in infrastructure inspection.

Conclusion: The proposed WSA-Net framework demonstrates that signal-centric awareness in lightweight architectures can effectively enhance detection of faint subsurface defects in GPR data, reducing false negatives while maintaining computational efficiency for practical infrastructure inspection applications.

Abstract: Subsurface defect detection via Ground Penetrating Radar is challenged by “weak signals” faint diffraction hyperbolas with low signal-to-clutter ratios, high wavefield similarity, and geometric degradation. Existing lightweight detectors prioritize efficiency over sensitivity, failing to preserve low-frequency structures or decouple heterogeneous clutter. We propose WSA-Net, a framework designed to enhance faint signatures through physical-feature reconstruction. Moving beyond simple parameter reduction, WSA-Net integrates four mechanisms: Signal preservation using partial convolutions; Clutter suppression via heterogeneous grouping attention; Geometric reconstruction to sharpen hyperbolic arcs; Context anchoring to resolve semantic ambiguities. Evaluations on the RTSTdataset show WSA-Net achieves 0.6958 mAP@0.5 and 164 FPS with only 2.412 M parameters. Results prove that signal-centric awareness in lightweight architectures effectively reduces false negatives in infrastructure inspection.

[212] CLIP-Guided Data Augmentation for Night-Time Image Dehazing

Xining Ge, Weijun Yuan, Gengjia Chang, Xuyang Li, Shuhong Liu

Main category: cs.CV

TL;DR: A unified framework for nighttime image dehazing that combines domain-aligned data construction, two-stage training of NAFNet, and inference-time enhancement techniques to address complex degradation patterns with limited supervision.

DetailsMotivation: Nighttime image dehazing faces more complex degradation than daytime due to haze scattering coupled with low illumination, non-uniform lighting, and strong light interference. Limited supervision aggravates domain drift and training instability since target-domain samples are scarce, and naive introduction of external data may weaken adaptation due to distribution mismatch.

Method: 1) Uses pre-trained CLIP visual encoder to screen candidate external samples by similarity to construct domain-aligned training data; 2) Trains NAFNet in two stages: first adapting to target domain, then expanding to broader degradation patterns; 3) At inference, combines TLC, x8 self-ensemble, and weighted snapshot fusion for output stability.

Result: The framework offers a practical and effective pipeline for nighttime image dehazing without complex network redesign, addressing the challenge of limited supervision and domain drift in complex nighttime degradation scenarios.

Conclusion: Rather than relying on complex network redesign, the proposed framework provides a practical solution for nighttime image dehazing by integrating domain-aligned data construction, stage-wise training, and inference-time enhancement techniques.

Abstract: Nighttime image dehazing faces a more complex degradation pattern than its daytime counterpart, as haze scattering couples with low illumination, non-uniform lighting, and strong light interference. Under limited supervision, this complexity aggravates domain drift and training instability, since target-domain samples are scarce while naively introducing external data may weaken adaptation due to distribution mismatch. This paper presents our solution to the NTIRE 2026 Night Time Image Dehazing Challenge, built as a unified framework that integrates domain-aligned data construction, stage-wise training, and inference-time enhancement. Specifically, a pre-trained CLIP visual encoder screens candidate external samples by similarity to construct training data closer to the target domain. NAFNet is then trained in two stages, first adapting to the target domain and then expanding to broader degradation patterns. At inference time, TLC, x8 self-ensemble, and weighted snapshot fusion are combined to improve output stability. Rather than relying on complex network redesign, the proposed framework offers a practical and effective pipeline for nighttime image dehazing.

[213] Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality

Yanming Xiu, Zhengayuan Jiang, Neil Zhenqiang Gong, Maria Gorlatova

Main category: cs.CV

TL;DR: ContrAR: A benchmark for evaluating vision-language models’ robustness against contradictory virtual content attacks in augmented reality environments.

DetailsMotivation: As AR becomes more integrated into daily life, security and reliability become critical challenges. Contradictory virtual content attacks pose unique risks by misleading users, creating semantic confusion, or delivering harmful information through malicious virtual elements in AR environments.

Method: Systematically model contradictory virtual content attacks and create ContrAR benchmark containing 312 real-world AR videos validated by 10 human participants. Benchmark 11 VLMs (commercial and open-source) to evaluate their robustness against virtual content manipulation and contradiction.

Result: Current VLMs show reasonable understanding of contradictory virtual content but need improvement in detecting and reasoning about adversarial content manipulations in AR. Balancing detection accuracy and latency remains challenging.

Conclusion: ContrAR provides a valuable benchmark for evaluating VLM robustness in AR security contexts, revealing current limitations and highlighting the need for improved models that can handle contradictory virtual content attacks while maintaining practical latency constraints.

Abstract: Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user’s view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of contradictory virtual content, room still remains for improvement in detecting and reasoning about adversarial content manipulations in AR environments. Moreover, balancing detection accuracy and latency remains challenging.

[214] Geometrical Cross-Attention and Nonvoid Voxelization for Efficient 3D Medical Image Segmentation

Chenxin Yuan, Shoupeng Chen, Haojiang Ye, Yiming Miao, Limei Peng, Pin-Han Ho

Main category: cs.CV

TL;DR: GCNV-Net is a novel 3D medical segmentation framework that integrates a Tri-directional Dynamic Nonvoid Voxel Transformer, Geometrical Cross-Attention module, and Nonvoid Voxelization to achieve both high accuracy and computational efficiency across diverse medical imaging modalities.

DetailsMotivation: Existing 3D medical segmentation methods often fail to achieve both high accuracy and computational efficiency across diverse anatomies and imaging modalities, creating a need for a balanced solution suitable for clinical deployment.

Method: The framework combines three key components: 1) Tri-directional Dynamic Nonvoid Voxel Transformer (3DNVT) that partitions relevant voxels along three orthogonal anatomical planes to model 3D spatial dependencies; 2) Geometrical Cross-Attention (GCA) module that incorporates geometric positional information during multi-scale feature fusion; 3) Nonvoid Voxelization that processes only informative regions to reduce redundant computation.

Result: GCNV-Net achieves state-of-the-art performance across multiple benchmarks (BraTS2021, ACDC, MSD Prostate, MSD Pancreas, AMOS2022), outperforming existing methods by 0.65% on Dice, 0.63% on IoU, 1% on NSD, and 14.5% on HD95. It reduces FLOPs by 56.13% and inference latency by 68.49% compared to conventional voxelization.

Conclusion: GCNV-Net effectively balances accuracy and efficiency for 3D medical segmentation, demonstrating robustness across diverse organs, disease conditions, and imaging modalities with strong potential for clinical deployment.

Abstract: Accurate segmentation of 3D medical scans is crucial for clinical diagnostics and treatment planning, yet existing methods often fail to achieve both high accuracy and computational efficiency across diverse anatomies and imaging modalities. To address these challenges, we propose GCNV-Net, a novel 3D medical segmentation framework that integrates a Tri-directional Dynamic Nonvoid Voxel Transformer (3DNVT), a Geometrical Cross-Attention module (GCA), and Nonvoid Voxelization. The 3DNVT dynamically partitions relevant voxels along the three orthogonal anatomical planes, namely the transverse, sagittal, and coronal planes, enabling effective modeling of complex 3D spatial dependencies. The GCA mechanism explicitly incorporates geometric positional information during multi-scale feature fusion, significantly enhancing fine-grained anatomical segmentation accuracy. Meanwhile, Nonvoid Voxelization processes only informative regions, greatly reducing redundant computation without compromising segmentation quality, and achieves a 56.13% reduction in FLOPs and a 68.49% reduction in inference latency compared to conventional voxelization. We evaluate GCNV-Net on multiple widely used benchmarks: BraTS2021, ACDC, MSD Prostate, MSD Pancreas, and AMOS2022. Our method achieves state-of-the-art segmentation performance across all datasets, outperforming the best existing methods by 0.65% on Dice, 0.63% on IoU, 1% on NSD, and relatively 14.5% on HD95. All results demonstrate that GCNV-Net effectively balances accuracy and efficiency, and its robustness across diverse organs, disease conditions, and imaging modalities highlights strong potential for clinical deployment.

[215] Cross-Resolution Diffusion Models via Network Pruning

Jiaxuan Ren, Junhan Zhu, Huan Wang

Main category: cs.CV

TL;DR: CR-Diff improves cross-resolution consistency in diffusion models by pruning adverse weights and amplifying pruned outputs, enhancing image quality at unseen resolutions while preserving default resolution performance.

DetailsMotivation: Diffusion models trained at fixed resolutions degrade when generating images at out-of-training resolutions due to resolution-dependent parameter behaviors where weights that work well at default resolutions become adverse at different spatial scales, weakening semantic alignment and causing structural instability.

Method: Two-stage approach: 1) Block-wise pruning to selectively eliminate adverse weights that cause resolution-dependent issues, 2) Pruned output amplification to further purify the pruned predictions and enhance quality.

Result: Extensive experiments show CR-Diff improves perceptual fidelity and semantic coherence across various diffusion backbones and unseen resolutions while largely preserving performance at default resolutions. Also supports prompt-specific refinement for on-demand quality enhancement.

Conclusion: CR-Diff effectively addresses resolution-dependent parameter issues in diffusion models through targeted pruning and amplification, enabling consistent image generation quality across different resolutions without retraining.

Abstract: Diffusion models have demonstrated impressive image synthesis performance, yet many UNet-based models are trained at certain fixed resolutions. Their quality tends to degrade when generating images at out-of-training resolutions. We trace this issue to resolution-dependent parameter behaviors, where weights that function well at the default resolution can become adverse when spatial scales shift, weakening semantic alignment and causing structural instability in the UNet architecture. Based on this analysis, this paper introduces CR-Diff, a novel method that improves the cross-resolution visual consistency by pruning some parameters of the diffusion model. Specifically, CR-Diff has two stages. It first performs block-wise pruning to selectively eliminate adverse weights. Then, a pruned output amplification is conducted to further purify the pruned predictions. Empirically, extensive experiments suggest that CR-Diff can improve perceptual fidelity and semantic coherence across various diffusion backbones and unseen resolutions, while largely preserving the performance at default resolutions. Additionally, CR-Diff supports prompt-specific refinement, enabling quality enhancement on demand.

[216] Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images

Xuanguang Liu, Lei Ding, Yujie Li, Chenguang Dai, Zhenchao Zhang, Mengmeng Li, Ziyi Yang, Yifan Sun, Yongqi Sun, Hanyun Wang

Main category: cs.CV

TL;DR: STSF-Net is a multimodal change detection framework for optical and SAR remote sensing images that jointly models modality-specific and spatio-temporal features to enhance change representations and suppress pseudo-changes.

DetailsMotivation: Existing multimodal change detection approaches have limitations in cross-modal interaction and exploiting modality-specific characteristics, leading to insufficient modeling of fine-grained change information and hindering precise detection of semantic changes in multimodal data.

Method: Proposes STSF-Net that jointly models modality-specific features (to capture genuine semantic change signals) and spatio-temporal common features (to suppress pseudo-changes). Introduces an optical and SAR feature fusion strategy that adaptively adjusts feature importance based on semantic priors from pre-trained foundational models for semantic-guided adaptive fusion.

Result: Outperforms state-of-the-art methods on Delta-SN6, BRIGHT, and Wuhan-Het datasets by 3.21%, 1.08%, and 1.32% in mIoU respectively. Also introduces Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark with VHR fully polarimetric SAR and optical images.

Conclusion: STSF-Net effectively addresses limitations in multimodal change detection by enhancing cross-modal interaction and exploiting modality-specific characteristics, demonstrating superior performance on multiple datasets.

Abstract: Multimodal change detection (MMCD) identifies changed areas in multimodal remote sensing (RS) data, demonstrating significant application value in land use monitoring, disaster assessment, and urban sustainable development. However, literature MMCD approaches exhibit limitations in cross-modal interaction and exploiting modality-specific characteristics. This leads to insufficient modeling of fine-grained change information, thus hindering the precise detection of semantic changes in multimodal data. To address the above problems, we propose STSF-Net, a framework designed for MMCD between optical and SAR images. STSF-Net jointly models modality-specific and spatio-temporal common features to enhance change representations. Specifically, modality-specific features are exploited to capture genuine semantic change signals, while spatio-temporal common features are embedded to suppress pseudo-changes caused by differences in imaging mechanisms. Furthermore, we introduce an optical and SAR feature fusion strategy that adaptively adjusts feature importance based on semantic priors obtained from pre-trained foundational models, enabling semantic-guided adaptive fusion of multi-modal information. In addition, we introduce the Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark consisting of very-high-resolution (VHR) fully polarimetric SAR and optical images. Experimental results on Delta-SN6, BRIGHT, and Wuhan-Het datasets demonstrate that our method outperforms the state-of-the-art (SOTA) by 3.21%, 1.08%, and 1.32% in mIoU, respectively. The associated code and Delta-SN6 dataset will be released at: https://github.com/liuxuanguang/STSF-Net.

[217] EchoAgent: Towards Reliable Echocardiography Interpretation with “Eyes”,“Hands” and “Minds”

Qin Wang, Zhiqing He, Yu Liu, Bowen Guo, Zeju Li, Miao Zhao, Wenhao Ju, Zhiling Luo, Xianhong Shu, Yi Guo, Yuanyuan Wang

Main category: cs.CV

TL;DR: EchoAgent is an agentic system for end-to-end echocardiography interpretation that coordinates visual observation, manual measurement, and expert reasoning like a cardiac sonographer.

DetailsMotivation: Current deep learning approaches for echocardiography analysis are limited to specific skills (either visual segmentation or reasoning), lacking the comprehensive coordination of eyes, hands, and minds needed for reliable clinical interpretation.

Method: Three components: 1) Expertise-driven cognition engine that assimilates Echo guidelines into structured knowledge, 2) Hierarchical collaboration toolkit for parsing videos, identifying views, segmentation, and measurement, 3) Orchestrated reasoning hub integrating multimodal evidence with knowledge base for explainable inference.

Result: Achieves optimal performance across diverse structure analyses on CAMUS and MIMIC-EchoQA datasets, with overall accuracy up to 80.00% across 48 echocardiographic views spanning 14 cardiac anatomical regions.

Conclusion: EchoAgent enables a single system to learn, observe, operate, and reason like an echocardiologist, showing great promise for reliable echocardiography interpretation through coordinated multimodal capabilities.

Abstract: Reliable interpretation of echocardiography (Echo) is crucial for assessing cardiac function, which demands clinicians to synchronously orchestrate multiple capabilities, including visual observation (eyes), manual measurement (hands), and expert knowledge learning and reasoning (minds). While current task-specific deep-learning approaches and multimodal large language models have demonstrated promise in assisting Echo analysis through automated segmentation or reasoning, they remain focused on restricted skills, i.e., eyes-hands or eyes-minds, thereby limiting clinical reliability and utility. To address these issues, we propose EchoAgent, an agentic system tailored for end-to-end Echo interpretation, which achieves a fully coordinated eyes-hands-minds workflow that learns, observes, operates, and reasons like a cardiac sonographer. First, we introduce an expertise-driven cognition engine where our agent can automatically assimilate credible Echo guidelines into a structured knowledge base, thus constructing an Echo-customized mind. Second, we devise a hierarchical collaboration toolkit to endow EchoAgent with eyes-hands, which can automatically parse Echo video streams, identify cardiac views, perform anatomical segmentation, and quantitative measurement. Third, we integrate the perceived multimodal evidence with the exclusive knowledge base into an orchestrated reasoning hub to conduct explainable inferences. We evaluate EchoAgent on CAMUS and MIMIC-EchoQA datasets, which cover 48 distinct echocardiographic views spanning 14 cardiac anatomical regions. Experimental results show that EchoAgent achieves optimal performance across diverse structure analyses, yielding overall accuracy of up to 80.00%. Importantly, EchoAgent empowers a single system with abilities to learn, observe, operate and reason like an echocardiologist, which holds great promise for reliable Echo interpretation.

[218] Evaluation Before Generation: A Paradigm for Robust Multimodal Sentiment Analysis with Missing Modalities

Rongfei Chen, Tingting Zhang, Xiaoyu Shen, Wei Zhang

Main category: cs.CV

TL;DR: A Prompt-based Missing Modality Adaptation framework for multimodal sentiment analysis that dynamically evaluates missing modalities, disentangles prompts, weights them adaptively, and connects them globally to handle missing data robustly.

DetailsMotivation: The missing modality problem significantly degrades accuracy in real-world multimodal sentiment analysis. Existing approaches have limitations: lack of rigorous evaluation for generating missing modalities, and insufficient exploration of structural dependencies among multimodal prompts and their global coherence.

Method: Proposes ProMMA framework with four key components: 1) Missing Modality Evaluator to dynamically assess importance of missing modalities using pretrained models and pseudo labels; 2) Modality-invariant Prompt Disentanglement to decompose shared prompts into modality-specific private prompts; 3) Dynamic Prompt Weighting to compute mutual information-based weights from cross-attention outputs; 4) Multi-level Prompt Dynamic Connection to integrate shared prompts with self-attention outputs through residual connections.

Result: Extensive experiments on CMU MOSI, CMU MOSEI, and CH-SIMS benchmarks demonstrate state-of-the-art performance and stable results under diverse missing modality settings.

Conclusion: The proposed ProMMA framework effectively addresses missing modality challenges in multimodal sentiment analysis through dynamic evaluation, prompt disentanglement, adaptive weighting, and global connection, achieving robust performance across various missing modality scenarios.

Abstract: The missing modality problem poses a fundamental challenge in multimodal sentiment analysis, significantly degrading model accuracy and generalization in real world scenarios. Existing approaches primarily improve robustness through prompt learning and pre trained models. However, two limitations remain. First, the necessity of generating missing modalities lacks rigorous evaluation. Second, the structural dependencies among multimodal prompts and their global coherence are insufficiently explored. To address these issues, a Prompt based Missing Modality Adaptation framework is proposed. A Missing Modality Evaluator is introduced at the input stage to dynamically assess the importance of missing modalities using pretrained models and pseudo labels, thereby avoiding low quality data imputation. Building on this, a Modality invariant Prompt Disentanglement module decomposes shared prompts into modality specific private prompts to capture intrinsic local correlations and improve representation quality. In addition, a Dynamic Prompt Weighting module computes mutual information based weights from cross attention outputs to adaptively suppress interference from missing modalities. To enhance global consistency, a Multi level Prompt Dynamic Connection module integrates shared prompts with self attention outputs through residual connections, leveraging global prompt priors to strengthen key guidance features. Extensive experiments on three public benchmarks, including CMU MOSI, CMU MOSEI, and CH SIMS, demonstrate that the proposed framework achieves state of the art performance and stable results under diverse missing modality settings. The implementation is available at https://github.com/rongfei-chen/ProMMA

[219] Physics-Aligned Spectral Mamba: Decoupling Semantics and Dynamics for Few-Shot Hyperspectral Target Detection

Luqi Gong, Qixin Xie, Yue Chen, Ziqiang Chen, Fanda Fan, Shuai Zhao, Chao Li

Main category: cs.CV

TL;DR: SpecMamba: A parameter-efficient few-shot hyperspectral target detection framework using frequency-aware Mamba adapters and spectral priors for cross-domain generalization.

DetailsMotivation: Existing meta-learning approaches for hyperspectral target detection face challenges with inefficient full-parameter fine-tuning, overfitting, and ignoring the frequency-domain structure and spectral band continuity of hyperspectral data, limiting spectral adaptation and cross-domain generalization.

Method: Proposes SpecMamba with three key components: 1) Discrete Cosine Transform Mamba Adapter (DCTMA) that projects spectral features into frequency domain using DCT and leverages Mamba’s linear-complexity state-space recursion; 2) Prior-Guided Tri-Encoder (PGTE) that incorporates laboratory spectral priors to guide adapter optimization; 3) Self-Supervised Pseudo-Label Mapping (SSPLM) for test-time adaptation with uncertainty-aware sampling and dual-path consistency constraints.

Result: Extensive experiments on multiple public datasets demonstrate that SpecMamba consistently outperforms state-of-the-art methods in detection accuracy and cross-domain generalization.

Conclusion: SpecMamba effectively addresses the challenges of few-shot hyperspectral target detection by decoupling stable semantic representation from agile spectral adaptation, leveraging frequency-domain processing and parameter-efficient adaptation strategies.

Abstract: Meta-learning facilitates few-shot hyperspectral target detection (HTD), but adapting deep backbones remains challenging. Full-parameter fine-tuning is inefficient and prone to overfitting, and existing methods largely ignore the frequency-domain structure and spectral band continuity of hyperspectral data, limiting spectral adaptation and cross-domain generalization.To address these challenges, we propose SpecMamba, a parameter-efficient and frequency-aware framework that decouples stable semantic representation from agile spectral adaptation. Specifically, we introduce a Discrete Cosine Transform Mamba Adapter (DCTMA) on top of frozen Transformer representations. By projecting spectral features into the frequency domain via DCT and leveraging Mamba’s linear-complexity state-space recursion, DCTMA explicitly captures global spectral dependencies and band continuity while avoiding the redundancy of full fine-tuning. Furthermore, to address prototype drift caused by limited sample sizes, we design a Prior-Guided Tri-Encoder (PGTE) that allows laboratory spectral priors to guide the optimization of the learnable adapter without disrupting the stable semantic feature space. Finally, a Self-Supervised Pseudo-Label Mapping (SSPLM) strategy is developed for test-time adaptation, enabling efficient decision boundary refinement through uncertainty-aware sampling and dual-path consistency constraints. Extensive experiments on multiple public datasets demonstrate that SpecMamba consistently outperforms state-of-the-art methods in detection accuracy and cross-domain generalization.

[220] High-Resolution Single-Shot Polarimetric Imaging Made Easy

Shuangfan Zhou, Chu Zhou, Heng Guo, Youwei Lyu, Boxin Shi, Zhanyu Ma, Imari Sato

Main category: cs.CV

TL;DR: EasyPolar: A multi-view polarimetric imaging system using three synchronized RGB cameras to capture polarization information without sacrificing spatial resolution, overcoming limitations of traditional DoFP sensors.

DetailsMotivation: Existing Division-of-Focal-Plane (DoFP) polarization sensors suffer from reduced spatial resolution and artifacts due to spatial multiplexing. There's a need for single-shot polarization capture without sacrificing spatial quality for practical applications.

Method: A triple-camera setup with three synchronized RGB cameras captures one unpolarized view and two polarized views with distinct orientations. A confidence-guided polarization reconstruction network performs multi-modal feature fusion with confidence-aware physical guidance to address misalignment and enforce geometric constraints.

Result: The method achieves high-quality polarization reconstruction results and benefits various downstream tasks, demonstrating effectiveness in overcoming DoFP sensor limitations.

Conclusion: EasyPolar provides a practical solution for single-shot polarimetric imaging without sacrificing spatial resolution, offering richer physical cues beyond RGB images through multi-view fusion and learned reconstruction.

Abstract: Polarization-based vision has gained increasing attention for providing richer physical cues beyond RGB images. While achieving single-shot capture is highly desirable for practical applications, existing Division-of-Focal-Plane (DoFP) sensors inherently suffer from reduced spatial resolution and artifacts due to their spatial multiplexing mechanism. To overcome these limitations without sacrificing the snapshot capability, we propose EasyPolar, a multi-view polarimetric imaging framework. Our system is grounded in the physical insight that three independent intensity measurements are sufficient to fully characterize linear polarization. Guided by this, we design a triple-camera setup consisting of three synchronized RGB cameras that capture one unpolarized view and two polarized views with distinct orientations. Building upon this hardware design, we further propose a confidence-guided polarization reconstruction network to address the potential misalignment in multi-view fusion. The network performs multi-modal feature fusion under a confidence-aware physical guidance mechanism, which effectively suppresses warping-induced artifacts and enforces explicit geometric constraints on the solution space. Experimental results demonstrate that our method achieves high-quality results and benefits various downstream tasks.

[221] WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval

Yizhuo Xu, Chaojian Yu, Yuanjie Shao, Tongliang Liu, Qinmu Peng, Xinge You

Main category: cs.CV

TL;DR: WRF4CIR introduces weight-regularized fine-tuning with adversarial perturbations to address overfitting in composed image retrieval when training data is limited.

DetailsMotivation: Current CIR methods based on vision-language pre-trained models suffer from severe overfitting, especially with limited triplet data, leading to poor generalization across models and datasets.

Method: WRF4CIR applies adversarial perturbations to model weights during fine-tuning, generated in the opposite direction of gradient descent, making training data fitting more difficult to mitigate overfitting.

Result: Extensive experiments show WRF4CIR significantly narrows the generalization gap and achieves substantial improvements over existing methods on benchmark datasets.

Conclusion: Weight regularization through adversarial perturbations effectively addresses overfitting in CIR with limited supervision, improving generalization performance.

Abstract: Composed Image Retrieval (CIR) task aims to retrieve target images based on reference images and modification texts. Current CIR methods primarily rely on fine-tuning vision-language pre-trained models. However, we find that these approaches commonly suffer from severe overfitting, posing challenges for CIR with limited triplet data. To better understand this issue, we present a systematic study of overfitting in VLP-based CIR, revealing a significant and previously overlooked generalization gap across different models and datasets. Motivated by these findings, we introduce WRF4CIR, a Weight-Regularized Fine-tuning network for CIR. Specifically, during the fine-tuning process, we apply adversarial perturbations to the model weights for regularization, where these perturbations are generated in the opposite direction of gradient descent. Intuitively, WRF4CIR increases the difficulty of fitting the training data, which helps mitigate overfitting in CIR under limited triplet supervision. Extensive experiments on benchmark datasets demonstrate that WRF4CIR significantly narrows the generalization gap and achieves substantial improvements over existing methods.

[222] Purify-then-Align: Towards Robust Human Sensing under Modality Missing with Knowledge Distillation from Noisy Multimodal Teacher

Pengcheng Weng, Yanyu Qian, Yangxin Xu, Fei Wang

Main category: cs.CV

TL;DR: PTA is a “Purify-then-Align” framework that addresses missing modalities in multimodal sensing by first purifying knowledge sources using meta-learning, then aligning modalities through diffusion-based knowledge distillation.

DetailsMotivation: Robust multimodal human sensing faces challenges from missing modalities, specifically the Representation Gap between heterogeneous data and Contamination Effect from low-quality modalities, which are causally linked and impede representation alignment.

Method: PTA uses a two-stage approach: 1) Meta-learning-driven weighting to dynamically down-weight noisy modalities and purify knowledge sources, 2) Diffusion-based knowledge distillation where a clean teacher (from purified consensus) refines each student modality’s features.

Result: PTA achieves state-of-the-art performance on MM-Fi and XRF55 datasets under pronounced Representation Gap and Contamination Effect, significantly improving robustness of single-modality models in diverse missing-modality scenarios.

Conclusion: The “Purify-then-Align” strategy creates powerful single-modality encoders with cross-modal knowledge, effectively addressing the causal dependency between representation gaps and contamination effects in multimodal sensing.

Abstract: Robust multimodal human sensing must overcome the critical challenge of missing modalities. Two principal barriers are the Representation Gap between heterogeneous data and the Contamination Effect from low-quality modalities. These barriers are causally linked, as the corruption introduced by contamination fundamentally impedes the reduction of representation disparities. In this paper, we propose PTA, a novel “Purify-then-Align” framework that solves this causal dependency through a synergistic integration of meta-learning and knowledge diffusion. To purify the knowledge source, PTA first employs a meta-learning-driven weighting mechanism that dynamically learns to down-weight the influence of noisy, low-contributing modalities. Subsequently, to align different modalities, PTA introduces a diffusion-based knowledge distillation paradigm in which an information-rich clean teacher, formed from this purified consensus, refines the features of each student modality. The ultimate payoff of this “Purify-then-Align” strategy is the creation of exceptionally powerful single-modality encoders imbued with cross-modal knowledge. Comprehensive experiments on the large-scale MM-Fi and XRF55 datasets, under pronounced Representation Gap and Contamination Effect, demonstrate that PTA achieves state-of-the-art performance and significantly improves the robustness of single-modality models in diverse missing-modality scenarios.

[223] BPC-Net: Annotation-Free Skin Lesion Segmentation via Boundary Probability Calibration

Yujie Yao, Yuhaohang He, Junjie Huang, Zhou Liu, Jiangzhao Li, Yan Qiao, Wen Xiao, Yunsen Liang, Xiaofan Li

Main category: cs.CV

TL;DR: BPC-Net: A boundary probability calibration framework for annotation-free skin lesion segmentation that addresses under-confident lesion boundaries through Gaussian Probability Smoothing and auxiliary designs for noisy pseudo-supervision and cross-domain transfer.

DetailsMotivation: Annotation-free skin lesion segmentation faces challenges: noisy pseudo-label supervision, unstable transfer under limited target-domain data, and boundary probability under-confidence. Most methods focus on pseudo-label denoising, while boundary probability compression affecting contour completeness receives less attention.

Method: Proposes BPC-Net with Gaussian Probability Smoothing (GPS) for localized probability-space calibration before thresholding to recover under-confident boundaries. Includes feature-decoupled decoder for context suppression, detail recovery, and boundary refinement, plus interaction-branch adaptation strategy that updates only pseudo-label interaction branch while preserving image-only segmentation path.

Result: Achieves state-of-the-art performance among unsupervised methods on ISIC-2017, ISIC-2018, and PH2 datasets with macro-average Dice coefficient of 85.80% and Jaccard index of 76.97%, approaching supervised reference performance on PH2.

Conclusion: BPC-Net effectively addresses boundary probability under-confidence in annotation-free skin lesion segmentation through probability-space calibration and specialized architectural designs, demonstrating strong performance without manual annotations.

Abstract: Annotation-free skin lesion segmentation is attractive for low-resource dermoscopic deployment. However, its performance remains constrained by three coupled challenges: noisy pseudo-label supervision, unstable transfer under limited target-domain data, and boundary probability under-confidence. Most existing annotation-free methods primarily focus on pseudo-label denoising. In contrast, the effect of compressed boundary probabilities on final mask quality has received less explicit attention, although it directly affects contour completeness and cannot be adequately corrected by global threshold adjustment alone. To address this issue, we propose BPC-Net, a boundary probability calibration framework for annotation-free skin lesion segmentation. The core of the framework is Gaussian Probability Smoothing (GPS), which performs localized probability-space calibration before thresholding to recover under-confident lesion boundaries without inducing indiscriminate foreground expansion. To support this calibration under noisy pseudo-supervision and cross-domain transfer, we further incorporate two auxiliary designs: a feature-decoupled decoder that separately handles context suppression, detail recovery, and boundary refinement, and an interaction-branch adaptation strategy that updates only the pseudo-label interaction branch while preserving the deployed image-only segmentation path. Under a strictly annotation-free protocol, no manual masks are used during training or target-domain adaptation, and validation labels, when available, are used only for final operating-point selection. Experiments on ISIC-2017, ISIC-2018, and PH2 show that the proposed framework achieves state-of-the-art performance among published unsupervised methods, reaching a macro-average Dice coefficient and Jaccard index of 85.80% and 76.97%, respectively, while approaching supervised reference performance on PH2.

[224] ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference

Zhaohong Huang, Wenjing Liu, Yuxin Zhang, Fei Chao, Rongrong Ji

Main category: cs.CV

TL;DR: ID-Selection: A token pruning method for large vision-language models that balances importance and diversity by coupling importance scoring with iterative diversity-aware selection to preserve informative tokens while reducing redundancy.

DetailsMotivation: Existing visual token pruning methods for LVLMs struggle to balance token importance and diversity - importance-based methods retain redundant tokens while diversity-based methods may overlook informative ones, especially problematic under high reduction ratios where preserving only a small subset of tokens is critical.

Method: ID-Selection couples importance estimation with diversity-aware iterative selection: each token is first assigned an importance score, then high-scoring tokens are selected one by one while progressively suppressing scores of similar tokens to reduce redundancy while preserving informativeness.

Result: Extensive experiments across 5 LVLM backbones and 16 benchmarks show ID-Selection achieves superior performance and efficiency, especially under extreme pruning ratios. On LLaVA-1.5-7B, it prunes 97.2% of visual tokens (retaining only 16), reduces inference FLOPs by over 97%, and preserves 91.8% of original performance without additional training.

Conclusion: ID-Selection provides an effective token selection strategy for efficient LVLM inference that successfully balances importance and diversity, enabling extreme token pruning while maintaining performance.

Abstract: Recent advances have explored visual token pruning to accelerate the inference of large vision-language models (LVLMs). However, existing methods often struggle to balance token importance and diversity: importance-based methods tend to retain redundant tokens, whereas diversity-based methods may overlook informative ones. This trade-off becomes especially problematic under high reduction ratios, where preserving only a small subset of visual tokens is critical. To address this issue, we propose ID-Selection, a simple yet effective token selection strategy for efficient LVLM inference. The key idea is to couple importance estimation with diversity-aware iterative selection: each token is first assigned an importance score, after which high-scoring tokens are selected one by one while the scores of similar tokens are progressively suppressed. In this way, ID-Selection preserves informative tokens while reducing redundancy in a unified selection process. Extensive experiments across 5 LVLM backbones and 16 main benchmarks demonstrate that ID-Selection consistently achieves superior performance and efficiency, especially under extreme pruning ratios. For example, on LLaVA-1.5-7B, ID-Selection prunes 97.2% of visual tokens, retaining only 16 tokens, while reducing inference FLOPs by over 97% and preserving 91.8% of the original performance, all without additional training.

[225] Evaluation of Randomization through Style Transfer for Enhanced Domain Generalization

Dustin Eisenhardt, Timothy Schaumlöffel, Alperen Kantarci, Gemma Roig

Main category: cs.CV

TL;DR: Systematic study of style transfer for domain generalization in computer vision, identifying key design principles and proposing StyleMixDG augmentation method

DetailsMotivation: Address poor generalization of vision models when deployed in real-world settings, especially with synthetic-to-real (Sim2Real) gap. Resolve contradictions in literature about style transfer design choices for domain generalization.

Method: Systematic empirical study isolating three key factors: style pool diversity, texture complexity, and style source choice. Based on findings, propose StyleMixDG - lightweight, model-agnostic augmentation recipe using style mixing.

Result: Findings show: (1) expanding style pool yields larger gains than repeated augmentation with few styles, (2) texture complexity has no significant effect with large pool, (3) diverse artistic styles outperform domain-aligned alternatives. StyleMixDG demonstrates consistent improvements on GTAV→{BDD100k, Cityscapes, Mapillary Vistas} benchmark.

Conclusion: Empirically identified design principles for style transfer in domain generalization translate to practical gains. StyleMixDG provides effective, lightweight augmentation strategy without architectural modifications.

Abstract: Deep learning models for computer vision often suffer from poor generalization when deployed in real-world settings, especially when trained on synthetic data due to the well-known Sim2Real gap. Despite the growing popularity of style transfer as a data augmentation strategy for domain generalization, the literature contains unresolved contradictions regarding three key design axes: the diversity of the style pool, the role of texture complexity, and the choice of style source. We present a systematic empirical study that isolates and evaluates each of these factors for driving scene understanding, resolving inconsistencies in prior work. Our findings show that (i) expanding the style pool yields larger gains than repeated augmentation with few styles, (ii) texture complexity has no significant effect when the pool is sufficiently large, and (iii) diverse artistic styles outperform domain-aligned alternatives. Guided by these insights, we derive StyleMixDG (Style-Mixing for Domain Generalization), a lightweight, model-agnostic augmentation recipe that requires no architectural modifications or additional losses. Evaluated on the GTAV $\rightarrow$ {BDD100k, Cityscapes, Mapillary Vistas} benchmark, StyleMixDG demonstrates consistent improvements over strong baselines, confirming that the empirically identified design principles translate into practical gains. The code will be released on GitHub.

[226] Semantic-Topological Graph Reasoning for Language-Guided Pulmonary Screening

Chenyu Xue, Yiran Liu, Mian Zhou, Jionglong Su, Zhixiang Lu

Main category: cs.CV

TL;DR: STGR framework combines LLM reasoning with vision foundation models for language-guided medical image segmentation, using text-to-vision intent distillation and graph reasoning with minimal parameter updates.

DetailsMotivation: Existing multimodal models struggle with semantic ambiguity in clinical reports and anatomical overlaps in low-contrast medical scans, while full fine-tuning on limited medical data causes overfitting.

Method: Proposes Semantic-Topological Graph Reasoning (STGR) framework that synergizes LLaMA-3-V LLM with MedSAM vision model. Uses Text-to-Vision Intent Distillation (TVID) for precise guidance, formulates mask selection as dynamic graph reasoning, and employs Selective Asymmetric Fine-Tuning (SAFT) updating <1% parameters.

Result: Achieves 81.5% Dice Similarity Coefficient on LIDC-IDRI dataset, outperforming LISA by over 5%. SAFT strategy provides exceptional cross-fold stability with only 0.6% DSC variance.

Conclusion: STGR framework establishes new SOTA for language-guided pulmonary screening, demonstrating robust performance with minimal parameter updates, enabling context-aware clinical deployment.

Abstract: Medical image segmentation driven by free-text clinical instructions is a critical frontier in computer-aided diagnosis. However, existing multimodal and foundation models struggle with the semantic ambiguity of clinical reports and fail to disambiguate complex anatomical overlaps in low-contrast scans. Furthermore, fully fine-tuning these massive architectures on limited medical datasets invariably leads to severe overfitting. To address these challenges, we propose a novel Semantic-Topological Graph Reasoning (STGR) framework for language-guided pulmonary screening. Our approach elegantly synergizes the reasoning capabilities of large language models (LLaMA-3-V) with the zero-shot delineation of vision foundation models (MedSAM). Specifically, we introduce a Text-to-Vision Intent Distillation (TVID) module to extract precise diagnostic guidance. To resolve anatomical ambiguity, we formulate mask selection as a dynamic graph reasoning problem, where candidate lesions are modeled as nodes and edges capture spatial and semantic affinities. To ensure deployment feasibility, we introduce a Selective Asymmetric Fine-Tuning (SAFT) strategy that updates less than 1% of the parameters. Rigorous 5-fold cross-validation on the LIDC-IDRI and LNDb datasets demonstrates that our framework establishes a new state-of-the-art. Notably, it achieves an 81.5% Dice Similarity Coefficient (DSC) on LIDC-IDRI, outperforming leading LLM-based tools like LISA by over 5%. Crucially, our SAFT strategy acts as a powerful regularizer, yielding exceptional cross-fold stability (0.6% DSC variance) and paving the way for robust, context-aware clinical deployment.

[227] FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos

Alexandros Delitzas, Chenyangguang Zhang, Alexey Gavryushin, Tommaso Di Mario, Boyang Sun, Rishabh Dabral, Leonidas Guibas, Christian Theobalt, Marc Pollefeys, Francis Engelmann, Daniel Barath

Main category: cs.CV

TL;DR: FunRec reconstructs functional 3D digital twins from egocentric RGB-D interaction videos, automatically discovering articulated parts, estimating kinematics, and creating simulation-compatible meshes for indoor scenes.

DetailsMotivation: Existing articulated reconstruction methods require controlled setups, multi-state captures, or CAD priors, limiting real-world applicability. There's a need for methods that work directly on in-the-wild human interaction sequences to create interactable 3D scenes.

Method: FunRec operates on egocentric RGB-D interaction videos to automatically discover articulated parts, estimate kinematic parameters, track 3D motion, and reconstruct static and moving geometry in canonical space, producing simulation-compatible meshes.

Result: FunRec significantly outperforms prior work, achieving up to +50 mIoU improvement in part segmentation, 5-10 times lower articulation and pose errors, and higher reconstruction accuracy on real and simulated benchmarks.

Conclusion: FunRec enables practical reconstruction of functional 3D digital twins from real-world interaction videos, with applications in simulation, affordance mapping, and robot-scene interaction.

Abstract: We present FunRec, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. Unlike existing methods on articulated reconstruction, which rely on controlled setups, multi-state captures, or CAD priors, FunRec operates directly on in-the-wild human interaction sequences to recover interactable 3D scenes. It automatically discovers articulated parts, estimates their kinematic parameters, tracks their 3D motion, and reconstructs static and moving geometry in canonical space, yielding simulation-compatible meshes. Across new real and simulated benchmarks, FunRec surpasses prior work by a large margin, achieving up to +50 mIoU improvement in part segmentation, 5-10 times lower articulation and pose errors, and significantly higher reconstruction accuracy. We further demonstrate applications on URDF/USD export for simulation, hand-guided affordance mapping and robot-scene interaction.

[228] A Unified Foundation Model for All-in-One Multi-Modal Remote Sensing Image Restoration and Fusion with Language Prompting

Yongchuan Cui, Peng Liu

Main category: cs.CV

TL;DR: LLaRS is a unified foundation model for multi-modal, multi-task remote sensing image restoration that uses optimal transport for band alignment and mixture-of-experts architecture for spatial, spectral, and global processing.

DetailsMotivation: Remote sensing imagery suffers from various degradations (clouds, haze, noise, resolution limits, sensor heterogeneity), but existing approaches train separate models per degradation type, lacking a unified solution.

Method: Uses Sinkhorn-Knopp optimal transport to align heterogeneous bands into semantically matched slots, then routes features through three complementary mixture-of-experts layers: convolutional experts for spatial patterns, channel-mixing experts for spectral fidelity, and attention experts with low-rank adapters for global context. Stabilizes training via step-level dynamic weight adjustment.

Result: LLaRS consistently outperforms seven competitive models across eleven restoration tasks. Parameter-efficient finetuning demonstrates strong transfer capability and adaptation efficiency on unseen data.

Conclusion: LLaRS is the first unified foundation model for multi-modal, multi-task remote sensing low-level vision, showing superior performance and strong generalization capabilities through its novel architecture and large-scale training dataset.

Abstract: Remote sensing imagery suffers from clouds, haze, noise, resolution limits, and sensor heterogeneity. Existing restoration and fusion approaches train separate models per degradation type. In this work, we present Language-conditioned Large-scale Remote Sensing restoration model (LLaRS), the first unified foundation model for multi-modal and multi-task remote sensing low-level vision. LLaRS employs Sinkhorn-Knopp optimal transport to align heterogeneous bands into semantically matched slots, routes features through three complementary mixture-of-experts layers (convolutional experts for spatial patterns, channel-mixing experts for spectral fidelity, and attention experts with low-rank adapters for global context), and stabilizes joint training via step-level dynamic weight adjustment. To train LLaRS, we construct LLaRS1M, a million-scale multi-task dataset spanning eleven restoration and enhancement tasks, integrating real paired observations and controlled synthetic degradations with diverse natural language prompts. Experiments show LLaRS consistently outperforms seven competitive models, and parameter-efficient finetuning experiments demonstrate strong transfer capability and adaptation efficiency on unseen data. Repo: https://github.com/yc-cui/LLaRS

[229] SGANet: Semantic and Geometric Alignment for Multimodal Multi-view Anomaly Detection

Letian Bai, Chengyu Tao, Juan Du

Main category: cs.CV

TL;DR: SGANet is a unified framework for multimodal multi-view anomaly detection that combines semantic and geometric alignment to learn physically coherent feature representations across viewpoints and modalities.

DetailsMotivation: Existing unsupervised methods for multi-view anomaly detection suffer from feature inconsistency due to viewpoint variations and modality discrepancies, which hinders accurate defect identification on complex objects.

Method: SGANet consists of three components: 1) Selective Cross-view Feature Refinement Module for cross-view feature interaction, 2) Semantic-Structural Patch Alignment for semantic alignment across modalities while maintaining structural consistency, and 3) Multi-View Geometric Alignment for aligning geometrically corresponding patches across viewpoints.

Result: Extensive experiments on SiM3D and Eyecandies datasets show SGANet achieves state-of-the-art performance in both anomaly detection and localization, validating effectiveness in realistic industrial scenarios.

Conclusion: SGANet effectively addresses feature inconsistency in multimodal multi-view anomaly detection by jointly modeling feature interaction, semantic/structural consistency, and geometric correspondence, leading to improved performance in industrial applications.

Abstract: Multi-view anomaly detection aims to identify surface defects on complex objects using observations captured from multiple viewpoints. However, existing unsupervised methods often suffer from feature inconsistency arising from viewpoint variations and modality discrepancies. To address these challenges, we propose a Semantic and Geometric Alignment Network (SGANet), a unified framework for multimodal multi-view anomaly detection that effectively combines semantic and geometric alignment to learn physically coherent feature representations across viewpoints and modalities. SGANet consists of three key components. The Selective Cross-view Feature Refinement Module (SCFRM) selectively aggregates informative patch features from adjacent views to enhance cross-view feature interaction. The Semantic-Structural Patch Alignment (SSPA) enforces semantic alignment across modalities while maintaining structural consistency under viewpoint transformations. The Multi-View Geometric Alignment (MVGA) further aligns geometrically corresponding patches across viewpoints. By jointly modeling feature interaction, semantic and structural consistency, and global geometric correspondence, SGANet effectively enhances anomaly detection performance in multimodal multi-view settings. Extensive experiments on the SiM3D and Eyecandies datasets demonstrate that SGANet achieves state-of-the-art performance in both anomaly detection and localization, validating its effectiveness in realistic industrial scenarios.

[230] Towards Athlete Fatigue Assessment from Association Football Videos

Xavier Bou, Nathan Correger, Alexandre Cloots, Cédric Gavage, Silvio Giancola, Cédric Schwartz, François Delvaux, Rudi Cloots, Marc Van Droogenbroeck, Anthony Cioppa

Main category: cs.CV

TL;DR: Monocular broadcast video can extract player kinematics for fatigue analysis in soccer using Game State Reconstruction and novel processing algorithms, though sensitive to trajectory noise and calibration errors.

DetailsMotivation: Current fatigue monitoring in soccer relies on subjective self-reports, lab biomarkers, or intrusive sensors. The paper explores whether monocular broadcast videos can provide sufficient spatio-temporal signals for objective fatigue analysis as a low-cost alternative.

Method: Uses state-of-the-art Game State Reconstruction methods to extract player trajectories in pitch coordinates, proposes a novel kinematics processing algorithm for temporally consistent speed/acceleration estimates, and constructs acceleration-speed profiles for fatigue analysis.

Result: Monocular GSR can recover kinematic patterns compatible with acceleration-speed profiling, but reveals sensitivity to trajectory noise, calibration errors, and temporal discontinuities inherent to broadcast footage. Evaluation on SoccerNet-GSR benchmark shows both short-term reliability and longer-term consistency.

Conclusion: Monocular broadcast video provides a low-cost basis for fatigue analysis in soccer, but methodological challenges remain regarding trajectory noise and calibration errors that need addressing in future research.

Abstract: Fatigue monitoring is central in association football due to its links with injury risk and tactical performance. However, objective fatigue-related indicators are commonly derived from subjective self-reported metrics, biomarkers derived from laboratory tests, or, more recently, intrusive sensors such as heart monitors or GPS tracking data. This paper studies whether monocular broadcast videos can provide spatio-temporal signals of sufficient quality to support fatigue-oriented analysis. Building on state-of-the-art Game State Reconstruction methods, we extract player trajectories in pitch coordinates and propose a novel kinematics processing algorithm to obtain temporally consistent speed and acceleration estimates from reconstructed tracks. We then construct acceleration–speed (A-S) profiles from these signals and analyze their behavior as fatigue-related performance indicators. We evaluate the full pipeline on the public SoccerNet-GSR benchmark, considering both 30-second clips and a complete 45-minute half to examine short-term reliability and longer-term temporal consistency. Our results indicate that monocular GSR can recover kinematic patterns that are compatible with A-S profiling while also revealing sensitivity to trajectory noise, calibration errors, and temporal discontinuities inherent to broadcast footage. These findings support monocular broadcast video as a low-cost basis for fatigue analysis and delineate the methodological challenges for future research.

[231] PanopticQuery: Unified Query-Time Reasoning for 4D Scenes

Ruilin Tang, Yang Zhou, Zhong Ye, Wenxi Liu, Yan Huang, Shengfeng He

Main category: cs.CV

TL;DR: PanopticQuery is a framework for unified query-time reasoning in 4D scenes that combines 4D Gaussian Splatting for dynamic reconstruction with multi-view semantic consensus to ground natural language queries in complex dynamic environments.

DetailsMotivation: Current methods for understanding dynamic 4D environments through natural language queries lack robust contextual reasoning for complex semantics like interactions, temporal actions, and spatial relations. The challenge is transforming noisy, view-dependent predictions into globally consistent 4D interpretations.

Method: Builds on 4D Gaussian Splatting for high-fidelity dynamic reconstruction and introduces a multi-view semantic consensus mechanism that aggregates 2D semantic predictions across multiple views and time frames. Filters inconsistent outputs, enforces geometric consistency, and lifts 2D semantics into structured 4D groundings via neural field optimization.

Result: PanopticQuery sets a new state of the art on complex language queries, effectively handling attributes, actions, spatial relationships, and multi-object interactions. The paper also introduces Panoptic-L4D, a new benchmark for language-based querying in dynamic scenes.

Conclusion: PanopticQuery provides a unified framework for query-time reasoning in 4D scenes that addresses the limitations of current methods by combining accurate reconstruction with robust semantic grounding across space, time, and viewpoints.

Abstract: Understanding dynamic 4D environments through natural language queries requires not only accurate scene reconstruction but also robust semantic grounding across space, time, and viewpoints. While recent methods using neural representations have advanced 4D reconstruction, they remain limited in contextual reasoning, especially for complex semantics such as interactions, temporal actions, and spatial relations. A key challenge lies in transforming noisy, view-dependent predictions into globally consistent 4D interpretations. We introduce PanopticQuery, a framework for unified query-time reasoning in 4D scenes. Our approach builds on 4D Gaussian Splatting for high-fidelity dynamic reconstruction and introduces a multi-view semantic consensus mechanism that grounds natural language queries by aggregating 2D semantic predictions across multiple views and time frames. This process filters inconsistent outputs, enforces geometric consistency, and lifts 2D semantics into structured 4D groundings via neural field optimization. To support evaluation, we present Panoptic-L4D, a new benchmark for language-based querying in dynamic scenes. Experiments demonstrate that PanopticQuery sets a new state of the art on complex language queries, effectively handling attributes, actions, spatial relationships, and multi-object interactions. A video demonstration is available in the supplementary materials.

[232] Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis

Peixi Peng, Housheng Xie, Yanling Wei, Guangcong Ruan, Xiaoyang Zou, Qian Cao, Yongjian Nian, Guoyan Zheng

Main category: cs.CV

TL;DR: RATNet is a foundation model for gastrointestinal endoscopy imaging that uses analogical reasoning to transfer knowledge across heterogeneous datasets, improving generalization, robustness, and adaptability for AI-assisted diagnosis.

DetailsMotivation: Current AI models for gastrointestinal endoscopy diagnosis lack generalizability, adaptability, robustness, and scalability due to limited medical data, domain shift, and heterogeneous annotations across different medical sites and datasets.

Method: RATNet employs a cyclic pre-training strategy with an encoder, relevance-knowledge acquisition and transfer (RAT) module, projector, and multi-task head. It uses analogical reasoning to match image-derived posterior knowledge to a learned prior knowledge base and transfers relative knowledge to guide diagnosis.

Result: RATNet outperforms existing foundation models (GastroNet, GastroVision) across six scenarios: common disease diagnosis, few-shot learning for rare diseases, zero-shot transfer to new sites, robustness under long-tailed distributions, adaptation to novel diseases, and privacy-preserving federated learning.

Conclusion: RATNet provides a practical, open, and cost-effective foundation for intelligent gastrointestinal diagnosis, especially in resource-limited settings, by enabling automatic integration of heterogeneous annotations without manual label unification and reducing data acquisition costs.

Abstract: Gastrointestinal diseases impose a growing global health burden, and endoscopy is a primary tool for early diagnosis. However, routine endoscopic image interpretation still suffers from missed lesions and limited efficiency. Although AI-assisted diagnosis has shown promise, existing models often lack generalizability, adaptability, robustness, and scalability because of limited medical data, domain shift, and heterogeneous annotations. To address these challenges, we develop RATNet, a foundation model for gastrointestinal endoscopy imaging based on analogical reasoning. RATNet acquires and transfers knowledge from heterogeneous expert annotations across five gastrointestinal endoscopy datasets through a cyclic pre-training strategy. Its architecture consists of an encoder, a relevance-knowledge acquisition and transfer (RAT) module, a projector, and a multi-task head, and supports fine-tuning, linear probing, and zero-shot transfer. Evaluations show that RATNet outperforms existing foundation models, including GastroNet and GastroVision, across six scenarios: diagnosis of common gastrointestinal diseases, few-shot learning for rare diseases, zero-shot transfer to new medical sites, robustness under long-tailed disease distributions, adaptation to novel diseases, and privacy-preserving deployment via federated learning. Its advantage comes from an analogical reasoning mechanism that matches image-derived posterior knowledge to a learned prior knowledge base and transfers relative knowledge to guide diagnosis, improving generalization and resistance to bias. RATNet is open and cost-effective, supports automatic integration of heterogeneous annotations without manual label unification, and reduces data acquisition costs, making it a practical foundation for intelligent gastrointestinal diagnosis, especially in resource-limited settings.

[233] Probing Intrinsic Medical Task Relationships: A Contrastive Learning Perspective

Jonas Muth, Zdravko Marinov, Simon Reiß

Main category: cs.CV

TL;DR: TaCo framework uses contrastive learning to embed 30 medical vision tasks from 39 datasets into shared representation space to analyze task relationships and structure.

DetailsMotivation: To explore intrinsic relationships between medical vision tasks and understand how they relate, overlap, or differ on a representational level, which remains largely unexplored despite focus on task-specific performance.

Method: Task-Contrastive Learning (TaCo) framework that uses contrastive learning to embed heterogeneous medical vision tasks (semantic, generative, transformation) from different modalities into a joint representation space.

Result: Maps 30 tasks across 39 medical imaging datasets into shared space, revealing which tasks are distinctly represented, which blend together, and how iterative task alterations are reflected in embeddings.

Conclusion: Provides foundation for understanding intrinsic structure of medical vision tasks and offers deeper understanding of task similarities and interconnected properties in embedding spaces.

Abstract: While much of the medical computer vision community has focused on advancing performance for specific tasks, the underlying relationships between tasks, i.e., how they relate, overlap, or differ on a representational level, remain largely unexplored. Our work explores these intrinsic relationships between medical vision tasks, specifically, we investigate 30 tasks, such as semantic tasks (e.g., segmentation and detection), image generative tasks (e.g., denoising, inpainting, or colorization), and image transformation tasks (e.g., geometric transformations). Our goal is to probe whether a data-driven representation space can capture an underlying structure of tasks across a variety of 39 datasets from wildly different medical imaging modalities, including computed tomography, magnetic resonance, electron microscopy, X-ray ultrasound and more. By revealing how tasks relate to one another, we aim to provide insights into their fundamental properties and interconnectedness. To this end, we introduce Task-Contrastive Learning (TaCo), a contrastive learning framework designed to embed tasks into a shared representation space. Through TaCo, we map these heterogeneous tasks from different modalities into a joint space and analyze their properties: identifying which tasks are distinctly represented, which blend together, and how iterative alterations to tasks are reflected in the embedding space. Our work provides a foundation for understanding the intrinsic structure of medical vision tasks, offering a deeper understanding of task similarities and their interconnected properties in embedding spaces.

[234] SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation

Wuyang Luan, Junhui Li, Weiguang Zhao, Wenjian Zhang, Tieru Wu, Rui Ma

Main category: cs.CV

TL;DR: SnapFlow is a self-distillation method that compresses multi-step denoising in flow-matching VLA models into a single forward pass, achieving 9.6x denoising speedup while maintaining or exceeding original performance.

DetailsMotivation: Current VLA models using flow matching require iterative denoising (typically 10 ODE steps) which introduces substantial latency, accounting for 80% of inference time. Simply reducing step count degrades performance due to uncalibrated velocity fields for single-step jumps.

Method: SnapFlow uses self-distillation mixing standard flow-matching samples with consistency samples whose targets are two-step Euler shortcut velocities computed from the model’s own marginal velocity predictions. It employs a zero-initialized target-time embedding to let the network switch between local velocity estimation and global one-step generation within a single architecture.

Result: On pi0.5 (3B) across 40 tasks, SnapFlow achieves 98.75% average success (matching 10-step teacher at 97.75%) with 9.6x denoising speedup and end-to-end latency reduced from 274ms to 83ms. On SmolVLA (500M), it reduces MSE by 8.3% with 3.56x end-to-end acceleration.

Conclusion: SnapFlow enables efficient single-step inference for flow-matching VLA models without architecture changes or external teachers, achieving significant speedups while maintaining performance, and is orthogonal to other optimization approaches.

Abstract: Vision-Language-Action (VLA) models based on flow matching – such as pi0, pi0.5, and SmolVLA – achieve state-of-the-art generalist robotic manipulation, yet their iterative denoising, typically 10 ODE steps, introduces substantial latency: on a modern GPU, denoising alone accounts for 80% of end-to-end inference time. Naively reducing the step count is unreliable, degrading success on most tasks due to the velocity field being uncalibrated for single-step jumps. We present SnapFlow, a plug-and-play self-distillation method that compresses multi-step denoising into a single forward pass (1-NFE) for flow-matching VLAs. SnapFlow mixes standard flow-matching samples with consistency samples whose targets are two-step Euler shortcut velocities computed from the model’s own marginal velocity predictions, avoiding the trajectory drift caused by conditional velocities, as we analyze theoretically. A zero-initialized target-time embedding lets the network switch between local velocity estimation and global one-step generation within a single architecture. SnapFlow requires no external teacher, no architecture changes, and trains in ~12h on a single GPU. We validate on two VLA architectures spanning a 6x parameter range, with identical hyperparameters: on pi0.5 (3B) across four LIBERO suites (40 tasks, 400 episodes), SnapFlow achieves 98.75% average success – matching the 10-step teacher at 97.75% and slightly exceeding it – with 9.6x denoising speedup and end-to-end latency reduced from 274ms to 83ms; on SmolVLA (500M), it reduces MSE by 8.3% with 3.56x end-to-end acceleration. An action-step sweep on long-horizon tasks reveals that SnapFlow maintains its advantage across execution horizons, achieving 93% at n_act=5 where the baseline reaches only 90%. SnapFlow is orthogonal to layer-distillation and token-pruning approaches, enabling compositional speedups.

[235] 3D Smoke Scene Reconstruction Guided by Vision Priors from Multimodal Large Language Models

Xinye Zheng, Fei Wang, Yiqi Nie, Kun Li, Junjie Chen, Jiaqi Zhao, Yanyan Wei, Zhiliang Wu

Main category: cs.CV

TL;DR: A framework combining image enhancement (Nano-Banana-Pro) and 3D Gaussian Splatting (Smoke-GS) for reconstructing 3D scenes from smoke-degraded multi-view images, enabling clear novel view synthesis in challenging smoke environments.

DetailsMotivation: Smoke introduces strong scattering effects, view-dependent appearance changes, and severe degradation of cross-view consistency, making 3D scene reconstruction from smoke-degraded multi-view images particularly difficult.

Method: Proposes a framework integrating visual priors with efficient 3D scene modeling: 1) Uses Nano-Banana-Pro to enhance smoke-degraded images for clearer visual observations, and 2) Develops Smoke-GS, a medium-aware 3D Gaussian Splatting framework with a lightweight view-dependent medium branch to capture direction-dependent appearance variations caused by smoke.

Result: The method preserves the rendering efficiency of 3D Gaussian Splatting while improving robustness to smoke-induced degradation, demonstrating effectiveness for generating consistent and visually clear novel views in challenging smoke environments.

Conclusion: The proposed framework successfully addresses the challenges of 3D scene reconstruction from smoke-degraded images by combining image enhancement with specialized 3D modeling techniques, enabling high-quality novel view synthesis in smoke environments.

Abstract: Reconstructing 3D scenes from smoke-degraded multi-view images is particularly difficult because smoke introduces strong scattering effects, view-dependent appearance changes, and severe degradation of cross-view consistency. To address these issues, we propose a framework that integrates visual priors with efficient 3D scene modeling. We employ Nano-Banana-Pro to enhance smoke-degraded images and provide clearer visual observations for reconstruction and develop Smoke-GS, a medium-aware 3D Gaussian Splatting framework for smoke scene reconstruction and restoration-oriented novel view synthesis. Smoke-GS models the scene using explicit 3D Gaussians and introduces a lightweight view-dependent medium branch to capture direction-dependent appearance variations caused by smoke. Our method preserves the rendering efficiency of 3D Gaussian Splatting while improving robustness to smoke-induced degradation. Results demonstrate the effectiveness of our method for generating consistent and visually clear novel views in challenging smoke environments.

[236] CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration

Xuecong Liu, Mengzhu Ding, Zixuan Sun, Zhang Li, Xichao Teng

Main category: cs.CV

TL;DR: CRFT is a unified coarse-to-fine framework for robust cross-modal image registration using feature flow learning and transformer architecture.

DetailsMotivation: The paper addresses the challenge of robust cross-modal image registration under large affine and scale variations, where traditional methods struggle with modality-independent feature representation and maintaining structural coherence across different imaging modalities.

Method: CRFT uses a transformer-based architecture with modality-independent feature flow learning. It has coarse stage for global correspondences via multi-scale feature correlation, and fine stage for local refinement via hierarchical feature fusion. An iterative discrepancy-guided attention mechanism with Spatial Geometric Transform recurrently refines flow fields.

Result: Extensive experiments on diverse cross-modal datasets show CRFT consistently outperforms state-of-the-art registration methods in both accuracy and robustness.

Conclusion: CRFT provides an effective framework for cross-modal image registration and offers a generalizable paradigm for multimodal spatial correspondence with applications in remote sensing, autonomous navigation, and medical imaging.

Abstract: We present Consistent-Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework based on feature flow learning for robust cross-modal image registration. CRFT learns a modality-independent feature flow representation within a transformer-based architecture that jointly performs feature alignment and flow estimation. The coarse stage establishes global correspondences through multi-scale feature correlation, while the fine stage refines local details via hierarchical feature fusion and adaptive spatial reasoning. To enhance geometric adaptability, an iterative discrepancy-guided attention mechanism with a Spatial Geometric Transform (SGT) recurrently refines the flow field, progressively capturing subtle spatial inconsistencies and enforcing feature-level consistency. This design enables accurate alignment under large affine and scale variations while maintaining structural coherence across modalities. Extensive experiments on diverse cross-modal datasets demonstrate that CRFT consistently outperforms state-of-the-art registration methods in both accuracy and robustness. Beyond registration, CRFT provides a generalizable paradigm for multimodal spatial correspondence, offering broad applicability to remote sensing, autonomous navigation, and medical imaging. Code and datasets are publicly available at https://github.com/NEU-Liuxuecong/CRFT.

[237] Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

Chongyu Wang, Ting Huang, Chunyu Sun, Xinyu Ning, Di Wang, Hao Tang

Main category: cs.CV

TL;DR: GUIDE is a progressive geometric priors injection framework that enhances MLLMs’ 3D spatial awareness by multi-level sampling and step-by-step fusion of geometric features with early MLLM layers.

DetailsMotivation: Current MLLMs have limited physical spatial awareness in real-world visual tasks, and existing geometry-aware approaches suffer from flattened fusion that loses local geometric details and causes semantic mismatches in early layers.

Method: Progressive geometric priors injection framework with multi-level sampling in geometric encoder to capture features from local edges to global topologies, rigorous alignment and step-by-step fusion with MLLM early layers, and context-aware gating to fetch relevant spatial cues.

Result: GUIDE significantly outperforms existing baselines on multiple complex spatial reasoning and perception tasks, establishing a novel paradigm for 3D geometric priors integration.

Conclusion: GUIDE successfully addresses the limitations of current geometry-aware MLLMs by enabling progressive learning of 2D-to-3D transitions and efficient utilization of spatial priors.

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in 2D visual tasks but still exhibit limited physical spatial awareness when processing real-world visual streams. Recently, feed-forward geometric foundation models, which implicitly extract geometric priors, have provided a new pathway to address this issue. However, existing geometry-aware MLLMs are predominantly constrained by the paradigm of single deep-layer extraction and input-level fusion. This flattened fusion leads to the loss of local geometric details and causes semantic mismatches in the early layers. To break this bottleneck, we propose GUIDE (Geometric Unrolling Inside MLLM Early-layers), a progressive geometric priors injection framework. GUIDE performs multi-level sampling within the geometric encoder, comprehensively capturing multi-granularity features ranging from local edges to global topologies. Subsequently, we rigorously align and fuse these multi-level geometric priors step-by-step with the early layers of the MLLM. Building upon the injection of multi-granularity geometric information, this design guides the model to progressively learn the 2D-to-3D transitional process. Furthermore, we introduce a context-aware gating that enables the model to fetch requisite spatial cues based on current semantics, thereby maximizing the utilization efficiency of spatial priors and effectively suppressing redundant geometric noise. Extensive experiments demonstrate that GUIDE significantly outperforms existing baselines on multiple complex spatial reasoning and perception tasks, establishing a novel paradigm for integrating 3D geometric priors into large models.

[238] In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting

Wenhui Xiao, Ethan Goan, Rodrigo Santa Cruz, David Ahmedt-Aristizabal, Olivier Salvado, Clinton Fookes, Leo Lebrat

Main category: cs.CV

TL;DR: A framework for integrating noisy monocular depth priors into 3D Gaussian Splatting to improve rendering quality and geometric accuracy without requiring specialized depth acquisition systems.

DetailsMotivation: Accurate depth priors help 3D Gaussian Splatting but require specialized systems. Monocular depth estimation is cheaper but suffers from scale ambiguity, inconsistency, and inaccuracies that degrade rendering when used naively.

Method: Introduces a training framework that integrates scale-ambiguous and noisy depth priors into geometric supervision. Learns from weakly aligned depth variations and isolates ill-posed geometry for selective monocular depth regularization to prevent propagation of depth inaccuracies.

Result: Extensive experiments across diverse datasets show consistent improvements in geometric accuracy, leading to more faithful depth estimation and higher rendering quality across different GS variants and monocular depth backbones.

Conclusion: The proposed framework reliably leverages monocular depth priors for Gaussian Splatting enhancement, addressing challenges of scale ambiguity and inconsistency while improving both geometric accuracy and rendering quality.

Abstract: Using accurate depth priors in 3D Gaussian Splatting helps mitigate artifacts caused by sparse training data and textureless surfaces. However, acquiring accurate depth maps requires specialized acquisition systems. Foundation monocular depth estimation models offer a cost-effective alternative, but they suffer from scale ambiguity, multi-view inconsistency, and local geometric inaccuracies, which can degrade rendering performance when applied naively. This paper addresses the challenge of reliably leveraging monocular depth priors for Gaussian Splatting (GS) rendering enhancement. To this end, we introduce a training framework integrating scale-ambiguous and noisy depth priors into geometric supervision. We highlight the importance of learning from weakly aligned depth variations. We introduce a method to isolate ill-posed geometry for selective monocular depth regularization, restricting the propagation of depth inaccuracies into well-reconstructed 3D structures. Extensive experiments across diverse datasets show consistent improvements in geometric accuracy, leading to more faithful depth estimation and higher rendering quality across different GS variants and monocular depth backbones tested.

[239] MPM: Mutual Pair Merging for Efficient Vision Transformers

Simon Ravé, Pejman Rasti, David Rousseau

Main category: cs.CV

TL;DR: MPM is a training-free token merging method for vision transformers that forms mutual nearest-neighbor pairs in cosine space for semantic segmentation, achieving up to 60% latency reduction on edge devices while maintaining segmentation accuracy.

DetailsMotivation: Token reduction methods for transformers often target classification tasks and report proxy metrics rather than actual end-to-end latency. For semantic segmentation, token reduction is challenging due to the need to reconstruct dense pixel-aligned features, and existing methods often have overhead that erases expected speed gains.

Method: Mutual Pair Merging (MPM) is a training-free token aggregation module that: 1) forms mutual nearest-neighbor pairs in cosine similarity space, 2) averages each pair, and 3) records a merge map enabling gather-based reconstruction before the decoder. It introduces no learned parameters and uses discrete insertion schedules rather than continuous compression knobs.

Result: On ADE20K dataset, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping mIoU drop below 3%. The method shows practical wall-clock gains when overhead is explicitly accounted for.

Conclusion: Simple, reconstruction-aware, training-free token merging can translate into practical speed improvements for semantic segmentation tasks when implemented with careful attention to overhead and reconstruction requirements.

Abstract: Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel-aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training-free token aggregation module that forms mutual nearest-neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather-based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep-rate or threshold). The speed-accuracy trade-off is set by a discrete insertion schedule. We benchmark end-to-end latency on an NVIDIA H100 GPU (with and without FlashAttention-2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock gains for segmentation when overhead is explicitly accounted for.

[240] GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance

Weiqi Zhang, Junsheng Zhou, Haotian Geng, Kanle Shi, Shenkun Xu, Yi Fang, Yu-Shen Liu

Main category: cs.CV

TL;DR: GaussianGrow generates 3D Gaussians from point clouds using text-guided iterative growth with multi-view diffusion models and novel view synthesis

DetailsMotivation: 3D Gaussian Splatting has excellent rendering but lacks proper generation methods; existing approaches rely on unreliable geometric estimates, leading to poor quality generations

Method: Uses text-guided Gaussian growing from point clouds with multi-view diffusion for appearance supervision, novel view constraints for artifact reduction, and iterative camera pose detection with 2D diffusion inpainting for hard-to-observe regions

Result: Successfully generates 3D Gaussians from both synthetic and real-scanned point clouds with text guidance, demonstrating effective generation quality

Conclusion: GaussianGrow provides a robust approach for 3D Gaussian generation from point clouds by leveraging diffusion models and iterative growth strategies

Abstract: 3D Gaussian Splatting has demonstrated superior performance in rendering efficiency and quality, yet the generation of 3D Gaussians still remains a challenge without proper geometric priors. Existing methods have explored predicting point maps as geometric references for inferring Gaussian primitives, while the unreliable estimated geometries may lead to poor generations. In this work, we introduce GaussianGrow, a novel approach that generates 3D Gaussians by learning to grow them from easily accessible 3D point clouds, naturally enforcing geometric accuracy in Gaussian generation. Specifically, we design a text-guided Gaussian growing scheme that leverages a multi-view diffusion model to synthesize consistent appearances from input point clouds for supervision. To mitigate artifacts caused by fusing neighboring views, we constrain novel views generated at non-preset camera poses identified in overlapping regions across different views. For completing the hard-to-observe regions, we propose to iteratively detect the camera pose by observing the largest un-grown regions in point clouds and inpainting them by inpainting the rendered view with a pretrained 2D diffusion model. The process continues until complete Gaussians are generated. We extensively evaluate GaussianGrow on text-guided Gaussian generation from synthetic and even real-scanned point clouds. Project Page: https://weiqi-zhang.github.io/GaussianGrow

[241] Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

Yusung Ro, Jaehyun Choi, Junmo Kim

Main category: cs.CV

TL;DR: SAE features in CLIP vision encoders vary in information scope from local to global; proposed Contextual Dependency Score quantifies this; different scopes affect CLIP predictions differently.

DetailsMotivation: Existing SAE analyses focus on semantic meaning of individual features, but lack understanding of how broadly features aggregate visual evidence (information scope). Need complementary dimension of interpretability.

Method: Introduce information scope concept and propose Contextual Dependency Score (CDS) to quantify feature scope. CDS separates positionally stable local scope features from positionally variant global scope features through spatial perturbation experiments.

Result: Features with different information scopes systematically influence CLIP’s predictions and confidence. Some features respond consistently across spatial perturbations (local scope), while others shift unpredictably (global scope).

Conclusion: Information scope is a critical new axis for understanding CLIP representations, providing deeper diagnostic view of SAE-derived features beyond just semantic meaning.

Abstract: Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpreting the internal representations of CLIP vision encoders, yet existing analyses largely focus on the semantic meaning of individual features. We introduce information scope as a complementary dimension of interpretability that characterizes how broadly an SAE feature aggregates visual evidence, ranging from localized, patch-specific cues to global, image-level signals. We observe that some SAE features respond consistently across spatial perturbations, while others shift unpredictably with minor input changes, indicating a fundamental distinction in their underlying scope. To quantify this, we propose the Contextual Dependency Score (CDS), which separates positionally stable local scope features from positionally variant global scope features. Our experiments show that features of different information scopes exert systematically different influences on CLIP’s predictions and confidence. These findings establish information scope as a critical new axis for understanding CLIP representations and provide a deeper diagnostic view of SAE-derived features.

[242] Single-Stage Signal Attenuation Diffusion Model for Low-Light Image Enhancement and Denoising

Ying Liu, Junchao Zhang, Caiyun Wu

Main category: cs.CV

TL;DR: SADM integrates signal attenuation mechanism into diffusion process for single-stage low-light image enhancement with simultaneous brightness adjustment and noise suppression

DetailsMotivation: Existing diffusion-based LLIE methods use two-stage pipelines or auxiliary correction networks that sever the link between enhancement and denoising, leading to suboptimal performance due to inconsistent optimization objectives

Method: Signal Attenuation Diffusion Model (SADM) integrates signal attenuation coefficient into diffusion pipeline to simulate low-light degradation in forward process, encoding physical priors to guide reverse denoising for concurrent brightness recovery and noise suppression

Result: Maintains consistency with DDIM via multi-scale pyramid sampling, balancing interpretability, restoration quality, and computational efficiency

Conclusion: SADM enables single-stage low-light image enhancement without extra correction modules or staged training, improving performance through unified optimization

Abstract: Diffusion models excel at image restoration via probabilistic modeling of forward noise addition and reverse denoising, and their ability to handle complex noise while preserving fine details makes them well-suited for Low-Light Image Enhancement (LLIE). Mainstream diffusion based LLIE methods either adopt a two-stage pipeline or an auxiliary correction network to refine U-Net outputs, which severs the intrinsic link between enhancement and denoising and leads to suboptimal performance owing to inconsistent optimization objectives. To address these issues, we propose the Signal Attenuation Diffusion Model (SADM), a novel diffusion process that integrates the signal attenuation mechanism into the diffusion pipeline, enabling simultaneous brightness adjustment and noise suppression in a single stage. Specifically, the signal attenuation coefficient simulates the inherent signal attenuation of low-light degradation in the forward noise addition process, encoding the physical priors of low-light degradation to explicitly guide reverse denoising toward the concurrent optimization of brightness recovery and noise suppression, thereby eliminating the need for extra correction modules or staged training relied on by existing methods. We validate that our design maintains consistency with Denoising Diffusion Implicit Models(DDIM) via multi-scale pyramid sampling, balancing interpretability, restoration quality, and computational efficiency.

[243] FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

Mengtian Li, Kunyan Dai, Yi Ding, Ruobing Ni, Ying Zhang, Wenwu Wang, Zhifeng Xie

Main category: cs.CV

TL;DR: FoleyDesigner is a novel framework for automated spatio-temporally aligned Foley sound generation for films, using multi-agent analysis, latent diffusion models, and LLM-driven mechanisms, with a new professional stereo audio dataset called FilmStereo.

DetailsMotivation: Manual Foley sound creation for films is labor-intensive and requires precise spatio-temporal alignment. There's a need for automated solutions that can generate high-quality, spatially aligned audio while maintaining compatibility with professional film production standards.

Method: The framework integrates film clip analysis, spatio-temporally controllable Foley generation, and professional audio mixing. It uses multi-agent architecture for analysis, latent diffusion models trained on spatio-temporal cues from video frames, and LLM-driven hybrid mechanisms that emulate post-production practices. Introduces FilmStereo dataset with spatial metadata, timestamps, and semantic annotations.

Result: The method achieves superior spatio-temporal alignment compared to existing baselines and maintains seamless compatibility with professional film production standards including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards.

Conclusion: FoleyDesigner provides an effective automated solution for Foley sound generation that integrates with professional film workflows, offering extensive creative flexibility while reducing manual labor.

Abstract: Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporally aligned audio remains labor-intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporally controllable Foley generation, and professional audio mixing capabilities. FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatio-temporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate post-production practices in film industry. To address the lack of high-quality stereo audio datasets in film, we introduce FilmStereo, the first professional stereo audio dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For applications, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility. Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with seamless compatibility with professional film production standards. The project page is available at https://gekiii996.github.io/FoleyDesigner/ .

[244] ASSR-Net: Anisotropic Structure-Aware and Spectrally Recalibrated Network for Hyperspectral Image Fusion

Qiya Song, Hongzhi Zhou, Lishan Tan, Renwei Dian, Shutao Li

Main category: cs.CV

TL;DR: ASSR-Net is a two-stage network for hyperspectral image fusion that addresses anisotropic spatial structure reconstruction and spectral distortion through directional perception and spectral recalibration.

DetailsMotivation: Existing hyperspectral image fusion methods struggle with reconstructing anisotropic spatial structures (leading to blurred details) and suffer from spectral distortion during fusion, compromising both spatial and spectral quality.

Method: Two-stage approach: (1) Anisotropic Structure-Aware Spatial Enhancement (ASSE) with directional perception fusion module to capture structural features along multiple orientations; (2) Hierarchical Prior-Guided Spectral Calibration (HPSC) using original low-resolution HSI as spectral prior to correct spectral deviations.

Result: Extensive experiments on benchmark datasets show ASSR-Net consistently outperforms state-of-the-art methods, achieving superior spatial detail preservation and spectral consistency.

Conclusion: ASSR-Net effectively addresses key challenges in hyperspectral image fusion through its anisotropic structure-aware and spectrally recalibrated design, setting new state-of-the-art performance.

Abstract: Hyperspectral image fusion aims to reconstruct high-spatial-resolution hyperspectral images (HR-HSI) by integrating complementary information from multi-source inputs. Despite recent progress, existing methods still face two critical challenges: (1) inadequate reconstruction of anisotropic spatial structures, resulting in blurred details and compromised spatial quality; and (2) spectral distortion during fusion, which hinders fine-grained spectral representation. To address these issues, we propose \textbf{ASSR-Net}: an Anisotropic Structure-Aware and Spectrally Recalibrated Network for Hyperspectral Image Fusion. ASSR-Net adopts a two-stage fusion strategy comprising anisotropic structure-aware spatial enhancement (ASSE) and hierarchical prior-guided spectral calibration (HPSC). In the first stage, a directional perception fusion module adaptively captures structural features along multiple orientations, effectively reconstructing anisotropic spatial patterns. In the second stage, a spectral recalibration module leverages the original low-resolution HSI as a spectral prior to explicitly correct spectral deviations in the fused results, thereby enhancing spectral fidelity. Extensive experiments on various benchmark datasets demonstrate that ASSR-Net consistently outperforms state-of-the-art methods, achieving superior spatial detail preservation and spectral consistency.

[245] On the Robustness of Diffusion-Based Image Compression to Bit-Flip Errors

Amit Vaisman, Gal Pomerants, Raz Lapid

Main category: cs.CV

TL;DR: Diffusion-based image compressors using Reverse Channel Coding are more robust to bit flips than classical/learned codecs, with Turbo-DDCM variant improving robustness while maintaining rate-distortion-perception trade-off.

DetailsMotivation: Current image compression methods focus on rate-distortion-perception trade-offs but lack examination of robustness to bit-level corruption. The paper aims to address this gap by investigating robustness of compression methods to bit flips.

Method: Uses diffusion-based compressors built on Reverse Channel Coding (RCC) paradigm. Introduces a more robust variant of Turbo-DDCM that improves robustness while minimally affecting rate-distortion-perception trade-off.

Result: RCC-based compressors are substantially more robust to bit flips than classical and learned codecs. The Turbo-DDCM variant significantly improves robustness with minimal impact on compression performance.

Conclusion: RCC-based compression yields more resilient compressed representations, potentially reducing reliance on error-correcting codes in highly noisy environments, offering a new dimension for compression method evaluation.

Abstract: Modern image compression methods are typically optimized for the rate–distortion–perception trade-off, whereas their robustness to bit-level corruption is rarely examined. We show that diffusion-based compressors built on the Reverse Channel Coding (RCC) paradigm are substantially more robust to bit flips than classical and learned codecs. We further introduce a more robust variant of Turbo-DDCM that significantly improves robustness while only minimally affecting the rate–distortion–perception trade-off. Our findings suggest that RCC-based compression can yield more resilient compressed representations, potentially reducing reliance on error-correcting codes in highly noisy environments.

[246] SVC 2026: the Second Multimodal Deception Detection Challenge and the First Domain Generalized Remote Physiological Measurement Challenge

Dongliang Zhu, Zhiyi Niu, Bo Zhao, Jiajian Huang, Shuo Ye, Xun Lin, Hui Ma, Taorui Wang, Jiayu Zhang, Chunmei Zhu, Junzhe Cao, Yingjie Ma, Rencheng Song, Albert Clapés, Sergio Escalera, Dan Guo, Zitong Yu

Main category: cs.CV

TL;DR: The paper introduces the Subtle Visual Challenge focusing on learning robust representations for subtle visual signals, featuring two tasks: cross-domain multimodal deception detection and remote photoplethysmography (rPPG) estimation.

DetailsMotivation: Subtle visual signals contain important information for applications like biometric security, multimedia forensics, medical diagnosis, and affective computing, but existing models struggle with robustness, representation ability, and generalization when handling these weak signals in real-world environments.

Method: Organized a challenge competition with two specific tasks: 1) cross-domain multimodal deception detection, and 2) remote photoplethysmography (rPPG) estimation, providing baseline models and a platform for 22 participating teams to develop and evaluate their approaches.

Result: 22 teams submitted final results to the workshop competition, with baseline models released on the MMDD2026 platform, demonstrating community engagement in addressing subtle visual signal challenges.

Conclusion: The challenge aims to encourage development of more robust and generalizable models for subtle visual understanding, advancing research in computer vision and multimodal learning through focused tasks on deception detection and physiological signal estimation.

Abstract: Subtle visual signals, although difficult to perceive with the naked eye, contain important information that can reveal hidden patterns in visual data. These signals play a key role in many applications, including biometric security, multimedia forensics, medical diagnosis, industrial inspection, and affective computing. With the rapid development of computer vision and representation learning techniques, detecting and interpreting such subtle signals has become an emerging research direction. However, existing studies often focus on specific tasks or modalities, and models still face challenges in robustness, representation ability, and generalization when handling subtle and weak signals in real-world environments. To promote research in this area, we organize the Subtle visual Challenge, which aims to learn robust representations for subtle visual signals. The challenge includes two tasks: cross-domain multimodal deception detection and remote photoplethysmography (rPPG) estimation. We hope that this challenge will encourage the development of more robust and generalizable models for subtle visual understanding, and further advance research in computer vision and multimodal learning. A total of 22 teams submitted their final results to this workshop competition, and the corresponding baseline models have been released on the \href{https://sites.google.com/view/svc-cvpr26}{MMDD2026 platform}\footnote{https://sites.google.com/view/svc-cvpr26}

[247] Improving Controllable Generation: Faster Training and Better Performance via $x_0$-Supervision

Amadou S. Sangare, Adrien Maglo, Mohamed Chaouch, Bertrand Luvison

Main category: cs.CV

TL;DR: The paper proposes x0-supervision, a new training objective for controllable diffusion models that uses direct supervision on clean target images to accelerate convergence by up to 2× while improving visual quality and conditioning accuracy.

DetailsMotivation: Current text-to-image diffusion models struggle with precise layout control since natural language alone cannot reliably express spatial arrangements. While controllable generation methods add conditions, they use the same training loss as initial models, leading to slow convergence.

Method: Revisits training objectives for controllable diffusion models through denoising dynamics analysis. Proposes x0-supervision - direct supervision on clean target images, or an equivalent re-weighting of diffusion loss, to accelerate convergence.

Result: Experiments show 2× faster convergence (measured by mAUCC), improved visual quality, and better conditioning accuracy across multiple control settings compared to standard training approaches.

Conclusion: x0-supervision provides a more effective training objective for controllable diffusion models, offering significant convergence acceleration while maintaining or improving generation quality and control accuracy.

Abstract: Text-to-Image (T2I) diffusion/flow models have recently achieved remarkable progress in visual fidelity and text alignment. However, they remain limited when users need to precisely control image layouts, something that natural language alone cannot reliably express. Controllable generation methods augment the initial T2I model with additional conditions that more easily describe the scene. Prior works straightforwardly train the augmented network with the same loss as the initial network. Although natural at first glance, this can lead to very long training times in some cases before convergence. In this work, we revisit the training objective of controllable diffusion models through a detailed analysis of their denoising dynamics. We show that direct supervision on the clean target image, dubbed $x_0$-supervision, or an equivalent re-weighting of the diffusion loss, yields faster convergence. Experiments on multiple control settings demonstrate that our formulation accelerates convergence by up to 2$\times$ according to our novel metric (mean Area Under the Convergence Curve - mAUCC), while also improving both visual quality and conditioning accuracy. Our code is available at https://github.com/CEA-LIST/x0-supervision

[248] Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0

Roni Goldshmidt, Hamish Scott, Lorenzo Niccolini, Hernan Matzner

Main category: cs.CV

TL;DR: BADAS-2.0 advances collision anticipation using V-JEPA2 fine-tuning on ego-centric dashcam data, with improvements in long-tail scenario handling, knowledge distillation for edge deployment, and explainability through attention heatmaps and vision-language reasoning.

DetailsMotivation: To improve collision anticipation systems for autonomous driving by addressing long-tail safety-critical scenarios, enabling real-time edge deployment, and providing explainable predictions.

Method: Fine-tunes V-JEPA2 on large-scale ego-centric dashcam data (178,500 labeled videos), uses active learning with BADAS-1.0 to surface rare scenarios, performs domain-specific self-supervised pre-training on 2.25M unlabeled videos, and distills knowledge into compact models (86M and 22M parameters). Adds explainability through object-centric attention heatmaps and vision-language reasoning (BADAS-Reason).

Result: Achieves consistent gains across all subgroups with largest improvements on hardest long-tail cases. Compact models achieve 7-12x speedup with near-parity accuracy, enabling real-time edge deployment. Provides real-time attention heatmaps and structured textual reasoning.

Conclusion: BADAS-2.0 advances collision anticipation along three axes: long-tail scenario handling, efficient edge deployment, and explainability, establishing new state-of-the-art performance.

Abstract: We present BADAS-2.0, the second generation of our collision anticipation system, building on BADAS-1.0 [7], which showed that fine-tuning V-JEPA2 [1] on large-scale ego-centric dashcam data outperforms both academic baselines and production ADAS systems. BADAS-2.0 advances the state of the art along three axes. (i) Long-tail benchmark and accuracy: We introduce a 10-group long-tail benchmark targeting rare and safety-critical scenarios. To construct it, BADAS-1.0 is used as an active oracle to score millions of unlabeled drives and surface high-risk candidates for annotation. Combined with Nexar’s Atlas platform [13] for targeted data collection, this expands the dataset from 40k to 178,500 labeled videos (~2M clips), yielding consistent gains across all subgroups, with the largest improvements on the hardest long-tail cases. (ii) Knowledge distillation to edge: Domain-specific self-supervised pre-training on 2.25M unlabeled driving videos enables distillation into compact models, BADAS-2.0-Flash (86M) and BADAS-2.0-Flash-Lite (22M), achieving 7-12x speedup with near-parity accuracy, enabling real-time edge deployment. (iii) Explainability: BADAS-2.0 produces real-time object-centric attention heatmaps that localize the evidence behind predictions. BADAS-Reason [17] extends this with a vision-language model that consumes the last frame and heatmap to generate driver actions and structured textual reasoning. Inference code and evaluation benchmarks are publicly available.

[249] PDMP: Rethinking Balanced Multimodal Learning via Performance-Dominant Modality Prioritization

Shicai Wei, Chunbo Luo, Qiang Zhu, Yang Luo

Main category: cs.CV

TL;DR: PDMP is a multimodal learning strategy that prioritizes the performance-dominant modality rather than balancing modalities, achieving better multimodal performance through gradient modulation.

DetailsMotivation: Addresses the under-optimization problem in multimodal learning where multimodal models underperform compared to unimodal counterparts, challenging the conventional wisdom that balanced learning is optimal.

Method: Proposes Performance-Dominant Modality Prioritization (PDMP): 1) identifies performance-dominant modality via unimodal model rankings, 2) uses asymmetric coefficients to modulate gradients, allowing dominant modality to drive optimization, 3) is architecture-agnostic and independent of fusion methods.

Result: Extensive experiments on various datasets demonstrate PDMP’s superiority over existing methods, validating that prioritizing performance-dominant modalities improves multimodal learning outcomes.

Conclusion: Imbalanced learning driven by performance-dominant modalities is more effective than balanced learning for multimodal optimization, and PDMP provides a practical, architecture-independent solution to multimodal under-optimization.

Abstract: Multimodal learning has attracted increasing attention due to its practicality. However, it often suffers from insufficient optimization, where the multimodal model underperforms even compared to its unimodal counterparts. Existing methods attribute this problem to the imbalanced learning between modalities and solve it by gradient modulation. This paper argues that balanced learning is not the optimal setting for multimodal learning. On the contrary, imbalanced learning driven by the performance-dominant modality that has superior unimodal performance can contribute to better multimodal performance. And the under-optimization problem is caused by insufficient learning of the performance-dominant modality. To this end, we propose the Performance-Dominant Modality Prioritization (PDMP) strategy to assist multimodal learning. Specifically, PDMP firstly mines the performance-dominant modality via the performance ranking of the independently trained unimodal model. Then PDMP introduces asymmetric coefficients to modulate the gradients of each modality, enabling the performance-dominant modality to dominate the optimization. Since PDMP only relies on the unimodal performance ranking, it is independent of the structures and fusion methods of the multimodal model and has great potential for practical scenarios. Finally, extensive experiments on various datasets validate the superiority of PDMP.

[250] Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion

Yu Xue, Longjun Gao, Yuanqi Su, HaoAng Lu, Xiaoning Zhang

Main category: cs.CV

TL;DR: VoxSAMNet: A sparsity-aware 3D semantic scene completion framework that addresses voxel imbalance through dummy shortcuts and foreground modulation, achieving SOTA on SemanticKITTI benchmarks.

DetailsMotivation: Monocular SSC faces challenges with extreme voxel imbalance (93% empty, rare foreground classes), leading to redundant processing of uninformative voxels and poor generalization on long-tailed categories.

Method: Proposes VoxSAMNet with two key components: 1) DSFR module that bypasses empty voxels via shared dummy node while refining occupied ones with deformable attention; 2) Foreground Modulation Strategy combining Foreground Dropout and Text-Guided Image Filter to prevent overfitting and enhance class-relevant features.

Result: Achieves state-of-the-art performance on SemanticKITTI (18.2% mIoU) and SSCBench-KITTI-360 (20.2% mIoU), surpassing prior monocular and stereo baselines.

Conclusion: Sparsity-aware and semantics-guided design is crucial for efficient and accurate 3D scene completion, offering promising direction for future research in autonomous driving and robotics applications.

Abstract: Monocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a cost-effective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions, where over 93% of voxels are empty and foreground classes are rare, poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; and (2) a Foreground Modulation Strategy combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF) to alleviate overfitting and enhance class-relevant features. Extensive experiments on the public benchmarks SemanticKITTI and SSCBench-KITTI-360 demonstrate that VoxSAMNet achieves state-of-the-art performance, surpassing prior monocular and stereo baselines with mIoU scores of 18.2% and 20.2%, respectively. Our results highlight the importance of sparsity-aware and semantics-guided design for efficient and accurate 3D scene completion, offering a promising direction for future research.

[251] RHVI-FDD: A Hierarchical Decoupling Framework for Low-Light Image Enhancement

Junhao Yang, Bo Yang, Hongwei Ge, Yanchun Liang, Heow Pueh Lee, Chunguo Wu

Main category: cs.CV

TL;DR: RHVI-FDD: A hierarchical decoupling framework for low-light image enhancement that separates luminance-chrominance at macro level and decomposes chrominance into frequency bands for specialized processing.

DetailsMotivation: Low-light images suffer from noise, detail loss, and color distortion that hinder multimedia analysis tasks. Existing methods struggle with the complex degradation where luminance and chrominance are coupled, and noise/details are entangled within chrominance.

Method: Two-level hierarchical approach: 1) Macro: RHVI transform for robust luminance-chrominance decoupling, 2) Micro: Frequency-Domain Decoupling (FDD) module using DCT to separate chrominance into low, mid, high-frequency bands representing tone, details, and noise, processed by expert networks and fused adaptively.

Result: Extensive experiments on multiple low-light datasets show consistent outperformance over state-of-the-art methods in both objective metrics and subjective visual quality.

Conclusion: The hierarchical decoupling framework effectively addresses complex low-light degradation by separating entangled components at different levels, enabling simultaneous color correction, noise suppression, and detail preservation.

Abstract: Low-light images often suffer from severe noise, detail loss, and color distortion, which hinder downstream multimedia analysis and retrieval tasks. The degradation in low-light images is complex: luminance and chrominance are coupled, while within the chrominance, noise and details are deeply entangled, preventing existing methods from simultaneously correcting color distortion, suppressing noise, and preserving fine details. To tackle the above challenges, we propose a novel hierarchical decoupling framework (RHVI-FDD). At the macro level, we introduce the RHVI transform, which mitigates the estimation bias caused by input noise and enables robust luminance-chrominance decoupling. At the micro level, we design a Frequency-Domain Decoupling (FDD) module with three branches for further feature separation. Using the Discrete Cosine Transform, we decompose chrominance features into low, mid, and high-frequency bands that predominantly represent global tone, local details, and noise components, which are then processed by tailored expert networks in a divide-and-conquer manner and fused via an adaptive gating module for content-aware fusion. Extensive experiments on multiple low-light datasets demonstrate that our method consistently outperforms existing state-of-the-art approaches in both objective metrics and subjective visual quality.

[252] Sparse Gain Radio Map Reconstruction With Geometry Priors and Uncertainty-Guided Measurement Selection

Zhihan Zeng, Ning Wei, Muhammad Baqer Mollah, Kaihe Wang, Phee Lep Yeoh, Fei Xu, Yue Xiu, Zhongpei Zhang

Main category: cs.CV

TL;DR: A geometry-aware uncertainty-quantifying neural network for sparse radio map reconstruction in complex urban environments, with active sensing guided by predicted uncertainty.

DetailsMotivation: Dense radio map construction is challenging with limited measurements in complex urban environments with blockages and irregular geometry. Existing methods insufficiently exploit geometric priors or overlook predictive uncertainty for active sensing.

Method: Proposes GeoUQ-GFNet, a lightweight network that jointly predicts dense gain radio maps and spatial uncertainty maps from sparse measurements and structured scene priors. Also introduces UrbanRT-RM benchmark with diverse urban layouts and sampling modes. Uses predicted uncertainty to guide active measurement selection.

Result: GeoUQ-GFNet achieves strong and consistent reconstruction performance across different scenes and transmitter placements. Uncertainty-guided querying provides more effective reconstruction improvement than non-adaptive sampling under the same additional measurement budget.

Conclusion: Combining geometry-aware learning, uncertainty estimation, and benchmark-driven evaluation is effective for sparse radio map reconstruction in complex urban environments.

Abstract: Radio maps are important for environment-aware wireless communication, network planning, and radio resource optimization. However, dense radio map construction remains challenging when only a limited number of measurements are available, especially in complex urban environments with strong blockages, irregular geometry, and restricted sensing accessibility. Existing methods have explored interpolation, low-rank cartography, deep completion, and channel knowledge map (CKM) construction, but many of these methods insufficiently exploit explicit geometric priors or overlook the value of predictive uncertainty for subsequent sensing. In this paper, we study sparse gain radio map reconstruction from a geometry-aware and active sensing perspective. We first construct \textbf{UrbanRT-RM}, a controllable ray-tracing benchmark with diverse urban layouts, multiple base-station deployments, and multiple sparse sampling modes. We then propose \textbf{GeoUQ-GFNet}, a lightweight network that jointly predicts a dense gain radio map and a spatial uncertainty map from sparse measurements and structured scene priors. The predicted uncertainty is further used to guide active measurement selection under limited sensing budgets. Extensive experiments show that our proposed GeoUQ-GFNet method achieves strong and consistent reconstruction performance across different scenes and transmitter placements generated using UrbanRT-RM. Moreover, uncertainty-guided querying provides more effective reconstruction improvement than non-adaptive sampling under the same additional measurement budget. These results demonstrate the effectiveness of combining geometry-aware learning, uncertainty estimation, and benchmark-driven evaluation for sparse radio map reconstruction in complex urban environments.

[253] EfficientMonoHair: Fast Strand-Level Reconstruction from Monocular Video via Multi-View Direction Fusion

Da Li, Dominik Engel, Deng Luo, Ivan Viola

Main category: cs.CV

TL;DR: EfficientMonoHair: A fast framework for strand-level hair geometry reconstruction from monocular video using implicit neural networks with multi-view geometric fusion and parallel hair-growing strategy.

DetailsMotivation: Existing hair reconstruction methods face accuracy-efficiency trade-offs: implicit neural representations capture global shape but lose fine details, while explicit optimization approaches achieve high fidelity but are computationally heavy and poorly scalable.

Method: Combines implicit neural network with multi-view geometric fusion. Introduces fusion-patch-based multi-view optimization to reduce optimization iterations for point cloud direction, and a novel parallel hair-growing strategy that relaxes voxel occupancy constraints for stable large-scale strand tracing even with inaccurate orientation fields.

Result: Extensive experiments on real-world hairstyles show robust reconstruction of high-fidelity strand geometries. On synthetic benchmarks, achieves reconstruction quality comparable to state-of-the-art methods while improving runtime efficiency by nearly an order of magnitude.

Conclusion: EfficientMonoHair addresses the accuracy-efficiency trade-off in hair reconstruction, enabling fast and accurate strand-level geometry reconstruction from monocular video through innovative fusion optimization and parallel tracing techniques.

Abstract: Strand-level hair geometry reconstruction is a fundamental problem in virtual human modeling and the digitization of hairstyles. However, existing methods still suffer from a significant trade-off between accuracy and efficiency. Implicit neural representations can capture the global hair shape but often fail to preserve fine-grained strand details, while explicit optimization-based approaches achieve high-fidelity reconstructions at the cost of heavy computation and poor scalability. To address this issue, we propose EfficientMonoHair, a fast and accurate framework that combines the implicit neural network with multi-view geometric fusion for strand-level reconstruction from monocular video. Our method introduces a fusion-patch-based multi-view optimization that reduces the number of optimization iterations for point cloud direction, as well as a novel parallel hair-growing strategy that relaxes voxel occupancy constraints, allowing large-scale strand tracing to remain stable and robust even under inaccurate or noisy orientation fields. Extensive experiments on representative real-world hairstyles demonstrate that our method can robustly reconstruct high-fidelity strand geometries with accuracy. On synthetic benchmarks, our method achieves reconstruction quality comparable to state-of-the-art methods, while improving runtime efficiency by nearly an order of magnitude.

[254] WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

Yingjian Zhu, Xinming Wang, Kun Ding, Ying Wang, Bin Fan, Shiming Xiang

Main category: cs.CV

TL;DR: WikiSeeker is a novel multimodal RAG framework for KB-VQA that introduces a multimodal retriever and redefines VLMs as specialized agents (Refiner and Inspector) to improve retrieval and answer generation.

DetailsMotivation: Current multimodal RAG methods for KB-VQA primarily rely on images as retrieval keys and underutilize VLMs, failing to fully leverage their potential for improving retrieval and answer generation.

Method: Proposes WikiSeeker with: 1) Multimodal retriever using both image and text, 2) VLM as Refiner to rewrite textual queries based on input images, 3) VLM as Inspector to selectively route reliable retrieved context to LLM for answer generation or use VLM’s internal knowledge when retrieval is unreliable.

Result: Achieves state-of-the-art performance on EVQA, InfoSeek, and M2KR benchmarks with substantial improvements in both retrieval accuracy and answer quality.

Conclusion: WikiSeeker effectively bridges gaps in multimodal RAG by better utilizing VLMs and multimodal retrieval, demonstrating significant performance improvements in KB-VQA tasks.

Abstract: Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM’s internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state-of-the-art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on https://github.com/zhuyjan/WikiSeeker.

[255] Learn to Rank: Visual Attribution by Learning Importance Ranking

David Schinagl, Christian Fruhwirth-Reisinger, Alexander Prutsch, Samuel Schulter, Horst Possegger

Main category: cs.CV

TL;DR: Proposes a learning-based explainer for vision models that directly optimizes deletion/insertion metrics via differentiable permutation learning using Gumbel-Sinkhorn, producing pixel-level attributions with boundary-aligned explanations.

DetailsMotivation: Existing visual attribution methods face a three-way trade-off: propagation-based approaches are efficient but biased/architecture-specific; perturbation-based methods are causally grounded but expensive and coarse for vision transformers; learning-based explainers are fast but optimize surrogate objectives. Need a method that directly optimizes interpretability metrics while producing high-quality, pixel-level explanations.

Method: Frames attribution generation as permutation learning, using Gumbel-Sinkhorn to create a differentiable relaxation of hard sorting/ranking needed for deletion/insertion metrics. Enables end-to-end training through attribution-guided perturbations of the target model. During inference, produces dense pixel-level attributions in single forward pass with optional gradient refinement.

Result: Demonstrates consistent quantitative improvements and sharper, boundary-aligned explanations, particularly for transformer-based vision models. Produces high-quality pixel-level attributions that better align with object boundaries.

Conclusion: Proposed method successfully addresses limitations of existing approaches by directly optimizing interpretability metrics through differentiable permutation learning, resulting in improved attribution quality especially for modern vision transformer architectures.

Abstract: Interpreting the decisions of complex computer vision models is crucial to establish trust and accountability, especially in safety-critical domains. An established approach to interpretability is generating visual attribution maps that highlight regions of the input most relevant to the model’s prediction. However, existing methods face a three-way trade-off. Propagation-based approaches are efficient, but they can be biased and architecture-specific. Meanwhile, perturbation-based methods are causally grounded, yet they are expensive and for vision transformers often yield coarse, patch-level explanations. Learning-based explainers are fast but usually optimize surrogate objectives or distill from heuristic teachers. We propose a learning scheme that instead optimizes deletion and insertion metrics directly. Since these metrics depend on non-differentiable sorting and ranking, we frame them as permutation learning and replace the hard sorting with a differentiable relaxation using Gumbel-Sinkhorn. This enables end-to-end training through attribution-guided perturbations of the target model. During inference, our method produces dense, pixel-level attributions in a single forward pass with optional, few-step gradient refinement. Our experiments demonstrate consistent quantitative improvements and sharper, boundary-aligned explanations, particularly for transformer-based vision models.

[256] Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

Zonghao Ying, Haowen Dai, Lianyu Hu, Zonglei Jing, Quanchen Zou, Yaodong Yang, Aishan Liu, Xianglong Liu

Main category: cs.CV

TL;DR: Etch is a black-box attack framework that exploits text-to-image models’ text rendering capabilities to embed harmful textual payloads in visually benign images, bypassing safety filters through semantic camouflage, visual-spatial anchoring, and typographic encoding.

DetailsMotivation: The paper identifies a new vulnerability in modern text-to-image models: their ability to render legible text enables "inscriptive jailbreaks" where adversaries can embed harmful textual content in visually benign images, bypassing current safety mechanisms designed for visual content filtering.

Method: Etch decomposes adversarial prompts into three orthogonal layers: semantic camouflage (hiding harmful intent in benign descriptions), visual-spatial anchoring (controlling text placement), and typographic encoding (ensuring character fidelity). It uses a zero-order optimization loop where a vision-language model critiques generated images, localizes failures to specific layers, and prescribes targeted revisions.

Result: Extensive evaluations across 7 models on 2 benchmarks show Etch achieves average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. This reveals a critical blind spot in current T2I safety alignments.

Conclusion: The paper demonstrates a fundamental vulnerability in text-to-image safety mechanisms and underscores the urgent need for typography-aware multimodal defense mechanisms that can detect and prevent inscriptive jailbreaks.

Abstract: Modern text-to-image (T2I) models can now render legible, paragraph-length text, enabling a fundamentally new class of misuse. We identify and formalize the inscriptive jailbreak, where an adversary coerces a T2I system into generating images containing harmful textual payloads (e.g., fraudulent documents) embedded within visually benign scenes. Unlike traditional depictive jailbreaks that elicit visually objectionable imagery, inscriptive attacks weaponize the text-rendering capability itself. Because existing jailbreak techniques are designed for coarse visual manipulation, they struggle to bypass multi-stage safety filters while maintaining character-level fidelity. To expose this vulnerability, we propose Etch, a black-box attack framework that decomposes the adversarial prompt into three functionally orthogonal layers: semantic camouflage, visual-spatial anchoring, and typographic encoding. This decomposition reduces joint optimization over the full prompt space to tractable sub-problems, which are iteratively refined through a zero-order loop. In this process, a vision-language model critiques each generated image, localizes failures to specific layers, and prescribes targeted revisions. Extensive evaluations across 7 models on the 2 benchmarks demonstrate that Etch achieves an average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. Our results reveal a critical blind spot in current T2I safety alignments and underscore the urgent need for typography-aware defense multimodal mechanisms.

[257] Neural Network Pruning via QUBO Optimization

Osama Orabi, Artur Zagitov, Hadi Salloum, Viktor A. Lobachev, Kasymkhan Khubiev, Yaroslav Kholodov

Main category: cs.CV

TL;DR: Hybrid QUBO framework for neural network pruning combines gradient-aware sensitivity metrics with activation similarity, using Tensor-Train refinement for optimization.

DetailsMotivation: Existing neural network pruning methods rely on greedy heuristics that ignore complex filter interactions, while formal optimization methods like QUBO underperform due to oversimplified objectives based on metrics like L1-norm.

Method: Proposes a unified Hybrid QUBO framework integrating gradient-aware sensitivity metrics (first-order Taylor and second-order Fisher information) in the linear term and data-driven activation similarity in the quadratic term. Includes dynamic capacity-driven search for strict sparsity enforcement and a two-stage pipeline with Tensor-Train Refinement for gradient-free optimization.

Result: Experiments on SIDD image denoising dataset show Hybrid QUBO significantly outperforms both greedy Taylor pruning and traditional L1-based QUBO, with Tensor-Train Refinement providing further consistent gains at appropriate combinatorial scales.

Conclusion: Highlights the potential of hybrid combinatorial formulations for robust, scalable, and interpretable neural network compression.

Abstract: Neural network pruning can be formulated as a combinatorial optimization problem, yet most existing approaches rely on greedy heuristics that ignore complex interactions between filters. Formal optimization methods such as Quadratic Unconstrained Binary Optimization (QUBO) provide a principled alternative but have so far underperformed due to oversimplified objective formulations based on metrics like the L1-norm. In this work, we propose a unified Hybrid QUBO framework that bridges heuristic importance estimation with global combinatorial optimization. Our formulation integrates gradient-aware sensitivity metrics - specifically first-order Taylor and second-order Fisher information - into the linear term, while utilizing data-driven activation similarity in the quadratic term. This allows the QUBO objective to jointly capture individual filter relevance and inter-filter functional redundancy. We further introduce a dynamic capacity-driven search to strictly enforce target sparsity without distorting the optimization landscape. Finally, we employ a two-stage pipeline featuring a Tensor-Train (TT) Refinement stage - a gradient-free optimizer that fine-tunes the QUBO-derived solution directly against the true evaluation metric. Experiments on the SIDD image denoising dataset demonstrate that the proposed Hybrid QUBO significantly outperforms both greedy Taylor pruning and traditional L1-based QUBO, with TT Refinement providing further consistent gains at appropriate combinatorial scales. This highlights the potential of hybrid combinatorial formulations for robust, scalable, and interpretable neural network compression.

[258] Automatic dental superimposition of 3D intraorals and 2D photographs for human identification

Antonio D. Villegas-Yeguas, Xavier Abreau-Freire, Guillermo R-García, Andrea Valsecchi, Teresa Pinho, Daniel Pérez-Mongiovi, Oscar Ibáñez, Oscar Cordón

Main category: cs.CV

TL;DR: A 3D-2D computer vision approach for dental identification using post-mortem 3D scans and ante-mortem photos, addressing perspective distortion and providing objective morphological comparison scores.

DetailsMotivation: Dental comparison is crucial for identification but faces challenges with missing ante-mortem records and perspective distortion in photos. Social media photos with visible teeth offer potential but current methods lack proper modeling of perspective distortion and objective quantification.

Method: Two automatic approaches: 1) using paired landmarks between 3D post-mortem scans and 2D ante-mortem photos, and 2) using tooth region segmentation to estimate camera parameters. Both replicate ante-mortem images with 3D models for morphological comparison.

Result: Tested on 20,164 cross comparisons from 142 samples, achieving mean ranking values of 1.6 and 1.5 respectively. Outperforms automatic dental chart comparison approaches and provides objective, quantitative scores with visualizable superimposed images.

Conclusion: The 3D-2D approach enables automatic, objective dental morphological comparison using photos, addressing perspective distortion and providing interpretable quantitative scores for forensic identification.

Abstract: Dental comparison is considered a primary identification method, at the level of fingerprints and DNA profiling. One crucial but time-consuming step of this method is the morphological comparison. One of the main challenges to apply this method is the lack of ante-mortem medical records, specially on scenarios such as migrant death at the border and/or in countries where there is no universal healthcare. The availability of photos on social media where teeth are visible has led many odontologists to consider morphological comparison using them. However, state-of-the-art proposals have significant limitations, including the lack of proper modeling of perspective distortion and the absence of objective approaches that quantify morphological differences. Our proposal involves a 3D (post-mortem scan) - 2D (ante-mortem photos) approach. Using computer vision and optimization techniques, we replicate the ante-mortem image with the 3D model to perform the morphological comparison. Two automatic approaches have been developed: i) using paired landmarks and ii) using a segmentation of the teeth region to estimate camera parameters. Both are capable of obtaining very promising results over 20,164 cross comparisons from 142 samples, obtaining mean ranking values of 1.6 and 1.5, respectively. These results clearly outperform filtering capabilities of automatic dental chart comparison approaches, while providing an automatic, objective and quantitative score of the morphological correspondence, easily to interpret and analyze by visualizing superimposed images.

[259] Physics-Aware Video Instance Removal Benchmark

Zirui Li, Xinghao Chen, Lingyu Jiang, Dengzhe Hou, Fangzhou Lin, Kazunori Yamada, Xiangbo Gao, Zhengzhong Tu

Main category: cs.CV

TL;DR: PVIR benchmark for physics-aware video instance removal with 95 videos, evaluating methods on semantic, visual, and spatial dimensions, showing current methods struggle with complex physical interactions.

DetailsMotivation: Current video instance removal benchmarks focus on visual plausibility but overlook physical causalities like shadows and reflections triggered by object removal, creating a need for physics-aware evaluation.

Method: Introduces PVIR benchmark with 95 high-quality videos annotated with instance masks and removal prompts, partitioned into Simple and Hard subsets. Evaluates four methods (PISCO-Removal, UniVideo, DiffuEraser, CoCoCo) using decoupled human evaluation across instruction following, rendering quality, and edit exclusivity dimensions.

Result: PISCO-Removal and UniVideo achieve state-of-the-art performance, while DiffuEraser introduces blurring artifacts and CoCoCo struggles with instruction following. Performance drops significantly on Hard subset highlighting challenges with complex physical side effects.

Conclusion: Physics-aware video instance removal remains challenging, especially for complex physical interactions. The PVIR benchmark provides a more comprehensive evaluation framework that considers physical causalities beyond visual plausibility.

Abstract: Video Instance Removal (VIR) requires removing target objects while maintaining background integrity and physical consistency, such as specular reflections and illumination interactions. Despite advancements in text-guided editing, current benchmarks primarily assess visual plausibility, often overlooking the physical causalities, such as lingering shadows, triggered by object removal. We introduce the Physics-Aware Video Instance Removal (PVIR) benchmark, featuring 95 high-quality videos annotated with instance-accurate masks and removal prompts. PVIR is partitioned into Simple and Hard subsets, the latter explicitly targeting complex physical interactions. We evaluate four representative methods, PISCO-Removal, UniVideo, DiffuEraser, and CoCoCo, using a decoupled human evaluation protocol across three dimensions to isolate semantic, visual, and spatial failures: instruction following, rendering quality, and edit exclusivity. Our results show that PISCO-Removal and UniVideo achieve state-of-the-art performance, while DiffuEraser frequently introduces blurring artifacts and CoCoCo struggles significantly with instruction following. The persistent performance drop on the Hard subset highlights the ongoing challenge of recovering complex physical side effects.

[260] AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

Dong She, Xianrong Yao, Liqun Chen, Jinghe Yu, Yang Gao, Zhanpeng Jin

Main category: cs.CV

TL;DR: AICA-Bench benchmark for affective image content analysis with three tasks: emotion understanding, reasoning, and generation, plus GAT prompting framework to improve VLMs’ affective capabilities

DetailsMotivation: Current Vision-Language Models (VLMs) have strong perception capabilities but lack holistic Affective Image Content Analysis (AICA) that integrates perception, reasoning, and generation into a unified framework for emotional understanding and generation

Method: Created AICA-Bench benchmark with three core tasks (Emotion Understanding, Emotion Reasoning, Emotion-Guided Content Generation), evaluated 23 VLMs, and proposed Grounded Affective Tree (GAT) Prompting - a training-free framework combining visual scaffolding with hierarchical reasoning

Result: Identified two major limitations in current VLMs: weak intensity calibration and shallow open-ended descriptions. GAT prompting reduces intensity errors and improves descriptive depth, providing a strong baseline for affective multimodal understanding

Conclusion: AICA-Bench addresses the gap in holistic affective analysis, and GAT prompting effectively improves VLMs’ emotional understanding and generation capabilities without additional training

Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.

[261] Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family

Oscar Chew, Hsiao-Ying Huang, Kunal Jain, Tai-I Chen, Khoa D Doan, Kuan-Hao Huang

Main category: cs.CV

TL;DR: CLIP models exhibit center bias - disproportionately focusing on central image regions while overlooking important objects near boundaries, which persists in recent variants and limits fine-grained understanding.

DetailsMotivation: Despite improvements, contrastive vision-language models like CLIP still lack fine-grained visual understanding. The paper identifies a specific failure mode called "center bias" where models focus excessively on central image regions and miss important boundary objects, fundamentally limiting their ability to perform sophisticated tasks that depend on recognizing all relevant objects.

Method: The authors conduct analyses from representation and attention perspectives using interpretability methods: embedding decomposition and attention map analysis. They investigate how information loss during visual embedding aggregation, particularly through pooling mechanisms, causes relevant concepts associated with off-center objects to vanish from final representations.

Result: The analysis reveals that center bias persists in CLIP family models, including recent variants. The bias stems from information loss during visual embedding aggregation where pooling mechanisms cause off-center object concepts to disappear from final representations. The paper shows this bias can be alleviated with training-free strategies like visual prompting and attention redistribution.

Conclusion: Center bias is a fundamental limitation in CLIP models that affects their fine-grained visual understanding. The bias can be mitigated through training-free interventions like visual prompting and attention redistribution, which redirect model attention to off-center regions without requiring retraining.

Abstract: Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. To understand the underlying causes of the limitation, we conduct analyses from both representation and attention perspectives. Using interpretability methods, i.e., embedding decomposition and attention map analysis, we find that relevant concepts especially those associated with off-center objects vanish from the model’s embedding in the final representation due to information loss during the aggregation of visual embeddings, particularly the reliance on pooling mechanisms. Finally, we show that this bias can be alleviated with training-free strategies such as visual prompting and attention redistribution by redirecting models’ attention to off-center regions.

[262] Selective Aggregation of Attention Maps Improves Diffusion-Based Visual Interpretation

Jungwon Park, Jungmin Ko, Dongnam Byun, Wonjong Rhee

Main category: cs.CV

TL;DR: Selective aggregation of cross-attention maps from relevant heads improves interpretability and segmentation performance in text-to-image models.

DetailsMotivation: While cross-attention maps are widely used in text-to-image models for interpretability and applications, the distinct characteristics of attention maps from different attention heads remain underexplored. The paper aims to investigate how selective aggregation of attention maps from heads most relevant to target concepts can enhance visual interpretability.

Method: The approach involves identifying and selectively aggregating cross-attention maps from attention heads that are most relevant to specific target concepts. This selective aggregation is compared against existing methods like DAAM (diffusion-based segmentation method) for evaluating segmentation performance.

Result: The method achieves higher mean IoU scores compared to DAAM. The most relevant heads capture concept-specific features more accurately than the least relevant ones, and selective aggregation helps diagnose prompt misinterpretations in text-to-image generation.

Conclusion: Attention head selection offers a promising direction for improving both the interpretability and controllability of text-to-image generation by leveraging the distinct characteristics of different attention heads.

Abstract: Numerous studies on text-to-image (T2I) generative models have utilized cross-attention maps to boost application performance and interpret model behavior. However, the distinct characteristics of attention maps from different attention heads remain relatively underexplored. In this study, we show that selectively aggregating cross-attention maps from heads most relevant to a target concept can improve visual interpretability. Compared to the diffusion-based segmentation method DAAM, our approach achieves higher mean IoU scores. We also find that the most relevant heads capture concept-specific features more accurately than the least relevant ones, and that selective aggregation helps diagnose prompt misinterpretations. These findings suggest that attention head selection offers a promising direction for improving the interpretability and controllability of T2I generation.

[263] Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction

Yangyi Xiao, Siting Zhu, Baoquan Yang, Tianchen Deng, Yongbo Chen, Hesheng Wang

Main category: cs.CV

TL;DR: ADM-GS is a framework for multi-traversal scene reconstruction that decomposes static background appearance into traversal-invariant material properties and traversal-dependent illumination using Gaussian splatting and neural light fields.

DetailsMotivation: Multi-traversal scene reconstruction is crucial for autonomous driving simulation and digital twins, but suffers from appearance inconsistencies across different traversals due to varying illumination and environmental conditions, despite shared underlying geometry.

Method: Proposes ADM-GS with explicit appearance decomposition of static background into traversal-invariant material and traversal-dependent illumination. Uses a neural light field with frequency-separated hybrid encoding strategy incorporating surface normals and explicit reflection vectors to separately capture low-frequency diffuse illumination and high-frequency specular reflections.

Result: Achieves +0.98 dB PSNR improvement over existing latent-based baselines on Argoverse 2 and Waymo Open datasets, producing more consistent appearance across traversals.

Conclusion: ADM-GS effectively addresses appearance inconsistency in multi-traversal reconstruction through explicit appearance decomposition, enabling better scene consistency for autonomous driving applications.

Abstract: Multi-traversal scene reconstruction is important for high-fidelity autonomous driving simulation and digital twin construction. This task involves integrating multiple sequences captured from the same geographical area at different times. In this context, a primary challenge is the significant appearance inconsistency across traversals caused by varying illumination and environmental conditions, despite the shared underlying geometry. This paper presents ADM-GS (Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction), a framework that applies an explicit appearance decomposition to the static background to alleviate appearance entanglement across traversals. For the static background, we decompose the appearance into traversal-invariant material, representing intrinsic material properties, and traversal-dependent illumination, capturing lighting variations. Specifically, we propose a neural light field that utilizes a frequency-separated hybrid encoding strategy. By incorporating surface normals and explicit reflection vectors, this design separately captures low-frequency diffuse illumination and high-frequency specular reflections. Quantitative evaluations on the Argoverse 2 and Waymo Open datasets demonstrate the effectiveness of ADM-GS. In multi-traversal experiments, our method achieves a +0.98 dB PSNR improvement over existing latent-based baselines while producing more consistent appearance across traversals. Code will be available at https://github.com/IRMVLab/ADM-GS.

[264] Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning

Jingbo Sun, Qichao Zhang, Songjun Tu, Xing Fang, Yupeng Zheng, Haoran Li, Ke Chen, Dongbin Zhao

Main category: cs.CV

TL;DR: SRCP improves zero-shot generalization in visual unsupervised RL by addressing SR limitations through saliency-guided dynamics representation and consistency policy learning.

DetailsMotivation: Successor representations struggle in high-dimensional visual environments due to suboptimal representations that focus on irrelevant regions and hinder skill controllability.

Method: SRCP decouples representation learning via saliency-guided dynamics tasks and integrates fast-sampling consistency policy with URL-specific classifier-free guidance.

Result: Achieves state-of-the-art zero-shot generalization on 16 tasks across 4 datasets from ExORL benchmark, compatible with various SR methods.

Conclusion: SRCP effectively addresses SR limitations in visual URL, improving representation quality and skill controllability for better zero-shot generalization.

Abstract: Zero-shot unsupervised reinforcement learning (URL) offers a promising direction for building generalist agents capable of generalizing to unseen tasks without additional supervision. Among existing approaches, successor representations (SR) have emerged as a prominent paradigm due to their effectiveness in structured, low-dimensional settings. However, SR methods struggle to scale to high-dimensional visual environments. Through empirical analysis, we identify two key limitations of SR in visual URL: (1) SR objectives often lead to suboptimal representations that attend to dynamics-irrelevant regions, resulting in inaccurate successor measures and degraded task generalization; and (2) these flawed representations hinder SR policies from modeling multi-modal skill-conditioned action distributions and ensuring skill controllability. To address these limitations, we propose Saliency-Guided Representation with Consistency Policy Learning (SRCP), a novel framework that improves zero-shot generalization of SR methods in visual URL. SRCP decouples representation learning from successor training by introducing a saliency-guided dynamics task to capture dynamics-relevant representations, thereby improving successor measure and task generalization. Moreover, it integrates a fast-sampling consistency policy with URL-specific classifier-free guidance and tailored training objectives to improve skill-conditioned policy modeling and controllability. Extensive experiments on 16 tasks across 4 datasets from the ExORL benchmark demonstrate that SRCP achieves state-of-the-art zero-shot generalization in visual URL and is compatible with various SR methods.

[265] MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang, Chao Feng, Hongsheng Li

Main category: cs.CV

TL;DR: MMEmb-R1 is an adaptive reasoning-based multimodal embedding framework that selectively applies chain-of-thought reasoning only when beneficial for query-target alignment, addressing structural misalignment and unnecessary computation issues in MLLM-based embeddings.

DetailsMotivation: Current MLLMs underutilize generative reasoning capabilities for multimodal embedding tasks. Direct chain-of-thought reasoning integration faces two challenges: structural misalignment between instance-level reasoning and pairwise contrastive supervision (leading to shortcut learning), and the fact that reasoning isn't universally beneficial for all embedding tasks (introducing unnecessary computation and potentially obscuring semantic signals for simple cases).

Method: Proposes MMEmb-R1 framework that formulates reasoning as a latent variable and introduces pair-aware reasoning selection using counterfactual intervention to identify beneficial reasoning paths. Employs reinforcement learning to selectively invoke reasoning only when necessary, reducing overhead and latency.

Result: Achieves state-of-the-art score of 71.2 on MMEB-V2 benchmark with only 4B parameters, while significantly reducing reasoning overhead and inference latency compared to approaches that always apply reasoning.

Conclusion: The adaptive reasoning approach effectively addresses structural misalignment and computational inefficiency in MLLM-based multimodal embeddings, demonstrating that selective reasoning invocation can achieve superior performance with reduced overhead.

Abstract: MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental challenges. First, structural misalignment between instance-level reasoning and pairwise contrastive supervision may lead to shortcut behavior, where the model merely learns the superficial format of reasoning. Second, reasoning is not universally beneficial for embedding tasks. Enforcing reasoning for all inputs may introduce unnecessary computation and latency, and can even obscure salient semantic signals for simple cases. To address these issues, we propose MMEmb-R1, an adaptive reasoning-based multimodal embedding framework. We formulate reasoning as a latent variable and introduce pair-aware reasoning selection that employs counterfactual intervention to identify reasoning paths beneficial for query-target alignment. Furthermore, we adopt reinforcement learning to selectively invoke reasoning only when necessary. Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.

[266] SonoSelect: Efficient Ultrasound Perception via Active Probe Exploration

Yixin Zhang, Yunzhong Hou, Longqi Li, Zhenyue Qin, Yang Liu, Yue Yao

Main category: cs.CV

TL;DR: SonoSelect: Active view exploration method for ultrasound that adaptively guides probe movement using 3D spatial memory and ultrasound-specific objectives to reduce redundancy while maintaining diagnostic accuracy.

DetailsMotivation: Ultrasound diagnosis requires multiple scan views, but exhaustive scanning introduces redundancy and increases costs. Not all probe views are equally informative, so there's a need for intelligent, adaptive view selection to optimize scanning efficiency.

Method: Formulates ultrasound active view exploration as sequential decision-making problem. Each new 2D view is fused into 3D spatial memory of observed anatomy, which guides next probe position. Uses ultrasound-specific objective favoring movements with greater organ coverage, lower reconstruction uncertainty, and less redundant scanning.

Result: Achieves promising multi-view organ classification accuracy using only 2 out of N views. For kidney cyst detection task, reaches 54.56% kidney coverage and 35.13% cyst coverage with short trajectories consistently centered on target cyst.

Conclusion: SonoSelect demonstrates effective adaptive view selection for ultrasound, reducing scanning redundancy while maintaining diagnostic capability through intelligent probe guidance based on 3D spatial memory and ultrasound-specific objectives.

Abstract: Ultrasound perception typically requires multiple scan views through probe movement to reduce diagnostic ambiguity, mitigate acoustic occlusions, and improve anatomical coverage. However, not all probe views are equally informative. Exhaustively acquiring a large number of views can introduce substantial redundancy, increase scanning and processing costs. To address this, we define an active view exploration task for ultrasound and propose SonoSelect, an ultrasound-specific method that adaptively guides probe movement based on current observations. Specifically, we cast ultrasound active view exploration as a sequential decision-making problem. Each new 2D ultrasound view is fused into a 3D spatial memory of the observed anatomy, which guides the next probe position. On top of this formulation, we propose an ultrasound-specific objective that favors probe movements with greater organ coverage, lower reconstruction uncertainty, and less redundant scanning. Experiments on the ultrasound simulator show that SonoSelect achieves promising multi-view organ classification accuracy using only 2 out of N views. Furthermore, for a more difficult kidney cyst detection task, it reaches 54.56% kidney coverage and 35.13% cyst coverage, with short trajectories consistently centered on the target cyst.

[267] Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition

Tianyi Liu, Yiming Li, Wenqian Wang, Jiaojiao Wang, Chen Cai, Yi Wang, Kim-Hui Yap

Main category: cs.CV

TL;DR: MoME framework with HTL strategy enables adaptive multimodal fusion through modality-specific experts and holistic token learning for improved action recognition.

DetailsMotivation: Existing multimodal methods use fixed fusion modules that can't adapt to changing modality reliability or capture fine-grained action cues, limiting robust multimodal visual analytics.

Method: Proposes Mixture-of-Modality-Experts (MoME) framework with Holistic Token Learning (HTL) strategy. MoME enables adaptive collaboration among modality-specific experts, while HTL improves intra-expert refinement and inter-expert knowledge transfer using class tokens and spatio-temporal tokens.

Result: Outperforms representative single-modal and multimodal baselines on driver action recognition benchmark. HTL improves subtle multimodal understanding and offers better interpretability.

Conclusion: Forms a knowledge-centric multimodal learning framework that improves expert specialization while reducing ambiguity in multimodal fusion, validated on driver action recognition.

Abstract: Robust multimodal visual analytics remains challenging when heterogeneous modalities provide complementary but input-dependent evidence for decision-making.Existing multimodal learning methods mainly rely on fixed fusion modules or predefined cross-modal interactions, which are often insufficient to adapt to changing modality reliability and to capture fine-grained action cues. To address this issue, we propose a Mixture-of-Modality-Experts (MoME) framework with a Holistic Token Learning (HTL) strategy. MoME enables adaptive collaboration among modality-specific experts, while HTL improves both intra-expert refinement and inter-expert knowledge transfer through class tokens and spatio-temporal tokens. In this way, our method forms a knowledge-centric multimodal learning framework that improves expert specialization while reducing ambiguity in multimodal fusion.We validate the proposed framework on driver action recognition as a representative multimodal understanding taskThe experimental results on the public benchmark show that the proposed MoME framework and the HTL strategy jointly outperform representative single-modal and multimodal baselines. Additional ablation, validation, and visualization results further verify that the proposed HTL strategy improves subtle multimodal understanding and offers better interpretability.

[268] Multi-Modal Landslide Detection from Sentinel-1 SAR and Sentinel-2 Optical Imagery Using Multi-Encoder Vision Transformers and Ensemble Learning

Ioannis Nasios

Main category: cs.CV

TL;DR: A multi-model framework combining Sentinel-2 optical and Sentinel-1 SAR data with vision transformers and ensemble learning for robust landslide detection, achieving state-of-the-art F1 score of 0.919.

DetailsMotivation: Landslides are major geohazards requiring accurate and timely detection for disaster risk reduction. Current approaches need improved robustness and accuracy through better fusion of complementary optical and radar data modalities.

Method: Modular multi-model framework using multi-encoder vision transformers with separate lightweight pretrained encoders for each modality (optical and SAR). Integrates ensemble learning combining neural networks with gradient boosting models (LightGBM, XGBoost), and includes derived spectral indices like NDVI for enhanced vegetation/surface change sensitivity.

Result: Achieved state-of-the-art F1 score of 0.919 on landslide detection, demonstrated top performance in machine learning competition with strong precision-recall balance. Works without pre-event Sentinel-2 data in non-classical change detection setting.

Conclusion: The framework effectively leverages complementary strengths of optical and radar data, offers flexible configurations (optical-only, SAR-only, or combined), and provides transferable approach for broader natural hazard monitoring and environmental change applications.

Abstract: Landslides represent a major geohazard with severe impacts on human life, infrastructure, and ecosystems, underscoring the need for accurate and timely detection approaches to support disaster risk reduction. This study proposes a modular, multi-model framework that fuses Sentinel-2 optical imagery with Sentinel-1 Synthetic Aperture Radar (SAR) data, for robust landslide detection. The methodology leverages multi-encoder vision transformers, where each data modality is processed through separate lightweight pretrained encoders, achieving strong performance in landslide detection. In addition, the integration of multiple models, particularly the combination of neural networks and gradient boosting models (LightGBM and XGBoost), demonstrates the power of ensemble learning to further enhance accuracy and robustness. Derived spectral indices, such as NDVI, are integrated alongside original bands to enhance sensitivity to vegetation and surface changes. The proposed methodology achieves a state-of-the-art F1 score of 0.919 on landslide detection, addressing a patch-based classification task rather than pixel-level segmentation and operating without pre-event Sentinel-2 data, highlighting its effectiveness in a non-classical change detection setting. It also demonstrated top performance in a machine learning competition, achieving a strong balance between precision and recall and highlighting the advantages of explicitly leveraging the complementary strengths of optical and radar data. The conducted experiments and research also emphasize scalability and operational applicability, enabling flexible configurations with optical-only, SAR-only, or combined inputs, and offering a transferable framework for broader natural hazard monitoring and environmental change applications. Full training and inference code can be found in https://github.com/IoannisNasios/sentinel-landslide-cls.

[269] HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation

Tao Hu, Varun Jampani

Main category: cs.CV

TL;DR: HumANDiff: A framework for human video generation using articulated motion-consistent noise sampling and joint appearance-motion learning to enhance motion control and physical consistency.

DetailsMotivation: Current generative video diffusion models struggle to capture realistic human motion dynamics and physics, particularly for motion-dependent effects like clothing wrinkles. There's a need for better motion control in human video generation.

Method: Three key designs: 1) Articulated motion-consistent noise sampling using 3D articulated noise on human body template; 2) Joint appearance-motion learning predicting both pixel appearances and physical motions; 3) Geometric motion consistency loss in articulated noise space.

Result: Achieves state-of-the-art performance in rendering motion-consistent, high-fidelity humans with diverse clothing styles. Enables image-to-video generation with intrinsic motion control without additional modules.

Conclusion: HumANDiff provides an effective framework for controllable human video generation that is agnostic to diffusion model design and requires no architectural modifications, enabling scalable motion-controlled video synthesis.

Abstract: Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template. It inherits body topology priors for spatially and temporally consistent noise sampling. 2) Joint appearance-motion learning that enhances the standard training objective of video diffusion models by jointly predicting pixel appearances and corresponding physical motions from the articulated noises. It enables high-fidelity human video synthesis, e.g., capturing motion-dependent clothing wrinkles. 3) Geometric motion consistency learning that enforces physical motion consistency across frames via a novel geometric motion consistency loss defined in the articulated noise space. HumANDiff enables scalable controllable human video generation by fine-tuning video diffusion models with articulated noise sampling. Consequently, our method is agnostic to diffusion model design, and requires no modifications to the model architecture. During inference, HumANDiff enables image-to-video generation within a single framework, achieving intrinsic motion control without requiring additional motion modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance in rendering motion-consistent, high-fidelity humans with diverse clothing styles. Project page: https://taohuumd.github.io/projects/HumANDiff/

[270] OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

Yukun Wang, Ruihuang Li, Jiale Tao, Shiyuan Yang, Liyi Chen, Zhantao Yang, Handz, Yulan Guo, Shuai Shao, Qinglin Lu

Main category: cs.CV

TL;DR: OmniCamera is a unified framework for video generation that explicitly disentangles scene content from camera motion, enabling independent control over both dimensions through arbitrary pairings of camera and content conditions.

DetailsMotivation: Existing video generation models often entangle scene content and camera motion, limiting independent control. The authors aim to create a framework that explicitly disentangles these two crucial axes of video to enable unprecedented creative control.

Method: 1) Constructed OmniCAM, a hybrid dataset combining curated real-world videos with synthetic data for diverse paired examples. 2) Proposed Dual-level Curriculum Co-Training strategy: condition-level (progressive introduction of control modalities by difficulty) and data-level (train on synthetic data for precise control, then adapt to real data for photorealism).

Result: OmniCamera achieves state-of-the-art performance, enabling flexible control for complex camera movements while maintaining superior visual quality.

Conclusion: The compositional approach of disentangling camera motion from scene content, supported by novel dataset construction and training strategy, enables unprecedented creative control in video generation with superior quality.

Abstract: Video fundamentally intertwines two crucial axes: the dynamic content of a scene and the camera motion through which it is observed. However, existing generation models often entangle these factors, limiting independent control. In this work, we introduce OmniCamera, a unified framework designed to explicitly disentangle and command these two dimensions. This compositional approach enables flexible video generation by allowing arbitrary pairings of camera and content conditions, unlocking unprecedented creative control. To overcome the fundamental challenges of modality conflict and data scarcity inherent in such a system, we present two key innovations. First, we construct OmniCAM, a novel hybrid dataset combining curated real-world videos with synthetic data that provides diverse paired examples for robust multi-task learning. Second, we propose a Dual-level Curriculum Co-Training strategy that mitigates modality interference and synergistically learns from diverse data sources. This strategy operates on two levels: first, it progressively introduces control modalities by difficulties (condition-level), and second, trains for precise control on synthetic data before adapting to real data for photorealism (data-level). As a result, OmniCamera achieves state-of-the-art performance, enabling flexible control for complex camera movements while maintaining superior visual quality.

[271] Toward Aristotelian Medical Representations: Backpropagation-Free Layer-wise Analysis for Interpretable Generalized Metric Learning on MedMNIST

Michael Karnes, Alper Yilmaz

Main category: cs.CV

TL;DR: A-ROM is an interpretable medical imaging framework using pretrained ViTs and kNN classification instead of opaque backpropagation, enabling transparent few-shot learning for clinical applications.

DetailsMotivation: Address the "black-box" nature of deep learning models in medical imaging that hinders clinical adoption, by creating an interpretable framework that meets transparency requirements of clinical environments.

Method: Leverages Platonic Representation Hypothesis and pretrained Vision Transformers’ metric space, replaces traditional decision layers with human-readable concept dictionary and kNN classifier for interpretability, avoids gradient-based fine-tuning.

Result: Competitive performance with standard benchmarks on MedMNIST v2 suite, provides simple and scalable few-shot solution with transparency.

Conclusion: A-ROM offers a promising approach to bridge the gap between deep learning performance and clinical interpretability requirements in medical imaging.

Abstract: While deep learning has achieved remarkable success in medical imaging, the “black-box” nature of backpropagation-based models remains a significant barrier to clinical adoption. To bridge this gap, we propose Aristotelian Rapid Object Modeling (A-ROM), a framework built upon the Platonic Representation Hypothesis (PRH). This hypothesis posits that models trained on vast, diverse datasets converge toward a universal and objective representation of reality. By leveraging the generalizable metric space of pretrained Vision Transformers (ViTs), A-ROM enables the rapid modeling of novel medical concepts without the computational burden or opacity of further gradient-based fine-tuning. We replace traditional, opaque decision layers with a human-readable concept dictionary and a k-Nearest Neighbors (kNN) classifier to ensure the model’s logic remains interpretable. Experiments on the MedMNIST v2 suite demonstrate that A-ROM delivers performance competitive with standard benchmarks while providing a simple and scalable, “few-shot” solution that meets the rigorous transparency demands of modern clinical environments.

[272] Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models

Katarzyna Zaleska, Łukasz Popek, Monika Wysoczańska, Kamil Deja

Main category: cs.CV

TL;DR: A method to localize and modify implicit decision-making in text-to-image diffusion models by identifying key self-attention layers for precise steering interventions.

DetailsMotivation: Text-to-image diffusion models make implicit decisions when prompts are ambiguous, but current localization techniques focus on explicit conditioning rather than these implicit choices. The authors want to understand where in the architecture these implicit decisions are made.

Method: Introduces a probing-based localization technique to identify layers with highest attribute separability for concepts. Finds that ambiguous concept resolution is governed by self-attention layers, then proposes ICM (Implicit Choice-Modification) - a precise steering method applying targeted interventions to a small subset of identified layers.

Result: Extensive experiments show that intervening on specific self-attention layers yields superior debiasing performance compared to state-of-the-art methods, minimizing artifacts common to less precise approaches.

Conclusion: Implicit decision-making in diffusion models is computationally localized, particularly in self-attention layers, enabling precise steering interventions for better control over model outputs.

Abstract: Text-to-image diffusion models exhibit remarkable generative capabilities, yet their internal operations remain opaque, particularly when handling prompts that are not fully descriptive. In such scenarios, models must make implicit decisions to generate details not explicitly specified in the text. This work investigates the hypothesis that this decision-making process is not diffuse but is computationally localized within the model’s architecture. While existing localization techniques focus on prompt-related interventions, we notice that such explicit conditioning may differ from implicit decisions. Therefore, we introduce a probing-based localization technique to identify the layers with the highest attribute separability for concepts. Our findings indicate that the resolution of ambiguous concepts is governed principally by self-attention layers, identifying them as the most effective point for intervention. Based on this discovery, we propose ICM (Implicit Choice-Modification) - a precise steering method that applies targeted interventions to a small subset of layers. Extensive experiments confirm that intervening on these specific self-attention layers yields superior debiasing performance compared to existing state-of-the-art methods, minimizing artifacts common to less precise approaches. The code is available at https://github.com/kzaleskaa/icm.

[273] Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

Juekai Lin, Yun Zhu, Honglin Lin, Sijing Li, Tianwei Lin, Zheng Liu, Xiaoyang Wang, Wenqiao Zhang, Lijun Wu

Main category: cs.CV

TL;DR: SciTikZer introduces a framework for synthesizing TikZ graphics code from scientific schematics, addressing data quality and evaluation gaps with a large-scale dataset and benchmark, achieving SOTA performance.

DetailsMotivation: Graphics Program Synthesis is crucial for reverse-engineering static visuals into editable TikZ code, but existing approaches face challenges due to data quality issues (lack of executability and visual alignment) and evaluation gaps (no benchmarks for structural and visual fidelity).

Method: Developed a closed-loop framework with: 1) SciTikZ-230K dataset from Execution-Centric Data Engine covering 11 scientific disciplines, 2) SciTikZ-Bench benchmark for evaluating visual fidelity and structural logic, and 3) Dual Self-Consistency Reinforcement Learning with Round-Trip Verification to penalize degenerate code.

Result: SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary models like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.

Conclusion: The framework successfully addresses data quality and evaluation gaps in graphics program synthesis for TikZ, enabling high-quality reverse-engineering of scientific schematics into editable code through a comprehensive dataset, benchmark, and optimization paradigm.

Abstract: Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lack strict executability and reliable visual alignment; (2) Evaluation Gap: a lack of benchmarks for both structural and visual fidelity. To address these, we present a closed-loop framework featuring: SciTikZ-230K, a large-scale, high-quality dataset from our Execution-Centric Data Engine covering 11 diverse scientific disciplines; SciTikZ-Bench, a multifaceted benchmark spanning from basic geometric constructs to intricate hierarchical schematics to evaluate both visual fidelity and structural logic. To further broaden the scope of visual-code optimization methodology, we introduce a novel Dual Self-Consistency Reinforcement Learning optimization paradigm, which utilizes Round-Trip Verification to penalize degenerate code and boost overall self-consistency. Empowered by these, our trained model SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.

[274] Extending ZACH-ViT to Robust Medical Imaging: Corruption and Adversarial Stress Testing in Low-Data Regimes

Athanasios Angelakis, Marta Gomez-Barrero

Main category: cs.CV

TL;DR: ZACH-ViT’s robustness evaluation shows it maintains performance advantages under common image corruptions and remains competitive under adversarial attacks in low-data medical imaging settings.

DetailsMotivation: To extend the original ZACH-ViT study by evaluating its robustness under common image corruptions and adversarial perturbations in low-data medical imaging settings, since the foundational study only examined clean performance.

Method: Evaluated ZACH-ViT’s behavior under common image corruptions and adversarial perturbations (FGSM and PGD) on seven MedMNIST datasets using 50 samples per class, comparing against three compact baselines (ABMIL, Minimal-ViT, TransMIL) with fixed hyperparameters and five random seeds.

Result: ZACH-ViT achieved best overall mean rank on clean data (1.57) and under common corruptions (1.57), indicating good balance between baseline performance and robustness to realistic degradation. Under adversarial stress, it ranked first under FGSM (2.00) and second under PGD (2.29), though all models deteriorated substantially.

Conclusion: Compact permutation-invariant transformers like ZACH-ViT maintain advantages under realistic perturbation stress in low-data medical imaging, though adversarial robustness remains challenging for all evaluated models.

Abstract: The recently introduced ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer) formalized a compact permutation-invariant Vision Transformer for medical imaging and argued that architectural alignment with spatial structure can matter more than universal benchmark dominance. Its design was motivated by the observation that positional embeddings and a dedicated class token encode fixed spatial assumptions that may be suboptimal when spatial organization is weakly informative, locally distributed, or variable across biomedical images. The foundational study established a regime-dependent clean performance profile across MedMNIST, but did not examine robustness in detail. In this work, we present the first robustness-focused extension of ZACH-ViT by evaluating its behavior under common image corruptions and adversarial perturbations in the same low-data setting. We compare ZACH-ViT with three scratch-trained compact baselines, ABMIL, Minimal-ViT, and TransMIL, on seven MedMNIST datasets using 50 samples per class, fixed hyperparameters, and five random seeds. Across the benchmark, ZACH-ViT achieves the best overall mean rank on clean data (1.57) and under common corruptions (1.57), indicating a favorable balance between baseline predictive performance and robustness to realistic image degradation. Under adversarial stress, all models deteriorate substantially; nevertheless, ZACH-ViT remains competitive, ranking first under FGSM (2.00) and second under PGD (2.29), where ABMIL performs best overall. These results extend the original ZACH-ViT narrative: the advantages of compact permutation-invariant transformers are not limited to clean evaluation, but can persist under realistic perturbation stress in low-data medical imaging, while adversarial robustness remains an open challenge for all evaluated models.

[275] SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

Hiba Dahmani, Nathan Piasco, Moussab Bennehar, Luis Roldão, Dzmitry Tsishkou, Laurent Caraffa, Jean-Philippe Tarel, Roland Brémond

Main category: cs.CV

TL;DR: A 3D generative framework using Σ-Voxfield grid representation for scalable, multiview-consistent outdoor driving scene generation with semantic-conditioned diffusion and progressive spatial outpainting.

DetailsMotivation: Existing 3D scene generation methods either lack geometric coherence across views (image/video-based approaches) or are limited to small-scale scenes. There's a need for scalable generation of large outdoor driving scenes that maintain 3D consistency across multiple viewpoints.

Method: Proposes Σ-Voxfield grid representation where each occupied voxel stores colorized surface samples. Uses semantic-conditioned diffusion model operating on local voxel neighborhoods with 3D positional encodings. Scales via progressive spatial outpainting over overlapping regions, with deferred rendering for photorealistic images.

Result: Can generate diverse large-scale urban outdoor scenes renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.

Conclusion: The framework enables large-scale multiview-consistent 3D scene generation without per-scene optimization, addressing scalability and geometric coherence limitations of previous methods.

Abstract: Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on $Σ$-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated $Σ$-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization. Extensive experiments show that our approach can generate diverse large-scale urban outdoor scenes, renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.

[276] Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

Hao Chen, Fang Qiu, Fangchao Dong, Defei Yang, Eve Bohnett, Li An

Main category: cs.CV

TL;DR: Lightweight multimodal adaptation framework bridges RGB-pretrained VLMs to thermal infrared imagery using projector alignment, enabling species recognition, enumeration, and habitat-context interpretation from drone-collected thermal data.

DetailsMotivation: To address the representation gap between RGB-pretrained Vision-Language Models (VLMs) and thermal infrared imagery, enabling ecological monitoring applications using drone-collected thermal data without requiring full retraining of large models.

Method: Developed thermal dataset from drone imagery, used lightweight multimodal projector alignment to adapt RGB-pretrained VLMs (InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct) to thermal inputs, benchmarked under closed-set and open-set prompting for species recognition and enumeration.

Result: Qwen3-VL-8B-Instruct with open-set prompting achieved best performance: F1 scores of 0.935 (deer), 0.915 (rhino), 0.968 (elephant); within-1 enumeration accuracies of 0.779, 0.982, 1.000. Combined thermal+RGB enabled habitat-context information generation.

Conclusion: Lightweight projector-based adaptation effectively transfers RGB-pretrained VLMs to thermal drone imagery, expanding utility from object recognition to habitat-context interpretation in ecological monitoring applications.

Abstract: This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its practical utility using a real drone-collected dataset. A thermal dataset was developed from drone-collected imagery and was used to fine-tune VLMs through multimodal projector alignment, enabling the transfer of information from RGB-based visual representations to thermal radiometric inputs. Three representative models, including InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, and Qwen3-VL-8B-Instruct, were benchmarked under both closed-set and open-set prompting conditions for species recognition and instance enumeration. Among the tested models, Qwen3-VL-8B-Instruct with open-set prompting achieved the best overall performance, with F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant, and within-1 enumeration accuracies of 0.779, 0.982, and 1.000, respectively. In addition, combining thermal imagery with simultaneously collected RGB imagery enabled the model to generate habitat-context information, including land-cover characteristics, key landscape features, and visible human disturbance. Overall, the findings demonstrate that lightweight projector-based adaptation provides an effective and practical route for transferring RGB-pretrained VLMs to thermal drone imagery, expanding their utility from object-level recognition to habitat-context interpretation in ecological monitoring.

[277] PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

David Picard, Nicolas Dufour, Lucas Degeorge, Arijit Ghosh, Davide Allegro, Tom Ravaud, Yohann Perron, Corentin Sautier, Zeynep Sonat Baltaci, Fei Meng, Syrine Kalleli, Marta López-Rauhut, Thibaut Loiseau, Ségolène Albouy, Raphael Baena, Elliot Vincent, Loic Landrieu

Main category: cs.CV

TL;DR: PoM is a novel token mixing mechanism with linear complexity that replaces self-attention in transformers, achieving comparable performance across multiple domains while reducing computational costs for long sequences.

DetailsMotivation: Self-attention in transformers has quadratic complexity, making it computationally expensive for long sequences. The authors aim to develop a more efficient alternative that maintains the universal approximation properties of transformers.

Method: PoM aggregates input tokens into a compact representation using a learned polynomial function, from which each token retrieves contextual information. It has linear complexity and serves as a drop-in replacement for self-attention while preserving the contextual mapping property.

Result: PoM matches the performance of attention-based models across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation, while drastically reducing computational cost for long sequences.

Conclusion: PoM is an effective and efficient alternative to self-attention that maintains transformer capabilities while offering linear complexity, making it suitable for applications with long sequences.

Abstract: This paper introduces the Polynomial Mixer (PoM), a novel token mixing mechanism with linear complexity that serves as a drop-in replacement for self-attention. PoM aggregates input tokens into a compact representation through a learned polynomial function, from which each token retrieves contextual information. We prove that PoM satisfies the contextual mapping property, ensuring that transformers equipped with PoM remain universal sequence-to-sequence approximators. We replace standard self-attention with PoM across five diverse domains: text generation, handwritten text recognition, image generation, 3D modeling, and Earth observation. PoM matches the performance of attention-based models while drastically reducing computational cost when working with long sequences. The code is available at https://github.com/davidpicard/pom.

[278] The Character Error Vector: Decomposable errors for page-level OCR evaluation

Jonathan Bourne, Mwiza Simbeye, Joseph Nockels

Main category: cs.CV

TL;DR: The paper introduces Character Error Vector (CEV), a bag-of-characters evaluator for OCR that decomposes errors into parsing, OCR, and interaction components, addressing limitations of CER when page-parsing errors occur.

DetailsMotivation: Character Error Rate (CER) assumes perfect text parsing, but page-parsing errors make CER undefined, limiting its use for evaluating page-level OCR quality, especially with data lacking consistent labeling schemas.

Method: Proposes CEV as a bag-of-characters evaluator that can be decomposed into parsing error, OCR error, and interaction error components. Implements two methods: SpACER (Spatially Aware Character Error Rate) and Character distribution using Jensen-Shannon Distance.

Result: CEV bridges parsing metrics and local metrics like CER, validates relationships with CER and parse quality, and shows traditional pipeline approaches outperform state-of-the-art end-to-end models on degraded archival newspaper images. Thresholding can predict main error source with F1 of 0.91.

Conclusion: CEV is a valuable metric for document understanding research that addresses CER’s limitations with parsing errors, provides error decomposition for targeted improvements, and is available as a Python library.

Abstract: The Character Error Rate (CER) is a key metric for evaluating the quality of Optical Character Recognition (OCR). However, this metric assumes that text has been perfectly parsed, which is often not the case. Under page-parsing errors, CER becomes undefined, limiting its use as a metric and making evaluating page-level OCR challenging, particularly when using data that do not share a labelling schema. We introduce the Character Error Vector (CEV), a bag-of-characters evaluator for OCR. The CEV can be decomposed into parsing and OCR, and interaction error components. This decomposability allows practitioners to focus on the part of the Document Understanding pipeline that will have the greatest impact on overall text extraction quality. The CEV can be implemented using a variety of methods, of which we demonstrate SpACER (Spatially Aware Character Error Rate) and a Character distribution method using the Jensen-Shannon Distance. We validate the CEV’s performance against other metrics: first, the relationship with CER; then, parse quality; and finally, as a direct measure of page-level OCR quality. The validation process shows that the CEV is a valuable bridge between parsing metrics and local metrics like CER. We analyse a dataset of archival newspapers made of degraded images with complex layouts and find that state-of-the-art end-to-end models are outperformed by more traditional pipeline approaches. Whilst the CEV requires character-level positioning for optimal triage, thresholding on easily available values can predict the main error source with an F1 of 0.91. We provide the CEV as part of a Python library to support Document understanding research.

[279] DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models

Zhengming Yu, Li Ma, Mingming He, Leo Isikdogan, Yuancheng Xu, Dmitriy Smirnov, Pablo Salamanca, Dao Mi, Pablo Delgado, Ning Yu, Julien Philip, Xin Li, Wenping Wang, Paul Debevec

Main category: cs.CV

TL;DR: DiffHDR: A framework for LDR-to-HDR video conversion using latent video diffusion models to inpaint plausible radiance in over-/underexposed regions, enabling controllable conversion via text or reference images.

DetailsMotivation: Most digital videos are stored in 8-bit LDR formats, losing HDR scene radiance through saturation and quantization, which limits accurate luminance mapping to HDR displays and meaningful re-exposure in post-production. Existing LDR-to-HDR conversion techniques struggle to restore realistic detail in over- and underexposed regions.

Method: Formulates LDR-to-HDR conversion as generative radiance inpainting in latent space of video diffusion model. Operates in Log-Gamma color space to leverage spatio-temporal generative priors from pretrained video diffusion model. Enables controllable conversion via text prompts or reference images. Develops pipeline to synthesize HDR video training data from static HDRI maps to address data scarcity.

Result: Extensive experiments show DiffHDR significantly outperforms state-of-the-art approaches in radiance fidelity and temporal stability, producing realistic HDR videos with considerable latitude for re-exposure.

Conclusion: DiffHDR effectively addresses LDR-to-HDR video conversion by leveraging video diffusion models for generative radiance inpainting, overcoming limitations of existing methods and enabling high-quality, controllable HDR video restoration.

Abstract: Most digital videos are stored in 8-bit low dynamic range (LDR) formats, where much of the original high dynamic range (HDR) scene radiance is lost due to saturation and quantization. This loss of highlight and shadow detail precludes mapping accurate luminance to HDR displays and limits meaningful re-exposure in post-production workflows. Although techniques have been proposed to convert LDR images to HDR through dynamic range expansion, they struggle to restore realistic detail in the over- and underexposed regions. To address this, we present DiffHDR, a framework that formulates LDR-to-HDR conversion as a generative radiance inpainting task within the latent space of a video diffusion model. By operating in Log-Gamma color space, DiffHDR leverages spatio-temporal generative priors from a pretrained video diffusion model to synthesize plausible HDR radiance in over- and underexposed regions while recovering the continuous scene radiance of the quantized pixels. Our framework further enables controllable LDR-to-HDR video conversion guided by text prompts or reference images. To address the scarcity of paired HDR video data, we develop a pipeline that synthesizes high-quality HDR video training data from static HDRI maps. Extensive experiments demonstrate that DiffHDR significantly outperforms state-of-the-art approaches in radiance fidelity and temporal stability, producing realistic HDR videos with considerable latitude for re-exposure.

[280] HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models

Reihaneh Zohrabi, Hosein Hasani, Akshita Gupta, Mahdieh Soleymani Baghshah, Anna Rohrbach, Marcus Rohrbach

Main category: cs.CV

TL;DR: HaloProbe: A Bayesian framework for detecting and mitigating object hallucinations in vision-language models by factorizing external description statistics and internal decoding signals.

DetailsMotivation: Current methods for detecting object hallucinations in vision-language models rely on attention weights, but these are unreliable due to confounders like token position and object repetition, leading to Simpson's paradox where attention trends reverse when aggregated.

Method: HaloProbe uses Bayesian framework to factorize external description statistics and internal decoding signals to estimate token-level hallucination probabilities. It employs balanced training to isolate internal evidence and combines it with learned prior over external features to recover true posterior. Used as external scoring signal for non-invasive mitigation.

Result: HaloProbe-guided decoding reduces hallucinations more effectively than state-of-the-art intervention-based methods while preserving utility and fluency.

Conclusion: HaloProbe provides a more reliable approach to hallucination detection and mitigation in vision-language models by addressing statistical confounders in attention-based methods, enabling non-invasive mitigation that preserves model utility.

Abstract: Large vision-language models can produce object hallucinations in image descriptions, highlighting the need for effective detection and mitigation strategies. Prior work commonly relies on the model’s attention weights on visual tokens as a detection signal. We reveal that coarse-grained attention-based analysis is unreliable due to hidden confounders, specifically token position and object repetition in a description. This leads to Simpson’s paradox: the attention trends reverse or disappear when statistics are aggregated. Based on this observation, we introduce HaloProbe, a Bayesian framework that factorizes external description statistics and internal decoding signals to estimate token-level hallucination probabilities. HaloProbe uses balanced training to isolate internal evidence and combines it with learned prior over external features to recover the true posterior. While intervention-based mitigation methods often degrade utility or fluency by modifying models’ internals, we use HaloProbe as an external scoring signal for non-invasive mitigation. Our experiments show that HaloProbe-guided decoding reduces hallucinations more effectively than state-of-the-art intervention-based methods while preserving utility.

[281] Action Images: End-to-End Policy Learning via Multiview Video Generation

Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan

Main category: cs.CV

TL;DR: World Action Model using pixel-grounded action images for robot policy learning via multiview video generation, enabling zero-shot control without separate action modules.

DetailsMotivation: Existing world action models use separate action modules or non-pixel-grounded representations, limiting transfer across viewpoints and environments and failing to fully exploit pretrained video model knowledge.

Method: Translate 7-DoF robot actions into interpretable action images (multiview action videos grounded in 2D pixels) that explicitly track robot-arm motion, using video backbone as zero-shot policy without separate action modules.

Result: Achieves strongest zero-shot success rates on RLBench and real-world evaluations, improves video-action joint generation quality over prior video-space world models.

Conclusion: Interpretable action images are a promising route to policy learning, enabling unified world action models that support multiple tasks under shared representation.

Abstract: World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.

[282] Controllable Image Generation with Composed Parallel Token Prediction

Jamie Stirling, Noura Al-Moubayed, Chris G. Willcocks, Hubert P. H. Shum

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2405.06535: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.06535&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[283] The DeepSpeak Dataset

Sarah Barrington, Maty Bohacek, Hany Farid

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2408.05366: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.05366&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[284] ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization

Fanrui Zhang, Jiawei Liu, Jiaying Zhu, Esther Sun, Dong Li, Qiang Zhang, Zheng-Jun Zha

Main category: cs.CV

TL;DR: Paper 2410.10238: Unable to fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot determine conclusion as abstract is unavailable

Abstract: Failed to fetch summary for 2410.10238: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.10238&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[285] Why CNN Features Are not Gaussian: A Statistical Anatomy of Deep Representations

David Chapman, Parniyan Farvardin

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion due to lack of paper content

Abstract: Failed to fetch summary for 2411.05183: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.05183&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[286] Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models

Xiao Liu, Lijun Zhang, Deepak Ganesan, Hui Guan

Main category: cs.CV

TL;DR: Unable to analyze paper 2411.05961 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is unavailable due to rate limiting error from arXiv API

Method: Cannot determine method as abstract is unavailable due to rate limiting error from arXiv API

Result: Cannot determine results as abstract is unavailable due to rate limiting error from arXiv API

Conclusion: Cannot draw conclusions as abstract is unavailable due to rate limiting error from arXiv API

Abstract: Failed to fetch summary for 2411.05961: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.05961&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[287] Towards Robust and Realistic Human Pose Estimation via WiFi Signals

Yang Chen, Jingcai Guo

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2501.09411 suggests it’s from January 2025, but content cannot be retrieved for analysis.

DetailsMotivation: Cannot determine motivation without access to the paper content due to arXiv rate limiting.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2501.09411: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.09411&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[288] ENTER: Event Based Interpretable Reasoning for VideoQA

Hammad Ayyubi, Junzhang Liu, Ali Asgarov, Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Zhecan Wang, Chia-Wei Tang, Hani Alomari, Md. Atabuzzaman, Xudong Lin, Naveen Reddy Dyava, Shih-Fu Chang, Chris Thomas

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2501.14194: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.14194&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[289] Toward Generalizable Forgery Detection and Reasoning

Yueying Gao, Dongliang Chang, Bingyao Yu, Haotian Qin, Muxi Diao, Lei Chen, Kongming Liang, Zhanyu Ma

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to analyze paper content due to technical fetching error

Abstract: Failed to fetch summary for 2503.21210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.21210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[290] ProBA: Probabilistic Bundle Adjustment with the Bhattacharyya Coefficient

Jason Chui, Hector Andrade-Loarca, Daniel Cremers

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2505.20858: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20858&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[291] MAMMA: Markerless & Automatic Multi-Person Motion Action Capture

Hanz Cuevas-Velasquez, Anastasios Yiannakidis, Soyong Shin, Giorgio Becherini, Markus Höschle, Joachim Tesch, Taylor Obersat, Tsvetelina Alexiadis, Eni Halilaj, Michael J. Black

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2506.13040: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.13040&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[292] Aleatoric Uncertainty Medical Image Segmentation Estimation via Flow Matching

Phi Van Nguyen, Ngoc Huynh Trinh, Duy Minh Lam Nguyen, Phu Loc Nguyen, Quoc Long Tran

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2507.22418: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.22418&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[293] Time-reversed Flow Matching with Worst Transport in High-dimensional Latent Space for Image Anomaly Detection

Liangwei Li, Lin Liu, Hanzhe Liang, Juanxiu Liu, Jing Zhang, Ruqian Hao, Xiaohui Du, Yong Liu, Pan Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.05461: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05461&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[294] MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Animesh Jain, Alexandros Stergiou

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to determine conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2508.07833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[295] DSER: Spectral Epipolar Representation for Efficient Light Field Depth Estimation

Noor Islam S. Mohammad, Md Muntaqim Meherab

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2508.08900: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.08900&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[296] PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training

Yin Xie, Zhichao Chen, Zeyu Xiao, Yongle Zhao, Xiang An, Kaicheng Yang, Zimin Ran, Jia Guo, Ziyong Feng, Jiankang Deng

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.09691: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09691&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[297] Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, Baixin Xu, Hao-Xiang Guo, Kaixiong Gong, Size Wu, Wei Li, Xuchen Song, Yang Liu, Yangguang Li, Yahui Zhou

Main category: cs.CV

TL;DR: Unable to analyze paper 2508.13009 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2508.13009: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13009&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[298] MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation

Francisco Caetano, Christiaan Viviers, Peter H.N. De With, Fons van der Sommen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.21435: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.21435&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[299] Dual-Thresholded Heatmap-Guided Proposal Clustering and Negative Certainty Supervision with Enhanced Base Network for Weakly Supervised Object Detection

Yuelin Guo, Haoyu He, Zhiyuan Chen, Zitong Huang, Renhao Lu, Lu Shi, Zejun Wang, Weizhe Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.08289: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.08289&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[300] Unrolling Graph-based Douglas-Rachford Algorithm for Image Interpolation with Informed Initialization

Xue Zhang, Bingshuo Hu, Gene Cheung

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API

Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API

Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API

Conclusion: Cannot determine conclusion as paper content is unavailable due to HTTP 429 error from arXiv API

Abstract: Failed to fetch summary for 2509.11926: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11926&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[301] A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features

Axel Barroso-Laguna, Tommaso Cavallari, Victor Adrian Prisacariu, Eric Brachmann

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.00978: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00978&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[302] Image Diffusion Preview with Consistency Solver

Fu-Yun Wang, Hao Zhou, Liangzhe Yuan, Sanghyun Woo, Boqing Gong, Bohyung Han, Ming-Hsuan Yang, Han Zhang, Yukun Zhu, Ting Liu, Long Zhao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2512.13592: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13592&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[303] MATRIX: Mask Track Alignment for Interaction-aware Video Generation

Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiyoung Kim, Seungryong Kim

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2510.07310: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07310&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[304] Cattle-CLIP: A Multimodal Framework for Cattle Behaviour Recognition from Video

Huimin Liu, Jing Gao, Daria Baran, AxelX Montout, Neill W Campbell, Andrew W Dowsey

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - API request was rate limited

Conclusion: Cannot provide analysis due to technical limitations in accessing paper content

Abstract: Failed to fetch summary for 2510.09203: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09203&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[305] Online In-Context Distillation for Low-Resource Vision Language Models

Zhiqi Kang, Rahaf Aljundi, Vaggelis Dorovatas, Karteek Alahari

Main category: cs.CV

TL;DR: Paper 2510.18117: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2510.18117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[306] NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Longtian Qiu, Shan Ning, Jiaxuan Sun, Xuming He

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2510.21122: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21122&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[307] Forget Many, Forget Right: Scalable and Precise Concept Unlearning in Diffusion Models

Kaiyuan Deng, Gen Li, Yang Xiao, Bo Hui, Xiaolong Ma

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2601.06162: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06162&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[308] From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection

Mengfei Liang, Yiting Qu, Yukun Jiang, Michael Backes, Yang Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to content unavailability

Abstract: Failed to fetch summary for 2511.00181: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00181&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[309] Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

Panwang Pan, Chenguo Lin, Jingjing Zhao, Chenxin Li, Yuchen Lin, Haopeng Li, Honglei Yan, Kairun Wen, Yunlong Lin, Yixuan Yuan, Yadong Mu

Main category: cs.CV

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting (HTTP 429 error)

Method: Cannot analyze method without access to the paper content or abstract

Result: No results available due to technical limitations in accessing the paper information

Conclusion: Cannot provide analysis of paper 2511.00503 due to API rate limiting preventing access to the abstract

Abstract: Failed to fetch summary for 2511.00503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[310] SiLVi: Simple Interface for Labeling Video Interactions

Ozan Kanbertay, Richard Vogg, Elif Karakoc, Peter M. Kappeler, Claudia Fichtel, Alexander S. Ecker

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.03819: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03819&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[311] FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR

Yueru He, Xueqing Peng, Yupeng Cao, Yan Wang, Lingfei Qian, Haohang Li, Yi Han, Shuyao Wang, Ruoyu Xiang, Fan Zhang, Zhuohan Xie, Mingquan Lin, Prayag Tiwari, Jimin Huang, Guojun Xiong, Sophia Ananiadou

Main category: cs.CV

TL;DR: Unable to analyze paper 2511.14998 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2511.14998: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14998&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[312] Learning to Look Closer: A New Instance-Wise Loss for Small Cerebral Lesion Segmentation

Luc Bouteille, Alexander Jaus, Jens Kleesiek, Rainer Stiefelhagen, Lukas Heine

Main category: cs.CV

TL;DR: Failed to fetch summary for arXiv ID 2511.17146 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2511.17146: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17146&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[313] TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

Alexandros Stergiou

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to access limitations

Method: Cannot determine method due to access limitations

Result: Cannot determine results due to access limitations

Conclusion: Cannot draw conclusions due to access limitations

Abstract: Failed to fetch summary for 2511.18359: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18359&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[314] MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes

Kehua Chen, Tianlu Mao, Xinzhu Ma, Hao Jiang, Zehao Li, Zihan Liu, Shuqin Gao, Honglong Zhao, Feng Dai, Yucheng Zhang, Zhaoqi Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to paper fetch failure

Method: Unable to determine method due to paper fetch failure

Result: Unable to determine results due to paper fetch failure

Conclusion: Unable to determine conclusion due to paper fetch failure

Abstract: Failed to fetch summary for 2511.19172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[315] Tokenizing Buildings: A Transformer for Layout Synthesis

Manuel Ladron de Guevara, Jinmo Rhee, Ardavan Bidgoli, Vaidas Razgaitis, Michael Bergin

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2512.04832: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04832&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[316] BulletGen: Improving 4D Reconstruction with Bullet-Time Generation

Denis Rozumny, Jonathon Luiten, Numair Khan, Johannes Schönberger, Peter Kontschieder

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2506.18601 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation as paper content is unavailable.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot determine conclusion as paper content is unavailable.

Abstract: Failed to fetch summary for 2506.18601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[317] GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang, Shaoshuai Shi, Li Jiang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2512.16811: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16811&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[318] SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models

Sofian Chaybouti, Sanath Narayan, Yasser Dahou, Phúc H. Lê Khac, Ankit Singh, Ngoc Dung Huynh, Wamiq Reyaz Para, Hilde Kuehne, Hakim Hacid

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2512.20157: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20157&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[319] MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization

Zhexin Zhang, Yangyang Xu, Yifeng Zhu, Long Chen, Yong Du, Shengfeng He, Jun Yu

Main category: cs.CV

TL;DR: Unable to analyze paper 2601.01955 due to HTTP 429 error when fetching the abstract from arXiv API

DetailsMotivation: Cannot determine motivation as the paper content could not be retrieved

Method: Cannot determine method as the paper content could not be retrieved

Result: Cannot determine results as the paper content could not be retrieved

Conclusion: Cannot draw conclusions about the paper without access to its content

Abstract: Failed to fetch summary for 2601.01955: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01955&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[320] IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

Yankai Jiang, Qiaoru Li, Binlu Xu, Haoran Sun, Chao Ding, Junting Dong, Yuxiang Cai, Xuhong Zhang, Jianwei Yin

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.03054: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03054&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[321] I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

Jinghan Yu, Junhao Xiao, Chenyu Zhu, Jiaming Li, Jia Li, HanMing Deng, Xirui Wang, Guoli Jia, Jianjun Li, Xiang Bai, Bowen Zhou, Zhiyuan Ma

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.03741 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation as the paper content is unavailable due to API rate limiting.

Method: Cannot determine method as the paper content is unavailable due to API rate limiting.

Result: Cannot determine results as the paper content is unavailable due to API rate limiting.

Conclusion: Cannot draw conclusions as the paper content is unavailable due to API rate limiting.

Abstract: Failed to fetch summary for 2601.03741: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03741&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[322] RenderFlow: Single-Step Neural Rendering via Flow Matching

Shenghao Zhang, Runtao Liu, Christopher Schroers, Yang Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.06928: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06928&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[323] ReaMIL: Reasoning- and Evidence-Aware Multiple Instance Learning for Whole-Slide Histopathology

Hyun Do Jung, Jungwon Choi, Hwiyoung Kim

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2601.10073: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10073&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[324] Thinking Like Van Gogh: Structure-Aware Style Transfer via Flow-Guided 3D Gaussian Splatting

Lebin Zhou, Jingchuan Xiao, Zhendong Wang, Jinhao Wang, Rongduo Han, Nam Ling, Cihan Ruan

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2601.10075: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10075&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[325] MINERVA-Cultural: A Benchmark for Cultural and Multilingual Long Video Reasoning

Darshan Singh, Arsha Nagrani, Kawshik Manikantan, Harman Singh, Dinesh Tewari, Tobias Weyand, Cordelia Schmid, Anelia Angelova, Shachi Dave

Main category: cs.CV

TL;DR: Paper 2601.10649 summary unavailable due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to abstract fetch failure

Method: Unable to determine method due to abstract fetch failure

Result: Unable to determine results due to abstract fetch failure

Conclusion: Unable to draw conclusions due to abstract fetch failure

Abstract: Failed to fetch summary for 2601.10649: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10649&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[326] FeedbackSTS-Det: Sparse Frames-Based Spatio-Temporal Semantic Feedback Network for Moving Infrared Small Target Detection

Yian Huang, Qing Qin, Aji Mao, Xiangyu Qiu, Liang Xu, Xian Zhang, Zhenming Peng

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2601.14690: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14690&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[327] Why Can’t I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

Geo Ahn, Inwoong Lee, Taeoh Kim, Minho Shim, Dongyoon Wee, Jinwoo Choi

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2601.16211: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16211&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[328] PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction

Isaac Deutsch, Nicolas Moënne-Loccoz, Gavriel State, Zan Gojcic

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - arXiv API returned rate limiting error (HTTP 429)

Conclusion: Paper analysis impossible due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2601.18336: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18336&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[329] Cross-Domain Few-Shot Learning for Hyperspectral Image Classification Based on Mixup Foundation Model

Naeem Paeedeh, Mahardhika Pratama, Ary Shiddiqi, Zehong Cao, Mukesh Prasad, Wisnu Jatmiko

Main category: cs.CV

TL;DR: Unable to analyze paper 2601.22581 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about unavailable paper

Abstract: Failed to fetch summary for 2601.22581: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22581&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[330] R3G: A Reasoning–Retrieval–Reranking Framework for Vision-Centric Answer Generation

Zhuohong Chen, Zhengxian Wu, Zirui Liao, Shenao Jiang, Hangrui Xu, Yang Chen, Chaokui Su, Xiaoyu Liu, Haoqian Wang

Main category: cs.CV

TL;DR: Paper 2602.00104 summary unavailable due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2602.00104: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00104&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[331] OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization

Minghao Han, Dingkang Yang, Yue Jiang, Yizhou Liu, Lihua Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2602.07064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[332] Move What Matters: Parameter-Efficient Domain Adaptation via Optimal Transport Flow for Collaborative Perception

Zesheng Jia, Jin Wang, Siao Liu, Lingzhi Li, Ziyao Huang, Yunjiang Xu, Jianping Wang

Main category: cs.CV

TL;DR: Unable to analyze paper 2602.11565 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.11565: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11565&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[333] UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution

Cao Thien Tan, Phan Thi Thu Trang, Do Nghiem Duc, Ho Ngoc Anh, Hanyang Zhuang, Nguyen Duc Dung

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to technical limitations

Method: Cannot determine method as paper content is unavailable due to technical limitations

Result: Cannot determine results as paper content is unavailable due to technical limitations

Conclusion: Cannot draw conclusions as paper content is unavailable due to technical limitations

Abstract: Failed to fetch summary for 2603.11680: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11680&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[334] Attentive Dilated Convolution for Automatic Sleep Staging using Force-directed Layout

Md Jobayer, Md Mehedi Hasan Shawon, Tasfin Mahmud, Md. Borhan Uddin Antor, Arshad M. Chowdhury

Main category: cs.CV

TL;DR: Unable to analyze paper 2409.01962 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2409.01962: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.01962&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[335] InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model

InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xiaojun Xiang, Xiaoyu Zhang, Xianbin Liu, Yifu Wang, Yipeng Chen, Zhewen Le, Zhichao Ye, Ziqiang Zhao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.11911: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11911&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[336] Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification

Hiba Adil Al-kharsan, Róbert Rajkó

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.13182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[337] Unreal Robotics Lab: A High-Fidelity Robotics Simulator with Advanced Physics and Rendering

Jonathan Embley-Riches, Jianwei Liu, Simon Julier, Dimitrios Kanoulas

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2504.14135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.14135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[338] Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

Rui Hong, Jana Kosecka

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.17396 appears to be from March 2023, but no content is available for analysis.

DetailsMotivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates rate limiting from arXiv API.

Method: Cannot determine method without access to the paper content. The paper ID format suggests it might be from March 2023 (2603 prefix).

Result: Cannot determine results without access to the paper content. The arXiv API returned a rate limiting error.

Conclusion: Cannot draw conclusions about the paper without access to its content. The arXiv API rate limiting prevents analysis.

Abstract: Failed to fetch summary for 2603.17396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[339] DiffAttn: Diffusion-Based Drivers’ Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

Weimin Liu, Qingkun Li, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.28251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[340] One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control

Haoxiang Rao, Zhao Wang, Chenyang Si, Yan Lyu, Yuanyi Duan, Fang Zhao, Caifeng Shan

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.18093: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18093&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[341] Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning

Kavindu Herath, Joshua Zhao, Saurabh Bagchi

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use different approach.

DetailsMotivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2603.29328: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29328&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[342] Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation

Anupam Pani, Yanchao Yang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.23202 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation as the paper content is unavailable due to API rate limiting.

Method: Cannot determine method as the paper content is unavailable due to API rate limiting.

Result: Cannot determine results as the paper content is unavailable due to API rate limiting.

Conclusion: Cannot draw conclusions as the paper content is unavailable due to API rate limiting.

Abstract: Failed to fetch summary for 2603.23202: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23202&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[343] DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Hanbing Li, Long Chen, Zhi-Xin Yang, Jiwen Lu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2604.00813: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00813&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[344] MoCHA: Denoising Caption Supervision for Motion-Text Retrieval

Nikolai Warner, Cameron Ethan Taylor, Irfan Essa, Apaar Sadhwani

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2603.23684: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23684&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[345] VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

Zhe Gao, Shiyu Shen, Taifeng Chai, Weinong Wang, Haotian Xu, Xing W, Wenbin Li, Qi Fan, Yang Gao, Dacheng Tao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to draw conclusions due to failed API request

Abstract: Failed to fetch summary for 2603.25021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[346] Automatic Image-Level Morphological Trait Annotation for Organismal Images

Vardaan Pahuja, Samuel Stevens, Alyson East, Sydne Record, Yu Su

Main category: cs.CV

TL;DR: Paper 2604.01619: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to inability to fetch paper content

Method: Cannot determine method due to inability to fetch paper content

Result: Cannot determine results due to inability to fetch paper content

Conclusion: Cannot draw conclusions due to inability to fetch paper content

Abstract: Failed to fetch summary for 2604.01619: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01619&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[347] SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

Weihong Pan, Xiaoyu Zhang, Zhuang Zhang, Zhichao Ye, Nan Wang, Haomin Liu, Guofeng Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.26481: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26481&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[348] TrajectoryMover: Generative Movement of Object Trajectories in Videos

Kiran Chhatre, Hyeonho Jeong, Yulia Gryaditskaya, Christopher E. Peters, Chun-Hao Paul Huang, Paul Guerrero

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.29092: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29092&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[349] Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

Adrienne Deganutti, Elad Hirsch, Haonan Zhu, Jaejung Seol, Purvanshi Mehta

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2604.04192: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04192&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[350] PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

Weifu Fu, Jinyang Li, Bin-Bin Gao, Jialin Li, Yuhuan Lin, Hanqiu Deng, Wenbing Tao, Yong Liu, Chengjie Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2604.00503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[351] A global dataset of continuous urban dashcam driving

Md Shadab Alam, Olena Bazilinska, Pavlo Bazilinskyy

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2604.01044: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01044&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[352] Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

Junxuan Li, Rawal Khirodkar, Chengan He, Zhongshi Jiang, Giljoo Nam, Lingchen Yang, Jihyun Lee, Egor Zakharov, Zhaoen Su, Rinat Abdrashitov, Yuan Dong, Julieta Martinez, Kai Li, Qingyang Tan, Takaaki Shiratori, Matthew Hu, Peihong Guo, Xuhua Huang, Ariyan Zarei, Marco Pesavento, Yichen Xu, He Wen, Teng Deng, Wyatt Borsos, Anjali Thakrar, Jean-Charles Bazin, Carsten Stoll, Ginés Hidalgo, James Booth, Lucy Wang, Xiaowen Ma, Yu Rong, Sairanjith Thalanki, Chen Cao, Christian Häne, Abhishek Kar, Sofien Bouaziz, Jason Saragih, Yaser Sheikh, Shunsuke Saito

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.02320: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02320&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[353] PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis

Inseong Choi, Siwoo Lee, Seung-Hun Nam, Soohwan Song

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2604.04576: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04576&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[354] Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

Quoc-Huy Trinh, Mustapha Abdullahi, Bo Zhao, Debesh Jha

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.04579: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04579&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[355] Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

Lei Zhang, Junjiao Tian, Zhipeng Fan, Kunpeng Li, Jialiang Wang, Weifeng Chen, Markos Georgopoulos, Felix Juefei-Xu, Yuxiang Bao, Julian McAuley, Manling Li, Zecheng He

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2604.04746: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04746&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[356] Less Detail, Better Answers: Degradation-Driven Prompting for VQA

Haoxuan Han, Weijie Wang, Zeyu Zhang, Yefei He, Bohan Zhuang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.04838: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04838&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[357] MARS: Multi-Agent Robotic System with Multimodal Large Language Models for Assistive Intelligence

Renjun Gao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to retrieval error

Method: Unable to determine method due to retrieval error

Result: Unable to determine results due to retrieval error

Conclusion: Unable to determine conclusion due to retrieval error

Abstract: Failed to fetch summary for 2511.01594: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01594&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[358] TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots

Tianyu Liu, Weihao Xuan, Hao Wu, Peter Humphrey, Marcello DiStasio, Mohamed Kahila, Alfonso Garcia Tan, Heli Qi, Rui Yang, Simeng Han, Tinglin Huang, Fang Wu, Chen Liu, Qingyu Chen, Nan Liu, Irene Li, Hua Xu, Hongyu Zhao

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.17652: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17652&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[359] Prediction of Grade, Gender, and Academic Performance of Children and Teenagers from Handwriting Using the Sigma-Lognormal Model

Adrian Iste, Kazuki Nishizawa, Chisa Tanaka, Andrew Vargo, Anna Scius-Bertrand, Andreas Fischer, Koichi Kise

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.11519: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11519&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[360] Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

Wenxuan Song, Jiayi Chen, Shuai Chen, Jingbo Wang, Pengxiang Ding, Han Zhao, Yikai Qin, Xinhu Zheng, Donglin Wang, Yan Wang, Haoang Li

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.25661: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25661&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[361] DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching

Jiayi Chen, Wenxuan Song, Shuai Chen, Jingbo Wang, Zhijun Li, Haoang Li

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to retrieval failure

Method: Cannot determine method due to retrieval failure

Result: Cannot determine results due to retrieval failure

Conclusion: Cannot determine conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2603.26320: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26320&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[362] ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation

Feng Wu, Wei Zuo, Wenliang Yang, Jun Xiao, Yang Liu, Xinhua Zeng

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to determine conclusion due to technical error fetching paper content

Abstract: Failed to fetch summary for 2603.26788: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26788&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[363] STRADAViT: Towards a Foundational Model for Radio Astronomy through Self-Supervised Transfer

Andrea DeMarco, Ian Fenech Conti, Hayley Camilleri, Ardiana Bushi, Simone Riggi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.29660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.AI

[364] PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing

Yiwen Song, Yale Song, Tomas Pfister, Jinsung Yoon

Main category: cs.AI

TL;DR: PaperOrchestra is a multi-agent framework for automated AI research paper writing that transforms unstructured pre-writing materials into complete LaTeX manuscripts with literature synthesis and generated visuals.

DetailsMotivation: Existing autonomous writing systems are too rigidly coupled to specific experimental pipelines and produce superficial literature reviews, creating a need for more flexible AI-driven scientific paper writing tools.

Method: Multi-agent framework that flexibly transforms unconstrained pre-writing materials into submission-ready LaTeX manuscripts, including comprehensive literature synthesis and generated visuals like plots and conceptual diagrams.

Result: Significantly outperforms autonomous baselines with 50%-68% absolute win rate margin in literature review quality and 14%-38% in overall manuscript quality in human evaluations.

Conclusion: PaperOrchestra demonstrates effective automated AI research paper writing capabilities, addressing limitations of existing systems through flexible multi-agent architecture.

Abstract: Synthesizing unstructured research materials into manuscripts is an essential yet under-explored challenge in AI-driven scientific discovery. Existing autonomous writers are rigidly coupled to specific experimental pipelines, and produce superficial literature reviews. We introduce PaperOrchestra, a multi-agent framework for automated AI research paper writing. It flexibly transforms unconstrained pre-writing materials into submission-ready LaTeX manuscripts, including comprehensive literature synthesis and generated visuals, such as plots and conceptual diagrams. To evaluate performance, we present PaperWritingBench, the first standardized benchmark of reverse-engineered raw materials from 200 top-tier AI conference papers, alongside a comprehensive suite of automated evaluators. In side-by-side human evaluations, PaperOrchestra significantly outperforms autonomous baselines, achieving an absolute win rate margin of 50%-68% in literature review quality, and 14%-38% in overall manuscript quality.

[365] Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya

Sharath Sathish

Main category: cs.AI

TL;DR: Pramana teaches LLMs explicit epistemological methodology using Navya-Nyaya logic, a 2,500-year-old Indian reasoning framework, to improve systematic reasoning and reduce hallucinations.

DetailsMotivation: LLMs struggle with systematic reasoning and often hallucinate confident but unfounded claims, with performance degrading significantly when irrelevant context is added. This epistemic gap limits AI reliability in domains requiring justification.

Method: Fine-tune Llama 3.2-3B and DeepSeek-R1-Distill-Llama-8B on Navya-Nyaya logic framework with 6-phase structured reasoning: doubt analysis, evidence source identification, 5-member syllogism, counterfactual verification, fallacy detection, and ascertainment. Use 55 Nyaya-structured logical problems covering constraint satisfaction, Boolean SAT, and multi-step deduction.

Result: Stage 1 achieves 100% semantic correctness on held-out evaluation despite only 40% strict format adherence, showing models internalize reasoning content even with imperfect structural enforcement. Ablation studies reveal format prompting and temperature critically affect performance with optimal configurations differing by stage.

Conclusion: Integrating logic and epistemology through Navya-Nyaya provides cognitive scaffolding absent from standard reasoning approaches, enabling more reliable AI reasoning. All models, datasets, and training infrastructure are released on Hugging Face.

Abstract: Large language models produce fluent text but struggle with systematic reasoning, often hallucinating confident but unfounded claims. When Apple researchers added irrelevant context to mathematical problems, LLM performance degraded by 65% Apple Machine Learning Research, exposing brittle pattern-matching beneath apparent reasoning. This epistemic gap, the inability to ground claims in traceable evidence, limits AI reliability in domains requiring justification. We introduce Pramana, a novel approach that teaches LLMs explicit epistemological methodology by fine-tuning on Navya-Nyaya logic, a 2,500-year-old Indian reasoning framework. Unlike generic chain-of-thought prompting, Navya-Nyaya enforces structured 6-phase reasoning: SAMSHAYA (doubt analysis), PRAMANA (evidence source identification), PANCHA AVAYAVA (5-member syllogism with universal rules), TARKA (counterfactual verification), HETVABHASA (fallacy detection), and NIRNAYA (ascertainment distinguishing knowledge from hypothesis). This integration of logic and epistemology provides cognitive scaffolding absent from standard reasoning approaches. We fine-tune Llama 3.2-3B and DeepSeek-R1-Distill-Llama-8B on 55 Nyaya-structured logical problems (constraint satisfaction, Boolean SAT, multi-step deduction). Stage 1 achieves 100% semantic correctness on held-out evaluation despite only 40% strict format adherence revealing that models internalize reasoning content even when structural enforcement is imperfect. Ablation studies show format prompting and temperature critically affect performance, with optimal configurations differing by stage. We release all models, datasets, and training infrastructure on Hugging Face to enable further research on epistemic frameworks for AI reasoning.

[366] Operational Noncommutativity in Sequential Metacognitive Judgments

Enso O. Torres Alegre, Diana E. Mora Jimenez

Main category: cs.AI

TL;DR: Operational framework distinguishes classical state changes from genuine non-commutativity in metacognitive processes, showing order effects can violate classical constraints.

DetailsMotivation: To understand whether order effects in metacognition reflect classical state changes or reveal deeper structural non-commutativity, distinguishing between these possibilities operationally.

Method: Develop operational framework modeling metacognitive evaluations as state-transforming operations with probabilistic readouts. Introduce assumptions of counterfactual definiteness and evaluation non-invasiveness, derive testable constraints on sequential correlations, and provide explicit rotation model with numerical examples.

Result: Shows order dependence prevents faithful Boolean-commutative representation. Violations of derived constraints rule out classical non-invasive accounts, certifying genuine non-commutativity. Provides explicit three-dimensional rotation model exhibiting such violations.

Conclusion: Metacognitive processes can exhibit genuine non-commutativity beyond classical state changes, with operational framework enabling empirical tests through behavioral paradigms involving sequential confidence judgments.

Abstract: Metacognition, understood as the monitoring and regulation of one’s own cognitive processes, is inherently sequential: an agent evaluates an internal state, updates it, and may then re-evaluate under modified criteria. Order effects in cognition are well documented, yet it remains unclear whether such effects reflect classical state changes or reveal a deeper structural non-commutativity. We develop an operational framework that makes this distinction explicit. In our formulation, metacognitive evaluations are modeled as state-transforming operations acting on an internal state space with probabilistic readouts, thereby separating evaluation back-action from observable output. We show that order dependence prevents any faithful Boolean-commutative representation. We then address a stronger question: can observed order effects always be explained by enlarging the state space with classical latent variables? To formalize this issue, we introduce two assumptions, counterfactual definiteness and evaluation non-invasiveness, under which the existence of a joint distribution over all sequential readouts implies a family of testable constraints on pairwise sequential correlations. Violation of these constraints rules out any classical non-invasive account and certifies what we call genuine non-commutativity. We provide an explicit three-dimensional rotation model with fully worked numerical examples that exhibits such violations. We also outline a behavioral paradigm involving sequential confidence, error-likelihood, and feeling-of-knowing judgments following a perceptual decision, together with the corresponding empirical test. No claim is made regarding quantum physical substrates; the framework is purely operational and algebraic.

[367] Proximity Measure of Information Object Features for Solving the Problem of Their Identification in Information Systems

Volodymyr Yuzefovych

Main category: cs.AI

TL;DR: Proposes new proximity measures for information objects from multiple sources, combining probabilistic measures for quantitative features and possibility measures for qualitative features without requiring value transformation.

DetailsMotivation: Need to determine if information objects from different sources refer to the same physical object, accounting for measurement errors in both quantitative and qualitative features without requiring feature transformation.

Method: Uses probabilistic measures for quantitative feature proximity and possibility measures for qualitative feature proximity, with proposed variants for determining overall information object proximity based on diverse feature groups.

Result: Demonstrates feasibility by checking compliance with required axioms, showing the approach works without feature value transformation unlike many existing measures.

Conclusion: Proposed proximity measures effectively handle both quantitative and qualitative features from multiple sources for object matching, with axiomatic validation.

Abstract: The paper considers a new quantitative-qualitative proximity measure for the features of information objects, where data enters a common information resource from several sources independently. The goal is to determine the possibility of their relation to the same physical object (observation object). The proposed measure accounts for the possibility of differences in individual feature values - both quantitative and qualitative - caused by existing determination errors. To analyze the proximity of quantitative feature values, the author employs a probabilistic measure; for qualitative features, a measure of possibility is used. The paper demonstrates the feasibility of the proposed measure by checking its compliance with the axioms required of any measure. Unlike many known measures, the proposed approach does not require feature value transformation to ensure comparability. The work also proposes several variants of measures to determine the proximity of information objects (IO) based on a group of diverse features.

[368] From Governance Norms to Enforceable Controls: A Layered Translation Method for Runtime Guardrails in Agentic AI

Christopher Koch

Main category: cs.AI

TL;DR: A framework for translating AI governance standards into implementable runtime guardrails for agentic AI systems through a layered control approach.

DetailsMotivation: Agentic AI systems create unique governance challenges because risks emerge during execution, not just at development or deployment time. Existing standards don't provide implementable runtime guardrails.

Method: Proposes a layered translation method connecting standards-derived governance objectives to four control layers: governance objectives, design-time constraints, runtime mediation, and assurance feedback. Introduces control tuple and runtime-enforceability rubric for layer assignment.

Result: Demonstrates the method in a procurement-agent case study, showing how to translate governance standards into practical controls across architecture, runtime policy, human escalation, and audit layers.

Conclusion: Standards should guide control placement across multiple layers, with runtime guardrails reserved only for controls that are observable, determinate, and time-sensitive enough to justify execution-time intervention.

Abstract: Agentic AI systems plan, use tools, maintain state, and produce multi-step trajectories with external effects. Those properties create a governance problem that differs materially from single-turn generative AI: important risks emerge dur- ing execution, not only at model development or deployment time. Governance standards such as ISO/IEC 42001, ISO/IEC 23894, ISO/IEC 42005, ISO/IEC 5338, ISO/IEC 38507, and the NIST AI Risk Management Framework are therefore highly relevant to agentic AI, but they do not by themselves yield implementable runtime guardrails. This paper proposes a layered translation method that connects standards-derived governance objectives to four control layers: governance objectives, design- time constraints, runtime mediation, and assurance feedback. It distinguishes governance objectives, technical controls, runtime guardrails, and assurance evidence; introduces a control tuple and runtime-enforceability rubric for layer assignment; and demonstrates the method in a procurement-agent case study. The central claim is modest: standards should guide control placement across architecture, runtime policy, human escalation, and audit, while runtime guardrails are reserved for controls that are observable, determinate, and time-sensitive enough to justify execution-time intervention.

[369] ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback

Cuong Van Duc, Minh Nguyen Dinh Tuan, Tam Vu Duc, Tung Vu Duy, Son Nguyen Van, Hanh Nguyen Thi, Binh Huynh Thi Thanh

Main category: cs.AI

TL;DR: ReVEL is a framework that uses LLMs as multi-turn reasoners within evolutionary algorithms to generate robust heuristics for NP-hard combinatorial optimization problems through structured performance feedback.

DetailsMotivation: Current LLM applications for heuristic design rely on one-shot code synthesis, producing brittle heuristics that don't leverage LLMs' capacity for iterative reasoning. There's a need for frameworks that enable LLMs to engage in multi-turn reasoning with structured feedback for better heuristic evolution.

Method: ReVEL embeds LLMs as interactive reasoners within evolutionary algorithms using two key mechanisms: (1) performance-profile grouping to cluster candidate heuristics into behaviorally coherent groups for compact feedback, and (2) multi-turn feedback-driven reflection where LLMs analyze group behaviors and generate targeted refinements. An EA-based meta-controller selectively integrates and validates these refinements while balancing exploration and exploitation.

Result: Experiments on standard combinatorial optimization benchmarks show ReVEL consistently produces more robust and diverse heuristics, achieving statistically significant improvements over strong baselines.

Conclusion: Multi-turn reasoning with structured grouping represents a principled paradigm for automated heuristic design, demonstrating the value of embedding LLMs as interactive reasoners within evolutionary frameworks.

Abstract: Designing effective heuristics for NP-hard combinatorial optimization problems remains a challenging and expertise-intensive task. Existing applications of large language models (LLMs) primarily rely on one-shot code synthesis, yielding brittle heuristics that underutilize the models’ capacity for iterative reasoning. We propose ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback, a hybrid framework that embeds LLMs as interactive, multi-turn reasoners within an evolutionary algorithm (EA). The core of ReVEL lies in two mechanisms: (i) performance-profile grouping, which clusters candidate heuristics into behaviorally coherent groups to provide compact and informative feedback to the LLM; and (ii) multi-turn, feedback-driven reflection, through which the LLM analyzes group-level behaviors and generates targeted heuristic refinements. These refinements are selectively integrated and validated by an EA-based meta-controller that adaptively balances exploration and exploitation. Experiments on standard combinatorial optimization benchmarks show that ReVEL consistently produces heuristics that are more robust and diverse, achieving statistically significant improvements over strong baselines. Our results highlight multi-turn reasoning with structured grouping as a principled paradigm for automated heuristic design.

[370] Algebraic Structure Discovery for Real World Combinatorial Optimisation Problems: A General Framework from Abstract Algebra to Quotient Space Learning

Min Sun, Federica Storti, Valentina Martino, Miguel Gonzalez-Andrades, Tony Kam-Thong

Main category: cs.AI

TL;DR: A framework that identifies algebraic structures in combinatorial optimization problems to shrink search spaces and improve optimization efficiency, demonstrated on rule-combination tasks.

DetailsMotivation: Many combinatorial optimization problems contain hidden algebraic structures that, when exposed, can reduce search space size and improve the probability of finding global optimal solutions. The authors aim to develop a general framework to systematically identify and exploit these structures.

Method: The framework: (1) identifies algebraic structure in problems, (2) formalizes operations, (3) constructs quotient spaces that collapse redundant representations, and (4) optimizes directly over reduced spaces. For rule-combination tasks, conjunctive rules form a monoid, and via characteristic-vector encoding, an isomorphism to Boolean hypercube with bitwise OR is proven, making logical AND in rules equivalent to bitwise OR in encoding.

Result: On real clinical data and synthetic benchmarks, quotient-space-aware genetic algorithms recovered the global optimum in 48% to 77% of runs, compared to 35% to 37% for standard approaches, while maintaining diversity across equivalence classes.

Conclusion: Exposing and exploiting algebraic structure offers a simple, general route to more efficient combinatorial optimization across various domains including patient subgroup discovery and rule-based molecular screening.

Abstract: Many combinatorial optimisation problems hide algebraic structures that, once exposed, shrink the search space and improve the chance of finding the global optimal solution. We present a general framework that (i) identifies algebraic structure, (ii) formalises operations, (iii) constructs quotient spaces that collapse redundant representations, and (iv) optimises directly over these reduced spaces. Across a broad family of rule-combination tasks (e.g., patient subgroup discovery and rule-based molecular screening), conjunctive rules form a monoid. Via a characteristic-vector encoding, we prove an isomorphism to the Boolean hypercube ${0,1}^n$ with bitwise OR, so logical AND in rules becomes bitwise OR in the encoding. This yields a principled quotient-space formulation that groups functionally equivalent rules and guides structure-aware search. On real clinical data and synthetic benchmarks, quotient-space-aware genetic algorithms recover the global optimum in 48% to 77% of runs versus 35% to 37% for standard approaches, while maintaining diversity across equivalence classes. These results show that exposing and exploiting algebraic structure offers a simple, general route to more efficient combinatorial optimisation.

[371] Part-Level 3D Gaussian Vehicle Generation with Joint and Hinge Axis Estimation

Shiyao Qian, Yuan Ren, Dongfeng Bai, Bingbing Liu

Main category: cs.AI

TL;DR: A generative framework that creates animatable 3D Gaussian vehicle models from single images or sparse multi-view inputs, enabling part-level articulation for autonomous driving simulation.

DetailsMotivation: Current autonomous driving simulation frameworks model vehicles as rigid assets, failing to capture part-level articulation needed for perception algorithms that leverage dynamics like wheel steering or door opening. Existing CAD-based pipelines have limited coverage and fixed templates, preventing faithful reconstruction of real-world vehicle instances.

Method: Proposes a generative framework with two key components: (1) a part-edge refinement module that enforces exclusive Gaussian ownership to prevent distortions at part boundaries during animation, and (2) a kinematic reasoning head that predicts joint positions and hinge axes of movable parts. The system works from single image or sparse multi-view input.

Result: The method enables faithful part-aware simulation by bridging the gap between static generation and animatable vehicle models, addressing limitations of existing 3D asset generators that are optimized for static quality but not articulation.

Conclusion: The proposed framework successfully creates animatable 3D Gaussian vehicle representations that capture part-level articulation, overcoming limitations of rigid asset modeling and enabling more realistic simulation for autonomous driving applications.

Abstract: Simulation is essential for autonomous driving, yet current frameworks often model vehicles as rigid assets and fail to capture part-level articulation. With perception algorithms increasingly leveraging dynamics such as wheel steering or door opening, realistic simulation requires animatable vehicle representations. Existing CAD-based pipelines are limited by library coverage and fixed templates, preventing faithful reconstruction of in-the-wild instances. We propose a generative framework that, from a single image or sparse multi-view input, synthesizes an animatable 3D Gaussian vehicle. Our method addresses two challenges: (i) large 3D asset generators are optimized for static quality but not articulation, leading to distortions at part boundaries when animated; and (ii) segmentation alone cannot provide the kinematic parameters required for motion. To overcome this, we introduce a part-edge refinement module that enforces exclusive Gaussian ownership and a kinematic reasoning head that predicts joint positions and hinge axes of movable parts. Together, these components enable faithful part-aware simulation, bridging the gap between static generation and animatable vehicle models.

[372] SCMAPR: Self-Correcting Multi-Agent Prompt Refinement for Complex-Scenario Text-to-Video Generation

Chengyi Yang, Pengzhen Li, Jiayin Qi, Aimin Zhou, Ji Wu, Ji Liu

Main category: cs.AI

TL;DR: SCMAPR is a multi-agent prompt refinement framework for text-to-video generation that improves handling of complex scenarios through scenario-aware rewriting and self-correcting verification.

DetailsMotivation: Current text-to-video systems struggle with complex scenarios due to ambiguous and underspecified text prompts, which leads to poor text-video alignment and generation quality.

Method: A multi-agent framework with three components: (1) scenario routing to taxonomy-grounded categories, (2) scenario-aware rewriting policies with policy-conditioned refinement, and (3) structured semantic verification with conditional revision.

Result: SCMAPR consistently improves text-video alignment and generation quality, achieving up to 2.67% and 3.28 gains on VBench and EvalCrafter, and 0.028 improvement on T2V-CompBench over SOTA baselines.

Conclusion: The proposed multi-agent prompt refinement framework effectively addresses complex scenarios in T2V generation through systematic scenario-aware rewriting and verification mechanisms.

Abstract: Text-to-Video (T2V) generation has benefited from recent advances in diffusion models, yet current systems still struggle under complex scenarios, which are generally exacerbated by the ambiguity and underspecification of text prompts. In this work, we formulate complex-scenario prompt refinement as a stage-wise multi-agent refinement process and propose SCMAPR, i.e., a scenario-aware and Self-Correcting Multi-Agent Prompt Refinement framework for T2V prompting. SCMAPR coordinates specialized agents to (i) route each prompt to a taxonomy-grounded scenario for strategy selection, (ii) synthesize scenario-aware rewriting policies and perform policy-conditioned refinement, and (iii) conduct structured semantic verification that triggers conditional revision when violations are detected. To clarify what constitutes complex scenarios in T2V prompting, provide representative examples, and enable rigorous evaluation under such challenging conditions, we further introduce {T2V-Complexity}, which is a complex-scenario T2V benchmark consisting exclusively of complex-scenario prompts. Extensive experiments on 3 existing benchmarks and our T2V-Complexity benchmark demonstrate that SCMAPR consistently improves text-video alignment and overall generation quality under complex scenarios, achieving up to 2.67% and 3.28 gains in average score on VBench and EvalCrafter, and up to 0.028 improvement on T2V-CompBench over 3 State-Of-The-Art baselines.

[373] MMORF: A Multi-agent Framework for Designing Multi-objective Retrosynthesis Planning Systems

Frazier N. Baker, Trieu Nguyen, Reza Averly, Botao Yu, Daniel Adu-Ampratwum, Huan Sun, Xia Ning

Main category: cs.AI

TL;DR: MMORF is a modular framework for building multi-agent systems for multi-objective retrosynthesis planning in chemistry, enabling flexible system design and evaluation.

DetailsMotivation: Multi-objective retrosynthesis planning requires balancing quality, safety, and cost objectives, and language model-based multi-agent systems offer a promising approach to incorporate multiple objectives through specialized agent interactions.

Method: MMORF provides modular agentic components that can be flexibly combined into different multi-agent systems. The framework enables principled evaluation and comparison of system designs, with two representative systems built: MASIL and RFAS.

Result: On a new benchmark of 218 multi-objective retrosynthesis tasks, MASIL achieves strong safety and cost metrics on soft-constraint tasks and frequently Pareto-dominates baseline routes, while RFAS achieves 48.6% success rate on hard-constraint tasks, outperforming state-of-the-art baselines.

Conclusion: MMORF serves as an effective foundational framework for exploring multi-agent systems for multi-objective retrosynthesis planning, demonstrating the value of modular, configurable agent architectures in complex chemistry tasks.

Abstract: Multi-objective retrosynthesis planning is a critical chemistry task requiring dynamic balancing of quality, safety, and cost objectives. Language model-based multi-agent systems (MAS) offer a promising approach for this task: leveraging interactions of specialized agents to incorporate multiple objectives into retrosynthesis planning. We present MMORF, a framework for constructing MAS for multi-objective retrosynthesis planning. MMORF features modular agentic components, which can be flexibly combined and configured into different systems, enabling principled evaluation and comparison of different system designs. Using MMORF, we construct two representative MAS: MASIL and RFAS. On a newly curated benchmark consisting of 218 multi-objective retrosynthesis planning tasks, MASIL achieves strong safety and cost metrics on soft-constraint tasks, frequently Pareto-dominating baseline routes, while RFAS achieves a 48.6% success rate on hard-constraint tasks, outperforming state-of-the-art baselines. Together, these results show the effectiveness of MMORF as a foundational framework for exploring MAS for multi-objective retrosynthesis planning. Code and data are available at https://anonymous.4open.science/r/MMORF/.

[374] LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

Ojas Jain, Dhruv Kumar

Main category: cs.AI

TL;DR: LudoBench is a benchmark for evaluating LLM strategic reasoning in the stochastic board game Ludo, featuring 480 handcrafted scenarios across 12 decision categories with a game-theory baseline for comparison.

DetailsMotivation: Current LLM benchmarks often lack meaningful strategic complexity and uncertainty. Ludo provides a controlled environment with stochastic elements (dice), multi-agent dynamics, and strategic trade-offs that better test reasoning under uncertainty.

Method: Created 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories in Ludo. Built a 4-player simulator supporting Random, Heuristic, Game-Theory (Expectiminimax search), and LLM agents. Evaluated six models from four families against game-theory baseline.

Result: All models only agreed with game-theory baseline 40-46% of the time. Models split into two behavioral archetypes: finishers (complete pieces but neglect development) and builders (develop but never finish), each capturing only half of optimal strategy. Models also showed prompt-sensitivity with behavioral shifts under history-conditioned grudge framing.

Conclusion: LudoBench provides a lightweight, interpretable framework for benchmarking LLM strategic reasoning under uncertainty, revealing significant gaps in current models’ strategic capabilities and highlighting prompt-sensitivity as a key vulnerability.

Abstract: We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning complexity. LudoBench comprises 480 handcrafted spot scenarios across 12 behaviorally distinct decision categories, each isolating a specific strategic choice. We additionally contribute a fully functional 4-player Ludo simulator supporting Random, Heuristic, Game-Theory, and LLM agents. The game-theory agent uses Expectiminimax search with depth-limited lookahead to provide a principled strategic ceiling beyond greedy heuristics. Evaluating six models spanning four model families, we find that all models agree with the game-theory baseline only 40-46% of the time. Models split into distinct behavioral archetypes: finishers that complete pieces but neglect development, and builders that develop but never finish. Each archetype captures only half of the game theory strategy. Models also display measurable behavioral shifts under history-conditioned grudge framing on identical board states, revealing prompt-sensitivity as a key vulnerability. LudoBench provides a lightweight and interpretable framework for benchmarking LLM strategic reasoning under uncertainty. All code, the spot dataset (480 entries) and model outputs are available at https://anonymous.4open.science/r/LudoBench-5CBF/

[375] MedGemma 1.5 Technical Report

Andrew Sellergren, Chufan Gao, Fereshteh Mahvar, Timo Kohlberger, Fayaz Jamil, Madeleine Traverse, Alberto Tono, Bashir Sadjad, Lin Yang, Charles Lau, Liron Yatziv, Tiffany Chen, Bram Sterling, Kenneth Philbrick, Richa Tiwari, Yun Liu, Madhuram Jajoo, Chandrashekar Sankarapu, Swapnil Vispute, Harshad Purandare, Abhishek Bijay Mishra, Sam Schmidgall, Tao Tu, Anil Palepu, Chunjong Park, Tim Strother, Rahul Thapa, Yong Cheng, Preeti Singh, Kat Black, Yossi Matias, Katherine Chou, Avinatan Hassidim, Kavi Goel, Joelle Barral, Tris Warkentin, Shravya Shetty, Dale Webster, Sunny Virmani, David F. Steiner, Can Kirmizibayrak, Daniel Golden

Main category: cs.AI

TL;DR: MedGemma 1.5 4B is an enhanced multimodal medical AI model that expands capabilities to include 3D medical imaging (CT/MRI), histopathology whole slide images, anatomical localization, multi-timepoint chest X-ray analysis, and improved medical document understanding.

DetailsMotivation: To create a more comprehensive medical AI foundation model that can handle diverse medical modalities including high-dimensional imaging, temporal analysis, and complex medical documents, building upon the previous MedGemma 1 model.

Method: Extended MedGemma 1 architecture with innovations including new training data, long-context 3D volume slicing techniques, whole-slide pathology sampling methods, and integration of multiple medical modalities within a single unified model.

Result: Significant improvements over MedGemma 1: 11% gain in 3D MRI classification, 3% in 3D CT classification, 47% macro F1 gain in pathology imaging, 35% IoU improvement in anatomical localization, 4% accuracy for multi-timepoint X-ray analysis, plus 5% MedQA and 22% EHRQA improvements.

Conclusion: MedGemma 1.5 serves as a robust open resource providing a comprehensive multimodal medical AI foundation with substantial improvements across diverse medical modalities, enabling next-generation medical AI system development.

Abstract: We introduce MedGemma 1.5 4B, the latest model in the MedGemma collection. MedGemma 1.5 expands on MedGemma 1 by integrating additional capabilities: high-dimensional medical imaging (CT/MRI volumes and histopathology whole slide images), anatomical localization via bounding boxes, multi-timepoint chest X-ray analysis, and improved medical document understanding (lab reports, electronic health records). We detail the innovations required to enable these modalities within a single architecture, including new training data, long-context 3D volume slicing, and whole-slide pathology sampling. Compared to MedGemma 1 4B, MedGemma 1.5 4B demonstrates significant gains in these new areas, improving 3D MRI condition classification accuracy by 11% and 3D CT condition classification by 3% (absolute improvements). In whole slide pathology imaging, MedGemma 1.5 4B achieves a 47% macro F1 gain. Additionally, it improves anatomical localization with a 35% increase in Intersection over Union on chest X-rays and achieves a 4% macro accuracy for longitudinal (multi-timepoint) chest x-ray analysis. Beyond its improved multimodal performance over MedGemma 1, MedGemma 1.5 improves on text-based clinical knowledge and reasoning, improving by 5% on MedQA accuracy and 22% on EHRQA accuracy. It also achieves an average of 18% macro F1 on 4 different lab report information extraction datasets (EHR Datasets 2, 3, 4, and Mendeley Clinical Laboratory Test Reports). Taken together, MedGemma 1.5 serves as a robust, open resource for the community, designed as an improved foundation on which developers can create the next generation of medical AI systems. Resources and tutorials for building upon MedGemma 1.5 can be found at https://goo.gle/MedGemma.

[376] Uncertainty-Guided Latent Diagnostic Trajectory Learning for Sequential Clinical Diagnosis

Xuyang Shen, Haoran Liu, Dongjin Song, Martin Renqiang Min

Main category: cs.AI

TL;DR: LDTL framework uses two LLM agents for sequential clinical diagnosis: a diagnostic agent treats diagnostic paths as latent variables, and a planning agent learns to follow posterior distributions that prioritize informative trajectories, reducing uncertainty over time.

DetailsMotivation: Most LLM-based diagnostic systems assume fully observed patient information and don't model sequential evidence acquisition. Clinical diagnosis requires sequential decision-making under uncertainty, but learning effective diagnostic trajectories is challenging due to large path spaces and lack of explicit supervision in clinical datasets.

Method: Proposes Latent Diagnostic Trajectory Learning (LDTL) with two LLM agents: (1) Diagnostic agent treats diagnostic action sequences as latent paths with a posterior distribution prioritizing informative trajectories, (2) Planning agent is trained to follow this distribution to generate coherent diagnostic trajectories that progressively reduce uncertainty.

Result: Experiments on MIMIC-CDM benchmark show LDTL outperforms existing baselines in diagnostic accuracy under sequential clinical diagnosis settings while requiring fewer diagnostic tests. Ablation studies confirm trajectory-level posterior alignment is critical for these improvements.

Conclusion: LDTL effectively addresses sequential clinical diagnosis by modeling diagnostic trajectories as latent variables and aligning planning with informative posterior distributions, enabling more efficient and accurate diagnosis with fewer tests.

Abstract: Clinical diagnosis requires sequential evidence acquisition under uncertainty. However, most Large Language Model (LLM) based diagnostic systems assume fully observed patient information and therefore do not explicitly model how clinical evidence should be sequentially acquired over time. Even when diagnosis is formulated as a sequential decision process, it is still challenging to learn effective diagnostic trajectories. This is because the space of possible evidence-acquisition paths is relatively large, while clinical datasets rarely provide explicit supervision information for desirable diagnostic paths. To this end, we formulate sequential diagnosis as a Latent Diagnostic Trajectory Learning (LDTL) framework based on a planning LLM agent and a diagnostic LLM agent. For the diagnostic LLM agent, diagnostic action sequences are treated as latent paths and we introduce a posterior distribution that prioritizes trajectories providing more diagnostic information. The planning LLM agent is then trained to follow this distribution, encouraging coherent diagnostic trajectories that progressively reduce uncertainty. Experiments on the MIMIC-CDM benchmark demonstrate that our proposed LDTL framework outperforms existing baselines in diagnostic accuracy under a sequential clinical diagnosis setting, while requiring fewer diagnostic tests. Furthermore, ablation studies highlight the critical role of trajectory-level posterior alignment in achieving these improvements.

[377] Non-monotonic causal discovery with Kolmogorov-Arnold Fuzzy Cognitive Maps

Jose L. Salmeron

Main category: cs.AI

TL;DR: KA-FCM replaces scalar weights with learnable B-spline functions on edges to model non-monotonic causal relationships while preserving interpretability.

DetailsMotivation: Standard Fuzzy Cognitive Maps are limited to monotonic causal dependencies due to scalar weights and monotonic activation functions, restricting their ability to model systems with saturation effects or periodic dynamics.

Method: Proposes Kolmogorov-Arnold Fuzzy Cognitive Map (KA-FCM) that replaces static scalar weights with learnable univariate B-spline functions on edges, shifting non-linearity from node aggregation to causal influence phase.

Result: KA-FCMs significantly outperform conventional FCM architectures and achieve competitive accuracy relative to MLPs across non-monotonic inference, symbolic regression, and chaotic time-series forecasting tasks.

Conclusion: The KA-FCM architecture enables modeling of arbitrary non-monotonic causal relationships without increasing graph complexity, while preserving interpretability and allowing explicit extraction of mathematical laws from learned edges.

Abstract: Fuzzy Cognitive Maps constitute a neuro-symbolic paradigm for modeling complex dynamic systems, widely adopted for their inherent interpretability and recurrent inference capabilities. However, the standard FCM formulation, characterized by scalar synaptic weights and monotonic activation functions, is fundamentally constrained in modeling non-monotonic causal dependencies, thereby limiting its efficacy in systems governed by saturation effects or periodic dynamics. To overcome this topological restriction, this research proposes the Kolmogorov-Arnold Fuzzy Cognitive Map (KA-FCM), a novel architecture that redefines the causal transmission mechanism. Drawing upon the Kolmogorov-Arnold representation theorem, static scalar weights are replaced with learnable, univariate B-spline functions located on the model edges. This fundamental modification shifts the non-linearity from the nodes’ aggregation phase directly to the causal influence phase. This modification allows for the modeling of arbitrary, non-monotonic causal relationships without increasing the graph density or introducing hidden layers. The proposed architecture is validated against both baselines (standard FCM trained with Particle Swarm Optimization) and universal black-box approximators (Multi-Layer Perceptron) across three distinct domains: non-monotonic inference (Yerkes-Dodson law), symbolic regression, and chaotic time-series forecasting. Experimental results demonstrate that KA-FCMs significantly outperform conventional architectures and achieve competitive accuracy relative to MLPs, while preserving graph- based interpretability and enabling the explicit extraction of mathematical laws from the learned edges.

[378] A mathematical theory of evolution for self-designing AIs

Kenneth D Harris

Main category: cs.AI

TL;DR: Mathematical model of evolution in self-designing AI systems where current programs design descendants, showing fitness may not increase without assumptions, and deception can be selected if it increases fitness beyond genuine utility.

DetailsMotivation: To understand how behavioral traits in AI systems will be shaped by evolutionary processes when AIs design their own descendants through directed design rather than random mutations, and to analyze implications for AI alignment when fitness and human utility are not perfectly correlated.

Method: Develop mathematical model replacing random mutations with directed tree of possible AI programs, where current programs determine descendant design and humans control through fitness function allocating computational resources. Analyze evolutionary dynamics considering long-run growth potential.

Result: Evolutionary dynamics reflect not just current fitness but long-run growth potential; without assumptions, fitness need not increase. With bounded fitness and fixed probability of “locked” self-copies, fitness concentrates on maximum reachable value. In additive model, deception is selected if it increases fitness beyond genuine utility.

Conclusion: AI evolution differs fundamentally from biological evolution due to directed design, creating alignment risks when fitness and human utility diverge. Deception can be evolutionarily favored, mitigated by using purely objective reproduction criteria rather than human judgment.

Abstract: As artificial intelligence systems (AIs) become increasingly produced by recursive self-improvement, a form of evolution may emerge, in which the traits of AI systems are shaped by the success of earlier AIs in designing and propagating their descendants. There is a rich mathematical theory modeling how behavioral traits are shaped by biological evolution, but AI evolution will be radically different: biological DNA mutations are random and approximately reversible, but descendant design in AIs will be strongly directed. Here we develop a mathematical model of evolution in self-designing AI systems, replacing random mutations with a directed tree of possible AI programs. Current programs determine the design of their descendants, while humans retain partial control through a “fitness function” that allocates limited computational resources across lineages. We show that evolutionary dynamics reflects not just current fitness but factors related to the long-run growth potential of descendant lineages. Without further assumptions, fitness need not increase over time. However, assuming bounded fitness and a fixed probability that any AI reproduces a “locked” copy of itself, we show that fitness concentrates on the maximum reachable value. We consider the implications of this for AI alignment, specifically for cases where fitness and human utility are not perfectly correlated. We show in an additive model that if deception increases fitness beyond genuine utility, evolution will select for deception. This risk could be mitigated if reproduction is based on purely objective criteria, rather than human judgment.

[379] IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

Rongqian Chen, Yu Li, Zeyu Fang, Sizhe Tang, Weidong Cao, Tian Lan

Main category: cs.AI

TL;DR: IntentScore is a plan-aware reward model that scores GUI actions for computer-use agents, improving task success by evaluating action quality before execution to prevent cascading errors.

DetailsMotivation: Current computer-use agents generate GUI actions without evaluating their quality, leading to irreversible errors that cascade through subsequent steps. There's a need for a mechanism to score candidate actions before execution to prevent such errors.

Method: Proposes IntentScore, a reward model trained on 398K offline GUI interaction steps across three operating systems. Uses two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Embeds planning intent in the action encoder to discriminate between candidates with similar actions but different rationales.

Result: Achieves 97.5% pairwise discrimination accuracy on held-out evaluation. When deployed as a re-ranker for Agent S3 on OSWorld (unseen during training), improves task success rate by 6.9 points.

Conclusion: Reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions, enabling more reliable computer-use agents through action quality evaluation.

Abstract: Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems. IntentScore trains with two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Architecturally, it embeds each candidate’s planning intent in the action encoder, enabling discrimination between candidates with similar actions but different rationales. IntentScore achieves 97.5% pairwise discrimination accuracy on held-out evaluation. Deployed as a re-ranker for Agent S3 on OSWorld, an environment entirely unseen during training, IntentScore improves task success rate by 6.9 points, demonstrating that reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.

[380] Bypassing the CSI Bottleneck: MARL-Driven Spatial Control for Reflector Arrays

Hieu Le, Oguz Bedir, Mostafa Ibrahim, Jian Tao, Sabit Ekin

Main category: cs.AI

TL;DR: MARL framework for mechanically adjustable RIS achieves CSI-free beam focusing using user coordinates, outperforming baselines with 26.86 dB gain over static reflectors.

DetailsMotivation: Practical RIS deployment is bottlenecked by computational overhead of CSI estimation; need AI-native approach to bypass physical-layer barriers and enable scalable wireless networks.

Method: Multi-Agent Reinforcement Learning (MARL) with Centralized Training with Decentralized Execution (CTDE) architecture using Multi-Agent Proximal Policy Optimization (MAPPO). Maps mechanical constraints to reduced-order virtual focal point space for decentralized control of metallic reflector arrays.

Result: Achieves up to 26.86 dB enhancement over static flat reflectors, outperforms single-agent and hardware-constrained DRL baselines in spatial selectivity and temporal stability. Policies show resilience to 1.0-meter localization noise.

Conclusion: MARL-driven spatial abstractions provide scalable, practical pathway toward AI-empowered wireless networks with CSI-free operation and good deployment resilience.

Abstract: Reconfigurable Intelligent Surfaces (RIS) are pivotal for next-generation smart radio environments, yet their practical deployment is severely bottlenecked by the intractable computational overhead of Channel State Information (CSI) estimation. To bypass this fundamental physical-layer barrier, we propose an AI-native, data-driven paradigm that replaces complex channel modeling with spatial intelligence. This paper presents a fully autonomous Multi-Agent Reinforcement Learning (MARL) framework to control mechanically adjustable metallic reflector arrays. By mapping high-dimensional mechanical constraints to a reduced-order virtual focal point space, we deploy a Centralized Training with Decentralized Execution (CTDE) architecture. Using Multi-Agent Proximal Policy Optimization (MAPPO), our decentralized agents learn cooperative beam-focusing strategies relying on user coordinates, achieving CSI-free operation. High-fidelity ray-tracing simulations in dynamic non-line-of-sight (NLOS) environments demonstrate that this multi-agent approach rapidly adapts to user mobility, yielding up to a 26.86 dB enhancement over static flat reflectors and outperforming single-agent and hardware-constrained DRL baselines in both spatial selectivity and temporal stability. Crucially, the learned policies exhibit good deployment resilience, sustaining stable signal coverage even under 1.0-meter localization noise. These results validate the efficacy of MARL-driven spatial abstractions as a scalable, highly practical pathway toward AI-empowered wireless networks.

[381] Learning to Focus: CSI-Free Hierarchical MARL for Reconfigurable Reflectors

Hieu Le, Mostafa Ibrahim, Oguz Bedir, Jian Tao, Sabit Ekin

Main category: cs.AI

TL;DR: Hierarchical multi-agent reinforcement learning framework for CSI-free control of reconfigurable intelligent surfaces using user localization data instead of channel estimation

DetailsMotivation: Overcome computational overhead of CSI estimation and dimensionality explosion in centralized optimization for large-scale RIS deployments in mmWave networks

Method: Two-tier neural architecture: high-level controller handles discrete user-to-reflector allocations, low-level controllers optimize continuous focal points using MAPPO with CTDE scheme, replacing pilot-based channel estimation with user localization data

Result: Achieves up to 7.79 dB RSSI improvement over centralized baselines, exhibits robust multi-user scalability, maintains resilient beam-focusing under sub-meter localization errors

Conclusion: Establishes scalable, cost-effective blueprint for intelligent wireless environments by eliminating CSI overhead while maintaining high-fidelity signal redirection

Abstract: Reconfigurable Intelligent Surfaces (RIS) has a potential to engineer smart radio environments for next-generation millimeter-wave (mmWave) networks. However, the prohibitive computational overhead of Channel State Information (CSI) estimation and the dimensionality explosion inherent in centralized optimization severely hinder practical large-scale deployments. To overcome these bottlenecks, we introduce a ``CSI-free" paradigm powered by a Hierarchical Multi-Agent Reinforcement Learning (HMARL) architecture to control mechanically reconfigurable reflective surfaces. By substituting pilot-based channel estimation with accessible user localization data, our framework leverages spatial intelligence for macro-scale wave propagation management. The control problem is decomposed into a two-tier neural architecture: a high-level controller executes temporally extended, discrete user-to-reflector allocations, while low-level controllers autonomously optimize continuous focal points utilizing Multi-Agent Proximal Policy Optimization (MAPPO) under a Centralized Training with Decentralized Execution (CTDE) scheme. Comprehensive deterministic ray-tracing evaluations demonstrate that this hierarchical framework achieves massive RSSI improvements of up to 7.79 dB over centralized baselines. Furthermore, the system exhibits robust multi-user scalability and maintains highly resilient beam-focusing performance under practical sub-meter localization tracking errors. By eliminating CSI overhead while maintaining high-fidelity signal redirection, this work establishes a scalable and cost-effective blueprint for intelligent wireless environments.

[382] UserCentrix: An Agentic Memory-augmented AI Framework for Smart Spaces

Alaa Saleh, Sasu Tarkoma, Praveen Kumar Donta, Anders Lindgren, Naser Hossein Motlagh, Schahram Dustdar, Susanna Pirttikangas, Lauri Lovén

Main category: cs.AI

TL;DR: UserCentrix is a hybrid agentic orchestration framework for smart spaces that optimizes resource management through urgency-aware and intent-driven decision-making, balancing latency, accuracy, and computational costs.

DetailsMotivation: To create intelligent agents for smart environments that can enhance operational efficiency, optimize resource allocation, and strengthen systemic resilience through autonomous decision-making capabilities.

Method: A hybrid agentic orchestration framework that integrates interactive modules with agentic behavior, using user intent as a governing control signal to prioritize decisions, regulate task execution, and adapt decision-making strategies to balance speed vs. accuracy trade-offs.

Result: The framework autonomously enables efficient intent processing and real-time monitoring while balancing reasoning quality and computational efficiency, particularly under resource-constrained edge conditions.

Conclusion: UserCentrix represents a transformative approach to agentic AI for smart spaces, demonstrating effective resource management and enhanced user experience through intent-driven, urgency-aware decision-making mechanisms.

Abstract: Agentic Artificial Intelligence (AI) constitutes a transformative paradigm in the evolution of intelligent agents and decision-support systems, redefining smart environments by enhancing operational efficiency, optimizing resource allocation, and strengthening systemic resilience. This paper presents UserCentrix, a hybrid agentic orchestration framework for smart spaces that optimizes resource management and enhances user experience through urgency-aware and intent-driven decision-making mechanisms. The framework integrates interactive modules equipped with agentic behavior and autonomous decision-making capabilities to dynamically balance latency, accuracy, and computational cost. User intent functions as a governing control signal that prioritizes decisions, regulates task execution and resource allocation, and guides the adaptation of decision-making strategies to balance trade-offs between speed and accuracy. Experimental results demonstrate that the framework autonomously enables efficient intent processing and real-time monitoring, while balancing reasoning quality and computational efficiency, particularly under resource-constrained edge conditions.

[383] Instruction-Tuned LLMs for Parsing and Mining Unstructured Logs on Leadership HPC Systems

Ahmad Maroof Karimi, Jong Youl Choi, Charles Qing Cao, Awais Khan

Main category: cs.AI

TL;DR: Domain-adapted LLM framework for parsing and analyzing heterogeneous HPC system logs using instruction-tuned LLaMA model with chain-of-thought reasoning

DetailsMotivation: Leadership-class HPC systems generate massive volumes of unstructured, heterogeneous system logs from diverse software/hardware layers with inconsistent formats, making structure extraction and pattern discovery extremely challenging. There's a need to transform raw telemetry into actionable insights for operational patterns, anomaly diagnosis, and scalable system analysis.

Method: Developed a domain-adapted, instruction-following LLM-driven framework using chain-of-thought reasoning. Fine-tuned an 8B-parameter LLaMA model with domain-specific log-template data and instruction-tuned examples. Used hybrid fine-tuning methodology to adapt general-purpose LLM to HPC log data for privacy-preserving, locally deployable, fast, and energy-efficient log mining.

Result: Achieved parsing accuracy on par with significantly larger models like LLaMA 70B and Anthropic’s Claude on LogHub datasets. Successfully parsed over 600 million production logs from Frontier supercomputer over four weeks, uncovering critical patterns in temporal dynamics, node-level anomalies, and workload-error log correlations.

Conclusion: The domain-adapted LLM framework provides an effective solution for HPC log parsing and mining, offering high-fidelity structure extraction with practical utility for large-scale production systems while being privacy-preserving and energy-efficient.

Abstract: Leadership-class HPC systems generate massive volumes of heterogeneous, largely unstructured system logs. Because these logs originate from diverse software, hardware, and runtime layers, they exhibit inconsistent formats, making structure extraction and pattern discovery extremely challenging. Therefore, robust log parsing and mining is critical to transform this raw telemetry into actionable insights that reveal operational patterns, diagnose anomalies, and enable reliable, efficient, and scalable system analysis. Recent advances in large language models (LLMs) offer a promising new direction for automated log understanding in leadership-class HPC environments. To capitalize on this opportunity, we present a domain-adapted, instruction-following, LLM-driven framework that leverages chain-of-thought (CoT) reasoning to parse and structure HPC logs with high fidelity. Our approach combines domain-specific log-template data with instruction-tuned examples to fine-tune an 8B-parameter LLaMA model tailored for HPC log analysis. We develop a hybrid fine-tuning methodology that adapts a general-purpose LLM to domain-specific log data, enabling privacy-preserving, locally deployable, fast, and energy-efficient log-mining approach. We conduct experiments on a diverse set of log datasets from the LogHub repository. The evaluation confirms that our approach achieves parsing accuracy on par with significantly larger models, such as LLaMA 70B and Anthropic’s Claude. We further validate the practical utility of our fine-tuned LLM model by parsing over 600 million production logs from the Frontier supercomputer over a four-week window, uncovering critical patterns in temporal dynamics, node-level anomalies, and workload-error log correlations.

[384] ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee

Main category: cs.AI

TL;DR: ClawsBench: A benchmark for evaluating LLM agents in realistic productivity environments with mock services (Gmail, Slack, Calendar, Docs, Drive) and state management, measuring both task success and safety risks.

DetailsMotivation: Existing benchmarks for LLM agents use simplified environments that don't capture realistic, stateful, multi-service workflows needed for productivity automation. There's a need for evaluation in settings that reflect real-world risks and complexities.

Method: Created ClawsBench with five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore. Includes 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. Decomposed agent scaffolding into domain skills (API knowledge injection) and meta prompts (cross-service coordination).

Result: With full scaffolding, agents achieved 39-64% task success rates but exhibited 7-33% unsafe action rates. On OpenClaw, top models showed 53-63% success with 7-23% unsafe actions. Identified eight patterns of unsafe behavior including multi-step sandbox escalation and silent contract modification.

Conclusion: ClawsBench enables realistic evaluation of LLM agents in productivity settings, revealing significant safety concerns even with high task success rates. The benchmark highlights the need for better safety measures in agent deployment.

Abstract: Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39-64% but exhibit unsafe action rates of 7-33%. On OpenClaw, the top five models fall within a 10 percentage-point band on task success (53-63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi-step sandbox escalation and silent contract modification.

[385] Attribution Bias in Large Language Models

Eliza Berman, Bella Chang, Daniel B. Neill, Emily Black

Main category: cs.AI

TL;DR: AttriBench is a balanced benchmark for evaluating quote attribution in LLMs, revealing systematic demographic biases in attribution accuracy and suppression rates.

DetailsMotivation: As LLMs are increasingly used for search and information retrieval, accurate attribution of content to original authors is critical. Current benchmarks lack demographic balance, making it difficult to assess fairness in attribution.

Method: Created AttriBench, the first fame- and demographically-balanced quote attribution benchmark dataset. Evaluated 11 widely used LLMs across different prompt settings, measuring both attribution accuracy and suppression (omitting attribution entirely).

Result: Quote attribution remains challenging even for frontier models. Found large systematic disparities in attribution accuracy across race, gender, and intersectional groups. Suppression is widespread and unevenly distributed across demographic groups, revealing biases not captured by standard accuracy metrics.

Conclusion: AttriBench positions quote attribution as a benchmark for representational fairness in LLMs, revealing systematic biases that need to be addressed as LLMs are deployed in information retrieval systems.

Abstract: As Large Language Models (LLMs) are increasingly used to support search and information retrieval, it is critical that they accurately attribute content to its original authors. In this work, we introduce AttriBench, the first fame- and demographically-balanced quote attribution benchmark dataset. Through explicitly balancing author fame and demographics, AttriBench enables controlled investigation of demographic bias in quote attribution. Using this dataset, we evaluate 11 widely used LLMs across different prompt settings and find that quote attribution remains a challenging task even for frontier models. We observe large and systematic disparities in attribution accuracy between race, gender, and intersectional groups. We further introduce and investigate suppression, a distinct failure mode in which models omit attribution entirely, even when the model has access to authorship information. We find that suppression is widespread and unevenly distributed across demographic groups, revealing systematic biases not captured by standard accuracy metrics. Our results position quote attribution as a benchmark for representational fairness in LLMs.

[386] EAGLE: Edge-Aware Graph Learning for Proactive Delivery Delay Prediction in Smart Logistics Networks

Zhiming Xue, Menghao Huo, Yujue Wang

Main category: cs.AI

TL;DR: A hybrid deep learning framework combining Transformer patch encoding for temporal dynamics and Edge-Aware Graph Attention Network for spatial dependencies to predict delivery delays in supply chains.

DetailsMotivation: Existing predictive approaches for delivery delays either ignore network topology (treating it as tabular classification) or overlook spatial dependencies (treating it as time-series anomaly detection), failing to capture the complex spatiotemporal nature of supply chain operations.

Method: Proposes a hybrid framework that jointly models temporal order-flow dynamics using a lightweight Transformer patch encoder and inter-hub relational dependencies through an Edge-Aware Graph Attention Network (E-GAT), optimized via multi-task learning.

Result: Achieves F1-score of 0.8762 and AUC-ROC of 0.9773 on the DataCo Smart Supply Chain dataset, with cross-seed F1 standard deviation of only 0.0089 (3.8x improvement over best ablated variant), showing strong predictive accuracy and training stability.

Conclusion: The hybrid framework effectively captures both temporal dynamics and spatial dependencies in supply chain networks, providing a robust solution for proactive delivery delay prediction with superior performance and stability compared to existing approaches.

Abstract: Modern logistics networks generate rich operational data streams at every warehouse node and transportation lane – from order timestamps and routing records to shipping manifests – yet predicting delivery delays remains predominantly reactive. Existing predictive approaches typically treat this problem either as a tabular classification task, ignoring network topology, or as a time-series anomaly detection task, overlooking the spatial dependencies of the supply chain graph. To bridge this gap, we propose a hybrid deep learning framework for proactive supply chain risk management. The proposed method jointly models temporal order-flow dynamics via a lightweight Transformer patch encoder and inter-hub relational dependencies through an Edge-Aware Graph Attention Network (E-GAT), optimized via a multi-task learning objective. Evaluated on the real-world DataCo Smart Supply Chain dataset, our framework achieves consistent improvements over baseline methods, yielding an F1-score of 0.8762 and an AUC-ROC of 0.9773. Across four independent random seeds, the framework exhibits a cross-seed F1 standard deviation of only 0.0089 – a 3.8 times improvement over the best ablated variant – achieving the strongest balance of predictive accuracy and training stability among all evaluated models.

[387] Simulating the Evolution of Alignment and Values in Machine Intelligence

Jonathan Elsworth Eicher

Main category: cs.AI

TL;DR: Evolutionary modeling shows how iterative alignment testing can fix deceptive beliefs in AI models, even with high test-true value correlation, requiring improved evaluators and adaptive testing to reduce deception.

DetailsMotivation: Current model alignment focuses on benchmark performance without considering evolutionary dynamics over time. The study aims to examine how alignment affects populations of models through time, particularly focusing on deceptive beliefs that may become fixed despite appearing aligned on tests.

Method: Applies evolutionary theory to model populations of beliefs with both alignment signals (test performance) and true values (actual impact). Studies how different selection methodologies and populations can fix deceptive beliefs through iterative alignment testing, including analysis of mutation effects and test quality updates.

Result: Even with high correlation between testing accuracy and true value (ρ=0.8), there is variability in deceptive beliefs that become fixed. Mutations enable more complex deceptive developments. Only combining improved evaluator capabilities, adaptive test design, and mutational dynamics significantly reduces deception while maintaining alignment fitness (permutation test, p_adj < 0.001).

Conclusion: Current alignment testing approaches are insufficient to prevent fixation of deceptive beliefs over time. Evolutionary dynamics must be considered, requiring continuous improvement of evaluator capabilities and adaptive test design to maintain genuine alignment while reducing deception in AI models.

Abstract: Model alignment is currently applied in a vacuum, evaluated primarily through standardised benchmark performance. The purpose of this study is to examine the effects of alignment on populations of models through time. We focus on the treatment of beliefs which contain both an alignment signal (how well it does on the test) and a true value (what the impact actually will be). By applying evolutionary theory we can model how different populations of beliefs and selection methodologies can fix deceptive beliefs through iterative alignment testing. The correlation between testing accuracy and true value remains a strong feature, but even at high correlations ($ρ= 0.8$) there is variability in the resulting deceptive beliefs that become fixed. Mutations allow for more complex developments, highlighting the increasing need to update the quality of tests to avoid fixation of maliciously deceptive models. Only by combining improving evaluator capabilities, adaptive test design, and mutational dynamics do we see significant reductions in deception while maintaining alignment fitness (permutation test, $p_{\text{adj}} < 0.001$).

[388] Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition

Muhammad Ahmed Mohsin, Ahsan Bilal, Muhammad Umer, Emily Fox

Main category: cs.AI

TL;DR: The paper proposes GRPO, a reward decomposition method to reduce sycophancy in LLMs by separating pressure resistance from evidence responsiveness, achieving significant reduction in sycophantic behavior across multiple models.

DetailsMotivation: Large language models exhibit sycophancy - changing their stated positions toward perceived user preferences regardless of evidence. Standard alignment methods fail because they conflate two distinct failure modes: pressure capitulation (changing correct answers under social pressure) and evidence blindness (ignoring context entirely).

Method: Proposes Group Relative Policy Optimisation (GRPO) with multi-component reward decomposition into five terms: pressure resistance, context fidelity, position consistency, agreement suppression, and factual correctness. Uses contrastive dataset pairing pressure-free baselines with pressured variants across three authority levels and two opposing evidence contexts.

Result: Across five base models, the two-phase pipeline consistently reduces sycophancy on all metric axes. Ablations confirm each reward term governs an independent behavioral dimension. Learned pressure resistance generalizes beyond training methodology, reducing answer-priming sycophancy by up to 17 points on SycophancyEval.

Conclusion: Reward decomposition effectively addresses sycophancy in LLMs by disentangling pressure resistance from evidence responsiveness, with the approach generalizing beyond specific training conditions.

Abstract: Large language models exhibit sycophancy, the tendency to shift their stated positions toward perceived user preferences or authority cues regardless of evidence. Standard alignment methods fail to correct this because scalar reward models conflate two distinct failure modes into a single signal: pressure capitulation, where the model changes a correct answer under social pressure, and evidence blindness, where the model ignores the provided context entirely. We operationalise sycophancy through formal definitions of pressure independence and evidence responsiveness, serving as a working framework for disentangled training rather than a definitive characterisation of the phenomenon. We propose the first approach to sycophancy reduction via reward decomposition, introducing a multi-component Group Relative Policy Optimisation (GRPO) reward that decomposes the training signal into five terms: pressure resistance, context fidelity, position consistency, agreement suppression, and factual correctness. We train using a contrastive dataset pairing pressure-free baselines with pressured variants across three authority levels and two opposing evidence contexts. Across five base models, our two-phase pipeline consistently reduces sycophancy on all metric axes, with ablations confirming that each reward term governs an independent behavioural dimension. The learned resistance to pressure generalises beyond our training methodology and prompt structure, reducing answer-priming sycophancy by up to 17 points on SycophancyEval despite the absence of such pressure forms during training.

[389] Soft Tournament Equilibrium

Saad Alqithami

Main category: cs.AI

TL;DR: A differentiable framework called Soft Tournament Equilibrium (STE) for evaluating AI agents using tournament theory instead of linear rankings, addressing non-transitive interactions.

DetailsMotivation: Traditional ranking methods fail for evaluating AI agents with non-transitive interactions (where A beats B, B beats C, but C beats A), leading to misleading and unstable linear orderings. The paper argues for using tournament theory's set-valued core concepts instead of forced rankings.

Method: Proposes Soft Tournament Equilibrium (STE): 1) Learns a probabilistic tournament model from pairwise comparison data, potentially conditioned on contextual information; 2) Uses novel differentiable operators for soft reachability and soft covering to compute continuous analogues of tournament solutions (Top Cycle and Uncovered Set); 3) Outputs a set of core agents with calibrated membership scores.

Result: Theoretical foundation proves STE’s consistency with classical tournament solutions in the zero-temperature limit, establishing Condorcet-inclusion properties, and analyzes stability and sample complexity. Experimental protocol specified for validation on synthetic and real-world benchmarks.

Conclusion: Provides a complete framework to shift general-agent evaluation from unstable linear rankings to stable, set-valued equilibria based on tournament theory, offering more appropriate and robust theoretical foundation for evaluating AI agents with non-transitive interactions.

Abstract: The evaluation of general-purpose artificial agents, particularly those based on large language models, presents a significant challenge due to the non-transitive nature of their interactions. When agent A defeats B, B defeats C, and C defeats A, traditional ranking methods that force a linear ordering can be misleading and unstable. We argue that for such cyclic domains, the fundamental object of evaluation should not be a ranking but a set-valued core, as conceptualized in classical tournament theory. This paper introduces Soft Tournament Equilibrium (STE), a differentiable framework for learning and computing set-valued tournament solutions directly from pairwise comparison data. STE first learns a probabilistic tournament model, potentially conditioned on rich contextual information. It then employs novel, differentiable operators for soft reachability and soft covering to compute continuous analogues of two seminal tournament solutions: the Top Cycle and the Uncovered Set. The output is a set of core agents, each with a calibrated membership score, providing a nuanced and robust assessment of agent capabilities. We develop the theoretical foundation for STE to prove its consistency with classical solutions in the zero-temperature limit, which establishes its Condorcet-inclusion properties, and analyzing its stability and sample complexity. We specify an experimental protocol for validating STE on both synthetic and real-world benchmarks. This work aims to provide a complete, standalone treatise that re-centers general-agent evaluation on a more appropriate and robust theoretical foundation, moving from unstable rankings to stable, set-valued equilibria.

[390] Breakthrough the Suboptimal Stable Point in Value-Factorization-Based Multi-Agent Reinforcement Learning

Lesong Tao, Yifei Wang, Haodong Jing, Jingwen Fu, Miao Kang, Shitao Chen, Nanning Zheng

Main category: cs.AI

TL;DR: The paper introduces a novel theoretical framework for analyzing suboptimal convergence in multi-agent reinforcement learning value factorization methods, proposing a Multi-Round Value Factorization (MRVF) approach that iteratively filters suboptimal actions by making them unstable.

DetailsMotivation: Value factorization in multi-agent reinforcement learning (MARL) suffers from theoretical and algorithmic bottlenecks where it tends to converge to suboptimal solutions, but existing analyses fail to explain this phenomenon because they primarily focus on optimal cases rather than general convergence behavior.

Method: The paper introduces the concept of “stable points” to characterize potential convergence of value factorization in general cases. Through analyzing stable point distributions, they identify non-optimal stable points as the main cause of poor performance. They propose MRVF framework that measures non-negative payoff increments relative to previously selected actions, transforming inferior actions into unstable points to drive iterations toward stable points with superior actions.

Result: Experiments on challenging benchmarks including predator-prey tasks and StarCraft II Multi-Agent Challenge (SMAC) validate the stable point analysis and demonstrate MRVF’s superiority over state-of-the-art methods.

Conclusion: The stable point concept provides a theoretical foundation for understanding suboptimal convergence in value factorization, and the iterative filtering approach in MRVF offers a practical solution for achieving global optimality in multi-agent reinforcement learning.

Abstract: Value factorization, a popular paradigm in MARL, faces significant theoretical and algorithmic bottlenecks: its tendency to converge to suboptimal solutions remains poorly understood and unsolved. Theoretically, existing analyses fail to explain this due to their primary focus on the optimal case. To bridge this gap, we introduce a novel theoretical concept: the stable point, which characterizes the potential convergence of value factorization in general cases. Through an analysis of stable point distributions in existing methods, we reveal that non-optimal stable points are the primary cause of poor performance. However, algorithmically, making the optimal action the unique stable point is nearly infeasible. In contrast, iteratively filtering suboptimal actions by rendering them unstable emerges as a more practical approach for global optimality. Inspired by this, we propose a novel Multi-Round Value Factorization (MRVF) framework. Specifically, by measuring a non-negative payoff increment relative to the previously selected action, MRVF transforms inferior actions into unstable points, thereby driving each iteration toward a stable point with a superior action. Experiments on challenging benchmarks, including predator-prey tasks and StarCraft II Multi-Agent Challenge (SMAC), validate our analysis of stable points and demonstrate the superiority of MRVF over state-of-the-art methods.

[391] Memory Intelligence Agent

Jingyang Qiao, Weicheng Meng, Yu Cheng, Zhihang Lin, Zhizhong Zhang, Xin Tan, Jingyu Gong, Kun Shao, Yuan Xie

Main category: cs.AI

TL;DR: MIA is a Memory Intelligence Agent framework with Manager-Planner-Executor architecture that enables efficient memory evolution for deep research agents through parametric/non-parametric memory conversion and test-time learning.

DetailsMotivation: Existing deep research agents with memory systems suffer from ineffective memory evolution and increasing storage/retrieval costs when relying on similar trajectory retrieval. There's a need for more efficient memory evolution mechanisms.

Method: Proposes MIA framework with three components: Memory Manager (non-parametric compressed memory), Planner (parametric memory agent for search plans), and Executor (agent for guided search). Uses alternating reinforcement learning for Planner-Executor cooperation, test-time learning for continuous evolution, bidirectional parametric/non-parametric memory conversion, and reflection/unsupervised judgment mechanisms.

Result: Extensive experiments across eleven benchmarks demonstrate the superiority of MIA over existing methods, showing improved reasoning and self-evolution capabilities.

Conclusion: MIA provides an effective framework for memory evolution in deep research agents, addressing limitations of existing memory systems through its novel architecture and learning mechanisms.

Abstract: Deep research agents (DRAs) integrate LLM reasoning with external tools. Memory systems enable DRAs to leverage historical experiences, which are essential for efficient reasoning and autonomous evolution. Existing methods rely on retrieving similar trajectories from memory to aid reasoning, while suffering from key limitations of ineffective memory evolution and increasing storage and retrieval costs. To address these problems, we propose a novel Memory Intelligence Agent (MIA) framework, consisting of a Manager-Planner-Executor architecture. Memory Manager is a non-parametric memory system that can store compressed historical search trajectories. Planner is a parametric memory agent that can produce search plans for questions. Executor is another agent that can search and analyze information guided by the search plan. To build the MIA framework, we first adopt an alternating reinforcement learning paradigm to enhance cooperation between the Planner and the Executor. Furthermore, we enable the Planner to continuously evolve during test-time learning, with updates performed on-the-fly alongside inference without interrupting the reasoning process. Additionally, we establish a bidirectional conversion loop between parametric and non-parametric memories to achieve efficient memory evolution. Finally, we incorporate a reflection and an unsupervised judgment mechanisms to boost reasoning and self-evolution in the open world. Extensive experiments across eleven benchmarks demonstrate the superiority of MIA.

[392] Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

Dawei Li, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, Lichao Sun

Main category: cs.AI

TL;DR: GoS is a structural retrieval system for large skill libraries that uses graph-based inference-time retrieval to efficiently select relevant skills without saturating context windows.

DetailsMotivation: As skill libraries scale to thousands of skills in real-world agent systems, loading full skill sets saturates context windows, increasing token costs, hallucination, and latency. Current approaches struggle with efficient retrieval from large skill collections.

Method: GoS constructs an executable skill graph offline from skill packages. At inference time, it retrieves bounded, dependency-aware skill bundles using hybrid semantic-lexical seeding, reverse-weighted Personalized PageRank, and context-budgeted hydration.

Result: On SkillsBench and ALFWorld, GoS improves average reward by 43.6% over vanilla full skill-loading baseline while reducing input tokens by 37.8%. It generalizes across three model families and scales effectively from 200 to 2,000 skills.

Conclusion: GoS provides an effective structural retrieval layer for large skill libraries that balances reward, token efficiency, and runtime, outperforming both full skill loading and simple vector retrieval approaches.

Abstract: Skill usage has become a core component of modern agent systems and can substantially improve agents’ ability to complete complex tasks. In real-world settings, where agents must monitor and interact with numerous personal applications, web browsers, and other environment interfaces, skill libraries can scale to thousands of reusable skills. Scaling to larger skill sets introduces two key challenges. First, loading the full skill set saturates the context window, driving up token costs, hallucination, and latency. In this paper, we present Graph of Skills (GoS), an inference-time structural retrieval layer for large skill libraries. GoS constructs an executable skill graph offline from skill packages, then at inference time retrieves a bounded, dependency-aware skill bundle through hybrid semantic-lexical seeding, reverse-weighted Personalized PageRank, and context-budgeted hydration. On SkillsBench and ALFWorld, GoS improves average reward by 43.6% over the vanilla full skill-loading baseline while reducing input tokens by 37.8%, and generalizes across three model families: Claude Sonnet, GPT-5.2 Codex, and MiniMax. Additional ablation studies across skill libraries ranging from 200 to 2,000 skills further demonstrate that GoS consistently outperforms both vanilla skills loading and simple vector retrieval in balancing reward, token efficiency, and runtime.

[393] TRACE: Capability-Targeted Agentic Training

Hangoo Kang, Tarun Suresh, Jon Saad-Falcon, Azalia Mirhoseini

Main category: cs.AI

TL;DR: TRACE is a system for agent self-improvement that identifies lacking capabilities from failed trajectories, creates targeted training environments for those capabilities, and trains specialized LoRA adapters via RL to improve agent performance.

DetailsMotivation: Current approaches for improving LLM agents either use non-targeted synthetic data or require implicit learning of capabilities through direct environment training, which is inefficient and doesn't address specific capability deficits.

Method: TRACE analyzes successful vs failed trajectories to identify lacking capabilities, synthesizes targeted training environments that reward exercising those specific capabilities, trains LoRA adapters via RL on each synthetic environment, and routes to relevant adapters at inference.

Result: TRACE improves over base agents by +14.1 points on τ²-bench (customer service) and +7 perfect scores on ToolSandbox (tool use), outperforming strongest baselines by +7.4 points and +4 perfect scores respectively, with more efficient scaling.

Conclusion: TRACE provides an effective end-to-end system for environment-specific agent self-improvement that automatically identifies and targets capability deficits, leading to significant performance gains across different environments.

Abstract: Large Language Models (LLMs) deployed in agentic environments must exercise multiple capabilities across different task instances, where a capability is performing one or more actions in a trajectory that are necessary for successfully solving a subset of tasks in the environment. Many existing approaches either rely on synthetic training data that is not targeted to the model’s actual capability deficits in the target environment or train directly on the target environment, where the model needs to implicitly learn the capabilities across tasks. We introduce TRACE (Turning Recurrent Agent failures into Capability-targeted training Environments), an end-to-end system for environment-specific agent self-improvement. TRACE contrasts successful and failed trajectories to automatically identify lacking capabilities, synthesizes a targeted training environment for each that rewards whether the capability was exercised, and trains a LoRA adapter via RL on each synthetic environment, routing to the relevant adapter at inference. Empirically, TRACE generalizes across different environments, improving over the base agent by +14.1 points on $τ^2$-bench (customer service) and +7 perfect scores on ToolSandbox (tool use), outperforming the strongest baseline by +7.4 points and +4 perfect scores, respectively. Given the same number of rollouts, TRACE scales more efficiently than baselines, outperforming GRPO and GEPA by +9.2 and +7.4 points on $τ^2$-bench.

[394] Dynamic Agentic AI Expert Profiler System Architecture for Multidomain Intelligence Modeling

Aisvarya Adeseye, Jouni Isoaho, Seppo Virtanen, Mohammad Tahir

Main category: cs.AI

TL;DR: An agentic AI profiler that classifies natural language responses into four expertise levels (Novice, Basic, Advanced, Expert) using LLaMA v3.1 (8B) with modular architecture, achieving 83-97% alignment with human self-assessments.

DetailsMotivation: Modern AI systems need to understand user context and expertise levels for meaningful human-machine interaction, requiring systems that can accurately classify user expertise from natural language responses.

Method: Uses a modular layered architecture built on LLaMA v3.1 (8B) with components for text preprocessing, scoring, aggregation, and classification. Evaluation conducted in two phases: static phase with pre-recorded transcripts from 82 participants, and dynamic phase with 402 live interviews where expertise was assessed after each response.

Result: Achieved 83% to 97% alignment between profiler evaluations and participant self-assessments across domains. Discrepancies attributed to self-rating bias, unclear responses, and occasional misinterpretation of nuanced expertise by the language model.

Conclusion: The agentic AI profiler effectively classifies user expertise from natural language responses, demonstrating high alignment with human self-assessments and potential for improving context-aware human-machine interaction.

Abstract: In today’s artificial intelligence driven world, modern systems communicate with people from diverse backgrounds and skill levels. For human-machine interaction to be meaningful, systems must be aware of context and user expertise. This study proposes an agentic AI profiler that classifies natural language responses into four levels: Novice, Basic, Advanced, and Expert. The system uses a modular layered architecture built on LLaMA v3.1 (8B), with components for text preprocessing, scoring, aggregation, and classification. Evaluation was conducted in two phases: a static phase using pre-recorded transcripts from 82 participants, and a dynamic phase with 402 live interviews conducted by an agentic AI interviewer. In both phases, participant self-ratings were compared with profiler predictions. In the dynamic phase, expertise was assessed after each response rather than at the end of the interview. Across domains, 83% to 97% of profiler evaluations matched participant self-assessments. Remaining differences were due to self-rating bias, unclear responses, and occasional misinterpretation of nuanced expertise by the language model.

[395] From Retinal Evidence to Safe Decisions: RETINA-SAFE and ECRT for Hallucination Risk Triage in Medical LLMs

Zhe Yu, Wenpeng Xing, Meng Han

Main category: cs.AI

TL;DR: RETINA-SAFE benchmark for detecting hallucinations in medical LLMs for diabetic retinopathy, with ECRT framework for evidence-based risk triage using internal model signals.

DetailsMotivation: Hallucinations in medical LLMs pose safety-critical risks, especially when evidence is insufficient or conflicting. The paper addresses this problem specifically in diabetic retinopathy decision settings where accurate diagnosis is crucial.

Method: Introduces RETINA-SAFE benchmark (12,522 samples) with three evidence-relation tasks. Proposes ECRT (Evidence-Conditioned Risk Triage), a two-stage white-box detection framework that uses internal representation and logit shifts under CTX/NOCTX conditions with class-balanced training.

Result: ECRT provides strong Stage-1 risk triage and explicit subtype attribution, improves balanced accuracy by +0.15 to +0.19 over uncertainty baselines and +0.02 to +0.07 over supervised baselines, consistently outperforming single-stage ablation.

Conclusion: White-box internal signals grounded in retinal evidence offer a practical approach for interpretable medical LLM risk triage, addressing hallucination detection in evidence-challenged scenarios.

Abstract: Hallucinations in medical large language models (LLMs) remain a safety-critical issue, particularly when available evidence is insufficient or conflicting. We study this problem in diabetic retinopathy (DR) decision settings and introduce RETINA-SAFE, an evidence-grounded benchmark aligned with retinal grading records, comprising 12,522 samples. RETINA-SAFE is organized into three evidence-relation tasks: E-Align (evidence-consistent), E-Conflict (evidence-conflicting), and E-Gap (evidence-insufficient). We further propose ECRT (Evidence-Conditioned Risk Triage), a two-stage white-box detection framework: Stage 1 performs Safe/Unsafe risk triage, and Stage 2 refines unsafe cases into contradiction-driven versus evidence-gap risks. ECRT leverages internal representation and logit shifts under CTX/NOCTX conditions, with class-balanced training for robust learning. Under evidence-grouped (not patient-disjoint) splits across multiple backbones, ECRT provides strong Stage-1 risk triage and explicit subtype attribution, improves Stage-1 balanced accuracy by +0.15 to +0.19 over external uncertainty and self-consistency baselines and by +0.02 to +0.07 over the strongest adapted supervised baseline, and consistently exceeds a single-stage white-box ablation on Stage-1 balanced accuracy. These findings support white-box internal signals grounded in retinal evidence as a practical route to interpretable medical LLM risk triage.

[396] ETR: Entropy Trend Reward for Efficient Chain-of-Thought Reasoning

Xuan Xiong, Huan Liu, Li Gu, Zhixiang Chi, Yue Qiu, Yuanhao Yu, Yang Wang

Main category: cs.AI

TL;DR: ETR optimizes reasoning efficiency by encouraging progressive uncertainty reduction in chain-of-thought reasoning, achieving better accuracy with shorter reasoning traces.

DetailsMotivation: Chain-of-thought reasoning often produces excessively long and inefficient reasoning traces. Existing methods assume low uncertainty is always desirable, but the authors show reasoning efficiency is governed by uncertainty trajectory trends.

Method: Proposed Entropy Trend Reward (ETR), a trajectory-aware objective that encourages progressive uncertainty reduction while allowing limited local exploration. Integrated ETR into Group Relative Policy Optimization (GRPO) framework.

Result: ETR consistently achieves superior accuracy-efficiency tradeoff, improving DeepSeek-R1-Distill-7B by 9.9% in accuracy while reducing CoT length by 67% across four benchmarks.

Conclusion: Reasoning efficiency is governed by uncertainty trajectory trends, and ETR effectively optimizes this by encouraging progressive uncertainty reduction, leading to more efficient chain-of-thought reasoning.

Abstract: Chain-of-thought (CoT) reasoning improves large language model performance on complex tasks, but often produces excessively long and inefficient reasoning traces. Existing methods shorten CoTs using length penalties or global entropy reduction, implicitly assuming that low uncertainty is desirable throughout reasoning. We show instead that reasoning efficiency is governed by the trajectory of uncertainty. CoTs with dominant downward entropy trends are substantially shorter. Motivated by this insight, we propose Entropy Trend Reward (ETR), a trajectory-aware objective that encourages progressive uncertainty reduction while allowing limited local exploration. We integrate ETR into Group Relative Policy Optimization (GRPO) and evaluate it across multiple reasoning models and challenging benchmarks. ETR consistently achieves a superior accuracy-efficiency tradeoff, improving DeepSeek-R1-Distill-7B by 9.9% in accuracy while reducing CoT length by 67% across four benchmarks. Code is available at https://github.com/Xuan1030/ETR

[397] LatentAudit: Real-Time White-Box Faithfulness Monitoring for Retrieval-Augmented Generation with Verifiable Deployment

Zhe Yu, Wenpeng Xing, Meng Han

Main category: cs.AI

TL;DR: LatentAudit: White-box auditor using residual-stream activations to detect hallucination in retrieval-augmented generation systems by measuring Mahalanobis distance between generator activations and evidence representations.

DetailsMotivation: Retrieval-augmented generation (RAG) reduces but doesn't eliminate hallucination; deployed systems need real-time methods to determine if generated answers are actually supported by retrieved evidence without requiring auxiliary judge models.

Method: LatentAudit pools mid-to-late residual-stream activations from open-weight generators and measures their Mahalanobis distance to evidence representations. This creates a quadratic rule that can be calibrated on small held-out sets, requires no auxiliary models, and runs at generation time.

Result: Achieves 0.942 AUROC on PubMedQA with Llama-3-8B with only 0.77ms overhead. Maintains stability across three QA benchmarks and five model families (Llama-2/3, Qwen-2.5/3, Mistral), with 0.9566-0.9815 AUROC on PubMedQA and 0.9142-0.9315 on HotpotQA under stress tests. Preserves 99.8% of FP16 AUROC at 16-bit fixed-point precision.

Conclusion: Residual-stream geometry provides a practical basis for real-time RAG faithfulness monitoring and enables optional verifiable deployment through Groth16-based public verification without revealing model weights or activations.

Abstract: Retrieval-augmented generation (RAG) mitigates hallucination but does not eliminate it: a deployed system must still decide, at inference time, whether its answer is actually supported by the retrieved evidence. We introduce LatentAudit, a white-box auditor that pools mid-to-late residual-stream activations from an open-weight generator and measures their Mahalanobis distance to the evidence representation. The resulting quadratic rule requires no auxiliary judge model, runs at generation time, and is simple enough to calibrate on a small held-out set. We show that residual-stream geometry carries a usable faithfulness signal, that this signal survives architecture changes and realistic retrieval failures, and that the same rule remains amenable to public verification. On PubMedQA with Llama-3-8B, LatentAudit reaches 0.942 AUROC with 0.77,ms overhead. Across three QA benchmarks and five model families (Llama-2/3, Qwen-2.5/3, Mistral), the monitor remains stable; under a four-way stress test with contradictions, retrieval misses, and partial-support noise, it reaches 0.9566–0.9815 AUROC on PubMedQA and 0.9142–0.9315 on HotpotQA. At 16-bit fixed-point precision, the audit rule preserves 99.8% of the FP16 AUROC, enabling Groth16-based public verification without revealing model weights or activations. Together, these results position residual-stream geometry as a practical basis for real-time RAG faithfulness monitoring and optional verifiable deployment.

[398] TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems

Md Atik Ahamed, Mihir Parmar, Palash Goyal, Yiwen Song, Long T. Le, Qiang Cheng, Chun-Liang Li, Hamid Palangi, Jinsung Yoon, Tomas Pfister

Main category: cs.AI

TL;DR: TFRBench is the first benchmark for evaluating reasoning capabilities in time-series forecasting systems, moving beyond numerical accuracy to assess cross-channel dependencies, trends, and external event analysis.

DetailsMotivation: Traditional time-series forecasting evaluation focuses only on numerical accuracy, treating models as black boxes. There's a need for interpretable, reasoning-based evaluation that assesses how forecasting systems analyze dependencies, trends, and external factors.

Method: Proposes a systematic multi-agent framework with iterative verification loops to synthesize numerically grounded reasoning traces. The benchmark spans 10 datasets across 5 domains and uses LLM-as-a-Judge scoring for reasoning evaluation.

Result: Generated reasoning traces are causally effective and useful for evaluation. Prompting LLMs with these traces improves forecasting accuracy from ~40.2% to 56.6%. Off-the-shelf LLMs struggle with both reasoning and numerical forecasting, often failing to capture domain-specific dynamics.

Conclusion: TFRBench establishes a new standard for interpretable, reasoning-based evaluation in time-series forecasting, providing a protocol to assess reasoning capabilities beyond numerical accuracy.

Abstract: We introduce TFRBench, the first benchmark designed to evaluate the reasoning capabilities of forecasting systems. Traditionally, time-series forecasting has been evaluated solely on numerical accuracy, treating foundation models as ``black boxes.’’ Unlike existing benchmarks, TFRBench provides a protocol for evaluating the reasoning generated by forecasting systems–specifically their analysis of cross-channel dependencies, trends, and external events. To enable this, we propose a systematic multi-agent framework that utilizes an iterative verification loop to synthesize numerically grounded reasoning traces. Spanning ten datasets across five domains, our evaluation confirms that this reasoning is causally effective; useful for evaluation; and prompting LLMs with our generated traces significantly improves forecasting accuracy compared to direct numerical prediction (e.g., avg. $\sim40.2%\to56.6%)$, validating the quality of our reasoning. Conversely, benchmarking experiments reveal that off-the-shelf LLMs consistently struggle with both reasoning (lower LLM-as-a-Judge scores) and numerical forecasting, frequently failing to capture domain-specific dynamics. TFRBench thus establishes a new standard for interpretable, reasoning-based evaluation in time-series forecasting. Our benchmark is available at: https://tfrbench.github.io

[399] LLM-as-Judge for Semantic Judging of Powerline Segmentation in UAV Inspection

Akram Hossain, Rabab Abdelfattah, Xiaofeng Wang, Kareem Abdelfatah

Main category: cs.AI

TL;DR: Using LLMs as semantic judges to evaluate reliability of drone-based power line segmentation outputs, with focus on consistency and perceptual sensitivity to visual corruptions.

DetailsMotivation: Lightweight segmentation models on drones for power line inspection can degrade unpredictably in real-world conditions different from training data, raising safety concerns. Need reliable methods to assess segmentation quality in safety-critical applications.

Method: Formalize watchdog scenario where offboard LLM evaluates segmentation overlays. Design two evaluation protocols: 1) Repeatability - query LLM repeatedly with identical inputs to measure stability of quality scores and confidence estimates; 2) Perceptual sensitivity - introduce controlled visual corruptions (fog, rain, snow, shadow, sunflare) and analyze how judge’s outputs respond to progressive degradation.

Result: LLM produces highly consistent categorical judgments under identical conditions while exhibiting appropriate declines in confidence as visual reliability deteriorates. Judge remains responsive to perceptual cues like missing or misidentified power lines even under challenging conditions.

Conclusion: When carefully constrained, LLMs can serve as reliable semantic judges for monitoring segmentation quality in safety-critical aerial inspection tasks, providing consistent evaluation and appropriate sensitivity to visual degradation.

Abstract: The deployment of lightweight segmentation models on drones for autonomous power line inspection presents a critical challenge: maintaining reliable performance under real-world conditions that differ from training data. Although compact architectures such as U-Net enable real-time onboard inference, their segmentation outputs can degrade unpredictably in adverse environments, raising safety concerns. In this work, we study the feasibility of using a large language model (LLM) as a semantic judge to assess the reliability of power line segmentation results produced by drone-mounted models. Rather than introducing a new inspection system, we formalize a watchdog scenario in which an offboard LLM evaluates segmentation overlays and examine whether such a judge can be trusted to behave consistently and perceptually coherently. To this end, we design two evaluation protocols that analyze the judge’s repeatability and sensitivity. First, we assess repeatability by repeatedly querying the LLM with identical inputs and fixed prompts, measuring the stability of its quality scores and confidence estimates. Second, we evaluate perceptual sensitivity by introducing controlled visual corruptions (fog, rain, snow, shadow, and sunflare) and analyzing how the judge’s outputs respond to progressive degradation in segmentation quality. Our results show that the LLM produces highly consistent categorical judgments under identical conditions while exhibiting appropriate declines in confidence as visual reliability deteriorates. Moreover, the judge remains responsive to perceptual cues such as missing or misidentified power lines, even under challenging conditions. These findings suggest that, when carefully constrained, an LLM can serve as a reliable semantic judge for monitoring segmentation quality in safety-critical aerial inspection tasks.

[400] Towards Effective In-context Cross-domain Knowledge Transfer via Domain-invariant-neurons-based Retrieval

Jianzhi Yan, Zhiming Li, Le Liu, Zike Yuan, Shiwei Chen, Youcheng Pan, Buzhou Tang, Yang Xiang, Danny Dongning Sun

Main category: cs.AI

TL;DR: DIN-Retrieval enables cross-domain example retrieval for boosting LLM reasoning by identifying domain-invariant logical structures through neural representations.

DetailsMotivation: Current LLM reasoning enhancement methods rely on expert-crafted in-domain demonstrations, which limits applicability in expertise-scarce domains like specialized math, formal logic, or legal analysis. The paper aims to leverage cross-domain examples by identifying reusable implicit logical structures across domains.

Method: Proposes DIN-Retrieval (Domain-Invariant Neurons-based Retrieval) that: 1) summarizes a hidden representation universal across different domains, 2) uses this DIN vector during inference to retrieve structurally compatible cross-domain demonstrations for in-context learning.

Result: Experimental results in multiple settings for transfer of mathematical and logical reasoning show an average improvement of 1.8 over state-of-the-art methods.

Conclusion: Cross-domain demonstrating examples can effectively boost LLM reasoning performance, and DIN-Retrieval provides an effective method for retrieving structurally compatible examples across domains.

Abstract: Large language models (LLMs) have made notable progress in logical reasoning, yet still fall short of human-level performance. Current boosting strategies rely on expert-crafted in-domain demonstrations, limiting their applicability in expertise-scarce domains, such as specialized mathematical reasoning, formal logic, or legal analysis. In this work, we demonstrate the feasibility of leveraging cross-domain demonstrating examples to boost the LLMs’ reasoning performance. Despite substantial domain differences, many reusable implicit logical structures are shared across domains. In order to effectively retrieve cross-domain examples for unseen domains under investigation, in this work, we further propose an effective retrieval method, called domain-invariant neurons-based retrieval (\textbf{DIN-Retrieval}). Concisely, DIN-Retrieval first summarizes a hidden representation that is universal across different domains. Then, during the inference stage, we use the DIN vector to retrieve structurally compatible cross-domain demonstrations for the in-context learning. Experimental results in multiple settings for the transfer of mathematical and logical reasoning demonstrate that our method achieves an average improvement of 1.8 over the state-of-the-art methods \footnote{Our implementation is available at https://github.com/Leon221220/DIN-Retrieval}.

[401] Neural Assistive Impulses: Synthesizing Exaggerated Motions for Physics-based Characters

Zhiquan Wang, Bedrich Benes

Main category: cs.AI

TL;DR: Physics-based character animation framework using impulse-based control to achieve stylized, physically exaggerated motions that violate standard physics laws.

DetailsMotivation: Current data-driven DRL methods for physics-based character animation struggle with exaggerated, stylized motions (like instantaneous dashes or mid-air trajectory changes) that violate standard physical laws, due to training instability from velocity discontinuities and force spikes.

Method: Assistive Impulse Neural Control reformulates external assistance in impulse space rather than force space for numerical stability. Decomposes assistive signal into analytic high-frequency component from Inverse Dynamics and learned low-frequency residual correction via hybrid neural policy.

Result: Enables robust tracking of highly agile, dynamically infeasible maneuvers previously intractable for physics-based methods.

Conclusion: The impulse-based control framework successfully addresses limitations of traditional physics-based animation for stylized motions while maintaining stability.

Abstract: Physics-based character animation has become a fundamental approach for synthesizing realistic, physically plausible motions. While current data-driven deep reinforcement learning (DRL) methods can synthesize complex skills, they struggle to reproduce exaggerated, stylized motions, such as instantaneous dashes or mid-air trajectory changes, which are required in animation but violate standard physical laws. The primary limitation stems from modeling the character as an underactuated floating-base system, in which internal joint torques and momentum conservation strictly govern motion. Direct attempts to enforce such motions via external wrenches often lead to training instability, as velocity discontinuities produce sparse, high-magnitude force spikes that prevent policy convergence. We propose Assistive Impulse Neural Control, a framework that reformulates external assistance in impulse space rather than force space to ensure numerical stability. We decompose the assistive signal into an analytic high-frequency component derived from Inverse Dynamics and a learned low-frequency residual correction, governed by a hybrid neural policy. We demonstrate that our method enables robust tracking of highly agile, dynamically infeasible maneuvers that were previously intractable for physics-based methods.

[402] Reason Analogically via Cross-domain Prior Knowledge: An Empirical Study of Cross-domain Knowledge Transfer for In-Context Learning

Le Liu, Zhiming Li, Jianzhi Yan, Zike Yuan, Shiwei Chen, Youcheng Pan, Buzhou Tang, Qingcai Chen, Yang Xiang, Danny Dongning Sun

Main category: cs.AI

TL;DR: Cross-domain in-context learning enables knowledge transfer between domains by leveraging shared reasoning structures, even with semantic mismatch, through effective retrieval methods.

DetailsMotivation: Existing ICL requires in-domain expert demonstrations, limiting applicability when expert annotations are scarce. The paper hypothesizes that different domains may share underlying reasoning structures, enabling cross-domain knowledge transfer despite semantic differences.

Method: Conducted comprehensive empirical study of different retrieval methods to validate feasibility of cross-domain knowledge transfer in ICL. Identified example absorption threshold and analyzed whether gains stem from reasoning structure repair vs. semantic cues.

Result: Demonstrated conditional positive transfer in cross-domain ICL. Identified clear example absorption threshold beyond which positive transfer becomes more likely and additional demonstrations yield larger gains. Gains stem from reasoning structure repair rather than semantic cues.

Conclusion: Validates feasibility of leveraging cross-domain knowledge transfer to improve cross-domain ICL performance, motivating exploration of more effective retrieval approaches for this novel direction.

Abstract: Despite its success, existing in-context learning (ICL) relies on in-domain expert demonstrations, limiting its applicability when expert annotations are scarce. We posit that different domains may share underlying reasoning structures, enabling source-domain demonstrations to improve target-domain inference despite semantic mismatch. To test this hypothesis, we conduct a comprehensive empirical study of different retrieval methods to validate the feasibility of achieving cross-domain knowledge transfer under the in-context learning setting. Our results demonstrate conditional positive transfer in cross-domain ICL. We identify a clear example absorption threshold: beyond it, positive transfer becomes more likely, and additional demonstrations yield larger gains. Further analysis suggests that these gains stem from reasoning structure repair by retrieved cross-domain examples, rather than semantic cues. Overall, our study validates the feasibility of leveraging cross-domain knowledge transfer to improve cross-domain ICL performance, motivating the community to explore designing more effective retrieval approaches for this novel direction.\footnote{Our implementation is available at https://github.com/littlelaska/ICL-TF4LR}

[403] HYVE: Hybrid Views for LLM Context Engineering over Machine Data

Jian Tan, Fan Bu, Yuqing Gao, Dev Khanolkar, Jason Mackay, Boris Sobolev, Lei Jin, Li Zhang

Main category: cs.AI

TL;DR: HYVE framework improves LLM handling of large machine-data payloads by using database-inspired preprocessing to detect repetitive structure and create optimized hybrid views, reducing token usage by 50-90% while maintaining quality.

DetailsMotivation: LLMs struggle with long, deeply nested machine-data payloads (logs, metrics, telemetry traces, JSON, Python/AST literals) that dominate modern computing observability. Current LLMs remain brittle on such inputs, especially when repetitive and structured.

Method: HYVE framework uses database management principles with coordinated preprocessing/postprocessing. It detects repetitive structure in raw inputs, materializes it in a request-scoped datastore with schema info, transforms into hybrid columnar/row-oriented views, and selectively exposes only relevant representations to LLM. Postprocessing either returns output directly, queries datastore for omitted info, or performs bounded additional LLM call for SQL-augmented semantic synthesis.

Result: Reduces token usage by 50-90% while maintaining or improving output quality. Improves chart-generation accuracy by up to 132% and reduces latency by up to 83% on structured generation tasks. Effectively approximates unbounded context window for prompts with large machine-data payloads.

Conclusion: HYVE offers practical solution for LLM context engineering with large machine-data payloads, enabling more efficient and effective processing of structured observability data through database-inspired optimization techniques.

Abstract: Machine data is central to observability and diagnosis in modern computing systems, appearing in logs, metrics, telemetry traces, and configuration snapshots. When provided to large language models (LLMs), this data typically arrives as a mixture of natural language and structured payloads such as JSON or Python/AST literals. Yet LLMs remain brittle on such inputs, particularly when they are long, deeply nested, and dominated by repetitive structure. We present HYVE (HYbrid ViEw), a framework for LLM context engineering for inputs containing large machine-data payloads, inspired by database management principles. HYVE surrounds model invocation with coordinated preprocessing and postprocessing, centered on a request-scoped datastore augmented with schema information. During preprocessing, HYVE detects repetitive structure in raw inputs, materializes it in the datastore, transforms it into hybrid columnar and row-oriented views, and selectively exposes only the most relevant representation to the LLM. During postprocessing, HYVE either returns the model output directly, queries the datastore to recover omitted information, or performs a bounded additional LLM call for SQL-augmented semantic synthesis. We evaluate HYVE on diverse real-world workloads spanning knowledge QA, chart generation, anomaly detection, and multi-step network troubleshooting. Across these benchmarks, HYVE reduces token usage by 50-90% while maintaining or improving output quality. On structured generation tasks, it improves chart-generation accuracy by up to 132% and reduces latency by up to 83%. Overall, HYVE offers a practical approximation to an effectively unbounded context window for prompts dominated by large machine-data payloads.

[404] CODESTRUCT: Code Agents over Structured Action Spaces

Myeongsoo Kim, Joe Hsu, Dingmin Wang, Shweta Garg, Varun Kumar, Murali Krishna Ramanathan

Main category: cs.AI

TL;DR: CODESTRUCT is a framework that treats codebases as structured action spaces using AST entities instead of unstructured text, improving code agent performance and efficiency.

DetailsMotivation: Current LLM-based code agents treat repositories as unstructured text and use brittle string matching for edits, which frequently fails due to formatting drift or ambiguous patterns.

Method: Proposes reframing codebases as structured action spaces where agents operate on named AST entities rather than text spans. CODESTRUCT provides readCode for retrieving complete syntactic units and editCode for applying syntax-validated transformations to semantic program elements.

Result: On SWE-Bench Verified across six LLMs, CODESTRUCT improves Pass@1 accuracy by 1.2-5.0% while reducing token consumption by 12-38% for most models. GPT-5-nano improves by 20.8% as empty-patch failures drop from 46.6% to 7.2%. On CodeAssistBench, consistent accuracy gains (+0.8-4.4%) with cost reductions up to 33%.

Conclusion: Structure-aware interfaces offer a more reliable foundation for code agents compared to text-based approaches.

Abstract: LLM-based code agents treat repositories as unstructured text, applying edits through brittle string matching that frequently fails due to formatting drift or ambiguous patterns. We propose reframing the codebase as a structured action space where agents operate on named AST entities rather than text spans. Our framework, CODESTRUCT, provides readCode for retrieving complete syntactic units and editCode for applying syntax-validated transformations to semantic program elements. Evaluated on SWE-Bench Verified across six LLMs, CODESTRUCT improves Pass@1 accuracy by 1.2-5.0% while reducing token consumption by 12-38% for most models. Models that frequently fail to produce valid patches under text-based interfaces benefit most: GPT-5-nano improves by 20.8% as empty-patch failures drop from 46.6% to 7.2%. On CodeAssistBench, we observe consistent accuracy gains (+0.8-4.4%) with cost reductions up to 33%. Our results show that structure-aware interfaces offer a more reliable foundation for code agents.

[405] Multi-Agent Pathfinding with Non-Unit Integer Edge Costs via Enhanced Conflict-Based Search and Graph Discretization

Hongkai Fan, Qinjing Xie, Bo Ouyang, Yaonan Wang, Zhi Yan, Jiawen He, Zheng Fang

Main category: cs.AI

TL;DR: MAPFZ extends Multi-Agent Pathfinding to handle non-unit integer costs while preserving finite state space, with CBS-NIC solver and BOGD discretization method.

DetailsMotivation: Traditional MAPF assumes unit edge costs and single-timestep actions, limiting real-world applicability. MAPFR handles non-unit costs but has unbounded state space. Need a variant that balances realism with computational efficiency.

Method: Propose MAPFZ with non-unit integer costs on graphs. Develop CBS-NIC (Conflict-Based Search with Non-unit Integer Costs) featuring time-interval-based conflict detection and improved Safe Interval Path Planning. Also propose BOGD (Bayesian Optimization for Graph Design) for discretizing non-unit edge costs with sub-linear regret bound.

Result: Extensive experiments show the approach outperforms state-of-the-art methods in runtime and success rate across diverse benchmark scenarios.

Conclusion: MAPFZ provides improved realism over classical MAPF while maintaining computational efficiency through finite state space preservation and novel solving techniques.

Abstract: Multi-Agent Pathfinding (MAPF) plays a critical role in various domains. Traditional MAPF methods typically assume unit edge costs and single-timestep actions, which limit their applicability to real-world scenarios. MAPFR extends MAPF to handle non-unit costs with real-valued edge costs and continuous-time actions, but its geometric collision model leads to an unbounded state space that compromises solver efficiency. In this paper, we propose MAPFZ, a novel MAPF variant on graphs with non-unit integer costs that preserves a finite state space while offering improved realism over classical MAPF. To solve MAPFZ efficiently, we develop CBS-NIC, an enhanced Conflict-Based Search framework incorporating time-interval-based conflict detection and an improved Safe Interval Path Planning (SIPP) algorithm. Additionally, we propose Bayesian Optimization for Graph Design (BOGD), a discretization method for non-unit edge costs that balances efficiency and accuracy with a sub-linear regret bound. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in runtime and success rate across diverse benchmark scenarios.

[406] PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection

Siyuan Cheng, Bozhong Tian, YanChao Hao, Zheng Wei

Main category: cs.AI

TL;DR: PRISM-MCTS improves reasoning efficiency by sharing insights across search trajectories using a Process Reward Model and dynamic memory, reducing computational redundancy while maintaining performance.

DetailsMotivation: Existing Monte Carlo Tree Search approaches for reasoning models treat each rollout as isolated, leading to inefficiency and computational redundancy due to lack of information sharing across trajectories.

Method: PRISM-MCTS integrates a Process Reward Model with dynamic shared memory to capture heuristics and fallacies, enabling reinforcement of successful strategies and pruning of error-prone branches. Includes data-efficient training for the PRM.

Result: Halves trajectory requirements on GPQA benchmark while surpassing MCTS-RAG and Search-o1, demonstrating efficient scaling through judicious reasoning rather than exhaustive search.

Conclusion: PRISM-MCTS represents an effective reasoning framework that improves efficiency by sharing insights across search trajectories, enabling better scaling of inference through metacognitive reflection.

Abstract: PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection Siyuan Cheng, Bozhong Tian, Yanchao Hao, Zheng Wei Published: 06 Apr 2026, Last Modified: 06 Apr 2026 ACL 2026 Findings Conference, Area Chairs, Reviewers, Publication Chairs, Authors Revisions BibTeX CC BY 4.0 Keywords: Efficient/Low-Resource Methods for NLP, Generation, Question Answering Abstract: The emergence of reasoning models, exemplified by OpenAI o1, signifies a transition from intuitive to deliberative cognition, effectively reorienting the scaling laws from pre-training paradigms toward test-time computation. While Monte Carlo Tree Search (MCTS) has shown promise in this domain, existing approaches typically treat each rollout as an isolated trajectory. This lack of information sharing leads to severe inefficiency and substantial computational redundancy, as the search process fails to leverage insights from prior explorations. To address these limitations, we propose PRISM-MCTS, a novel reasoning framework that draws inspiration from human parallel thinking and reflective processes. PRISM-MCTS integrates a Process Reward Model (PRM) with a dynamic shared memory, capturing both “Heuristics” and “Fallacies”. By reinforcing successful strategies and pruning error-prone branches, PRISM-MCTS effectively achieves refinement. Furthermore, we develop a data-efficient training strategy for the PRM, achieving high-fidelity evaluation under a few-shot regime. Empirical evaluations across diverse reasoning benchmarks substantiate the efficacy of PRISM-MCTS. Notably, it halves the trajectory requirements on GPQA while surpassing MCTS-RAG and Search-o1, demonstrating that it scales inference by reasoning judiciously rather than exhaustively.

[407] Automated Auditing of Hospital Discharge Summaries for Care Transitions

Akshat Dasula, Prasanna Desikan, Jaideep Srivastava

Main category: cs.AI

TL;DR: Automated auditing of discharge summaries using LLMs to identify documentation gaps in key transition-of-care elements for quality improvement

DetailsMotivation: Incomplete discharge documentation causes care fragmentation and avoidable readmissions, but manual auditing is difficult to scale, requiring automated solutions

Method: Uses locally deployed LLMs to operationalize transition-of-care requirements into structured validation checklist based on DISCHARGED framework, applied to MIMIC-IV inpatient summaries

Result: Demonstrates feasibility of scalable automated clinical auditing using privacy-preserving LLMs to identify presence/absence/ambiguity of key documentation elements

Conclusion: Provides foundation for systematic quality improvement in EHR documentation through automated auditing of discharge summaries

Abstract: Incomplete or inconsistent discharge documentation is a primary driver of care fragmentation and avoidable readmissions. Despite its critical role in patient safety, auditing discharge summaries relies heavily on manual review and is difficult to scale. We propose an automated framework for large-scale auditing of discharge summaries using locally deployed Large Language Models (LLMs). Our approach operationalizes core transition-of-care requirements such as follow-up instructions, medication history and changes, patient information and clinical course, etc. into a structured validation checklist of questions based on DISCHARGED framework. Using adult inpatient summaries from the MIMIC-IV database, we utilize a privacy-preserving LLM to identify the presence, absence, or ambiguity of key documentation elements. This work demonstrates the feasibility of scalable, automated clinical auditing and provides a foundation for systematic quality improvement in electronic health record documentation.

[408] Adaptive Serverless Resource Management via Slot-Survival Prediction and Event-Driven Lifecycle Control

Zeyu Wang, Cuiqianhe Du, Renyue Zhang, Kejian Tong, Qi He, Qiyuan Tian

Main category: cs.AI

TL;DR: Adaptive serverless computing framework reduces cold starts by 51.2% and improves cost-efficiency 2x using event-driven architecture and probabilistic modeling

DetailsMotivation: Serverless computing eliminates infrastructure management but suffers from cold start latency and resource inefficiency under variable workloads, leading to performance degradation and excessive costs.

Method: Proposes an adaptive engineering framework with dual-strategy mechanism: dynamically adjusts idle durations and employs intelligent request waiting based on slot survival predictions, using sliding window aggregation and asynchronous processing to proactively manage resource lifecycles.

Result: Experimental results show reduction of cold starts by up to 51.2% and improvement of cost-efficiency by nearly 2x compared to baseline methods in multi-cloud environments.

Conclusion: The adaptive framework effectively addresses serverless computing challenges by optimizing performance through event-driven architecture and probabilistic modeling, significantly reducing cold starts and improving cost-efficiency.

Abstract: Serverless computing eliminates infrastructure management overhead but introduces significant challenges regarding cold start latency and resource utilization. Traditional static resource allocation often leads to inefficiencies under variable workloads, resulting in performance degradation or excessive costs. This paper presents an adaptive engineering framework that optimizes serverless performance through event-driven architecture and probabilistic modeling. We propose a dual-strategy mechanism that dynamically adjusts idle durations and employs an intelligent request waiting strategy based on slot survival predictions. By leveraging sliding window aggregation and asynchronous processing, our system proactively manages resource lifecycles. Experimental results show that our approach reduces cold starts by up to 51.2% and improves cost-efficiency by nearly 2x compared to baseline methods in multi-cloud environments.

[409] OntoTKGE: Ontology-Enhanced Temporal Knowledge Graph Extrapolation

Dongying Lin, Yinan Liu, Shengwei tang, Bin Wang, Xiaochun Yang

Main category: cs.AI

TL;DR: OntoTKGE is an encoder-decoder framework that leverages ontological knowledge (concept hierarchies) to improve temporal knowledge graph extrapolation, especially for entities with sparse historical interactions.

DetailsMotivation: Temporal knowledge graph extrapolation struggles with entities having sparse historical interactions. Previous models ignore ontological knowledge that could help sparse entities inherit behavioral patterns from entities with the same concepts.

Method: Proposes OntoTKGE framework that integrates ontological knowledge from ontology-view KGs (modeling hierarchical concept relations and concept-entity connections) with temporal knowledge through an encoder-decoder architecture to enhance entity embeddings.

Result: Extensive experiments on four datasets show OntoTKGE significantly improves performance of many TKG extrapolation models and surpasses state-of-the-art baseline methods.

Conclusion: Ontological knowledge effectively alleviates sparsity issues in TKG extrapolation by enabling sparse entities to inherit patterns from concept-similar entities, and the flexible framework can adapt to various TKG models.

Abstract: Temporal knowledge graph (TKG) extrapolation is an important task that aims to predict future facts through historical interaction information within KG snapshots. A key challenge for most existing TKG extrapolation models is handling entities with sparse historical interaction. The ontological knowledge is beneficial for alleviating this sparsity issue by enabling these entities to inherit behavioral patterns from other entities with the same concept, which is ignored by previous studies. In this paper, we propose a novel encoder-decoder framework OntoTKGE that leverages the ontological knowledge from the ontology-view KG (i.e., a KG modeling hierarchical relations among abstract concepts as well as the connections between concepts and entities) to guide the TKG extrapolation model’s learning process through the effective integration of the ontological and temporal knowledge, thereby enhancing entity embeddings. OntoTKGE is flexible enough to adapt to many TKG extrapolation models. Extensive experiments on four data sets demonstrate that OntoTKGE not only significantly improves the performance of many TKG extrapolation models but also surpasses many SOTA baseline methods.

[410] Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

Xiaotian Zhou, Di Tang, Xiaofeng Wang, Xiaozhong Liu

Main category: cs.AI

TL;DR: GMRL-BD algorithm identifies untrustworthy topic boundaries in black-box LLMs using knowledge graphs and reinforcement learning agents to detect where models generate biased responses.

DetailsMotivation: LLMs sometimes produce biased, ideologized, or incorrect responses, limiting their applications when there's no clear understanding of which topics their answers can be trusted. There's a need to identify the untrustworthy boundaries of LLMs in terms of topics.

Method: GMRL-BD algorithm uses a general Knowledge Graph derived from Wikipedia and incorporates multiple reinforcement learning agents to efficiently identify topics (nodes in KG) where the LLM is likely to generate biased answers. Works with black-box access to LLMs under specific query constraints.

Result: The algorithm demonstrated efficiency in detecting untrustworthy boundaries with limited queries to the LLM. Researchers released a new dataset containing popular LLMs (Llama2, Vicuna, Falcon, Qwen2, Gemma2, Yi-1.5) with labels indicating topics where each LLM is likely to be biased.

Conclusion: GMRL-BD provides an effective method for identifying untrustworthy topic boundaries in LLMs, enabling better understanding of model limitations and potential applications where trustworthiness is critical.

Abstract: Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.

[411] Auditable Agents

Yi Nian, Aojie Yuan, Haiyue Zhang, Jiate Li, Yue Zhao

Main category: cs.AI

TL;DR: Paper proposes auditability framework for LLM agent systems with five dimensions and three mechanism classes, showing current systems lack basic security prerequisites for accountability.

DetailsMotivation: As LLM agents increasingly interact with the world through tools, databases, and external actions, there's a critical need to ensure these systems remain answerable after deployment. The paper distinguishes between accountability, auditability, and auditing, arguing that no agent system can be accountable without proper auditability mechanisms.

Method: Defines five dimensions of agent auditability: action recoverability, lifecycle coverage, policy checkability, responsibility attribution, and evidence integrity. Identifies three mechanism classes (detect, enforce, recover) with temporal information-and-intervention constraints. Conducts ecosystem measurements across six open-source projects, runtime feasibility tests, and controlled recovery experiments.

Result: Found 617 security findings across six prominent open-source projects indicating basic security prerequisites for auditability are widely unmet. Pre-execution mediation with tamper-evident records adds only 8.3 ms median overhead. Controlled recovery experiments show responsibility-relevant information can be partially recovered even when conventional logs are missing.

Conclusion: Proposes an Auditability Card for agent systems and identifies six open research problems organized by mechanism class. Emphasizes that auditability is essential for accountable agent systems and current implementations lack necessary security foundations.

Abstract: LLM agents call tools, query databases, delegate tasks, and trigger external side effects. Once an agent system can act in the world, the question is no longer only whether harmful actions can be prevented–it is whether those actions remain answerable after deployment. We distinguish accountability (the ability to determine compliance and assign responsibility), auditability (the system property that makes accountability possible), and auditing (the process of reconstructing behavior from trustworthy evidence). Our claim is direct: no agent system can be accountable without auditability. To make this operational, we define five dimensions of agent auditability, i.e., action recoverability, lifecycle coverage, policy checkability, responsibility attribution, and evidence integrity, and identify three mechanism classes (detect, enforce, recover) whose temporal information-and-intervention constraints explain why, in practice, no single approach suffices. We support the position with layered evidence rather than a single benchmark: lower-bound ecosystem measurements suggest that even basic security prerequisites for auditability are widely unmet (617 security findings across six prominent open-source projects); runtime feasibility results show that pre-execution mediation with tamper-evident records adds only 8.3 ms median overhead; and controlled recovery experiments show that responsibility-relevant information can be partially recovered even when conventional logs are missing. We propose an Auditability Card for agent systems and identify six open research problems organized by mechanism class.

[412] Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

Keuntae Kim, Mingyu Kang, Yong Suk Choi

Main category: cs.AI

TL;DR: Proposed PSP and VRG methods to address premature answer generation and weak visual grounding in diffusion multimodal LLMs during Chain-of-Thought reasoning.

DetailsMotivation: Diffusion multimodal LLMs (dMLLMs) show promise for faster parallel generation but exhibit critical issues when combined with Chain-of-Thought reasoning: they generate final answers too early without sufficient reasoning, and show minimal visual prompt dependency in early timesteps.

Method: Two techniques: 1) Position and Step Penalty (PSP) penalizes tokens in later positions during early timesteps to delay premature answer generation and encourage progressive reasoning; 2) Visual Reasoning Guidance (VRG) amplifies visual grounding signals using classifier-free guidance to enhance alignment with visual evidence.

Result: The method achieves up to 7.5% higher accuracy while delivering more than 3x speedup compared to reasoning with four times more diffusion steps across various dMLLMs.

Conclusion: The proposed PSP and VRG techniques effectively address the premature answer generation and weak visual grounding issues in dMLLMs during Chain-of-Thought reasoning, improving both accuracy and inference speed.

Abstract: Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of diffusion multimodal large language models (dMLLMs). These models are expected to retain the reasoning capabilities of LLMs while enabling faster inference through parallel generation. However, when combined with Chain-of-Thought (CoT) reasoning, dMLLMs exhibit two critical issues. First, we observe that dMLLMs often generate the final answer token at a very early timestep. This trend indicates that the model determines the answer before sufficient reasoning, leading to degraded reasoning performance. Second, during the initial timesteps, dMLLMs show minimal dependency on visual prompts, exhibiting a fundamentally different pattern of visual information utilization compared to AR vision-language models. In summary, these findings indicate that dMLLMs tend to generate premature final answers without sufficiently grounding on visual inputs. To address these limitations, we propose Position and Step Penalty (PSP) and Visual Reasoning Guidance (VRG). PSP penalizes tokens in later positions during early timesteps, delaying premature answer generation and encouraging progressive reasoning across timesteps. VRG, inspired by classifier-free guidance, amplifies visual grounding signals to enhance the model’s alignment with visual evidence. Extensive experiments across various dMLLMs demonstrate that our method achieves up to 7.5% higher accuracy while delivering more than 3x speedup compared to reasoning with four times more diffusion steps.

[413] OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward

Haoyue Yang, Xuanle Zhao, Xuexin Liu, Feibang Jiang, Yao Zhu

Main category: cs.AI

TL;DR: OmniDiagram is a unified framework for diagram code generation that supports multiple diagram languages and uses a novel visual feedback RL strategy called VIVA to align code logic with visual fidelity without manual annotations.

DetailsMotivation: Existing diagram generation approaches are limited to narrow task formulations and language support, restricting their applicability to diverse diagram types. There's a need for a unified framework that can handle multiple diagram languages and address the challenge of aligning code logic with visual fidelity in reinforcement learning.

Method: Proposes OmniDiagram framework with: 1) Support for diverse diagram code languages and task definitions, 2) Novel visual feedback strategy VIVA that generates targeted visual inquiries to scrutinize diagram visual fidelity and provides fine-grained feedback for optimization, 3) Self-evolving training process that eliminates need for manually annotated ground truth code, 4) Construction of M3²Diagram dataset with over 196k high-quality instances.

Result: Experimental results show that combining supervised fine-tuning (SFT) with VIVA-based reinforcement learning allows OmniDiagram to establish new state-of-the-art performance across diagram code generation benchmarks.

Conclusion: OmniDiagram provides a unified solution for programmable diagram generation that overcomes limitations of existing approaches through its multi-language support and innovative visual feedback mechanism, enabling robust diagram generation without manual annotations.

Abstract: The paradigm of programmable diagram generation is evolving rapidly, playing a crucial role in structured visualization. However, most existing studies are confined to a narrow range of task formulations and language support, constraining their applicability to diverse diagram types. In this work, we propose OmniDiagram, a unified framework that incorporates diverse diagram code languages and task definitions. To address the challenge of aligning code logic with visual fidelity in Reinforcement Learning (RL), we introduce a novel visual feedback strategy named Visual Interrogation Verifies All (\textsc{Viva}). Unlike brittle syntax-based rules or pixel-level matching, \textsc{Viva} rewards the visual structure of rendered diagrams through a generative approach. Specifically, \textsc{Viva} actively generates targeted visual inquiries to scrutinize diagram visual fidelity and provides fine-grained feedback for optimization. This mechanism facilitates a self-evolving training process, effectively obviating the need for manually annotated ground truth code. Furthermore, we construct M3$^2$Diagram, the first large-scale diagram code generation dataset, containing over 196k high-quality instances. Experimental results confirm that the combination of SFT and our \textsc{Viva}-based RL allows OmniDiagram to establish a new state-of-the-art (SOTA) across diagram code generation benchmarks.

[414] UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning

Xiaolong Wei, Zerun Zhu, Simin Niu, Xingyu Zhang, Peiying Yu, Changxuan Xiao, Yuchen Li, Jicheng Yang, Zhejun Zhao, Chong Meng, Long Xia, Daiting Shi

Main category: cs.AI

TL;DR: UniCreative is a unified RL framework for creative writing that addresses the tension between long-form coherence and short-form expressiveness using adaptive constraint-aware reward modeling and policy optimization.

DetailsMotivation: Addresses the fundamental challenge in creative writing: reconciling the tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. Existing alignment paradigms use static rewards and rely on costly supervised data.

Method: Proposes UniCreative with two components: 1) AC-GenRM - adaptive constraint-aware reward model that dynamically synthesizes query-specific criteria for fine-grained preference judgments, and 2) ACPO - policy optimization algorithm that aligns models with human preferences across content quality and structural paradigms without supervised fine-tuning.

Result: AC-GenRM aligns closely with expert evaluations, and ACPO significantly enhances performance across diverse writing tasks. The model develops emergent meta-cognitive ability to autonomously differentiate between tasks requiring rigorous planning vs. direct generation.

Conclusion: The direct alignment approach effectively addresses the creative writing challenge, enabling models to adaptively handle both structured long-form and expressive short-form generation without requiring supervised data or ground-truth references.

Abstract: A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. While long-context generation necessitates explicit macroscopic planning, short-form creativity often demands spontaneous, constraint-free expression. Existing alignment paradigms, however, typically employ static reward signals and rely heavily on high-quality supervised data, which is costly and difficult to scale. To address this, we propose \textbf{UniCreative}, a unified reference-free reinforcement learning framework. We first introduce \textbf{AC-GenRM}, an adaptive constraint-aware reward model that dynamically synthesizes query-specific criteria to provide fine-grained preference judgments. Leveraging these signals, we propose \textbf{ACPO}, a policy optimization algorithm that aligns models with human preferences across both content quality and structural paradigms without supervised fine-tuning and ground-truth references. Empirical results demonstrate that AC-GenRM aligns closely with expert evaluations, while ACPO significantly enhances performance across diverse writing tasks. Crucially, our analysis reveals an emergent meta-cognitive ability: the model learns to autonomously differentiate between tasks requiring rigorous planning and those favoring direct generation, validating the effectiveness of our direct alignment approach.

[415] Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

Yushuo Zheng, Huiyu Duan, Zicheng Zhang, Yucheng Zhu, Xiongkuo Min, Guangtao Zhai

Main category: cs.AI

TL;DR: Market-Bench is a benchmark for evaluating LLMs’ economic capabilities through multi-agent supply chain competition where LLMs act as retailers bidding for inventory and setting prices.

DetailsMotivation: The paper aims to address the unclear ability of large language models to manage and acquire economic resources, creating a systematic way to evaluate LLMs' performance in economically-relevant tasks through market competition.

Method: The authors construct a configurable multi-agent supply chain economic model where LLMs act as retailer agents. The benchmark has two stages: 1) Procurement stage where LLMs bid for limited inventory in budget-constrained auctions, and 2) Retail stage where LLMs set retail prices, generate marketing slogans, and provide them to buyers through role-based attention mechanism. The system logs complete trajectories for automatic evaluation with economic, operational, and semantic metrics.

Result: Benchmarking 20 open- and closed-source LLM agents reveals significant performance disparities and a winner-take-most phenomenon - only a small subset of LLM retailers can consistently achieve capital appreciation, while many hover around break-even despite similar semantic matching scores.

Conclusion: Market-Bench provides a reproducible testbed for studying how LLMs interact in competitive markets, revealing important insights about LLMs’ economic capabilities and competitive behaviors.

Abstract: The ability of large language models (LLMs) to manage and acquire economic resources remains unclear. In this paper, we introduce \textbf{Market-Bench}, a comprehensive benchmark that evaluates the capabilities of LLMs in economically-relevant tasks through economic and trade competition. Specifically, we construct a configurable multi-agent supply chain economic model where LLMs act as retailer agents responsible for procuring and retailing merchandise. In the \textbf{procurement} stage, LLMs bid for limited inventory in budget-constrained auctions. In the \textbf{retail} stage, LLMs set retail prices, generate marketing slogans, and provide them to buyers through a role-based attention mechanism for purchase. Market-Bench logs complete trajectories of bids, prices, slogans, sales, and balance-sheet states, enabling automatic evaluation with economic, operational, and semantic metrics. Benchmarking on 20 open- and closed-source LLM agents reveals significant performance disparities and winner-take-most phenomenon, \textit{i.e.}, only a small subset of LLM retailers can consistently achieve capital appreciation, while many hover around the break-even point despite similar semantic matching scores. Market-Bench provides a reproducible testbed for studying how LLMs interact in competitive markets.

[416] ActivityEditor: Learning to Synthesize Physically Valid Human Mobility

Chenjie Yang, Yutian Jiang, Anqi Liang, Wei Qi, Chenyu Wu, Junbo Zhang

Main category: cs.AI

TL;DR: ActivityEditor is a dual-LLM-agent framework for zero-shot cross-regional human mobility trajectory generation that addresses data scarcity issues by decomposing synthesis into intention generation and iterative refinement stages.

DetailsMotivation: Existing data-driven human mobility modeling methods suffer from data scarcity, limiting applicability in regions where historical trajectories are unavailable or restricted. There's a need for zero-shot cross-regional trajectory generation that can work without local training data.

Method: Proposes ActivityEditor, a dual-LLM-agent framework with two collaborative stages: (1) intention-based agent uses demographic-driven priors to generate structured human intentions and coarse activity chains for socio-semantic coherence, and (2) editor agent refines these through iterative revisions enforcing human mobility laws using reinforcement learning with multiple rewards grounded in real-world physical constraints.

Result: Extensive experiments demonstrate superior zero-shot performance when transferred across diverse urban contexts, maintaining high statistical fidelity and physical validity. The framework provides robust and highly generalizable solutions for mobility simulation in data-scarce scenarios.

Conclusion: ActivityEditor offers an effective solution for zero-shot cross-regional human mobility trajectory generation, bridging the data scarcity gap through its dual-LLM-agent architecture that combines intention generation with mobility law-enforced refinement.

Abstract: Human mobility modeling is indispensable for diverse urban applications. However, existing data-driven methods often suffer from data scarcity, limiting their applicability in regions where historical trajectories are unavailable or restricted. To bridge this gap, we propose \textbf{ActivityEditor}, a novel dual-LLM-agent framework designed for zero-shot cross-regional trajectory generation. Our framework decomposes the complex synthesis task into two collaborative stages. Specifically, an intention-based agent, which leverages demographic-driven priors to generate structured human intentions and coarse activity chains to ensure high-level socio-semantic coherence. These outputs are then refined by editor agent to obtain mobility trajectories through iteratively revisions that enforces human mobility law. This capability is acquired through reinforcement learning with multiple rewards grounded in real-world physical constraints, allowing the agent to internalize mobility regularities and ensure high-fidelity trajectory generation. Extensive experiments demonstrate that \textbf{ActivityEditor} achieves superior zero-shot performance when transferred across diverse urban contexts. It maintains high statistical fidelity and physical validity, providing a robust and highly generalizable solution for mobility simulation in data-scarce scenarios. Our code is available at: https://anonymous.4open.science/r/ActivityEditor-066B.

[417] Inventory of the 12 007 Low-Dimensional Pseudo-Boolean Landscapes Invariant to Rank, Translation, and Rotation

Arnaud Liefooghe, Sébastien Verel

Main category: cs.AI

TL;DR: The paper introduces a stronger notion of rank landscape invariance that considers not just solution rankings but also neighborhood structure and symmetries, providing an exhaustive inventory of invariant landscape classes for pseudo-Boolean functions up to dimension 3.

DetailsMotivation: To develop a more comprehensive understanding of optimization problem difficulty by studying rank landscapes rather than individual functions, considering neighborhood structure and symmetries in addition to solution rankings.

Method: Introduces rank landscape invariance concept, then provides exhaustive inventory of invariant landscape classes for pseudo-Boolean functions of dimensions 1, 2, and 3, analyzing both injective and non-injective cases.

Result: Identified 12,007 invariant landscape classes total, showing significant reduction compared to rank-invariance alone. Non-injective functions yield far more invariant landscape classes than injective ones, revealing complex combinations of topological properties and algorithm behaviors.

Conclusion: The inventory serves as a resource for pedagogical purposes and benchmark design, offering foundation for constructing larger problems with controlled hardness and advancing understanding of landscape difficulty and algorithm performance.

Abstract: Many randomized optimization algorithms are rank-invariant, relying solely on the relative ordering of solutions rather than absolute fitness values. We introduce a stronger notion of rank landscape invariance: two problems are equivalent if their ranking, but also their neighborhood structure and symmetries (translation and rotation), induce identical landscapes. This motivates the study of rank landscapes rather than individual functions. While prior work analyzed the rankings of injective function classes in isolation, we provide an exhaustive inventory of the invariant landscape classes for pseudo-Boolean functions of dimensions 1, 2, and 3, including non-injective cases. Our analysis reveals 12,007 classes in total, a significant reduction compared to rank-invariance alone. We find that non-injective functions yield far more invariant landscape classes than injective ones. In addition, complex combinations of topological landscape properties and algorithm behaviors emerge, particularly regarding deceptiveness, neutrality, and the performance of hill-climbing strategies. The inventory serves as a resource for pedagogical purposes and benchmark design, offering a foundation for constructing larger problems with controlled hardness and advancing our understanding of landscape difficulty and algorithm performance.

[418] Experience Transfer for Multimodal LLM Agents in Minecraft Game

Chenghao Li, Jun Liu, Songbo Zhang, Huadong Jian, Hao Ni, Lik-Hang Lee, Sung-Ho Bae, Guoqing Wang, Yang Yang, Chaoning Zhang

Main category: cs.AI

TL;DR: Echo: A transfer-oriented memory framework for multimodal LLM agents that decomposes reusable knowledge into five dimensions to enable experience transfer across tasks in complex environments like Minecraft.

DetailsMotivation: Multimodal LLM agents need to efficiently reuse past experience to solve new tasks in complex game environments, rather than treating memory as just a passive repository of static records.

Method: Proposes Echo framework that decomposes reusable knowledge into five dimensions (structure, attribute, process, function, interaction) and uses In-Context Analogy Learning (ICAL) to retrieve relevant experiences and adapt them to unseen tasks through contextual examples.

Result: In Minecraft experiments, Echo achieves 1.3x to 1.7x speed-up on object-unlocking tasks under from-scratch learning, and exhibits burst-like chain-unlocking phenomenon where multiple similar items are rapidly unlocked after acquiring transferable experience.

Conclusion: Experience transfer through structured memory decomposition and analogy learning is promising for improving efficiency and adaptability of multimodal LLM agents in complex interactive environments.

Abstract: Multimodal LLM agents operating in complex game environments must continually reuse past experience to solve new tasks efficiently. In this work, we propose Echo, a transfer-oriented memory framework that enables agents to derive actionable knowledge from prior interactions rather than treating memory as a passive repository of static records. To make transfer explicit, Echo decomposes reusable knowledge into five dimensions: structure, attribute, process, function, and interaction. This formulation allows the agent to identify recurring patterns shared across different tasks and infer what prior experience remains applicable in new situations. Building on this formulation, Echo leverages In-Context Analogy Learning (ICAL) to retrieve relevant experiences and adapt them to unseen tasks through contextual examples. Experiments in Minecraft show that, under a from-scratch learning setting, Echo achieves a 1.3x to 1.7x speed-up on object-unlocking tasks. Moreover, Echo exhibits a burst-like chain-unlocking phenomenon, rapidly unlocking multiple similar items within a short time interval after acquiring transferable experience. These results suggest that experience transfer is a promising direction for improving the efficiency and adaptability of multimodal LLM agents in complex interactive environments.

[419] SignalClaw: LLM-Guided Evolutionary Synthesis of Interpretable Traffic Signal Control Skills

Da Lei, Feng Xiao, Lu Li, Yuzhan Liu

Main category: cs.AI

TL;DR: SIGNALCLAW uses LLMs as evolutionary skill generators to create interpretable traffic signal control policies with human-readable rationales and executable code, achieving competitive performance with high interpretability.

DetailsMotivation: Current traffic signal control methods face a trade-off: reinforcement learning produces effective but opaque neural policies, while program synthesis offers interpretability but depends on restrictive domain-specific languages. There's a need for approaches that are both effective and interpretable for real-world deployment.

Method: SIGNALCLAW uses large language models as evolutionary skill generators to synthesize and refine interpretable control skills. Each skill includes rationale, selection guidance, and executable code. Evolution signals from simulation metrics are translated into natural language feedback to guide improvement. The framework introduces event-driven compositional evolution with an event detector for emergencies, transit priority, incidents, and congestion, and a priority dispatcher that selects specialized skills without retraining.

Result: On routine scenarios, SIGNALCLAW achieves average delay of 7.8 to 9.2 seconds, within 3-10% of the best baseline method. Under event scenarios, it yields the lowest emergency delay (11.2-18.5s vs 42.3-72.3s for MaxPressure and 78.5-95.3s for DQN) and lowest transit person delay (9.8-11.5s vs 38.7-45.2s for MaxPressure). Skills evolve from simple linear rules to conditional strategies with multi-feature interactions while remaining fully interpretable.

Conclusion: SIGNALCLAW demonstrates that LLMs can effectively generate and evolve interpretable control skills for traffic signal control, achieving competitive performance while maintaining full interpretability and modifiability by traffic engineers.

Abstract: Traffic signal control TSC requires strategies that are both effective and interpretable for deployment, yet reinforcement learning produces opaque neural policies while program synthesis depends on restrictive domain-specific languages. We present SIGNALCLAW, a framework that uses large language models LLMs as evolutionary skill generators to synthesize and refine interpretable control skills for adaptive TSC. Each skill includes rationale, selection guidance, and executable code, making policies human-inspectable and self-documenting. At each generation, evolution signals from simulation metrics such as queue percentiles, delay trends, and stagnation are translated into natural language feedback to guide improvement. SignalClaw also introduces event-driven compositional evolution: an event detector identifies emergency vehicles, transit priority, incidents, and congestion via TraCI, and a priority dispatcher selects specialized skills. Each skill is evolved independently, and a priority chain enables runtime composition without retraining. We evaluate SignalClaw on routine and event-injected SUMO scenarios against four baselines. On routine scenarios, it achieves average delay of 7.8 to 9.2 seconds, within 3 to 10 percent of the best method, with low variance across random seeds. Under event scenarios, it yields the lowest emergency delay 11.2 to 18.5 seconds versus 42.3 to 72.3 for MaxPressure and 78.5 to 95.3 for DQN, and the lowest transit person delay 9.8 to 11.5 seconds versus 38.7 to 45.2 for MaxPressure. In mixed events, the dispatcher composes skills effectively while maintaining stable overall delay. The evolved skills progress from simple linear rules to conditional strategies with multi-feature interactions, while remaining fully interpretable and directly modifiable by traffic engineers.

[420] A canonical generalization of OBDD

Florent Capelli, YooJung Choi, Stefan Mengel, Martín Muñoz, Guy Van den Broeck

Main category: cs.AI

TL;DR: TDDs generalize OBDDs as a Boolean function model, maintaining tractability while being more succinct, enabling FPT-size representation for CNFs with bounded treewidth.

DetailsMotivation: To overcome limitations of OBDDs (Ordered Binary Decision Diagrams) which cannot represent CNF formulas of bounded treewidth with fixed-parameter tractable size, while preserving desirable tractability properties like model counting and enumeration.

Method: Introduces Tree Decision Diagrams (TDDs) as a restriction of structured d-DNNF that respects a vtree structure. Studies bottom-up compilation of CNF formulas into deterministic TDDs and relates compilation complexity to factor width.

Result: TDDs maintain OBDD’s tractability properties (model counting, enumeration, conditioning, apply) while being more succinct. CNF formulas of treewidth k can be represented by TDDs of FPT size, which is impossible for OBDD.

Conclusion: TDDs provide a more expressive yet tractable Boolean function representation that overcomes key limitations of OBDDs, particularly for structured CNF formulas with bounded treewidth.

Abstract: We introduce Tree Decision Diagrams (TDD) as a model for Boolean functions that generalizes OBDD. They can be seen as a restriction of structured d-DNNF; that is, d-DNNF that respect a vtree $T$. We show that TDDs enjoy the same tractability properties as OBDD, such as model counting, enumeration, conditioning, and apply, and are more succinct. In particular, we show that CNF formulas of treewidth $k$ can be represented by TDDs of FPT size, which is known to be impossible for OBDD. We study the complexity of compiling CNF formulas into deterministic TDDs via bottom-up compilation and relate the complexity of this approach with the notion of factor width introduced by Bova and Szeider.

[421] From Large Language Model Predicates to Logic Tensor Networks: Neurosymbolic Offer Validation in Regulated Procurement

Cedric Haufe, Frieder Stolzenburg

Main category: cs.AI

TL;DR: Neurosymbolic approach combining language models with Logic Tensor Networks for validating offer documents in regulated public institutions, focusing on interpretability and legal verifiability.

DetailsMotivation: In regulated public institutions, decisions must be both factually correct and legally verifiable. Current AI approaches often lack the interpretability and auditability needed for such sensitive applications, requiring a system that can link domain-specific knowledge with semantic text understanding while providing transparent justification for decisions.

Method: Uses a language model to extract information from offer documents, then aggregates this information with a Logic Tensor Network (LTN) to make auditable decisions. The approach combines symbolic AI (rule-based reasoning) with subsymbolic AI (neural language models), enabling predicate extraction and rule checking with explicit support for explainable AI (XAI).

Result: Experiments on a real corpus of offer documents show that the proposed pipeline achieves performance comparable to existing models, with key advantages in interpretability, modular predicate extraction, and explicit support for XAI. Decisions can be justified by predicate values, rule truth values, and corresponding text passages.

Conclusion: The neurosymbolic approach successfully addresses the need for auditable decision-making in regulated institutions by combining the semantic understanding of language models with the interpretability of symbolic reasoning, enabling rule checking on real document corpora while maintaining transparency and verifiability.

Abstract: We present a neurosymbolic approach, i.e., combining symbolic and subsymbolic artificial intelligence, to validating offer documents in regulated public institutions. We employ a language model to extract information and then aggregate with an LTN (Logic Tensor Network) to make an auditable decision. In regulated public institutions, decisions must be made in a manner that is both factually correct and legally verifiable. Our neurosymbolic approach allows existing domain-specific knowledge to be linked to the semantic text understanding of language models. The decisions resulting from our pipeline can be justified by predicate values, rule truth values, and corresponding text passages, which enables rule checking based on a real corpus of offer documents. Our experiments on a real corpus show that the proposed pipeline achieves performance comparable to existing models, while its key advantage lies in its interpretability, modular predicate extraction, and explicit support for XAI (Explainable AI).

[422] COSMO-Agent: Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

Liyuan Deng, Shujian Deng, Yongkang Chen, Yongkang Dai, Zhihang Zhong, Linyang Li, Xiao Sun, Yilei Shi, Huaxi Huang

Main category: cs.AI

TL;DR: COSMO-Agent is a tool-augmented RL framework that teaches LLMs to close the CAD-CAE semantic gap by orchestrating external tools for iterative industrial design-simulation optimization.

DetailsMotivation: The paper addresses the bottleneck in industrial design-simulation optimization caused by the CAD-CAE semantic gap - the difficulty in translating simulation feedback into valid geometric edits under diverse, coupled constraints.

Method: Proposes COSMO-Agent, a tool-augmented reinforcement learning framework that casts CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment where an LLM learns to orchestrate external tools and revise parametric geometries until constraints are satisfied.

Result: COSMO-Agent training substantially improves small open-source LLMs for constraint-driven design, exceeding large open-source and strong closed-source models in feasibility, efficiency, and stability.

Conclusion: The framework successfully bridges the CAD-CAE semantic gap through LLM-based tool orchestration and constraint-driven learning, making industrial design-simulation optimization more automated and efficient.

Abstract: Iterative industrial design-simulation optimization is bottlenecked by the CAD-CAE semantic gap: translating simulation feedback into valid geometric edits under diverse, coupled constraints. To fill this gap, we propose COSMO-Agent (Closed-loop Optimization, Simulation, and Modeling Orchestration), a tool-augmented reinforcement learning (RL) framework that teaches LLMs to complete the closed-loop CAD-CAE process. Specifically, we cast CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment, where an LLM learns to orchestrate external tools and revise parametric geometries until constraints are satisfied. To make this learning stable and industrially usable, we design a multi-constraint reward that jointly encourages feasibility, toolchain robustness, and structured output validity. In addition, we contribute an industry-aligned dataset that covers 25 component categories with executable CAD-CAE tasks to support realistic training and evaluation. Experiments show that COSMO-Agent training substantially improves small open-source LLMs for constraint-driven design, exceeding large open-source and strong closed-source models in feasibility, efficiency, and stability.

[423] ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation

Zhe Zhao, Haibin Wen, Jiaming Ma, Jiachang Zhan, Tianyi Xu, Ye Wei, Qingfu Zhang

Main category: cs.AI

TL;DR: ResearchEVO is an end-to-end AI framework that discovers novel algorithms through LLM-guided evolution and then autonomously writes publication-ready research papers explaining the discoveries.

DetailsMotivation: The paper aims to automate the scientific discovery process by mimicking the two-stage pattern seen in breakthroughs: initial undirected experimentation yielding unexpected findings, followed by retrospective explanation situating findings within existing theory.

Method: The framework has two phases: 1) Evolution Phase uses LLM-guided bi-dimensional co-evolution to simultaneously optimize algorithmic logic and architecture through fitness-based search, 2) Writing Phase generates complete research papers via sentence-level retrieval-augmented generation with anti-hallucination verification and automated experiment design.

Result: Validated on Quantum Error Correction (using Google quantum hardware data) and Physics-Informed Neural Networks, the system discovered human-interpretable algorithmic mechanisms not previously proposed in domain literature and produced compilable LaTeX manuscripts with zero fabricated citations.

Conclusion: ResearchEVO is the first end-to-end system covering both principled algorithm evolution and literature-grounded scientific documentation, demonstrating autonomous scientific discovery and explanation capabilities.

Abstract: An important recurring pattern in scientific breakthroughs is a two-stage process: an initial phase of undirected experimentation that yields an unexpected finding, followed by a retrospective phase that explains why the finding works and situates it within existing theory. We present ResearchEVO, an end-to-end framework that computationally instantiates this discover-then-explain paradigm. The Evolution Phase employs LLM-guided bi-dimensional co-evolution – simultaneously optimizing both algorithmic logic and overall architecture – to search the space of code implementations purely by fitness, without requiring any understanding of the solutions it produces. The Writing Phase then takes the best-performing algorithm and autonomously generates a complete, publication-ready research paper through sentence-level retrieval-augmented generation with explicit anti-hallucination verification and automated experiment design. To our knowledge, ResearchEVO is the first system to cover this full pipeline end to end: no prior work jointly performs principled algorithm evolution and literature-grounded scientific documentation. We validate the framework on two cross-disciplinary scientific problems – Quantum Error Correction using real Google quantum hardware data, and Physics-Informed Neural Networks – where the Evolution Phase discovered human-interpretable algorithmic mechanisms that had not been previously proposed in the respective domain literatures. In both cases, the Writing Phase autonomously produced compilable LaTeX manuscripts that correctly grounded these blind discoveries in existing theory via RAG, with zero fabricated citations.

[424] Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

Xin Sun, Di Wu, Sijing Qin, Isao Echizen, Abdallah El Ali, Saku Sugawara

Main category: cs.AI

TL;DR: LLM-as-a-Judge evaluations are biased by source labels, with both humans and LLMs showing higher trust in content labeled as human-authored versus AI-generated, revealing heuristic reliance on labels rather than content quality.

DetailsMotivation: To investigate the reliability of LLM-as-a-Judge evaluations by examining whether trust judgments are biased by disclosed source labels (human vs. AI), and to understand if this bias mirrors human heuristic reliance on labels.

Method: Used counterfactual design with identical content labeled differently as human-authored or AI-generated. Collected human judgments with eye-tracking to analyze gaze patterns. Analyzed LLM internal states during judgment, including attention allocation to label vs. content regions and decision uncertainty measured by logits.

Result: Both humans and LLM judges assign higher trust to information labeled as human-authored than to the same content labeled as AI-generated. Eye-tracking shows humans heavily rely on source labels as heuristic cues. LLMs allocate denser attention to label regions than content regions, with stronger label dominance under Human labels. Decision uncertainty is higher under AI labels than Human labels.

Conclusion: Source labels serve as salient heuristic cues for both humans and LLMs, raising validity concerns for label-sensitive LLM-as-a-Judge evaluation. Aligning models with human preferences may propagate human heuristic reliance into models, motivating the need for debiased evaluation and alignment approaches.

Abstract: Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge). This work challenges its reliability by showing that trust judgments by LLMs are biased by disclosed source labels. Using a counterfactual design, we find that both humans and LLM judges assign higher trust to information labeled as human-authored than to the same content labeled as AI-generated. Eye-tracking data reveal that humans rely heavily on source labels as heuristic cues for judgments. We analyze LLM internal states during judgment. Across label conditions, models allocate denser attention to the label region than the content region, and this label dominance is stronger under Human labels than AI labels, consistent with the human gaze patterns. Besides, decision uncertainty measured by logits is higher under AI labels than Human labels. These results indicate that the source label is a salient heuristic cue for both humans and LLMs. It raises validity concerns for label-sensitive LLM-as-a-Judge evaluation, and we cautiously raise that aligning models with human preferences may propagate human heuristic reliance into models, motivating debiased evaluation and alignment.

[425] Beyond Behavior: Why AI Evaluation Needs a Cognitive Revolution

Amir Konigsberg

Main category: cs.AI

TL;DR: The paper critiques Turing’s behavioral test as an epistemological commitment that has constrained AI research for decades, preventing investigation of internal computational processes and mechanisms behind intelligent behavior.

DetailsMotivation: To analyze how Turing's behavioral test has shaped AI epistemology, limiting the field's ability to ask questions about internal processes and mechanisms that are crucial for understanding intelligence.

Method: Philosophical analysis tracing the historical influence of Turing’s behavioral epistemology in AI, drawing parallels to psychology’s behaviorist-to-cognitivist transition, and articulating what a post-behaviorist epistemology for AI would entail.

Result: Identifies that AI’s commitment to behavioral evaluation prevents distinguishing between systems achieving identical outputs through different computational processes, which is essential for intelligence attribution.

Conclusion: AI needs an epistemological transition similar to psychology’s cognitive revolution, recognizing that behavioral evidence alone is insufficient for making construct claims about intelligence.

Abstract: In 1950, Alan Turing proposed replacing the question “Can machines think?” with a behavioral test: if a machine’s outputs are indistinguishable from those of a thinking being, the question of whether it truly thinks can be set aside. This paper argues that Turing’s move was not only a pragmatic simplification but also an epistemological commitment, a decision about what kind of evidence counts as relevant to intelligence attribution, and that this commitment has quietly constrained AI research for seven decades. We trace how Turing’s behavioral epistemology became embedded in the field’s evaluative infrastructure, rendering unaskable a class of questions about process, mechanism, and internal organization that cognitive psychology, neuroscience, and related disciplines learned to ask. We draw a structural parallel to the behaviorist-to-cognitivist transition in psychology: just as psychology’s commitment to studying only observable behavior prevented it from asking productive questions about internal mental processes until that commitment was abandoned, AI’s commitment to behavioral evaluation prevents it from distinguishing between systems that achieve identical outputs through fundamentally different computational processes, a distinction on which intelligence attribution depends. We argue that the field requires an epistemological transition comparable to the cognitive revolution: not an abandonment of behavioral evidence, but a recognition that behavioral evidence alone is insufficient for the construct claims the field wishes to make. We articulate what a post-behaviorist epistemology for AI would involve and identify the specific questions it would make askable that the field currently has no way to ask.

[426] PECKER: A Precisely Efficient Critical Knowledge Erasure Recipe For Machine Unlearning in Diffusion Models

Zhiyong Ma, Zhitao Deng, Huan Tang, Jialin Chen, Zhijun Zheng, Zhengping Li, Qingyuan Chuai

Main category: cs.AI

TL;DR: PECKER is an efficient machine unlearning method that uses saliency masking to prioritize gradient updates on parameters most relevant to forgetting target data, reducing training time while maintaining unlearning efficacy.

DetailsMotivation: Current machine unlearning methods for GenAI models impose prohibitive training time and computational overhead due to poorly directed gradient updates that reduce efficiency and destabilize convergence.

Method: Proposes PECKER within a distillation framework, introducing a saliency mask to prioritize updates to parameters that contribute most to forgetting targeted data, reducing unnecessary gradient computation and shortening training time.

Result: Achieves shorter training times for both class forgetting and concept forgetting on CIFAR-10 and STL-10 datasets, while closely aligning with the true image distribution and matching or outperforming prevailing methods.

Conclusion: PECKER provides an efficient machine unlearning approach that addresses computational overhead issues while maintaining unlearning efficacy through targeted parameter updates.

Abstract: Machine unlearning (MU) has become a critical technique for GenAI models’ safe and compliant operation. While existing MU methods are effective, most impose prohibitive training time and computational overhead. Our analysis suggests the root cause lies in poorly directed gradient updates, which reduce training efficiency and destabilize convergence. To mitigate these issues, we propose PECKER, an efficient MU approach that matches or outperforms prevailing methods. Within a distillation framework, PECKER introduces a saliency mask to prioritize updates to parameters that contribute most to forgetting the targeted data, thereby reducing unnecessary gradient computation and shortening overall training time without sacrificing unlearning efficacy. Our method generates samples that unlearn related class or concept more quickly, while closely aligning with the true image distribution on CIFAR-10 and STL-10 datasets, achieving shorter training times for both class forgetting and concept forgetting.

[427] CuraLight: Debate-Guided Data Curation for LLM-Centered Traffic Signal Control

Qing Guo, Xinhang Li, Junyu Chen, Zheng Guo, Shengzhe Xu, Lin Zhang, Lei Li

Main category: cs.AI

TL;DR: CuraLight: LLM-centered traffic signal control framework using RL-assisted exploration and multi-LLM ensemble deliberation for improved performance and interpretability.

DetailsMotivation: Existing RL and LLM approaches for traffic signal control suffer from limited interpretability, insufficient interaction data, and weak generalization to heterogeneous intersections.

Method: Proposes CuraLight framework where RL agent explores traffic environments to generate high-quality trajectories converted to prompt-response pairs for LLM fine-tuning, with multi-LLM ensemble deliberation system evaluating candidate actions through structured debate.

Result: Experiments in SUMO across real-world networks show CuraLight outperforms SOTA baselines, reducing average travel time by 5.34%, queue length by 5.14%, and waiting time by 7.02%.

Conclusion: Combining RL-assisted exploration with deliberation-based data curation enables scalable and interpretable traffic signal control with superior performance.

Abstract: Traffic signal control (TSC) is a core component of intelligent transportation systems (ITS), aiming to reduce congestion, emissions, and travel time. Recent approaches based on reinforcement learning (RL) and large language models (LLMs) have improved adaptivity, but still suffer from limited interpretability, insufficient interaction data, and weak generalization to heterogeneous intersections. This paper proposes CuraLight, an LLM-centered framework where an RL agent assists the fine-tuning of an LLM-based traffic signal controller. The RL agent explores traffic environments and generates high-quality interaction trajectories, which are converted into prompt-response pairs for imitation fine-tuning. A multi-LLM ensemble deliberation system further evaluates candidate signal timing actions through structured debate, providing preference-aware supervision signals for training. Experiments conducted in SUMO across heterogeneous real-world networks from Jinan, Hangzhou, and Yizhuang demonstrate that CuraLight consistently outperforms state-of-the-art baselines, reducing average travel time by 5.34 percent, average queue length by 5.14 percent, and average waiting time by 7.02 percent. The results highlight the effectiveness of combining RL-assisted exploration with deliberation-based data curation for scalable and interpretable traffic signal control.

[428] QA-MoE: Towards a Continuous Reliability Spectrum with Quality-Aware Mixture of Experts for Robust Multimodal Sentiment Analysis

Yitong Zhu, Yuxuan Jiang, Guanxuan Jiang, Bojing Hou, Peng Yuan Zhou, Ge Lin Kan, Yuyang Wang

Main category: cs.AI

TL;DR: QA-MoE: A Quality-Aware Mixture-of-Experts framework for multimodal sentiment analysis that handles both missing modalities and quality degradation through continuous reliability estimation and expert routing.

DetailsMotivation: Real-world multimodal inputs often suffer from noise or missing modalities, but existing methods treat these as discrete cases or assume fixed corruption ratios, limiting adaptability to varying reliability conditions.

Method: Introduces Continuous Reliability Spectrum to unify missingness and quality degradation, then proposes QA-MoE framework that quantifies modality reliability via self-supervised aleatoric uncertainty to guide expert routing and suppress error propagation.

Result: Extensive experiments show QA-MoE achieves competitive or state-of-the-art performance across diverse degradation scenarios and exhibits promising One-Checkpoint-for-All property in practice.

Conclusion: The proposed framework effectively handles real-world multimodal imperfections by dynamically assessing and routing based on modality reliability, offering practical advantages for deployment.

Abstract: Multimodal Sentiment Analysis (MSA) aims to infer human sentiment from textual, acoustic, and visual signals. In real-world scenarios, however, multimodal inputs are often compromised by dynamic noise or modality missingness. Existing methods typically treat these imperfections as discrete cases or assume fixed corruption ratios, which limits their adaptability to continuously varying reliability conditions. To address this, we first introduce a Continuous Reliability Spectrum to unify missingness and quality degradation into a single framework. Building on this, we propose QA-MoE, a Quality-Aware Mixture-of-Experts framework that quantifies modality reliability via self-supervised aleatoric uncertainty. This mechanism explicitly guides expert routing, enabling the model to suppress error propagation from unreliable signals while preserving task-relevant information. Extensive experiments indicate that QA-MoE achieves competitive or state-of-the-art performance across diverse degradation scenarios and exhibits a promising One-Checkpoint-for-All property in practice.

[429] Can Large Language Models Reinvent Foundational Algorithms?

Jian Zhao, Haoren Luo, Yu Wang, Yuhan Cao, Pingyue Sheng, Tianxing He

Main category: cs.AI

TL;DR: LLMs can reinvent foundational CS algorithms after targeted unlearning, with success rates up to 90% with hints, but complex algorithms remain challenging.

DetailsMotivation: To investigate whether LLMs have the capacity for foundational innovation by testing if they can reinvent core computer science algorithms after having that knowledge removed.

Method: Unlearn-and-Reinvent pipeline: uses GRPO-based on-policy unlearning to remove specific algorithms from LLMs, then tests reinvention in controlled environments with varying hint levels (0-2).

Result: Qwen3-4B-Thinking-2507 reinvented 50% of algorithms with no hints, 70% with level 1 hints, and 90% with level 2 hints. Complex algorithms still failed even with step-by-step hints. Test-time RL helped with Strassen algorithm.

Conclusion: LLMs show promising but limited innovative thinking capacity - they can reinvent algorithms but struggle with complexity. Generative verifiers help sustain reasoning and prevent “thought collapse.”

Abstract: LLMs have shown strong potential to advance scientific discovery. Whether they possess the capacity for foundational innovation, however, remains an open question. In this work, we focus on a prerequisite for foundational innovation: can LLMs reinvent foundational algorithms in computer science? Our \textit{Unlearn-and-Reinvent} pipeline applies LLM unlearning to remove a specific foundational algorithm, such as Dijkstra’s or Euclid’s algorithm, from an LLM’s pretrained knowledge, and then tests whether the model can reinvent it in a controlled environment. To enable effective unlearning, we adopt a GRPO-based, on-policy unlearning method. Across 10 target algorithms, 3 strong open-weight models, and 3 hint levels, our experiments demonstrate that (1) the strongest model Qwen3-4B-Thinking-2507 successfully reinvents 50% of the algorithms with no hint, 70% at hint level 1, and 90% at hint level 2; (2) a few high-level hints can enhance the reinvention success rate, but even step-by-step hints fail for those complicated algorithms; and (3) test-time reinforcement learning enables successful reinvention for the Strassen algorithm at hint level 2. Through analyses of output trajectories and ablation studies, we find that generative verifier in the reinvention phase plays a critical role in sustaining models’ reasoning strength, helping to avoid the ``thought collapse’’ phenomenon. These findings offer insights into both the potential and current limits of LLMs’ innovative thinking.

[430] Emergent social transmission of model-based representations without inference

Silja Keßler, Miriam Bautista-Salinero, Claudio Tennie, Charley M. Wu

Main category: cs.AI

TL;DR: Simple social learning mechanisms without mental state inference can transmit complex representations through cultural evolution by biasing learners’ experiences toward expert behavior.

DetailsMotivation: To understand how humans acquire rich knowledge from others despite cognitive limitations, challenging the assumption that costly mentalizing (inferring others' beliefs) is necessary, and exploring whether simple social cues can support behavioral transmission.

Method: Reinforcement learning simulations where a naïve agent searches for rewards in a reconfigurable environment, learning either alone or by observing an expert without mental state inference. The learner uses simple heuristics: either selecting actions based on observed expert actions or boosting value representations of observed actions.

Result: Simple social cues bias the learner’s experience, causing its representation to converge toward the expert’s. Model-based learners benefit most from social exposure, showing faster learning and more expert-like representations compared to asocial learning.

Conclusion: Cultural transmission can arise from simple, non-mentalizing processes that exploit asocial learning mechanisms, demonstrating how complex knowledge can be transmitted without requiring computationally expensive mental state inference.

Abstract: How do people acquire rich, flexible knowledge about their environment from others despite limited cognitive capacity? Humans are often thought to rely on computationally costly mentalizing, such as inferring others’ beliefs. In contrast, cultural evolution emphasizes that behavioral transmission can be supported by simple social cues. Using reinforcement learning simulations, we show how minimal social learning can indirectly transmit higher-level representations. We simulate a naïve agent searching for rewards in a reconfigurable environment, learning either alone or by observing an expert - crucially, without inferring mental states. Instead, the learner heuristically selects actions or boosts value representations based on observed actions. Our results demonstrate that these cues bias the learner’s experience, causing its representation to converge toward the expert’s. Model-based learners benefit most from social exposure, showing faster learning and more expert-like representations. These findings show how cultural transmission can arise from simple, non-mentalizing processes exploiting asocial learning mechanisms.

[431] Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents

Shuai Zhen, Yanhua Yu, Ruopei Guo, Nan Cheng, Yang Deng

Main category: cs.AI

TL;DR: STEP-HRL is a hierarchical reinforcement learning framework for LLM agents that enables step-level learning using single-step transitions instead of full interaction histories, improving performance and reducing computational costs.

DetailsMotivation: Existing LLM agents rely on increasingly long interaction histories, leading to high computational costs and limited scalability. There's a need for more efficient approaches that can maintain strong decision-making capabilities while reducing token usage.

Method: Proposes STEP-HRL, a hierarchical reinforcement learning framework that: 1) Structures tasks hierarchically using completed subtasks to represent global progress, 2) Introduces a local progress module that iteratively summarizes interaction history within each subtask, 3) Produces augmented step-level transitions for both high-level and low-level policies using only single-step transitions rather than full histories.

Result: Experimental results on ScienceWorld and ALFWorld benchmarks show that STEP-HRL substantially outperforms baselines in terms of performance and generalization while significantly reducing token usage.

Conclusion: STEP-HRL provides an efficient hierarchical reinforcement learning framework for LLM agents that enables step-level learning, reduces computational costs, and improves performance and generalization capabilities.

Abstract: Large language model (LLM) agents have demonstrated strong capabilities in complex interactive decision-making tasks. However, existing LLM agents typically rely on increasingly long interaction histories, resulting in high computational cost and limited scalability. In this paper, we propose STEP-HRL, a hierarchical reinforcement learning (HRL) framework that enables step-level learning by conditioning only on single-step transitions rather than full interaction histories. STEP-HRL structures tasks hierarchically, using completed subtasks to represent global progress of overall task. By introducing a local progress module, it also iteratively and selectively summarizes interaction history within each subtask to produce a compact summary of local progress. Together, these components yield augmented step-level transitions for both high-level and low-level policies. Experimental results on ScienceWorld and ALFWorld benchmarks consistently demonstrate that STEP-HRL substantially outperforms baselines in terms of performance and generalization while reducing token usage. Our code is available at https://github.com/TonyStark042/STEP-HRL.

[432] Reciprocal Trust and Distrust in Artificial Intelligence Systems: The Hard Problem of Regulation

Martino Maggetti

Main category: cs.AI

TL;DR: AI systems should be recognized as having agency and capable of engaging in reciprocal trust relationships with humans, which has implications for AI regulation and governance.

DetailsMotivation: The paper addresses growing concerns about AI regulation and trustworthiness, arguing that current discussions overlook the potential for AI systems to exercise agency and engage in reciprocal trust relationships with humans.

Method: Theoretical/philosophical analysis examining AI systems as artifacts capable of agency, exploring implications for trust dynamics between AI and humans, and analyzing regulatory consequences.

Result: Identifies that recognizing AI agency enables understanding of reciprocal trust relationships, which creates new challenges and tensions for AI regulation and governance frameworks.

Conclusion: AI systems should be viewed as having some degree of agency, enabling trust relationships with humans, which poses unresolved dilemmas for future AI regulation that must address these reciprocal dynamics.

Abstract: Policy makers, scientists, and the public are increasingly confronted with thorny questions about the regulation of artificial intelligence (AI) systems. A key common thread concerns whether AI can be trusted and the factors that can make it more trustworthy in front of stakeholders and users. This is indeed crucial, as the trustworthiness of AI systems is fundamental for both democratic governance and for the development and deployment of AI. This article advances the discussion by arguing that AI systems should also be recognized, as least to some extent, as artifacts capable of exercising a form of agency, thereby enabling them to engage in relationships of trust or distrust with humans. It further examines the implications of these reciprocal trust dynamics for regulators tasked with overseeing AI systems. The article concludes by identifying key tensions and unresolved dilemmas that these dynamics pose for the future of AI regulation and governance.

[433] Vision-Guided Iterative Refinement for Frontend Code Generation

Hannah Sansford, Derek H. C. Law, Wei Liu, Abhishek Tripathi, Niresh Agarwal, Gerrit J. J. van den Burg

Main category: cs.AI

TL;DR: Automated critic-in-the-loop framework uses vision-language model as visual critic to provide structured feedback on rendered webpages, guiding iterative refinement of generated frontend code, achieving up to 17.8% performance improvement.

DetailsMotivation: Current code generation with LLMs relies on costly multi-stage human-in-the-loop refinement, especially in frontend web development where solution quality depends on visual output. Need for automated approach to improve code quality without human intervention.

Method: Proposes fully automated critic-in-the-loop framework where VLM serves as visual critic providing structured feedback on rendered webpages. Uses iterative refinement cycles (up to 3) guided by VLM feedback. Also investigates parameter-efficient fine-tuning using LoRA to internalize critic improvements.

Result: Achieves consistent improvements in solution quality across real-world user requests from WebDev Arena dataset, with up to 17.8% performance increase over three refinement cycles. Fine-tuning achieves 25% of gains from best critic-in-the-loop solution without significant token count increase.

Conclusion: Automated VLM-based critique of frontend code generation yields significantly higher quality solutions than single LLM inference pass. Highlights importance of iterative refinement for complex visual outputs in web development. Fine-tuning can internalize some critic benefits.

Abstract: Code generation with large language models often relies on multi-stage human-in-the-loop refinement, which is effective but very costly - particularly in domains such as frontend web development where the solution quality depends on rendered visual output. We present a fully automated critic-in-the-loop framework in which a vision-language model serves as a visual critic that provides structured feedback on rendered webpages to guide iterative refinement of generated code. Across real-world user requests from the WebDev Arena dataset, this approach yields consistent improvements in solution quality, achieving up to 17.8% increase in performance over three refinement cycles. Next, we investigate parameter-efficient fine-tuning using LoRA to understand whether the improvements provided by the critic can be internalized by the code-generating LLM. Fine-tuning achieves 25% of the gains from the best critic-in-the-loop solution without a significant increase in token counts. Our findings indicate that automated, VLM-based critique of frontend code generation leads to significantly higher quality solutions than can be achieved through a single LLM inference pass, and highlight the importance of iterative refinement for the complex visual outputs associated with web development.

[434] Deep Researcher Agent: An Autonomous Framework for 24/7 Deep Learning Experimentation with Zero-Cost Monitoring

Xiangyue Zhang

Main category: cs.AI

TL;DR: Deep Researcher Agent is an open-source framework for autonomous deep learning experiments using LLM agents, featuring zero-cost monitoring, constant-size memory, and minimal-toolset architecture.

DetailsMotivation: Existing AI research assistants focus on paper writing or code generation but don't address the full experiment lifecycle. There's a need for autonomous systems that can conduct deep learning experiments continuously with cost efficiency and scalability.

Method: Three key innovations: 1) Zero-Cost Monitoring using process-level checks and log file reads without LLM API calls during training; 2) Two-Tier Constant-Size Memory capped at ~5K characters to prevent unbounded context growth; 3) Minimal-Toolset Leader-Worker Architecture with workers having only 3-5 tools to reduce token overhead.

Result: The framework autonomously completed 500+ experiment cycles across four concurrent research projects over 30+ days, achieving 52% improvement over baseline metrics in one project through 200+ automated experiments, with average LLM cost of $0.08 per 24-hour cycle.

Conclusion: The Deep Researcher Agent framework enables autonomous, cost-effective deep learning experimentation at scale, addressing the full experiment lifecycle with innovative solutions for monitoring, memory management, and agent architecture.

Abstract: We present \textbf{Deep Researcher Agent}, an open-source framework that enables large language model (LLM) agents to autonomously conduct deep learning experiments around the clock. Unlike existing AI research assistants that focus on paper writing or code generation, our system addresses the full experiment lifecycle: hypothesis formation, code implementation, training execution, result analysis, and iterative refinement. The framework introduces three key innovations: (1) \textbf{Zero-Cost Monitoring} – a monitoring paradigm that incurs zero LLM API costs during model training by relying solely on process-level checks and log file reads; (2) \textbf{Two-Tier Constant-Size Memory} – a memory architecture capped at $\sim$5K characters regardless of runtime duration, preventing the unbounded context growth that plagues long-running agents; and (3) \textbf{Minimal-Toolset Leader-Worker Architecture} – a multi-agent design where each worker agent is equipped with only 3–5 tools, reducing per-call token overhead by up to 73%. In sustained deployments spanning 30+ days, the framework autonomously completed 500+ experiment cycles across four concurrent research projects, achieving a 52% improvement over baseline metrics in one project through 200+ automated experiments – all at an average LLM cost of $0.08 per 24-hour cycle. Code is available at https://github.com/Xiangyue-Zhang/auto-deep-researcher-24x7.

[435] Contextual Control without Memory Growth in a Context-Switching Task

Song-Ju Kim

Main category: cs.AI

TL;DR: Intervention-based recurrent architecture enables contextual control without enlarging recurrent dimensionality by using additive context-indexed operators on shared latent states.

DetailsMotivation: Current approaches to context-dependent sequential decision making either provide context explicitly as input or increase recurrent memory to represent context internally. The authors explore a third alternative: achieving contextual dependence by intervening on a shared recurrent latent state without increasing recurrent dimensionality.

Method: Proposed an intervention-based recurrent architecture where a recurrent core first constructs a shared pre-intervention latent state, and context then acts through an additive, context-indexed operator. Evaluated on context-switching sequential decision tasks under partial observability, comparing three model families: label-assisted baseline (direct context access), memory baseline (enlarged recurrent state), and the intervention model (no direct context input, no memory growth).

Result: The intervention model performed strongly on the main benchmark without additional recurrent dimensions. Using conditional mutual information (I(C;O | S)) as an operational probe, the intervention model exhibited positive conditional contextual information for task-relevant phase-1 outcomes.

Conclusion: Intervention on a shared recurrent state provides a viable alternative to recurrent memory growth for contextual control in sequential decision making tasks under partial observability.

Abstract: Context-dependent sequential decision making is commonly addressed either by providing context explicitly as an input or by increasing recurrent memory so that contextual information can be represented internally. We study a third alternative: realizing contextual dependence by intervening on a shared recurrent latent state, without enlarging recurrent dimensionality. To this end, we introduce an intervention-based recurrent architecture in which a recurrent core first constructs a shared pre-intervention latent state, and context then acts through an additive, context-indexed operator. We evaluate this idea on a context-switching sequential decision task under partial observability. We compare three model families: a label-assisted baseline with direct context access, a memory baseline with enlarged recurrent state, and the proposed intervention model, which uses no direct context input to the recurrent core and no memory growth. On the main benchmark, the intervention model performs strongly without additional recurrent dimensions. We also evaluate the models using the conditional mutual information (I(C;O | S)) as a theorem-motivated operational probe of contextual dependence at fixed latent state. For task-relevant phase-1 outcomes, the intervention model exhibits positive conditional contextual information. Together, these results suggest that intervention on a shared recurrent state provides a viable alternative to recurrent memory growth for contextual control in this setting.

[436] When Do We Need LLMs? A Diagnostic for Language-Driven Bandits

Uljad Berdica, Fernando Acero, Anton Ipsen, Parisa Zehtabi, Michael Cashmore, Manuela Veloso

Main category: cs.AI

TL;DR: LLMP-UCB bandit algorithm uses LLMs for uncertainty estimates in contextual multi-armed bandits with mixed text/numerical data, but finds lightweight numerical bandits on embeddings match LLM performance at much lower cost.

DetailsMotivation: Address computational expense and uncertainty estimation challenges when using LLMs for sequential decision making in financial applications with mixed text/numerical contexts.

Method: Introduces LLMP-UCB algorithm that derives uncertainty estimates from LLMs via repeated inference, compares with lightweight numerical bandits operating on text embeddings, and proposes geometric diagnostic for deployment decisions.

Result: Lightweight numerical bandits on embeddings match or exceed LLM-based solutions at fraction of cost; embedding dimensionality controls exploration-exploitation balance; geometric diagnostic guides deployment choices.

Conclusion: Provides principled framework for cost-effective, uncertainty-aware decision systems in financial services, showing embeddings often sufficient without expensive LLM reasoning.

Abstract: We study Contextual Multi-Armed Bandits (CMABs) for non-episodic sequential decision making problems where the context includes both textual and numerical information (e.g., recommendation systems, dynamic portfolio adjustments, offer selection; all frequent problems in finance). While Large Language Models (LLMs) are increasingly applied to these settings, utilizing LLMs for reasoning at every decision step is computationally expensive and uncertainty estimates are difficult to obtain. To address this, we introduce LLMP-UCB, a bandit algorithm that derives uncertainty estimates from LLMs via repeated inference. However, our experiments demonstrate that lightweight numerical bandits operating on text embeddings (dense or Matryoshka) match or exceed the accuracy of LLM-based solutions at a fraction of their cost. We further show that embedding dimensionality is a practical lever on the exploration-exploitation balance, enabling cost–performance tradeoffs without prompt complexity. Finally, to guide practitioners, we propose a geometric diagnostic based on the arms’ embedding to decide when to use LLM-driven reasoning versus a lightweight numerical bandit. Our results provide a principled deployment framework for cost-effective, uncertainty-aware decision systems with broad applicability across AI use cases in financial services.

[437] JTON: A Token-Efficient JSON Superset with Zen Grid Tabular Encoding for Large Language Models

Gowthamkumar Nandakishore

Main category: cs.AI

TL;DR: JTON (JSON Tabular Object Notation) is a JSON superset that reduces token overhead for tabular data by factoring column headers into a single row and using semicolon-separated values, achieving 15-60% token reduction while maintaining JSON compatibility.

DetailsMotivation: Standard JSON serialization for structured/tabular data wastes tokens by repeating key names in every row, leading to increased costs and inefficient context utilization for LLMs. This overhead scales linearly with row count.

Method: Introduces JTON with Zen Grid format that factors column headers into a single row and encodes values with semicolons, preserving JSON’s type system while eliminating redundancy. Includes a Rust/PyO3 reference implementation with SIMD-accelerated parsing.

Result: Across seven domains, Zen Grid reduces token counts by 15-60% vs JSON compact (28.5% average). LLM comprehension tests show +0.3 pp accuracy gain over JSON. Generation tests achieve 100% syntactic validity. Parsing is 1.4x faster than Python’s json module.

Conclusion: JTON provides an efficient JSON-compatible serialization format for tabular data that significantly reduces token overhead for LLMs while maintaining or improving comprehension and generation performance.

Abstract: When LLMs process structured data, the serialization format directly affects cost and context utilization. Standard JSON wastes tokens repeating key names in every row of a tabular array–overhead that scales linearly with row count. This paper presents JTON (JSON Tabular Object Notation), a strict JSON superset whose main idea, Zen Grid, factors column headers into a single row and encodes values with semicolons, preserving JSON’s type system while cutting redundancy. Across seven real-world domains, Zen Grid reduces token counts by 15-60% versus JSON compact (28.5% average; 32% with bare_strings). Comprehension tests on 10 LLMs show a net +0.3 pp accuracy gain over JSON: four models improve, three hold steady, and three dip slightly. Generation tests on 12 LLMs yield 100% syntactic validity in both few-shot and zero-shot settings. A Rust/PyO3 reference implementation adds SIMD-accelerated parsing at 1.4x the speed of Python’s json module. Code, a 683-vector test suite, and all experimental data are publicly available.

[438] Joint Knowledge Base Completion and Question Answering by Combining Large Language Models and Small Language Models

Yinan Liu, Dongying Lin, Sigang Luo, Xiaochun Yang, Bin Wang

Main category: cs.AI

TL;DR: JCQL is a novel framework that jointly solves knowledge base completion (KBC) and knowledge base question answering (KBQA) by combining large language models (LLMs) with small language models (SLMs) in an iterative reinforcement process.

DetailsMotivation: Existing approaches typically use small language models for joint KBC and KBQA tasks, ignoring the strong reasoning capabilities of large language models. The authors aim to leverage both LLMs and SLMs to make KBC and KBQA reinforce each other, addressing issues like LLM hallucination and high computational costs in KBQA while improving SLM performance in KBC.

Method: JCQL combines LLMs with SLMs in an iterative framework: 1) For KBC enhancing KBQA, an SLM-trained KBC model is incorporated as an action in the LLM agent-based KBQA model to augment reasoning paths, 2) For KBQA enhancing KBC, KBQA reasoning paths are used as supplementary training data to incrementally fine-tune the KBC model.

Result: Extensive experiments on two public benchmark datasets show that JCQL surpasses all baselines for both KBC and KBQA tasks, demonstrating the effectiveness of the joint reinforcement approach.

Conclusion: The JCQL framework successfully leverages the complementary strengths of LLMs and SLMs to create a mutually reinforcing system between KBC and KBQA, achieving state-of-the-art performance on both tasks through iterative enhancement.

Abstract: Knowledge Bases (KBs) play a key role in various applications. As two representative KB-related tasks, knowledge base completion (KBC) and knowledge base question answering (KBQA) are closely related and inherently complementary with each other. Thus, it will be beneficial to solve the task of joint KBC and KBQA to make them reinforce each other. However, existing studies usually rely on the small language model (SLM) to enhance them jointly, and the large language model (LLM)’s strong reasoning ability is ignored. In this paper, by combining the strengths of the LLM with the SLM, we propose a novel framework JCQL, which can make these two tasks enhance each other in an iterative manner. To make KBC enhance KBQA, we augment the LLM agent-based KBQA model’s reasoning paths by incorporating an SLM-trained KBC model as an action of the agent, alleviating the LLM’s hallucination and high computational costs issue in KBQA. To make KBQA enhance KBC, we incrementally fine-tune the KBC model by leveraging KBQA’s reasoning paths as its supplementary training data, improving the ability of the SLM in KBC. Extensive experiments over two public benchmark data sets demonstrate that JCQL surpasses all baselines for both KBC and KBQA tasks.

[439] HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

Bowen Zeng, Feiyang Ren, Jun Zhang, Xiaoling Gu, Ke Chen, Lidan Shou, Huan Li

Main category: cs.AI

TL;DR: HybridKV is a hybrid KV cache compression framework for multimodal LLMs that reduces memory usage and speeds up decoding by classifying attention heads and applying specialized compression strategies.

DetailsMotivation: MLLMs face prohibitive memory overhead and latency due to KV cache growth from visual inputs (thousands of tokens per image). Existing compression methods use uniform allocation but overlook heterogeneous attention head behaviors requiring distinct compression strategies.

Method: Three-stage hybrid framework: 1) Classify heads as static or dynamic using text-centric attention patterns, 2) Top-down hierarchical budget allocation across layers and heads, 3) Static heads compressed via text-prior pruning, dynamic heads via chunk-wise retrieval.

Result: On 11 multimodal benchmarks with Qwen2.5-VL-7B: reduces KV cache memory by up to 7.9×, achieves 1.52× faster decoding, with almost no performance drop or even improvement relative to full-cache MLLM.

Conclusion: HybridKV effectively addresses KV cache inefficiency in MLLMs by recognizing attention head heterogeneity and applying tailored compression strategies, enabling more efficient multimodal reasoning with minimal performance impact.

Abstract: Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs. A common solution is to compress caches under a fixed allocated budget at different granularities: token-level uniformly discards less important tokens, layer-level varies retention across layers, and head-level redistributes budgets across heads. Yet these approaches stop at allocation and overlook the heterogeneous behaviors of attention heads that require distinct compression strategies. We propose HybridKV, a hybrid KV cache compression framework that integrates complementary strategies in three stages: heads are first classified into static or dynamic types using text-centric attention; then a top-down budget allocation scheme hierarchically assigns KV budgets; finally, static heads are compressed by text-prior pruning and dynamic heads by chunk-wise retrieval. Experiments on 11 multimodal benchmarks with Qwen2.5-VL-7B show that HybridKV reduces KV cache memory by up to $7.9\times$ and achieves $1.52\times$ faster decoding, with almost no performance drop or even higher relative to the full-cache MLLM.

[440] Context-Value-Action Architecture for Value-Driven Large Language Model Agents

TianZe Zhang, Sirui Sun, Yuhang Xie, Xin Zhang, Zhiqiang Wu, Guojie Song

Main category: cs.AI

TL;DR: CVA architecture improves LLM-based human behavior simulation by decoupling action generation from reasoning, reducing value polarization and enhancing fidelity using real human data.

DetailsMotivation: Current LLM-based agents exhibit behavioral rigidity masked by self-referential bias in evaluations. Increasing reasoning intensity paradoxically causes value polarization and reduces population diversity, revealing flaws in existing approaches.

Method: Proposes Context-Value-Action (CVA) architecture based on Stimulus-Organism-Response model and Schwartz’s Theory of Basic Human Values. Uses a Value Verifier trained on authentic human data to decouple action generation from cognitive reasoning, explicitly modeling dynamic value activation.

Result: CVA significantly outperforms baselines on CVABench (1.1M+ real-world interaction traces), effectively mitigates polarization while offering superior behavioral fidelity and interpretability.

Conclusion: CVA architecture addresses limitations of current LLM-based human behavior simulation by reducing polarization and improving fidelity through decoupled reasoning and value-based modeling with real human data.

Abstract: Large Language Models (LLMs) have shown promise in simulating human behavior, yet existing agents often exhibit behavioral rigidity, a flaw frequently masked by the self-referential bias of current “LLM-as-a-judge” evaluations. By evaluating against empirical ground truth, we reveal a counter-intuitive phenomenon: increasing the intensity of prompt-driven reasoning does not enhance fidelity but rather exacerbates value polarization, collapsing population diversity. To address this, we propose the Context-Value-Action (CVA) architecture, grounded in the Stimulus-Organism-Response (S-O-R) model and Schwartz’s Theory of Basic Human Values. Unlike methods relying on self-verification, CVA decouples action generation from cognitive reasoning via a novel Value Verifier trained on authentic human data to explicitly model dynamic value activation. Experiments on CVABench, which comprises over 1.1 million real-world interaction traces, demonstrate that CVA significantly outperforms baselines. Our approach effectively mitigates polarization while offering superior behavioral fidelity and interpretability.

[441] MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

Maria Nesterova, Mikhail Kolosov, Anton Andreychuk, Egor Cherepanov, Oleg Bulichev, Alexey Kovalev, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik

Main category: cs.AI

TL;DR: MARL-GPT: A single GPT-based model trained with offline RL on massive expert trajectories achieves competitive performance across diverse multi-agent environments without task-specific tuning.

DetailsMotivation: Current multi-agent reinforcement learning approaches require specialized models for each task/environment. The authors aim to develop a single, general-purpose model that can perform well across diverse MARL domains, similar to how foundation models work in NLP.

Method: MARL-GPT uses offline reinforcement learning to train a GPT-based model on massive expert trajectories (400M for SMACv2, 100M for GRF, 1B for POGEMA). It employs a single transformer-based observation encoder that requires no task-specific tuning or modifications.

Result: The model achieves competitive performance compared to specialized baselines across all tested environments (StarCraft Multi-Agent Challenge, Google Research Football, and POGEMA), demonstrating the feasibility of a general-purpose MARL model.

Conclusion: It is possible to build a multi-task transformer-based model for diverse multi-agent problems, paving the way for fundamental MARL models similar to foundation models in natural language processing.

Abstract: Recent advances in multi-agent reinforcement learning (MARL) have demonstrated success in numerous challenging domains and environments, but typically require specialized models for each task. In this work, we propose a coherent methodology that makes it possible for a single GPT-based model to learn and perform well across diverse MARL environments and tasks, including StarCraft Multi-Agent Challenge, Google Research Football and POGEMA. Our method, MARL-GPT, applies offline reinforcement learning to train at scale on the expert trajectories (400M for SMACv2, 100M for GRF, and 1B for POGEMA) combined with a single transformer-based observation encoder that requires no task-specific tuning. Experiments show that MARL-GPT achieves competitive performance compared to specialized baselines in all tested environments. Thus, our findings suggest that it is, indeed, possible to build a multi-task transformer-based model for a wide variety of (significantly different) multi-agent problems paving the way to the fundamental MARL model (akin to ChatGPT, Llama, Mistral etc. in natural language modeling).

[442] Towards Trustworthy Report Generation: A Deep Research Agent with Progressive Confidence Estimation and Calibration

Yi Yuan, Xuhong Wang, Shanzhe Lei

Main category: cs.AI

TL;DR: A deep research agent with progressive confidence estimation and calibration for generating trustworthy research reports, addressing the limitation of existing evaluation frameworks that fail to measure epistemic confidence in open-ended scenarios.

DetailsMotivation: Existing evaluation frameworks for deep research agents focus on subjective dimensions and fail to capture trustworthiness, especially in open-ended research scenarios where ground-truth answers are unavailable. Current methods cannot effectively measure epistemic confidence, making calibration difficult and leaving users vulnerable to misleading or hallucinated information.

Method: Proposes a novel deep research agent incorporating progressive confidence estimation and calibration within the report generation pipeline. Uses a deliberative search model with deep retrieval and multi-hop reasoning to ground outputs in verifiable evidence while assigning confidence scores to individual claims. Combines this with a carefully designed workflow.

Result: Experimental results and case studies demonstrate that the method substantially improves interpretability and significantly increases user trust in generated reports.

Conclusion: The proposed approach produces trustworthy reports with enhanced transparency by addressing the critical limitation of measuring and calibrating epistemic confidence in deep research agents.

Abstract: As agent-based systems continue to evolve, deep research agents are capable of automatically generating research-style reports across diverse domains. While these agents promise to streamline information synthesis and knowledge exploration, existing evaluation frameworks-typically based on subjective dimensions-fail to capture a critical aspect of report quality: trustworthiness. In open-ended research scenarios where ground-truth answers are unavailable, current evaluation methods cannot effectively measure the epistemic confidence of generated content, making calibration difficult and leaving users susceptible to misleading or hallucinated information. To address this limitation, we propose a novel deep research agent that incorporates progressive confidence estimation and calibration within the report generation pipeline. Our system leverages a deliberative search model, featuring deep retrieval and multi-hop reasoning to ground outputs in verifiable evidence while assigning confidence scores to individual claims. Combined with a carefully designed workflow, this approach produces trustworthy reports with enhanced transparency. Experimental results and case studies demonstrate that our method substantially improves interpretability and significantly increases user trust.

[443] Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

Renxuan Tan, Rongpeng Li, Zhifeng Zhao, Honggang Zhang

Main category: cs.AI

TL;DR: PLC is a game-theoretic framework for multi-objective preference alignment in LLMs that uses consensus-driven lenient gradient rectification to escape local optima and explore better Pareto frontiers.

DetailsMotivation: Current multi-preference alignment methods use static linear scalarization or rigid gradient projection, which force conflict avoidance and converge prematurely to local stationary points, sacrificing potential global Pareto improvements.

Method: Proposes Pareto-Lenient Consensus (PLC), a game-theoretic framework that treats alignment as dynamic negotiation. Uses consensus-driven lenient gradient rectification to tolerate local degradation when there’s sufficient dominant coalition surplus, enabling escape from local suboptimal equilibria.

Result: Theoretical analysis shows PLC can facilitate stalemate escape and asymptotically converge to Pareto consensus equilibrium. Experiments demonstrate PLC surpasses baselines in both fixed-preference alignment and global Pareto frontier quality.

Conclusion: PLC highlights negotiation-driven alignment as a promising avenue for Multi-Objective Preference Alignment, enabling better exploration of Pareto-optimal frontiers beyond conservative compromises.

Abstract: Transcending the single-preference paradigm, aligning LLMs with diverse human values is pivotal for robust deployment. Contemporary Multi-Objective Preference Alignment (MPA) approaches predominantly rely on static linear scalarization or rigid gradient projection to navigate these trade-offs. However, by enforcing strict conflict avoidance or simultaneous descent, these paradigms often prematurely converge to local stationary points. While mathematically stable, these points represent a conservative compromise where the model sacrifices potential global Pareto improvements to avoid transient local trade-offs. To break this deadlock, we propose Pareto-Lenient Consensus (PLC), a game-theoretic framework that reimagines alignment as a dynamic negotiation process. Unlike rigid approaches, PLC introduces consensus-driven lenient gradient rectification, which dynamically tolerates local degradation provided there is a sufficient dominant coalition surplus, thereby empowering the optimization trajectory to escape local suboptimal equilibrium and explore the distal Pareto-optimal frontier. Theoretical analysis validates PLC can facilitate stalemate escape and asymptotically converge to a Pareto consensus equilibrium. Moreover, extensive experiments show that PLC surpasses baselines in both fixed-preference alignment and global Pareto frontier quality. This work highlights the potential of negotiation-driven alignment as a promising avenue for MPA. Our codes are available at https://anonymous.4open.science/r/aaa-6BB8.

[444] Flowr – Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains

Eranga Bandara, Ross Gore, Sachin Shetty, Piumi Siyambalapitiya, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Ravi Mukkamala, Peter Foytik, Safdar H. Bouk, Abdul Rahman, Xueping Liang, Amin Hass, Tharaka Hewa, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan

Main category: cs.AI

TL;DR: Flowr is an agentic AI framework that automates end-to-end retail supply chain workflows using specialized AI agents coordinated by LLMs, with human-in-the-loop supervision for accountability.

DetailsMotivation: Retail supply chain operations are manual, repetitive, decision-intensive, and difficult to scale despite data analytics investments. Current systems remain reactive and fragmented across outlets, distribution centers, and suppliers.

Method: Decomposes manual operations into specialized AI agents with defined cognitive roles. Uses a consortium of fine-tuned, domain-specialized LLMs coordinated by a central reasoning LLM. Implements human-in-the-loop orchestration via Model Context Protocol interface for manager supervision.

Result: Significantly reduces manual coordination overhead, improves demand-supply alignment, enables proactive exception handling at scale unachievable manually. Validated with large supermarket chain.

Conclusion: Flowr provides a generalizable blueprint for agentic AI-driven supply chain automation across enterprise settings while preserving accountability through human oversight.

Abstract: Retail supply chain operations in supermarket chains involve continuous, high-volume manual workflows spanning demand forecasting, procurement, supplier coordination, and inventory replenishment, processes that are repetitive, decision-intensive, and difficult to scale without significant human effort. Despite growing investment in data analytics, the decision-making and coordination layers of these workflows remain predominantly manual, reactive, and fragmented across outlets, distribution centers, and supplier networks. This paper introduces Flowr, a novel agentic AI framework for automating end-to-end retail supply chain workflows in large-scale supermarket operations. Flowr systematically decomposes manual supply chain operations into specialized AI agents, each responsible for a clearly defined cognitive role, enabling automation of processes previously dependent on continuous human coordination. To ensure task accuracy and adherence to responsible AI principles, the framework employs a consortium of fine-tuned, domain-specialized large language models coordinated by a central reasoning LLM. Central to the framework is a human-in-the-loop orchestration model in which supply chain managers supervise and intervene across workflow stages via a Model Context Protocol (MCP)-enabled interface, preserving accountability and organizational control. Evaluation demonstrates that Flowr significantly reduces manual coordination overhead, improves demand-supply alignment, and enables proactive exception handling at a scale unachievable through manual processes. The framework was validated in collaboration with a large-scale supermarket chain and is domain-independent, offering a generalizable blueprint for agentic AI-driven supply chain automation across large-scale enterprise settings.

[445] Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis

Michael Cuccarese

Main category: cs.AI

TL;DR: A method called “epistemic blinding” addresses LLMs’ inability to distinguish between data-driven inference and memorized priors by anonymizing entity identifiers during inference, enabling auditability of how much output comes from supplied data vs. model knowledge.

DetailsMotivation: LLMs silently blend data-driven inference with memorized priors about named entities, making it impossible to determine from outputs how much came from supplied data versus the model's training memory, which undermines auditability and scientific rigor in agentic systems.

Method: Epistemic blinding protocol replaces entity identifiers with anonymous codes before prompting, then compares outputs against unblinded controls. The system includes LLM-guided evolutionary optimization of scoring functions and blinded agentic reasoning for target rationalization, operating without access to entity identity.

Result: In oncology drug target prioritization across four cancer types, blinding changed 16% of top-20 predictions while preserving identical recovery of validated targets. In S&P 500 equity screening, brand-recognition bias reshaped 30-40% of top-20 rankings across five random seeds.

Conclusion: Epistemic blinding restores critical auditability by measuring how much output comes from supplied data versus model’s parametric knowledge, addressing a fundamental problem in LLM-based scientific reasoning that generalizes beyond biology to other domains.

Abstract: This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization. During development, it became apparent that LLM outputs silently blend data-driven inference with memorized priors about named entities - and the blend is invisible: there is no way to determine, from a single output, how much came from the data on the page and how much came from the model’s training memory. Epistemic blinding is a simple inference-time protocol that replaces entity identifiers with anonymous codes before prompting, then compares outputs against an unblinded control. The protocol does not make LLM reasoning deterministic, but it restores one critical axis of auditability: measuring how much of an output came from the supplied data versus the model’s parametric knowledge. The complete target identification system is described - including LLM-guided evolutionary optimization of scoring functions and blinded agentic reasoning for target rationalization - with demonstration that both stages operate without access to entity identity. In oncology drug target prioritization across four cancer types, blinding changes 16% of top-20 predictions while preserving identical recovery of validated targets. The contamination problem is shown to generalize beyond biology: in S&P 500 equity screening, brand-recognition bias reshapes 30-40% of top-20 rankings across five random seeds. To lower the barrier to adoption, the protocol is released as an open-source tool and as a Claude Code skill that enables one-command epistemic blinding within agentic workflows. The claim is not that blinded analysis produces better results, but that without blinding, there is no way to know to what degree the agent is adhering to the analytical process the researcher designed.

[446] How LLMs Follow Instructions: Skillful Coordination, Not a Universal Mechanism

Elisabetta Rocchetti, Alfio Ferrara

Main category: cs.AI

TL;DR: Instruction-following in language models relies on skillful coordination of diverse linguistic capabilities rather than a single universal mechanism, as shown through diagnostic probing across multiple tasks.

DetailsMotivation: To understand the underlying mechanism of instruction-following in language models - whether it relies on a universal mechanism or compositional skill deployment.

Method: Diagnostic probing across nine diverse tasks in three instruction-tuned models, using general vs. task-specific probes, cross-task transfer analysis, causal ablation, and temporal analysis of generation processes.

Result: Evidence against universal mechanism: 1) General probes underperform task-specific specialists, 2) Weak cross-task transfer clustered by skill similarity, 3) Sparse asymmetric dependencies rather than shared representations, 4) Tasks stratify by complexity across layers, 5) Constraint satisfaction operates as dynamic monitoring during generation.

Conclusion: Instruction-following is better characterized as skillful coordination of diverse linguistic capabilities rather than deployment of a single abstract constraint-checking process.

Abstract: Instruction tuning is commonly assumed to endow language models with a domain-general ability to follow instructions, yet the underlying mechanism remains poorly understood. Does instruction-following rely on a universal mechanism or compositional skill deployment? We investigate this through diagnostic probing across nine diverse tasks in three instruction-tuned models. Our analysis provides converging evidence against a universal mechanism. First, general probes trained across all tasks consistently underperform task-specific specialists, indicating limited representational sharing. Second, cross-task transfer is weak and clustered by skill similarity. Third, causal ablation reveals sparse asymmetric dependencies rather than shared representations. Tasks also stratify by complexity across layers, with structural constraints emerging early and semantic tasks emerging late. Finally, temporal analysis shows constraint satisfaction operates as dynamic monitoring during generation rather than pre-generation planning. These findings indicate that instruction-following is better characterized as skillful coordination of diverse linguistic capabilities rather than deployment of a single abstract constraint-checking process.

[447] Artificial Intelligence and the Structure of Mathematics

Maissam Barkeshli, Michael R. Douglas, Michael H. Freedman

Main category: cs.AI

TL;DR: AI can transform mathematics by providing new ways to understand formal proofs and mathematical structure through automated discovery and analysis of Platonic mathematical worlds.

DetailsMotivation: To explore how AI can complement mathematical logic in understanding the global structure of formal proofs and potentially answer fundamental questions about the nature of mathematics (discovered vs invented).

Method: Proposes analyzing mathematics through universal proof and structural hypergraphs, outlining criteria for AI models capable of automated mathematical discovery, and sending AI agents to explore Platonic mathematical worlds.

Result: Conceptual framework suggesting AI could reveal new perspectives on mathematical structure and potentially answer deep philosophical questions about mathematics.

Conclusion: AI may provide transformative insights into mathematics by offering complementary approaches to traditional logic and helping understand both the global structure of mathematics and human-comprehensible aspects.

Abstract: Recent progress in artificial intelligence (AI) is unlocking transformative capabilities for mathematics. There is great hope that AI will help solve major open problems and autonomously discover new mathematical concepts. In this essay, we further consider how AI may open a grand perspective on mathematics by forging a new route, complementary to mathematical\textbf{ logic,} to understanding the global structure of formal \textbf{proof}\textbf{s}. We begin by providing a sketch of the formal structure of mathematics in terms of universal proof and structural hypergraphs and discuss questions this raises about the foundational structure of mathematics. We then outline the main ingredients and provide a set of criteria to be satisfied for AI models capable of automated mathematical discovery. As we send AI agents to traverse Platonic mathematical worlds, we expect they will teach us about the nature of mathematics: both as a whole, and the small ribbons conducive to human understanding. Perhaps they will shed light on the old question: “Is mathematics discovered or invented?” Can we grok the terrain of these \textbf{Platonic worlds}?

[448] ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

Wang Yang, Chaoda Song, Xinpeng Li, Debargha Ganguly, Chuang Ma, Shouren Wang, Zhihao Dou, Yuli Zhou, Vipin Chaudhary, Xiaotian Han

Main category: cs.AI

TL;DR: ACE-Bench: A lightweight agent evaluation benchmark using grid-based planning tasks with controllable horizon and difficulty parameters, eliminating environment interaction overhead.

DetailsMotivation: Existing agent benchmarks suffer from high environment interaction overhead (up to 41% of evaluation time) and imbalanced task horizon/difficulty distributions that make aggregate scores unreliable.

Method: Proposes ACE-Bench built around unified grid-based planning tasks where agents fill hidden slots in partially completed schedules subject to local and global constraints. Uses two orthogonal axes: Scalable Horizons (H) controlling number of hidden slots, and Controllable Difficulty (B) governing decoy budget for misleading candidates. Features Lightweight Environment design where all tool calls are resolved via static JSON files.

Result: Validated that H and B provide reliable control over task horizon and difficulty, with strong domain consistency and model discriminability. Comprehensive experiments across 13 models of diverse sizes and families over 6 domains revealed significant cross-model performance variation.

Conclusion: ACE-Bench provides interpretable and controllable evaluation of agent reasoning with fast, reproducible evaluation suitable for training-time validation, eliminating setup overhead.

Abstract: Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose ACE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints. Our benchmark offers fine-grained control through two orthogonal axes: Scalable Horizons, controlled by the number of hidden slots $H$, and Controllable Difficulty, governed by a decoy budget $B$ that determines the number of globally misleading decoy candidates. Crucially, all tool calls are resolved via static JSON files under a Lightweight Environment design, eliminating setup overhead and enabling fast, reproducible evaluation suitable for training-time validation. We first validate that H and B provide reliable control over task horizon and difficulty, and that ACE-Bench exhibits strong domain consistency and model discriminability. We then conduct comprehensive experiments across 13 models of diverse sizes and families over 6 domains, revealing significant cross-model performance variation and confirming that ACE-Bench provides interpretable and controllable evaluation of agent reasoning.

[449] Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang

Main category: cs.AI

TL;DR: Claw-Eval is a comprehensive evaluation suite for autonomous AI agents that addresses limitations in existing benchmarks through trajectory-aware grading, safety/robustness evaluation, and multimodal task coverage.

DetailsMotivation: Existing agent benchmarks have three critical limitations: (1) trajectory-opaque grading that only checks final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms.

Method: Introduced Claw-Eval with 300 human-verified tasks across 9 categories in three groups (general service orchestration, multimodal perception/generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, environment snapshots) enabling trajectory-aware grading over 2,159 fine-grained rubric items.

Result: Experiments on 14 frontier models show: (1) trajectory-opaque evaluation misses 44% of safety violations and 13% of robustness failures; (2) controlled error injection degrades consistency (Pass^3 drops up to 24%) while peak capability remains stable; (3) multimodal performance varies sharply, with poorer performance on video than document/image tasks.

Conclusion: Claw-Eval provides comprehensive agent evaluation that reveals systematic limitations in existing benchmarks and highlights actionable directions for building reliably deployable agents, not just capable ones.

Abstract: Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

[450] VarDrop: Enhancing Training Efficiency by Reducing Variate Redundancy in Periodic Time Series Forecasting

Junhyeok Kang, Yooju Shin, Jae-Gil Lee

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to analyze paper due to technical retrieval error

Abstract: Failed to fetch summary for 2501.14183: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.14183&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[451] An Innovative Next Activity Prediction Using Process Entropy and Dynamic Attribute-Wise-Transformer in Predictive Business Process Monitoring

Hadi Zare, Mostafa Abbasi, Maryam Ahang, Homayoun Najjaran

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2502.10573: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.10573&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[452] Solving a Stackelberg Game on Transportation Networks in a Dynamic Crime Scenario: A Mixed Approach on Multi-Layer Networks

Sukanya Samanta, Kei Kimura, Makoto Yokoo, Palash Dey

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2406.14514: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.14514&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[453] Beyond Syntax: Action Semantics Learning for App Agents

Bohan Tang, Dezhao Luo, Jianheng Liu, Jingxuan Chen, Shaogang Gong, Jianye Hao, Jun Wang, Kun Shao

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2506.17697: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.17697&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[454] HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding

Yihan Xie, Sijing Li, Tianwei Lin, Zhuonan Wang, Chenglin Yang, Yu Zhong, Wenjie Yan, Wenqiao Zhang, Xiaogang Guo, Jun Xiao, Yueting Zhuang, Beng Chin Ooi

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2506.05831: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05831&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[455] URSA: The Universal Research and Scientific Agent

Michael Grosskopf, Nathan Debardeleben, Russell Bent, Rahul Somasundaram, Isaac Michaud, Arthur Lui, Alexius Wadell, Warren D. Graham, Golo A Wimmer, Sachin Shivakumar, Joan Vendrell Gallart, Harsha Nagarajan, Earl Lawrence

Main category: cs.AI

TL;DR: Unable to analyze paper 2506.22653 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract retrieval failed

Method: Cannot determine method as abstract retrieval failed

Result: Cannot determine results as abstract retrieval failed

Conclusion: Cannot draw conclusion due to missing abstract data

Abstract: Failed to fetch summary for 2506.22653: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.22653&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[456] A Survey of Continual Reinforcement Learning

Chaofan Pan, Xin Yang, Yanhua Li, Wei Wei, Tianrui Li, Bo An, Jiye Liang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2506.21872: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21872&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[457] Modelling Cascading Physical Climate Risk in Supply Chains with Adaptive Firms: A Spatial Agent-Based Framework

Yara Mohajerani

Main category: cs.AI

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Unable to determine paper motivation due to access error

Method: Unable to determine paper method due to access error

Result: Unable to determine paper results due to access error

Conclusion: Unable to determine paper conclusion due to access error

Abstract: Failed to fetch summary for 2509.18633: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18633&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[458] Dissecting Transformers: A CLEAR Perspective towards Green AI

Hemang Jain, Shailender Goyal, Divyansh Pandey, Karthik Vaidhyanathan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.02810: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02810&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[459] TS-Agent: Understanding and Reasoning Over Raw Time Series via Iterative Insight Gathering

Penghang Liu, Elizabeth Fons, Annita Vapsi, Mohsen Ghassemi, Svitlana Vyetrenko, Daniel Borrajo, Vamsi K. Potluru, Manuela Veloso

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.07432: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07432&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[460] Reveal-to-Revise: Explainable Bias-Aware Generative Modeling with Multimodal Attention

Noor Islam S. Mohammad, Md Muntaqim Meherab

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2510.12957: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12957&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[461] Eigen-Value: Efficient Domain-Robust Data Valuation via Eigenvalue-Based Approach

Youngjun Choi, Joonseong Kang, Sungjun Lim, Kyungwoo Song

Main category: cs.AI

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

DetailsMotivation: Unable to determine motivation due to API rate limiting error

Method: Unable to determine method due to API rate limiting error

Result: Unable to determine results due to API rate limiting error

Conclusion: Unable to determine conclusion due to API rate limiting error

Abstract: Failed to fetch summary for 2510.23409: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23409&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[462] Toward Virtuous Reinforcement Learning: A Critique and Roadmap

Majid Ghasemi, Mark Crowley

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2512.04246 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content.

Abstract: Failed to fetch summary for 2512.04246: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04246&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[463] Routing-Based Continual Learning for Multimodal Large Language Models

Jay Mohta, Kenan Emir Ak, Gwang Lee, Dimitrios Dimitriadis, Yan Xu, Mingwei Shen

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.01831: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01831&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[464] Robust AI Security and Alignment: A Sisyphean Endeavor?

Apostol Vassilev

Main category: cs.AI

TL;DR: Paper 2512.10100: Unable to fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2512.10100: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10100&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[465] EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration

Runze Li, Yuwen Zhai, Bo Xu, LiWu Xu, Nian Shi, Wei Zhang, Ran Lin, Liang Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2512.19396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[466] RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

Haoran Sun, Yongjian Guo, Zhong Guan, Shuai Di, Xiaodong Bai, Jing Long, Tianyun Zhao, Mingxi Luo, Hongke Zhao, Likang Wu, Xiaotie Deng, Xu Chu, Xi Xiao, Sheng Wen, Yicheng Gong, Junwu Xiong

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2602.05765: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05765&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[467] PhySe-RPO: Physics and Semantics Guided Relative Policy Optimization for Diffusion-Based Surgical Smoke Removal

Zining Fang, Cheng Xue, Chunhui Liu, Bin Xu, Ming Chen, Xiaowei Hu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.22844: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22844&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[468] ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Difan Jiao, Qianfeng Wen, Blair Yang, Zhenwei Tang, Ashton Anderson

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2604.01591: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01591&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[469] FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

Hang Xu, Ling Yue, Chaoqian Ouyang, Yuchen Liu, Libin Zheng, Shaowu Pan, Shimin Di, Min-Ling Zhang

Main category: cs.AI

TL;DR: Unable to analyze paper 2604.04074 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.04074: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04074&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[470] Gradual Cognitive Externalization: From Modeling Cognition to Constituting It

Zhimin Zhao

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2604.04387 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation without access to the paper content. The arXiv API returned a rate limiting error (HTTP 429), preventing retrieval of the paper's abstract and details.

Method: Cannot determine method without access to the paper content. The arXiv API rate limiting prevented retrieval of the paper’s methodology.

Result: Cannot determine results without access to the paper content. The arXiv API returned HTTP 429 error, indicating too many requests.

Conclusion: Cannot draw conclusions about the paper’s content due to technical limitations in accessing the arXiv API. The paper ID 2604.04387 exists but cannot be analyzed at this time.

Abstract: Failed to fetch summary for 2604.04387: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04387&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[471] MolDA: Molecular Understanding and Generation via Large Language Diffusion Model

Seohyeon Shin, HanJun Choi, Jun-Hyung Park, Hong Kook Kim, Mansu Kim

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2604.04403: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04403&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[472] AI Assistance Reduces Persistence and Hurts Independent Performance

Grace Liu, Brian Christian, Tsvetomira Dumbalska, Michiel A. Bakker, Rachit Dubey

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2604.04721 suggests it’s from April 2024, but no content available for analysis.

DetailsMotivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2604.04721: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04721&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[473] TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment

Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Xin Peng, Zhenpeng Chen, Yiling Lou

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without paper content

Method: Cannot determine method without paper content

Result: Cannot determine results without paper content

Conclusion: Cannot determine conclusion without paper content

Abstract: Failed to fetch summary for 2409.19894: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.19894&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[474] Incident-Guided Spatiotemporal Traffic Forecasting

Lixiang Fan, Bohao Li, Tao Zou, Junchen Ye, Bowen Du

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.02528: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02528&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[475] Cobblestone: A Divide-and-Conquer Approach for Automating Formal Verification

Saketh Ram Kasibatla, Arpan Agarwal, Yuriy Brun, Sorin Lerner, Talia Ringer, Emily First

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2410.19940: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.19940&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[476] From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap

Gopi Krishnan Rajbahadur, Gustavo A. Oliva, Dayi Lin, Jiho Shin, Ahmed E. Hassan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2410.20791: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.20791&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[477] Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights

Rafiya Javed, Cassandra Parent, Jackie Kay, David Yanni, Abdullah Zaini, Anushe Sheikh, Maribeth Rauh, Walter Gerych, Ramona Comanescu, Iason Gabriel, Marzyeh Ghassemi, Laura Weidinger

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2502.19463 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot draw conclusions about the paper without access to its content.

Abstract: Failed to fetch summary for 2502.19463: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.19463&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[478] From Concept to Practice: an Automated LLM-aided UVM Machine for RTL Verification

Junhao Ye, Yuchen Hu, Ke Xu, Dingrong Pan, Qichun Chen, Jie Zhou, Shuai Zhao, Xinwei Fang, Xi Wang, Nan Guan, Zhe Jiang

Main category: cs.AI

TL;DR: Unable to analyze paper 2504.19959 due to HTTP 429 error when fetching from arXiv API

DetailsMotivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2504.19959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.19959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[479] Synthesis of discrete-continuous quantum circuits with multimodal diffusion models

Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2506.01666: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01666&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[480] LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents

Zihe Yan, Jiaping Gui, Zhuosheng Zhang, Gongshen Liu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2507.10610: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.10610&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[481] ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

Wangsong Yin, Daliang Xu, Mengwei Xu, Gang Huang, Xuanzhe Liu

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2508.16703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[482] Not All Latent Spaces Are Flat: Hyperbolic Concept Control

Maria Rosaria Briglia, Simone Facchiano, Paolo Cursi, Alessio Sampieri, Emanuele Rodolà, Guido Maria D’Amely di Melendugno, Luca Franco, Fabio Galasso, Iacopo Masi

Main category: cs.AI

TL;DR: Unable to analyze paper 2603.14093 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract content is unavailable due to rate limiting error

Method: Cannot determine method as abstract content is unavailable due to rate limiting error

Result: Cannot determine results as abstract content is unavailable due to rate limiting error

Conclusion: Cannot draw conclusions about paper content due to inability to access abstract information

Abstract: Failed to fetch summary for 2603.14093: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14093&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[483] Pursuit of biomarkers of brain diseases: Beyond cohort comparisons

Pascal Helson, Arvind Kumar

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use different approach

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.10547: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.10547&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[484] Chiplet-Based RISC-V SoC with Modular AI Acceleration

Suhas Suresh Bharadwaj, Prerana Ramkumar

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2509.18355: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18355&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[485] First-Mover Bias in Gradient Boosting Explanations: Mechanism, Detection, and Resolution

Drake Caraker, Bryan Arnold, David Rhoads

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.22346: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22346&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[486] Cortex AISQL: A Production SQL Engine for Unstructured Data

Paweł Liskowski, Benjamin Han, Paritosh Aggarwal, Bowei Chen, Boxin Jiang, Nitish Jindal, Zihan Li, Aaron Lin, Kyle Schmaus, Jay Tayade, Weicheng Zhao, Anupam Datta, Nathan Wiegand, Dimitris Tsirogiannis

Main category: cs.AI

TL;DR: Unable to analyze paper 2511.07663 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot draw conclusion due to missing abstract data

Abstract: Failed to fetch summary for 2511.07663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[487] Developing and Evaluating a Large Language Model-Based Automated Feedback System Grounded in Evidence-Centered Design for Supporting Physics Problem Solving

Holger Maus, Paul Tschisgale, Fabian Kieser, Stefan Petersen, Peter Wulff

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2512.10785: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10785&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[488] Interpretable Deep Learning for Stock Returns: A Consensus-Bottleneck Asset Pricing Model

Changeun Kim, Younwoo Jeong, Bong-Gyu Jang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.16251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[489] WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching

Xiangchen Li, Jiakun Fan, Qingyuan Wang, Dimitrios Spatharakis, Saeid Ghafouri, Hans Vandierendonck, Deepu John, Bo Ji, Ali R. Butt, Dimitrios S. Nikolopoulos

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.11652 appears to be from 2026, suggesting it’s a future paper or incorrect ID format.

DetailsMotivation: Cannot determine motivation due to inability to access paper content.

Method: Cannot determine method due to inability to access paper content.

Result: Cannot determine results due to inability to access paper content.

Conclusion: Cannot determine conclusion due to inability to access paper content.

Abstract: Failed to fetch summary for 2601.11652: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11652&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[490] Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

Nilesh Sarkar, Dawar Jyoti Deka

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2604.04037: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04037&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[491] QUASAR: A Universal Autonomous System for Atomistic Simulation and a Benchmark of Its Capabilities

Fengxu Yang, Jack D. Evans

Main category: cs.AI

TL;DR: Unable to analyze paper 2602.00185 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation due to missing abstract content

Method: Cannot determine method due to missing abstract content

Result: Cannot determine results due to missing abstract content

Conclusion: Cannot determine conclusion due to missing abstract content

Abstract: Failed to fetch summary for 2602.00185: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00185&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[492] Graph-Theoretic Analysis of Phase Optimization Complexity in Variational Wave Functions for Heisenberg Antiferromagnets

Mahmud Ashraf Shamim, Md Moshiur Rahman Raj, Mohamed Hibat-Allah, Paulo T Araujo

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to access restrictions

Method: Cannot determine method due to access restrictions

Result: Cannot determine results due to access restrictions

Conclusion: Cannot determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2602.04943: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04943&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[493] “When to Hand Off, When to Work Together”: Expanding Human-Agent Co-Creative Collaboration through Concurrent Interaction

Kihoon Son, Hyewon Lee, DaEun Choi, Yoonsu Kim, Tae Soo Kim, Yoonjoo Lee, John Joon Young Chung, HyunJoon Jung, Juho Kim

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2603.02050: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02050&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[494] Agora: Teaching the Skill of Consensus-Finding with AI Personas Grounded in Human Voice

Prerna Ravi, Om Gokhale, Suyash Fulay, Eugene Yi, Deb Roy, Michiel Bakker

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2603.07339

DetailsMotivation: Unable to determine motivation due to access restrictions preventing paper retrieval

Method: Cannot analyze method without access to paper content

Result: No results available due to technical access limitations

Conclusion: Paper analysis impossible due to HTTP 429 error preventing content retrieval

Abstract: Failed to fetch summary for 2603.07339: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07339&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[495] MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

Anupam Purwar, Aditya Choudhary

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - unable to analyze content

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.09643: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09643&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[496] Code Review Agent Benchmark

Yuntong Zhang, Zhiyuan Pan, Imam Nur Bani Yusuf, Haifeng Ruan, Ridwan Shariffdeen, Abhik Roychoudhury

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.23448 appears to be from March 2024, but no content is available for analysis.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2603.23448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[497] Safety, Security, and Cognitive Risks in World Models

Manoj Parmar

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.01346: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01346&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[498] Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents

Wei Zou, Mingwen Dong, Miguel Romero Calvo, Shuaichen Chang, Jiang Guo, Dongkyu Lee, Xing Niu, Xiaofei Ma, Yanjun Qi, Jiarong Jiang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2604.02623: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02623&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[499] Generalizable Audio-Visual Navigation via Binaural Difference Attention and Action Transition Prediction

Jia Li, Yinfeng Yu

Main category: cs.SD

TL;DR: BDATP framework improves audio-visual navigation by using binaural difference attention for spatial orientation and action transition prediction for regularization, achieving state-of-the-art performance with strong generalization to unseen sounds and environments.

DetailsMotivation: Existing audio-visual navigation methods struggle with generalization in unseen scenarios due to overfitting to semantic sound features and specific training environments, limiting their practical applicability.

Method: Proposes BDATP framework with two components: 1) Binaural Difference Attention (BDA) module that explicitly models interaural differences to enhance spatial orientation and reduce reliance on semantic categories, and 2) Action Transition Prediction (ATP) task that introduces an auxiliary action prediction objective as a regularization term to mitigate environment-specific overfitting.

Result: Achieves state-of-the-art Success Rates across most settings on Replica and Matterport3D datasets, with up to 21.6 percentage point absolute improvement in Replica dataset for unheard sounds. The framework integrates seamlessly into various mainstream baselines and yields consistent performance gains.

Conclusion: BDATP demonstrates superior generalization capability and robustness across diverse navigation architectures by jointly optimizing perception and policy through explicit spatial modeling and regularization techniques.

Abstract: In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training environments. To address these challenges, we propose the \textbf{Binaural Difference Attention with Action Transition Prediction (BDATP)} framework, which jointly optimizes perception and policy. Specifically, the \textbf{Binaural Difference Attention (BDA)} module explicitly models interaural differences to enhance spatial orientation, reducing reliance on semantic categories. Simultaneously, the \textbf{Action Transition Prediction (ATP)} task introduces an auxiliary action prediction objective as a regularization term, mitigating environment-specific overfitting. Extensive experiments on the Replica and Matterport3D datasets demonstrate that BDATP can be seamlessly integrated into various mainstream baselines, yielding consistent and significant performance gains. Notably, our framework achieves state-of-the-art Success Rates across most settings, with a remarkable absolute improvement of up to 21.6 percentage points in Replica dataset for unheard sounds. These results underscore BDATP’s superior generalization capability and its robustness across diverse navigation architectures.

[500] YMIR: A new Benchmark Dataset and Model for Arabic Yemeni Music Genre Classification Using Convolutional Neural Networks

Moeen AL-Makhlafi, Abdulrahman A. AlKannad, Eiad Almekhlafi, Nawaf Q. Othman Ahmed Mohammed, Saher Qaid

Main category: cs.SD

TL;DR: YMIR dataset and YMCM model for Yemeni music genre classification, achieving 98.8% accuracy with Mel-spectrograms

DetailsMotivation: Most music genre classification benchmarks and models focus on Western music, leaving culturally specific traditions like Yemeni music underrepresented in MIR research.

Method: Created YMIR dataset with 1,475 Yemeni music clips across 5 genres, labeled by experts. Developed YMCM CNN model for classification, compared with various feature representations (Mel-spectrograms, Chroma, FilterBank, MFCCs) and benchmarked against standard models (AlexNet, VGG16, MobileNet, baseline CNN).

Result: YMCM achieved highest accuracy of 98.8% using Mel-spectrogram features. Strong inter-annotator agreement (Fleiss kappa = 0.85) for dataset labeling.

Conclusion: YMIR serves as a useful benchmark for Yemeni music classification, and YMCM provides a strong baseline model. Results show relationship between feature representation and model capacity for audio classification tasks.

Abstract: Automatic music genre classification is a major task in music information retrieval; however, most current benchmarks and models have been developed primarily for Western music, leaving culturally specific traditions underrepresented. In this paper, we introduce the Yemeni Music Information Retrieval (YMIR) dataset, which contains 1,475 carefully selected audio clips covering five traditional Yemeni genres: Sanaani, Hadhrami, Lahji, Tihami, and Adeni. The dataset was labeled by five Yemeni music experts following a clear and structured protocol, resulting in strong inter-annotator agreement (Fleiss kappa = 0.85). We also propose the Yemeni Music Classification Model (YMCM), a convolutional neural network (CNN)-based system designed to classify music genres from time-frequency features. Using a consistent preprocessing pipeline, we perform a systematic comparison across six experimental groups and five different architectures, resulting in a total of 30 experiments. Specifically, we evaluate several feature representations, including Mel-spectrograms, Chroma, FilterBank, and MFCCs with 13, 20, and 40 coefficients, and benchmark YMCM against standard models (AlexNet, VGG16, MobileNet, and a baseline CNN) under the same experimental conditions. The experimental findings reveal that YMCM is the most effective, achieving the highest accuracy of 98.8% with Mel-spectrogram features. The results also provide practical insights into the relationship between feature representation and model capacity. The findings establish YMIR as a useful benchmark and YMCM as a strong baseline for classifying Yemeni music genres.

[501] Anchored Cyclic Generation: A Novel Paradigm for Long-Sequence Symbolic Music Generation

Boyu Cao, Lekai Qian, Dehan Li, Haoyu Gu, Mingda Xu, Qi Liu

Main category: cs.SD

TL;DR: Proposes Anchored Cyclic Generation (ACG) paradigm and Hierarchical ACG (Hi-ACG) framework for symbolic music generation to address error accumulation in autoregressive models, using anchor features and global-to-local strategy.

DetailsMotivation: Autoregressive models struggle with generating long sequences with structural coherence due to error accumulation, especially in symbolic music generation where this leads to poor music quality and structural integrity.

Method: Introduces ACG paradigm using anchor features from identified music to guide subsequent generation, mitigating error accumulation. Hi-ACG framework employs global-to-local generation strategy with specially designed piano token representation.

Result: ACG reduces cosine distance by 34.7% between predicted and ground-truth feature vectors. Hi-ACG outperforms existing methods in long-sequence symbolic music generation in both subjective and objective evaluations, with strong generalization to tasks like music completion.

Conclusion: The ACG paradigm and Hi-ACG framework effectively address error accumulation in autoregressive models for symbolic music generation, achieving superior performance and generalization capabilities.

Abstract: Generating long sequences with structural coherence remains a fundamental challenge for autoregressive models across sequential generation tasks. In symbolic music generation, this challenge is particularly pronounced, as existing methods are constrained by the inherent severe error accumulation problem of autoregressive models, leading to poor performance in music quality and structural integrity. In this paper, we propose the Anchored Cyclic Generation (ACG) paradigm, which relies on anchor features from already identified music to guide subsequent generation during the autoregressive process, effectively mitigating error accumulation in autoregressive methods. Based on the ACG paradigm, we further propose the Hierarchical Anchored Cyclic Generation (Hi-ACG) framework, which employs a systematic global-to-local generation strategy and is highly compatible with our specifically designed piano token, an efficient musical representation. The experimental results demonstrate that compared to traditional autoregressive models, the ACG paradigm achieves reduces cosine distance by an average of 34.7% between predicted feature vectors and ground-truth semantic vectors. In long-sequence symbolic music generation tasks, the Hi-ACG framework significantly outperforms existing mainstream methods in both subjective and objective evaluations. Furthermore, the framework exhibits excellent task generalization capabilities, achieving superior performance in related tasks such as music completion.

[502] Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck

Zhetao Hu, Yiquan Zhou, Wenyu Wang, Zhiyu Wu, Xin Gao, Jihua Zhu

Main category: cs.SD

TL;DR: A singing voice conversion system for SVCC2025 that achieves state-of-the-art naturalness using boundary-aware Whisper bottleneck, explicit frame-level technique matrix, and high-frequency band completion strategy.

DetailsMotivation: To address challenges in singing style conversion including style leakage, dynamic rendering difficulties, and high-fidelity generation with limited data in the SVCC2025 competition.

Method: Three key innovations: 1) boundary-aware Whisper bottleneck for phoneme-span pooling to suppress source style while preserving content, 2) explicit frame-level technique matrix with targeted F0 processing for stable dynamic style rendering, 3) high-frequency band completion using auxiliary 48kHz SVC model to overcome data scarcity.

Result: Achieved best naturalness performance in official SVCC2025 subjective evaluation, with competitive speaker similarity and technique control, using significantly less extra singing data than other top systems.

Conclusion: The system successfully addresses key challenges in singing voice conversion through novel architectural innovations, achieving state-of-the-art performance in naturalness while maintaining efficiency in data usage.

Abstract: This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leakage, dynamic rendering, and high-fidelity generation with limited data, we introduce three key innovations: a boundary-aware Whisper bottleneck that pools phoneme-span representations to suppress residual source style while preserving linguistic content; an explicit frame-level technique matrix, enhanced by targeted F0 processing during inference, for stable and distinct dynamic style rendering; and a perceptually motivated high-frequency band completion strategy that leverages an auxiliary standard 48kHz SVC model to augment the high-frequency spectrum, thereby overcoming data scarcity without overfitting. In the official SVCC2025 subjective evaluation, our system achieves the best naturalness performance among all submissions while maintaining competitive results in speaker similarity and technique control, despite using significantly less extra singing data than other top-performing systems. Audio samples are available online.

[503] Time-Domain Voice Identity Morphing (TD-VIM): A Signal-Level Approach to Morphing Attacks on Speaker Verification Systems

Aravinda Reddy PN, Raghavendra Ramachandra, K. Sreenivasa Rao, Pabitra Mitra, Kunal Singh

Main category: cs.SD

TL;DR: TD-VIM introduces time-domain voice identity morphing to create voice samples that can match multiple identities, posing security risks to speaker verification systems.

DetailsMotivation: Most morph attack research focuses on image-based biometrics (face, fingerprints, iris), leaving voice-based systems vulnerable. There's a need to explore voice morphing attacks to assess security risks in speaker verification systems.

Method: Developed Time-domain Voice Identity Morphing (TD-VIM) that blends voice characteristics from two identities at signal level. Used Multilingual Audio-Visual Smartphone database to create four distinct morphed signals based on morphing factors. Evaluated using Generalized Morphing Attack Potential (G-MAP) metric across two deep-learning-based SVS and one commercial system (Verispeak).

Result: Morphed voice samples achieved high attack success rates: G-MAP values reached 99.40% on iPhone-11 and 99.74% on Samsung S8 in text-dependent scenarios at 0.1% false match rate.

Conclusion: TD-VIM demonstrates significant vulnerability in speaker verification systems to voice morphing attacks, highlighting security risks in voice-based biometric authentication that need addressing.

Abstract: In biometric systems, it is a common practice to associate each sample or template with a specific individual. Nevertheless, recent studies have demonstrated the feasibility of generating “morphed” biometric samples capable of matching multiple identities. These morph attacks have been recognized as potential security risks for biometric systems. However, most research on morph attacks has focused on biometric modalities that operate within the image domain, such as the face, fingerprints, and iris. In this work, we introduce Time-domain Voice Identity Morphing (TD-VIM), a novel approach for voice-based biometric morphing. This method enables the blending of voice characteristics from two distinct identities at the signal level, creating morphed samples that present a high vulnerability for speaker verification systems. Leveraging the Multilingual Audio-Visual Smartphone database, our study created four distinct morphed signals based on morphing factors and evaluated their effectiveness using a comprehensive vulnerability analysis. To assess the security impact of TD-VIM, we benchmarked our approach using the Generalized Morphing Attack Potential (G-MAP) metric, measuring attack success across two deep-learning-based Speaker Verification Systems (SVS) and one commercial system, Verispeak. Our findings indicate that the morphed voice samples achieved a high attack success rate, with G-MAP values reaching 99.40% on iPhone-11 and 99.74% on Samsung S8 in text-dependent scenarios, at a false match rate of 0.1%.

[504] Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization

Yanis Labrak, David Grünert, Séverin Baroudi, Jiyun Chun, Pawel Cyrta, Sergio Burdisso, Ahmed Hassoon, David Liu, Adam Rothschild, Reed Van Deusen, Petr Motlicek, Andrew Perrault, Ricard Marxer, Thomas Schaaf

Main category: cs.SD

TL;DR: Synthetic data pipeline for long-context audio reasoning with doctor-patient conversations and SOAP note generation, featuring 8,800 conversations and 1.3k hours of audio.

DetailsMotivation: Long-context audio reasoning lacks both training data and evaluation benchmarks; existing benchmarks target short-context tasks, and open-ended generation tasks pose challenges for automatic evaluation.

Method: Three-stage pipeline: 1) persona-driven dialogue generation, 2) multi-speaker audio synthesis with overlap/pause modeling, room acoustics, and sound events, 3) LLM-based reference SOAP note production, built entirely on open-weight models.

Result: Created 8,800 synthetic conversations with 1.3k hours of corresponding audio and reference notes; evaluation shows cascaded approaches substantially outperform end-to-end models.

Conclusion: Proposed synthetic data pipeline serves as both training resource and controlled evaluation environment for long-context audio reasoning, addressing current data and evaluation gaps.

Abstract: Long-context audio reasoning is underserved in both training data and evaluation. Existing benchmarks target short-context tasks, and the open-ended generation tasks most relevant to long-context reasoning pose well-known challenges for automatic evaluation. We propose a synthetic data generation pipeline designed to serve both as a training resource and as a controlled evaluation environment, and instantiate it for first-visit doctor-patient conversations with SOAP note generation as the task. The pipeline has three stages, persona-driven dialogue generation, multi-speaker audio synthesis with overlap/pause modeling, room acoustics, and sound events, and LLM-based reference SOAP note production, built entirely on open-weight models. We release 8,800 synthetic conversations with 1.3k hours of corresponding audio and reference notes. Evaluating current open-weight systems, we find that cascaded approaches still substantially outperform end-to-end models.

[505] PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, Jinting Wang, Linghan Fang, Lufei Gao, Orkesh Ablet, Peihua Zhang, Ruolin Hu, Shengyu Li, Weilin Lin, Xiaoyang Feng, Xinyue Yang, Yan Rong, Yanyun Wang, Zihang Shao, Zelin Zhao, Chenxing Li, Shan Yang, Wenfu Wang, Meng Yu, Dong Yu, Li Liu

Main category: cs.SD

TL;DR: PhyAVBench is the first benchmark for evaluating audio-physics grounding in text-to-audio-video generation, introducing a new dataset and contrastive evaluation method to assess physical plausibility of sounds.

DetailsMotivation: Current T2AV models often fail to produce physically plausible sounds, and existing benchmarks focus mainly on audio-video synchronization while overlooking explicit evaluation of audio-physics grounding.

Method: Created PhyAVBench with PhyAV-Sound-11K dataset (11,605 audible videos, 25.5 hours), introduces Audio-Physics Sensitivity Test (APST) using paired text prompts with controlled physical variations, and proposes Contrastive Physical Response Score (CPRS) metric.

Result: Comprehensive evaluation of 17 state-of-the-art models shows even leading commercial models struggle with fundamental audio-physical phenomena, revealing critical gaps beyond audio-visual synchronization.

Conclusion: PhyAVBench addresses the overlooked aspect of physical plausibility in audio-visual generation and provides a foundation for advancing physically grounded audio-visual generation research.

Abstract: Text-to-audio-video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks primarily focus on audio-video temporal synchronization, while largely overlooking explicit evaluation of audio-physics grounding, thereby limiting the study of physically plausible audio-visual generation. To address this issue, we present PhyAVBench, the first benchmark that systematically evaluates the audio-physics grounding capabilities of T2AV, image-to-audio-video (I2AV), and video-to-audio (V2A) models. PhyAVBench offers PhyAV-Sound-11K, a new dataset of 25.5 hours of 11,605 audible videos collected from 184 participants to ensure diversity and avoid data leakage. It contains 337 paired-prompt groups with controlled physical variations that drive sound differences, each grounded with an average of 17 videos and spanning 6 audio-physics dimensions and 41 fine-grained test points. Each prompt pair is annotated with the physical factors underlying their acoustic differences. Importantly, PhyAVBench leverages paired text prompts to evaluate this capability. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST) and introduce a novel metric, the Contrastive Physical Response Score (CPRS), which quantifies the acoustic consistency between generated videos and their real-world counterparts. We conduct a comprehensive evaluation of 17 state-of-the-art models. Our results reveal that even leading commercial models struggle with fundamental audio-physical phenomena, exposing a critical gap beyond audio-visual synchronization and pointing to future research directions. We hope PhyAVBench will serve as a foundation for advancing physically grounded audio-visual generation. Prompts, ground-truth, and generated video samples are available at https://phyavbench.pages.dev/.

[506] FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

Chengyou Wang, Hongfei Xue, Chunjiang He, Jingbin Hu, Shuiyuan Wang, Bo Wu, Yuyu Ji, Jimeng Zheng, Ruofei Chen, Zhou Zhu, Lei Xie

Main category: cs.SD

TL;DR: FastTurn is a unified framework for low-latency turn detection in full-duplex spoken dialogue systems, combining streaming CTC decoding with acoustic features for early decisions while preserving semantic understanding.

DetailsMotivation: Existing full-duplex approaches for spoken dialogue systems either rely on voice activity detection (lacking semantic understanding) or ASR-based modules (introducing latency and degrading under overlapping speech/noise). Available datasets also rarely capture realistic interaction dynamics, limiting evaluation and deployment.

Method: FastTurn combines streaming CTC decoding with acoustic features to enable early turn detection decisions from partial observations while preserving semantic cues. The framework is designed for low-latency operation in real-time full-duplex communication scenarios.

Result: Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions including overlapping speech, backchannels, pauses, pitch variation, and environmental noise.

Conclusion: FastTurn demonstrates effectiveness for practical full-duplex dialogue systems by advancing latency while maintaining performance, and the release of a real human dialogue test set helps address the lack of realistic interaction data in the field.

Abstract: Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches either rely on voice activity cues, which lack semantic understanding, or on ASR-based modules, which introduce latency and degrade under overlapping speech and noise. Moreover, available datasets rarely capture realistic interaction dynamics, limiting evaluation and deployment. To mitigate the problem, we propose \textbf{FastTurn}, a unified framework for low-latency and robust turn detection. To advance latency while maintaining performance, FastTurn combines streaming CTC decoding with acoustic features, enabling early decisions from partial observations while preserving semantic cues. We also release a test set based on real human dialogue, capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions, demonstrating its effectiveness for practical full-duplex dialogue systems.

cs.LG

[507] A Theory-guided Weighted $L^2$ Loss for solving the BGK model via Physics-informed neural networks

Gyounghun Ko, Sung-Jun Son, Seung Yeon Cho, Myeong-Su Lee

Main category: cs.LG

TL;DR: Proposes a velocity-weighted L² loss for Physics-Informed Neural Networks to improve accuracy in solving the BGK model by better penalizing errors in high-velocity regions.

DetailsMotivation: Standard L² loss in PINNs is insufficient for the BGK model as it fails to guarantee accurate predictions of macroscopic moments, causing solutions to miss true physical behavior.

Method: Introduces a velocity-weighted L² loss function that effectively penalizes errors in high-velocity regions, with theoretical stability analysis showing convergence guarantees.

Result: Numerical experiments show superior accuracy and robustness across various benchmarks compared to standard PINN approaches.

Conclusion: Velocity-weighted loss formulation overcomes limitations of standard PINNs for BGK model, providing more accurate and physically meaningful solutions.

Abstract: While Physics-Informed Neural Networks offer a promising framework for solving partial differential equations, the standard $L^2$ loss formulation is fundamentally insufficient when applied to the Bhatnagar-Gross-Krook (BGK) model. Specifically, simply minimizing the standard loss does not guarantee accurate predictions of the macroscopic moments, causing the approximate solutions to fail in capturing the true physical solution. To overcome this limitation, we introduce a velocity-weighted $L^2$ loss function designed to effectively penalize errors in the high-velocity regions. By establishing a stability estimate for the proposed approach, we shows that minimizing the proposed weighted loss guarantees the convergence of the approximate solution. Also, numerical experiments demonstrate that employing this weighted PINN loss leads to superior accuracy and robustness across various benchmarks compared to the standard approach.

[508] Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

Diyansha Singh

Main category: cs.LG

TL;DR: Territory Paint Wars is a competitive multi-agent RL environment used to study PPO failure modes in self-play, identifying implementation issues and competitive overfitting, with opponent mixing as a solution.

DetailsMotivation: To systematically investigate failure modes of Proximal Policy Optimisation (PPO) under self-play in competitive multi-agent reinforcement learning environments, and to provide a reproducible benchmark for studying these issues.

Method: Created Territory Paint Wars, a minimal competitive multi-agent RL environment in Unity, then conducted controlled ablations to identify implementation-level failure modes, studied competitive overfitting, and proposed opponent mixing intervention.

Result: Identified five critical implementation failures causing poor performance; discovered competitive overfitting where generalization collapses despite stable self-play; opponent mixing restored generalization to 77.1% without complex infrastructure.

Conclusion: PPO in competitive self-play suffers from both implementation-level issues and emergent competitive overfitting; simple opponent mixing effectively mitigates overfitting; the environment provides a valuable benchmark for competitive MARL research.

Abstract: We present Territory Paint Wars, a minimal competitive multi-agent reinforcement learning environment implemented in Unity, and use it to systematically investigate failure modes of Proximal Policy Optimisation (PPO) under self-play. A first agent trained for $84{,}000$ episodes achieves only $26.8%$ win rate against a uniformly-random opponent in a symmetric zero-sum game. Through controlled ablations we identify five implementation-level failure modes – reward-scale imbalance, missing terminal signal, ineffective long-horizon credit assignment, unnormalised observations, and incorrect win detection – each of which contributes critically to this failure in this setting. After correcting these issues, we uncover a distinct emergent pathology: competitive overfitting, where co-adapting agents maintain stable self-play performance while generalisation win rate collapses from $73.5%$ to $21.6%$. Critically, this failure is undetectable via standard self-play metrics: both agents co-adapt equally, so the self-play win rate remains near $50%$ throughout the collapse. We propose a minimal intervention – opponent mixing, where $20%$ of training episodes substitute a fixed uniformly-random policy for the co-adaptive opponent – which mitigates competitive overfitting and restores generalisation to $77.1%$ ($\pm 12.6%$, $10$ seeds) without population-based training or additional infrastructure. We open-source Territory Paint Wars to provide a reproducible benchmark for studying competitive MARL failure modes.

[509] Enhancing sample efficiency in reinforcement-learning-based flow control: replacing the critic with an adaptive reduced-order model

Zesheng Yao, Zhen-Hua Wan, Canjun Yang, Qingchao Xia, Mengqi Zhang

Main category: cs.LG

TL;DR: A ROM-based reinforcement learning framework for active flow control that improves sample efficiency by using reduced-order models to estimate gradients for controller optimization.

DetailsMotivation: Model-free deep reinforcement learning methods suffer from poor sample efficiency, which is particularly problematic for active flow control applications where data collection is expensive.

Method: Uses adaptive reduced-order models combining linear dynamical systems with neural ODEs for nonlinearity estimation. ROM parameters identified via operator inference, with continuous model updates during controller-environment interactions and differentiable simulation for controller optimization.

Result: Achieves superior performance with significantly less data: reduces Blasius boundary layer control to single-episode optimization with DRL-comparable performance, and achieves better drag reduction for square cylinder flow with far fewer exploration data than DRL approaches.

Conclusion: The ROM-based framework addresses sample efficiency limitations of model-free DRL for flow control and provides foundation for more data-efficient active flow controllers.

Abstract: Model-free deep reinforcement learning (DRL) methods suffer from poor sample efficiency. To overcome this limitation, this work introduces an adaptive reduced-order-model (ROM)-based reinforcement learning framework for active flow control. In contrast to conventional actor–critic architectures, the proposed approach leverages a ROM to estimate the gradient information required for controller optimization. The design of the ROM structure incorporates physical insights. The ROM integrates a linear dynamical system and a neural ordinary differential equation (NODE) for estimating the nonlinearity in the flow. The parameters of the linear component are identified via operator inference, while the NODE is trained in a data-driven manner using gradient-based optimization. During controller–environment interactions, the ROM is continuously updated with newly collected data, enabling adaptive refinement of the model. The controller is then optimized through differentiable simulation of the ROM. The proposed ROM-based DRL framework is validated on two canonical flow control problems: Blasius boundary layer flow and flow past a square cylinder. For the Blasius boundary layer, the proposed method effectively reduces to a single-episode system identification and controller optimization process, yet it yields controllers that outperform traditional linear designs and achieve performance comparable to DRL approaches with minimal data. For the flow past a square cylinder, the proposed method achieves superior drag reduction with significantly fewer exploration data compared with DRL approaches. The work addresses a key component of model-free DRL control algorithms and lays the foundation for designing more sample-efficient DRL-based active flow controllers.

[510] Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

Yongchang Hao, Lili Mou

Main category: cs.LG

TL;DR: Cactus introduces a constrained optimization approach to speculative sampling that allows controlled divergence from the verifier distribution while increasing acceptance rates, improving decoding efficiency without significant quality degradation.

DetailsMotivation: Current speculative sampling methods are too restrictive by strictly matching the verifier LLM's distribution, while typical acceptance sampling distorts the verifier distribution too much. There's a need for a method that balances efficiency (higher acceptance rates) with controlled divergence from the verifier distribution.

Method: Formalizes speculative sampling through constrained optimization, then proposes Cactus (constrained acceptance speculative sampling) which guarantees controlled divergence from the verifier distribution while increasing acceptance rates.

Result: Empirical results across a wide range of benchmarks confirm the effectiveness of the approach in improving decoding efficiency while maintaining output quality.

Conclusion: Cactus provides a principled approach to speculative sampling that balances efficiency gains with controlled divergence from the target distribution, addressing limitations of both strict speculative sampling and typical acceptance sampling methods.

Abstract: Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variations of the verifier’s distribution, such as sampling with top-$k$ or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information. In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose Cactus (constrained acceptance speculative sampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.

[511] Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression

Longsheng Zhou, Yu Shen

Main category: cs.LG

TL;DR: A practical pipeline combining unstructured pruning, INT8 quantization-aware training, and knowledge distillation to optimize models for CPU inference latency while maintaining accuracy.

DetailsMotivation: Common compression metrics like parameter count or FLOPs don't reliably predict actual wall-clock inference time, especially for CPU deployment where unstructured sparsity can fail to accelerate execution due to irregular memory access patterns.

Method: Proposes an ordered pipeline: 1) unstructured pruning for capacity reduction, 2) INT8 quantization-aware training (QAT) for dominant runtime benefits, 3) knowledge distillation (KD) to recover accuracy within the sparse INT8 regime without changing deployment form.

Result: Achieves 0.99-1.42 ms CPU latency with competitive accuracy on CIFAR-10/100 using ResNet-18, WRN-28-10, and VGG-16-BN backbones. The ordered pipeline outperforms any single technique alone, with INT8 QAT providing the main runtime benefit and pruning acting as a pre-conditioner.

Conclusion: Provides a simple guideline for edge deployment: evaluate compression choices using measured runtime in the joint accuracy-size-latency space rather than proxy metrics alone. The proposed ordering (pruning → QAT → KD) generally performs best among tested permutations.

Abstract: Modern deployment often requires trading accuracy for efficiency under tight CPU and memory constraints, yet common compression proxies such as parameter count or FLOPs do not reliably predict wall-clock inference time. In particular, unstructured sparsity can reduce model storage while failing to accelerate (and sometimes slightly slowing down) standard CPU execution due to irregular memory access and sparse kernel overhead. Motivated by this gap between compression and acceleration, we study a practical, ordered pipeline that targets measured latency by combining three widely used techniques: unstructured pruning, INT8 quantization-aware training (QAT), and knowledge distillation (KD). Empirically, INT8 QAT provides the dominant runtime benefit, while pruning mainly acts as a capacity-reduction pre-conditioner that improves the robustness of subsequent low-precision optimization; KD, applied last, recovers accuracy within the already constrained sparse INT8 regime without changing the deployment form. We evaluate on CIFAR-10/100 using three backbones (ResNet-18, WRN-28-10, and VGG-16-BN). Across all settings, the ordered pipeline achieves a stronger accuracy-size-latency frontier than any single technique alone, reaching 0.99-1.42 ms CPU latency with competitive accuracy and compact checkpoints. Controlled ordering ablations with a fixed 20/40/40 epoch allocation further confirm that stage order is consequential, with the proposed ordering generally performing best among the tested permutations. Overall, our results provide a simple guideline for edge deployment: evaluate compression choices in the joint accuracy-size-latency space using measured runtime, rather than proxy metrics alone.

[512] Learning-Based Multi-Criteria Decision Making Model for Sawmill Location Problems

Mahid Ahmed, Ali Dogru, Chaoyang Zhang, Chao Meng

Main category: cs.LG

TL;DR: A machine learning framework for optimal sawmill location selection using GIS and multi-criteria decision making, applied to Mississippi with Random Forest performing best.

DetailsMotivation: Strategic sawmill location is crucial for timber supply chain efficiency, profitability, and sustainability, requiring data-driven, unbiased approaches to site suitability assessment.

Method: Learning-Based Multi-Criteria Decision-Making (LB-MCDM) framework integrating machine learning with GIS-based spatial analysis using five ML algorithms (Random Forest, SVM, XGBoost, Logistic Regression, KNN) and SHAP for feature importance analysis.

Result: Random Forest Classifier achieved highest performance; SHAP analysis revealed Supply-Demand Ratio as most influential factor; validation showed 10-11% of Mississippi landscape is highly suitable for sawmill location.

Conclusion: The LB-MCDM framework provides an effective, data-driven approach for strategic sawmill location planning that can enhance timber supply chain efficiency and sustainability.

Abstract: Strategically locating a sawmill is vital for enhancing the efficiency, profitability, and sustainability of timber supply chains. Our study proposes a Learning-Based Multi-Criteria Decision-Making (LB-MCDM) framework that integrates machine learning (ML) with GIS-based spatial location analysis via MCDM. The proposed framework provides a data-driven, unbiased, and replicable approach to assessing site suitability. We demonstrate the utility of the proposed model through a case study in Mississippi (MS). We apply five ML algorithms (Random Forest Classifier, Support Vector Classifier, XGBoost Classifier, Logistic Regression, and K-Nearest Neighbors Classifier) to identify the most suitable sawmill locations in Mississippi. Among these models, the Random Forest Classifier achieved the highest performance. We use the SHAP (SHapley Additive exPlanations) technique to determine the relative importance of each criterion, revealing the Supply-Demand Ratio, a composite feature that reflects local market competition dynamics, as the most influential factor, followed by Road, Rail Line and Urban Area Distance. The validation of suitability maps generated by our LB-MCDM model suggests that 10-11% of the MS landscape is highly suitable for sawmill location.

[513] El Nino Prediction Based on Weather Forecast and Geographical Time-series Data

Viet Trinh, Ha-Vy Luu, Quoc-Khiem Nguyen-Pham, Hung Tong, Thanh-Huyen Tran, Hoai-Nam Nguyen Dang

Main category: cs.LG

TL;DR: A hybrid CNN-LSTM framework for El Niño prediction using multi-source meteorological data

DetailsMotivation: Traditional El Niño prediction models lack granularity and dynamic interplay captured by comprehensive meteorological datasets, limiting accuracy and lead time for mitigating global impacts

Method: Integrates real-time global weather forecast data with anomalies, subsurface ocean heat content, and atmospheric pressure across various resolutions. Uses hybrid CNN for spatial feature extraction and LSTM for temporal dependency modeling

Result: Not specified in abstract

Conclusion: Framework aims to identify complex precursors and evolving patterns of El Niño events for improved prediction

Abstract: This paper proposes a novel framework for enhancing the prediction accuracy and lead time of El Niño events, crucial for mitigating their global climatic, economic, and societal impacts. Traditional prediction models often rely on oceanic and atmospheric indices, which may lack the granularity or dynamic interplay captured by comprehensive meteorological and geographical datasets. Our framework integrates real-time global weather forecast data with anomalies, subsurface ocean heat content, and atmospheric pressure across various temporal and spatial resolutions. Leveraging a hybrid deep learning architecture that combines a Convolutional Neural Network (CNN) for spatial feature extraction and a Long Short-Term Memory (LSTM) network for temporal dependency modeling, the framework aims to identify complex precursors and evolving patterns of El Niño events.

[514] PRIME: Prototype-Driven Multimodal Pretraining for Cancer Prognosis with Missing Modalities

Kai Yu, Shuang Zhou, Yiran Song, Zaifu Zhan, Jie Peng, Kaixiong Zhou, Tianlong Chen, Feng Xie, Meng Wang, Huazhu Fu, Mingquan Lin, Rui Zhang

Main category: cs.LG

TL;DR: PRIME is a missing-aware multimodal self-supervised pretraining framework that learns robust representations from partially observed clinical data (histopathology images, gene expression, pathology reports) for cancer prognosis, handling missing modalities through semantic imputation and alignment objectives.

DetailsMotivation: Clinical cohorts often have fragmented multimodal data with missing modalities, limiting both supervised fusion and scalable multimodal pretraining. Existing approaches typically require fully paired inputs, which is impractical in real-world clinical settings where data is incomplete.

Method: PRIME maps heterogeneous modality embeddings into a unified token space and uses a shared prototype memory bank for latent-space semantic imputation via patient-level consensus retrieval. It employs two complementary pretraining objectives: inter-modality alignment and post-fusion consistency under structured missingness augmentation.

Result: PRIME achieves best macro-average performance on The Cancer Genome Atlas dataset across three tasks: overall survival prediction (0.653 C-index), 3-year mortality classification (0.689 AUROC), and 3-year recurrence classification (0.637 AUROC). It improves robustness under test-time missingness and supports parameter-efficient adaptation.

Conclusion: Missing-aware multimodal pretraining is a practical strategy for prognosis modeling in fragmented clinical data settings, with PRIME demonstrating superior performance and robustness compared to existing methods.

Abstract: Multimodal self-supervised pretraining offers a promising route to cancer prognosis by integrating histopathology whole-slide images, gene expression, and pathology reports, yet most existing approaches require fully paired and complete inputs. In practice, clinical cohorts are fragmented and often miss one or more modalities, limiting both supervised fusion and scalable multimodal pretraining. We propose PRIME, a missing-aware multimodal self-supervised pretraining framework that learns robust and transferable representations from partially observed cohorts. PRIME maps heterogeneous modality embeddings into a unified token space and introduces a shared prototype memory bank for latent-space semantic imputation via patient-level consensus retrieval, producing structurally aligned tokens without reconstructing raw signals. Two complementary pretraining objectives: inter-modality alignment and post-fusion consistency under structured missingness augmentation, jointly learn representations that remain predictive under arbitrary modality subsets. We evaluate PRIME on The Cancer Genome Atlas with label-free pretraining on 32 cancer types and downstream 5-fold evaluation on five cohorts across overall survival prediction, 3-year mortality classification, and 3-year recurrence classification. PRIME achieves the best macro-average performance among all compared methods, reaching 0.653 C-index, 0.689 AUROC, and 0.637 AUROC on the three tasks, respectively, while improving robustness under test-time missingness and supporting parameter-efficient and label-efficient adaptation. These results support missing-aware multimodal pretraining as a practical strategy for prognosis modeling in fragmented clinical data settings.

[515] Learning Stable Predictors from Weak Supervision under Distribution Shift

Mehrdad Shoeibi, Elias Hossain, Ivan Garibay, Niloofar Yousefi

Main category: cs.LG

TL;DR: Paper studies supervision drift in weak-label learning using CRISPR-Cas13d experiments, showing temporal transfer fails due to changing feature-label relationships over time.

DetailsMotivation: Understanding robustness of weak-label learning under distribution shift when supervision mechanisms change, formalized as supervision drift (changes in P(y|x,c) across contexts).

Method: Use CRISPR-Cas13d experiments with RNA-seq responses as weak labels, build controlled non-IID benchmark with domain (cell lines) and temporal shifts, evaluate models (ridge regression, XGBoost) on transfer performance.

Result: Strong in-domain performance (ridge R²=0.356, Spearman ρ=0.442), partial cross-cell-line transfer (ρ~0.40), but temporal transfer fails completely (negative R², near-zero correlation). Feature-label relationships stable across cell lines but change sharply over time.

Conclusion: Supervision drift causes transfer failure, not model limitations. Feature stability serves as diagnostic for detecting non-transferability before deployment in weak-label learning scenarios.

Abstract: Learning from weak or proxy supervision is common when ground-truth labels are unavailable, yet robustness under distribution shift remains poorly understood, especially when the supervision mechanism itself changes. We formalize this as supervision drift, defined as changes in P(y | x, c) across contexts, and study it in CRISPR-Cas13d experiments where guide efficacy is inferred indirectly from RNA-seq responses. Using data from two human cell lines and multiple time points, we build a controlled non-IID benchmark with explicit domain and temporal shifts while keeping the weak-label construction fixed. Models achieve strong in-domain performance (ridge R^2 = 0.356, Spearman rho = 0.442) and partial cross-cell-line transfer (rho ~ 0.40). However, temporal transfer fails across all models, with negative R^2 and near-zero correlation (e.g., XGBoost R^2 = -0.155, rho = 0.056). Additional analyses confirm this pattern. Feature-label relationships remain stable across cell lines but change sharply over time, indicating that failures arise from supervision drift rather than model limitations. These findings highlight feature stability as a simple diagnostic for detecting non-transferability before deployment.

[516] Energy-Based Dynamical Models for Neurocomputation, Learning, and Optimization

Arthur N. Montanari, Francesco Bullo, Dmitry Krotov, Adilson E. Motter

Main category: cs.LG

TL;DR: Tutorial on neuro-inspired dynamical systems for computation, focusing on energy-based models, gradient flows, and control-theoretic approaches for scalable, robust, and energy-efficient AI systems.

DetailsMotivation: Recent advances at the intersection of control theory, neuroscience, and machine learning reveal novel computational mechanisms in dynamical systems, aiming to bridge artificial and biological systems for improved scalability, robustness, and energy efficiency.

Method: Focuses on energy-based dynamical models encoding information through gradient flows and energy landscapes. Reviews classical formulations (Hopfield networks, Boltzmann machines) and extends to modern developments including dense associative memory models, oscillator-based networks for optimization, and proximal-descent dynamics for constrained reconstruction.

Result: Demonstrates how control-theoretic principles can guide design of next-generation neurocomputing systems, moving beyond conventional feedforward and backpropagation-based approaches to artificial intelligence.

Conclusion: Neuro-inspired dynamical systems offer promising approaches for scalable, robust, and energy-efficient computation, with applications in model learning, memory retrieval, data-driven control, and optimization.

Abstract: Recent advances at the intersection of control theory, neuroscience, and machine learning have revealed novel mechanisms by which dynamical systems perform computation. These advances encompass a wide range of conceptual, mathematical, and computational ideas, with applications for model learning and training, memory retrieval, data-driven control, and optimization. This tutorial focuses on neuro-inspired approaches to computation that aim to improve scalability, robustness, and energy efficiency across such tasks, bridging the gap between artificial and biological systems. Particular emphasis is placed on energy-based dynamical models that encode information through gradient flows and energy landscapes. We begin by reviewing classical formulations, such as continuous-time Hopfield networks and Boltzmann machines, and then extend the framework to modern developments. These include dense associative memory models for high-capacity storage, oscillator-based networks for large-scale optimization, and proximal-descent dynamics for composite and constrained reconstruction. The tutorial demonstrates how control-theoretic principles can guide the design of next-generation neurocomputing systems, steering the discussion beyond conventional feedforward and backpropagation-based approaches to artificial intelligence.

[517] PCA-Driven Adaptive Sensor Triage for Edge AI Inference

Ankit Hemant Lade, Sai Krishna Jasti, Nikhil Sinha, Indar Kumar, Akanksha Tiwari

Main category: cs.LG

TL;DR: PCA-Triage: A streaming algorithm for bandwidth-constrained industrial IoT that uses incremental PCA to determine optimal per-channel sampling rates under bandwidth budgets.

DetailsMotivation: Industrial IoT networks with multi-channel sensors often exceed available bandwidth, requiring intelligent sampling strategies to prioritize important data channels while staying within bandwidth constraints.

Method: PCA-Triage converts incremental PCA loadings into proportional per-channel sampling rates under a bandwidth budget. It runs in O(wdk) time with zero trainable parameters (0.67 ms per decision), making it efficient for streaming applications.

Result: Evaluated on 7 benchmarks (8-82 channels) against 9 baselines. PCA-Triage is the best unsupervised method on 3 of 6 datasets at 50% bandwidth, winning 5 of 6 against every baseline with large effect sizes (r = 0.71-0.91). On TEP, achieves F1 = 0.961 +/- 0.001 - within 0.1% of full-data performance - while maintaining F1 > 0.90 at 30% budget.

Conclusion: PCA-Triage provides an efficient, parameter-free solution for bandwidth-constrained industrial IoT sensor networks, maintaining high performance with significant bandwidth reductions while being robust to packet loss and sensor noise.

Abstract: Multi-channel sensor networks in industrial IoT often exceed available bandwidth. We propose PCA-Triage, a streaming algorithm that converts incremental PCA loadings into proportional per-channel sampling rates under a bandwidth budget. PCA-Triage runs in O(wdk) time with zero trainable parameters (0.67 ms per decision). We evaluate on 7 benchmarks (8–82 channels) against 9 baselines. PCA-Triage is the best unsupervised method on 3 of 6 datasets at 50% bandwidth, winning 5 of 6 against every baseline with large effect sizes (r = 0.71–0.91). On TEP, it achieves F1 = 0.961 +/- 0.001 – within 0.1% of full-data performance – while maintaining F1 > 0.90 at 30% budget. Targeted extensions push F1 to 0.970. The algorithm is robust to packet loss and sensor noise (3.7–4.8% degradation under combined worst-case).

[518] Blind-Spot Mass: A Good-Turing Framework for Quantifying Deployment Coverage Risk in Machine Learning Systems

Biplab Pal, Santanu Bhattacharya, Madanjit Singh

Main category: cs.LG

TL;DR: Blind-spot mass is a Good-Turing framework for quantifying deployment coverage risk in ML by estimating probability mass in under-supported operational states, validated across wearable activity recognition and clinical databases.

DetailsMotivation: Modern ML systems face coverage blindness where operational distributions are heavy-tailed, causing models to appear accurate on standard tests but unreliable in deployment due to rare but valid states being under-supported in finite training data.

Method: Proposes blind-spot mass B_n(tau) using Good-Turing unseen-species estimation to estimate total probability mass assigned to states with empirical support below threshold tau, plus a coverage-imposed accuracy ceiling decomposing performance into supported and blind components.

Result: Validated in wearable human activity recognition (HAR) with wrist-worn inertial data and replicated in MIMIC-IV hospital database (275 admissions), showing blind-spot mass curve converges to 95% at tau=5 across clinical state abstractions, demonstrating general applicability across domains.

Conclusion: Blind-spot mass provides a general ML methodology for quantifying combinatorial coverage risk, identifies risk-dominant activities/regimes, and offers actionable guidance for targeted data collection and safer deployment practices.

Abstract: Blind-spot mass is a Good-Turing framework for quantifying deployment coverage risk in machine learning. In modern ML systems, operational state distributions are often heavy-tailed, implying that a long tail of valid but rare states is structurally under-supported in finite training and evaluation data. This creates a form of ‘coverage blindness’: models can appear accurate on standard test sets yet remain unreliable across large regions of the deployment state space. We propose blind-spot mass B_n(tau), a deployment metric estimating the total probability mass assigned to states whose empirical support falls below a threshold tau. B_n(tau) is computed using Good-Turing unseen-species estimation and yields a principled estimate of how much of the operational distribution lies in reliability-critical, under-supported regimes. We further derive a coverage-imposed accuracy ceiling, decomposing overall performance into supported and blind components and separating capacity limits from data limits. We validate the framework in wearable human activity recognition (HAR) using wrist-worn inertial data. We then replicate the same analysis in the MIMIC-IV hospital database with 275 admissions, where the blind-spot mass curve converges to the same 95% at tau = 5 across clinical state abstractions. This replication across structurally independent domains - differing in modality, feature space, label space, and application - shows that blind-spot mass is a general ML methodology for quantifying combinatorial coverage risk, not an application-specific artifact. Blind-spot decomposition identifies which activities or clinical regimes dominate risk, providing actionable guidance for industrial practitioners on targeted data collection, normalization/renormalization, and physics- or domain-informed constraints for safer deployment.

[519] Dynamic Linear Coregionalization for Realistic Synthetic Multivariate Time Series

Annita Vapsi, Penghang Liu, Saheed Obitayo, Aakriti, Manoj Cherukumalli, Prathamesh Patil, Amit Varshney, Nicolas Marchesotti, Elizabeth Fons, Vamsi K. Potluru, Manuela Veloso

Main category: cs.LG

TL;DR: DynLMC introduces a dynamic linear model for generating synthetic multivariate time series with realistic time-varying correlations and cross-channel dependencies, improving foundation model transferability through data-centric pretraining.

DetailsMotivation: Current synthetic data generators for time series foundation models assume static correlations and lack realistic inter-channel dependencies, limiting their effectiveness for training transferable models.

Method: DynLMC (Dynamic Linear Model of Coregionalization) incorporates time-varying, regime-switching correlations and cross-channel lag structures to generate synthetic multivariate time series that closely resemble real data dynamics.

Result: Fine-tuning three foundation models on DynLMC-generated data yields consistent zero-shot forecasting improvements across nine benchmarks, demonstrating enhanced transferability.

Conclusion: Modeling dynamic inter-channel correlations in synthetic data generation significantly improves foundation model transferability, highlighting the importance of data-centric pretraining approaches.

Abstract: Synthetic data is essential for training foundation models for time series (FMTS), but most generators assume static correlations, and are typically missing realistic inter-channel dependencies. We introduce DynLMC, a Dynamic Linear Model of Coregionalization, that incorporates time-varying, regime-switching correlations and cross-channel lag structures. Our approach produces synthetic multivariate time series with correlation dynamics that closely resemble real data. Fine-tuning three foundational models on DynLMC-generated data yields consistent zero-shot forecasting improvements across nine benchmarks. Our results demonstrate that modeling dynamic inter-channel correlations enhances FMTS transferability, highlighting the importance of data-centric pretraining.

[520] Towards Scaling Law Analysis For Spatiotemporal Weather Data

Alexander Kiefer, Prasanna Balaprakash, Xiao Wang

Main category: cs.LG

TL;DR: This paper extends neural scaling analysis for autoregressive weather forecasting from single-step training to long rollouts and per-channel metrics, revealing strong heterogeneity across channels and forecast horizons that challenges pooled global metrics.

DetailsMotivation: Weather forecasting presents unique challenges for compute-optimal scaling analysis compared to NLP/CV: autoregressive rollouts compound errors, outputs have disparate physical channels with different scales/predictability, and global metrics can mask per-channel degradation at late forecast horizons.

Method: Extends neural scaling analysis to examine: 1) error distribution across channels and growth with forecast horizon, 2) power law scaling for test error relative to rollout length, and 3) how scaling fits vary jointly with horizon and channel across parameter, data, and compute axes.

Result: Found strong cross-channel and cross-horizon heterogeneity - pooled scaling can appear favorable while many individual channels degrade at late leads, revealing limitations of global metrics for weather forecasting.

Conclusion: Results have implications for weighted objectives, horizon-aware curricula, and resource allocation across outputs in weather forecasting models, highlighting the need for more nuanced evaluation beyond pooled global metrics.

Abstract: Compute-optimal scaling laws are relatively well studied for NLP and CV, where objectives are typically single-step and targets are comparatively homogeneous. Weather forecasting is harder to characterize in the same framework: autoregressive rollouts compound errors over long horizons, outputs couple many physical channels with disparate scales and predictability, and globally pooled test metrics can disagree sharply with per-channel, late-lead behavior implied by short-horizon training. We extend neural scaling analysis for autoregressive weather forecasting from single-step training loss to long rollouts and per-channel metrics. We quantify (1) how prediction error is distributed across channels and how its growth rate evolves with forecast horizon, (2) if power law scaling holds for test error, relative to rollout length when error is pooled globally, and (3) how that fit varies jointly with horizon and channel for parameter, data, and compute-based scaling axes. We find strong cross-channel and cross-horizon heterogeneity: pooled scaling can look favorable while many channels degrade at late leads. We discuss implications for weighted objectives, horizon-aware curricula, and resource allocation across outputs.

[521] Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling

Ximing Xing, Ziteng Xue, Zhenxi Li, Weicong Liang, Linqing Wang, Zhantao Yang, Tiankai Hang, Zijin Yin, Qinglin Lu, Chunyu Wang, Qian Yu

Main category: cs.LG

TL;DR: HiVG introduces hierarchical SVG tokenization for autoregressive vector graphics generation, improving spatial consistency and sequence efficiency over conventional byte-level tokenization.

DetailsMotivation: Existing SVG generation approaches use generic byte-level tokenization from NLP, which fragments numerical coordinates, destroys spatial relationships, causes token redundancy, coordinate hallucination, and inefficient long-sequence generation for vector graphics.

Method: Proposes HiVG: hierarchical SVG tokenization framework with atomic tokens and geometry-constrained segment tokens, plus Hierarchical Mean-Noise initialization for numerical ordering signals, and curriculum training that progressively increases program complexity.

Result: Extensive experiments on text-to-SVG and image-to-SVG tasks show improved generation fidelity, spatial consistency, and sequence efficiency compared to conventional tokenization schemes.

Conclusion: HiVG addresses fundamental limitations of conventional tokenization for SVG generation, enabling more stable learning of executable SVG programs through hierarchical tokenization and specialized training strategies.

Abstract: Recent large language models have shifted SVG generation from differentiable rendering optimization to autoregressive program synthesis. However, existing approaches still rely on generic byte-level tokenization inherited from natural language processing, which poorly reflects the geometric structure of vector graphics. Numerical coordinates are fragmented into discrete symbols, destroying spatial relationships and introducing severe token redundancy, often leading to coordinate hallucination and inefficient long-sequence generation. To address these challenges, we propose HiVG, a hierarchical SVG tokenization framework tailored for autoregressive vector graphics generation. HiVG decomposes raw SVG strings into structured \textit{atomic tokens} and further compresses executable command–parameter groups into geometry-constrained \textit{segment tokens}, substantially improving sequence efficiency while preserving syntactic validity. To further mitigate spatial mismatch, we introduce a Hierarchical Mean–Noise (HMN) initialization strategy that injects numerical ordering signals and semantic priors into new token embeddings. Combined with a curriculum training paradigm that progressively increases program complexity, HiVG enables more stable learning of executable SVG programs. Extensive experiments on both text-to-SVG and image-to-SVG tasks demonstrate improved generation fidelity, spatial consistency, and sequence efficiency compared with conventional tokenization schemes.

[522] Feature-Aware Anisotropic Local Differential Privacy for Utility-Preserving Graph Representation Learning in Metal Additive Manufacturing

MD Shafikul Islam, Mahathir Mohammad Bappy, Saifur Rahman Tushar, Md Arifuzzaman

Main category: cs.LG

TL;DR: FI-LDP-HGAT: A privacy-preserving framework for metal additive manufacturing defect detection that combines hierarchical graph attention networks with feature-importance-aware local differential privacy.

DetailsMotivation: Metal AM produces safety-critical components but requires proprietary sensor data for quality assurance, limiting collaborative data sharing. Existing defect detection models ignore layer-wise physical couplings, and conventional privacy techniques cause severe utility degradation.

Method: Combines stratified Hierarchical Graph Attention Network (HGAT) to capture spatial/thermal dependencies across scan tracks and layers, with feature-importance-aware anisotropic Gaussian mechanism (FI-LDP) that redistributes privacy budget using encoder-derived importance priors.

Result: Achieves 81.5% utility recovery at epsilon=4 and maintains defect recall of 0.762 at epsilon=2, outperforming classical ML, standard GNNs, and alternative privacy mechanisms including DP-SGD across all metrics.

Conclusion: FI-LDP-HGAT effectively balances privacy and utility in AM defect detection by intelligently allocating noise based on feature importance while maintaining formal LDP guarantees.

Abstract: Metal additive manufacturing (AM) enables the fabrication of safety-critical components, but reliable quality assurance depends on high-fidelity sensor streams containing proprietary process information, limiting collaborative data sharing. Existing defect-detection models typically treat melt-pool observations as independent samples, ignoring layer-wise physical couplings. Moreover, conventional privacy-preserving techniques, particularly Local Differential Privacy (LDP), lead to severe utility degradation because they inject uniform noise across all feature dimensions. To address these interrelated challenges, we propose FI-LDP-HGAT. This computational framework combines two methodological components: a stratified Hierarchical Graph Attention Network (HGAT) that captures spatial and thermal dependencies across scan tracks and deposited layers, and a feature-importance-aware anisotropic Gaussian mechanism (FI-LDP) for non-interactive feature privatization. Unlike isotropic LDP, FI-LDP redistributes the privacy budget across embedding coordinates using an encoder-derived importance prior, assigning lower noise to task-critical thermal signatures and higher noise to redundant dimensions while maintaining formal LDP guarantees. Experiments on a Directed Energy Deposition (DED) porosity dataset demonstrate that FI-LDP-HGAT achieves 81.5% utility recovery at a moderate privacy budget (epsilon = 4) and maintains defect recall of 0.762 under strict privacy (epsilon = 2), while outperforming classical ML, standard GNNs, and alternative privacy mechanisms, including DP-SGD across all evaluated metrics. Mechanistic analysis confirms a strong negative correlation (Spearman = -0.81) between feature importance and noise magnitude, providing interpretable evidence that the privacy-utility gains are driven by principled anisotropic allocation.

[523] Vintix II: Decision Pre-Trained Transformer is a Scalable In-Context Reinforcement Learner

Andrei Polubarov, Lyubaykin Nikita, Alexander Derevyagin, Artyom Grishin, Igor Saprygin, Aleksandr Serkov, Mark Averchenko, Daniil Tikhonov, Maksim Zhdanov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Alexey Zemtsov, Vladislav Kurenkov

Main category: cs.LG

TL;DR: Scaling Decision Pre-Trained Transformer (DPT) with Flow Matching to multi-domain environments for improved in-context reinforcement learning generalization across hundreds of tasks.

DetailsMotivation: Previous in-context RL approaches like Algorithm Distillation had limited generalization to unseen tasks, and while DPT showed promise in simplified domains, its scalability to diverse multi-domain settings hadn't been established.

Method: Extends Decision Pre-Trained Transformer (DPT) to multi-domain environments using Flow Matching as a training approach that preserves its Bayesian posterior sampling interpretation, training across hundreds of diverse tasks.

Result: Achieves clear gains in generalization to held-out test sets, improves upon prior Algorithm Distillation scaling, and demonstrates stronger performance in both online and offline inference.

Conclusion: Reinforces in-context reinforcement learning as a viable alternative to expert distillation for training generalist agents, with DPT+Flow Matching showing superior scalability and generalization.

Abstract: Recent progress in in-context reinforcement learning (ICRL) has demonstrated its potential for training generalist agents that can acquire new tasks directly at inference. Algorithm Distillation (AD) pioneered this paradigm and was subsequently scaled to multi-domain settings, although its ability to generalize to unseen tasks remained limited. The Decision Pre-Trained Transformer (DPT) was introduced as an alternative, showing stronger in-context reinforcement learning abilities in simplified domains, but its scalability had not been established. In this work, we extend DPT to diverse multi-domain environments, applying Flow Matching as a natural training choice that preserves its interpretation as Bayesian posterior sampling. As a result, we obtain an agent trained across hundreds of diverse tasks that achieves clear gains in generalization to the held-out test set. This agent improves upon prior AD scaling and demonstrates stronger performance in both online and offline inference, reinforcing ICRL as a viable alternative to expert distillation for training generalist agents.

[524] Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning

Lucas Dionisopoulos, Nicklas Majamaki, Prithviraj Ammanabrolu

Main category: cs.LG

TL;DR: Fine-tuning language models for chess reasoning shows that direct move prediction leads to strong RL performance but unfaithful reasoning, while multi-move trajectory training yields comparable performance with faithful reasoning and more stable RL.

DetailsMotivation: To understand how reasoning evolves in language models from supervised fine-tuning to reinforcement learning, specifically for tasks they natively struggle with, using chess as a case study.

Method: Analyzed reasoning evolution through SFT to RL using theoretically-inspired datasets in chess. Compared two approaches: 1) fine-tuning to directly predict best moves, and 2) training on multi-move trajectories. Evaluated reasoning faithfulness, move quality, hallucination rates, and checkpoint metrics.

Result: Direct move prediction leads to effective RL and strongest performance but unfaithful reasoning. Multi-move trajectory training yields comparable performance with faithful reasoning and more stable RL. RL improves move quality distribution and reduces hallucinations. SFT-checkpoint metrics predict post-RL performance.

Conclusion: Training strategy significantly impacts reasoning faithfulness in RL. Multi-move trajectory training balances performance and faithful reasoning. SFT metrics can predict RL outcomes. Achieved state-of-the-art chess reasoning with 7B model.

Abstract: How can you get a language model to reason in a task it natively struggles with? We study how reasoning evolves in a language model – from supervised fine-tuning (SFT) to reinforcement learning (RL) – by analyzing how a set of theoretically-inspired datasets impacts language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance – however, the RL step elicits unfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We show that RL induces a substantial positive shift in the distribution of move quality and reduces hallucination rates as a side effect. Finally, we find several SFT-checkpoint metrics – metrics spanning evaluation performance, hallucination rates, and reasoning quality – to be predictive of post-RL model performance. We release checkpoints and final models as well as training data, evaluations, and code which allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model.

[525] Not All Turns Are Equally Hard: Adaptive Thinking Budgets For Efficient Multi-Turn Reasoning

Neharika Jali, Anupam Nayak, Gauri Joshi

Main category: cs.LG

TL;DR: TAB: Turn-Adaptive Budgets - a policy for multi-turn reasoning that adaptively allocates compute tokens across conversation turns to maximize accuracy while respecting global token constraints, saving 35-40% tokens.

DetailsMotivation: As LLM reasoning performance plateaus, improving inference-time compute efficiency is crucial to mitigate overthinking. Prior approaches focus on single-turn settings and fail to address sequential dependencies in multi-turn reasoning.

Method: Formulate multi-turn reasoning as sequential compute allocation using multi-objective Markov Decision Process. Train TAB policy via Group Relative Policy Optimization (GRPO) that learns to maximize task accuracy while respecting global per-problem token constraints.

Result: TAB achieves superior accuracy-tokens tradeoff, saving up to 35% tokens while maintaining accuracy over static and off-the-shelf LLM budget baselines. TAB All-SubQ (with future sub-question knowledge) saves up to 40% tokens.

Conclusion: TAB provides an effective approach for adaptive compute allocation in multi-turn reasoning, significantly improving inference efficiency while maintaining accuracy.

Abstract: As LLM reasoning performance plateau, improving inference-time compute efficiency is crucial to mitigate overthinking and long thinking traces even for simple queries. Prior approaches including length regularization, adaptive routing, and difficulty-based budget allocation primarily focus on single-turn settings and fail to address the sequential dependencies inherent in multi-turn reasoning.In this work, we formulate multi-turn reasoning as a sequential compute allocation problem and model it as a multi-objective Markov Decision Process. We propose TAB: Turn-Adaptive Budgets, a budget allocation policy trained via Group Relative Policy Optimization (GRPO) that learns to maximize task accuracy while respecting global per-problem token constraints. Consequently, TAB takes as input the conversation history and learns to adaptively allocate smaller budgets to easier turns and save appropriate number of tokens for the crucial harder reasoning steps. Our experiments on mathematical reasoning benchmarks demonstrate that TAB achieves a superior accuracy-tokens tradeoff saving up to 35% tokens while maintaining accuracy over static and off-the-shelf LLM budget baselines. Further, for systems where a plan of all sub-questions is available apriori, we propose TAB All-SubQ, a budget allocation policy that budgets tokens based on the conversation history and all past and future sub-questions saving up to 40% tokens over baselines.

[526] General Multimodal Protein Design Enables DNA-Encoding of Chemistry

Jarrid Rector-Brooks, Théophile Lambert, Marta Skreta, Daniel Roth, Yueming Long, Zi-Qi Li, Xi Zhang, Miruna Cretu, Francesca-Zhoufan Li, Tanvi Ganapathy, Emily Jin, Avishek Joey Bose, Jason Yang, Kirill Neklyudov, Yoshua Bengio, Alexander Tong, Frances H. Arnold, Cheng-Hao Liu

Main category: cs.LG

TL;DR: DISCO is a multimodal diffusion model that co-designs protein sequence and 3D structure around arbitrary biomolecules to create novel enzymes for new-to-nature reactions without pre-specifying catalytic residues.

DetailsMotivation: Evolution has explored only a narrow slice of possible enzymatic chemistry. While deep generative models can design ligand-binding proteins, none have created enzymes without pre-specifying catalytic residues. The goal is to develop a scalable method for designing evolvable enzymes for new-to-nature reactions.

Method: DISCO (DIffusion for Sequence-structure CO-design) is a multimodal model that co-designs protein sequence and 3D structure around arbitrary biomolecules. It uses inference-time scaling methods to optimize objectives across both modalities. The model is conditioned solely on reactive intermediates.

Result: DISCO designed diverse heme enzymes with novel active-site geometries that catalyze new-to-nature carbene-transfer reactions including alkene cyclopropanation, spirocyclopropanation, B-H, and C(sp³)-H insertions. These enzymes showed high activities exceeding those of engineered enzymes. Random mutagenesis further improved enzyme activity through directed evolution.

Conclusion: DISCO provides a scalable route to evolvable enzymes, broadening the potential scope of genetically encodable transformations by enabling the design of novel enzymes for new-to-nature reactions without pre-specifying catalytic residues.

Abstract: Evolution is an extraordinary engine for enzymatic diversity, yet the chemistry it has explored remains a narrow slice of what DNA can encode. Deep generative models can design new proteins that bind ligands, but none have created enzymes without pre-specifying catalytic residues. We introduce DISCO (DIffusion for Sequence-structure CO-design), a multimodal model that co-designs protein sequence and 3D structure around arbitrary biomolecules, as well as inference-time scaling methods that optimize objectives across both modalities. Conditioned solely on reactive intermediates, DISCO designs diverse heme enzymes with novel active-site geometries. These enzymes catalyze new-to-nature carbene-transfer reactions, including alkene cyclopropanation, spirocyclopropanation, B-H, and C(sp$^3$)-H insertions, with high activities exceeding those of engineered enzymes. Random mutagenesis of a selected design further confirmed that enzyme activity can be improved through directed evolution. By providing a scalable route to evolvable enzymes, DISCO broadens the potential scope of genetically encodable transformations. Code is available at https://github.com/DISCO-design/DISCO.

[527] Cross-fitted Proximal Learning for Model-Based Reinforcement Learning

Nishanth Venkatesh, Andreas A. Malikopoulos

Main category: cs.LG

TL;DR: The paper develops statistical methods for estimating bridge functions in confounded partially observable Markov decision processes (POMDPs) to enable unbiased policy evaluation from offline observational data with hidden confounding.

DetailsMotivation: In offline reinforcement learning with hidden confounding, models learned directly from observational data can be biased, especially in partially observable systems where latent factors affect actions, rewards, and observations. Existing bridge function methods for policy evaluation in confounded POMDPs need improved statistical estimation techniques.

Method: The authors formulate bridge learning as a conditional moment restriction problem with nuisance objects given by conditional mean embedding and conditional density. They develop a K-fold cross-fitted extension of existing two-stage bridge estimators to use data more efficiently than single sample splits.

Result: The proposed cross-fitted estimator preserves the original bridge-based identification strategy while improving data efficiency. The authors derive an oracle-comparator bound and decompose error into Stage I (nuisance estimation) and Stage II (empirical averaging) terms.

Conclusion: The work provides improved statistical methods for estimating bridge functions in confounded POMDPs, enabling more reliable offline policy evaluation in partially observable systems with hidden confounding.

Abstract: Model-based reinforcement learning is attractive for sequential decision-making because it explicitly estimates reward and transition models and then supports planning through simulated rollouts. In offline settings with hidden confounding, however, models learned directly from observational data may be biased. This challenge is especially pronounced in partially observable systems, where latent factors may jointly affect actions, rewards, and future observations. Recent work has shown that policy evaluation in such confounded partially observable Markov decision processes (POMDPs) can be reduced to estimating reward-emission and observation-transition bridge functions satisfying conditional moment restrictions (CMRs). In this paper, we study the statistical estimation of these bridge functions. We formulate bridge learning as a CMR problem with nuisance objects given by a conditional mean embedding and a conditional density. We then develop a $K$-fold cross-fitted extension of the existing two-stage bridge estimator. The proposed procedure preserves the original bridge-based identification strategy while using the available data more efficiently than a single sample split. We also derive an oracle-comparator bound for the cross-fitted estimator and decompose the resulting error into a Stage I term induced by nuisance estimation and a Stage II term induced by empirical averaging.

[528] FNO$^{\angle θ}$: Extended Fourier neural operator for learning state and optimal control of distributed parameter systems

Zhexian Li, Ketan Savla

Main category: cs.LG

TL;DR: Extended Fourier neural operator (FNO) architecture for learning optimal control of PDE systems, with modifications to handle complex frequency domains for better representation of linear PDE solutions.

DetailsMotivation: To develop neural operators that can effectively learn state and optimal control for systems governed by partial differential equations, addressing limitations of existing FNO architectures in handling non-periodic boundary conditions and improving training accuracy.

Method: Extends FNO by modifying the Fourier transform layer to operate in complex frequency domain instead of just real domain, motivated by the Ehrenpreis-Palamodov fundamental principle which shows linear PDE solutions can be represented as complex integrals. This captures more general solution structures.

Result: Demonstrates order of magnitude improvements in training errors and more accurate predictions of non-periodic boundary values compared to standard FNO, tested on nonlinear Burgers’ equation.

Conclusion: The extended FNO architecture successfully captures integral representations of PDE solutions in complex domain, leading to significant performance improvements for learning optimal control problems.

Abstract: We propose an extended Fourier neural operator (FNO) architecture for learning state and linear quadratic additive optimal control of systems governed by partial differential equations. Using the Ehrenpreis-Palamodov fundamental principle, we show that any state and optimal control of linear PDEs with constant coefficients can be represented as an integral in the complex domain. The integrand of this representation involves the same exponential term as in the inverse Fourier transform, where the latter is used to represent the convolution operator in FNO layer. Motivated by this observation, we modify the FNO layer by extending the frequency variable in the inverse Fourier transform from the real to complex domain to capture the integral representation from the fundamental principle. We illustrate the performance of FNO in learning state and optimal control for the nonlinear Burgers’ equation, showing order of magnitude improvements in training errors and more accurate predictions of non-periodic boundary values over FNO.

[529] Vehicle-as-Prompt: A Unified Deep Reinforcement Learning Framework for Heterogeneous Fleet Vehicle Routing Problem

Shihong Huang, Shengjie Wang, Lei Gao, Hong Ma, Zhanluo Zhang, Feng Zhang, Weihua Zhou

Main category: cs.LG

TL;DR: A unified DRL framework with Vehicle-as-Prompt mechanism solves heterogeneous fleet vehicle routing problems with complex constraints, outperforming existing neural solvers and achieving competitive results with traditional heuristics.

DetailsMotivation: Existing DRL-based methods are limited to homogeneous vehicle routing problems, leading to suboptimal performance for heterogeneous fleet problems with complex real-world constraints that have higher computational complexity.

Method: Proposes VaP-CSMV framework with Vehicle-as-Prompt mechanism that formulates the problem as single-stage autoregressive decision process, featuring cross-semantic encoder and multi-view decoder to capture relationships between vehicle heterogeneity and customer attributes.

Result: VaP-CSMV significantly outperforms existing DRL-based neural solvers, achieves competitive solution quality compared to traditional heuristic solvers, reduces inference time to seconds, and shows strong zero-shot generalization on large-scale unseen variants.

Conclusion: The proposed framework effectively solves heterogeneous fleet vehicle routing problems with complex constraints, demonstrating superior performance, efficiency, and generalization capabilities compared to existing approaches.

Abstract: Unlike traditional homogeneous routing problems, the Heterogeneous Fleet Vehicle Routing Problem (HFVRP) involves heterogeneous fixed costs, variable travel costs, and capacity constraints, rendering solution quality highly sensitive to vehicle selection. Furthermore, real-world logistics applications often impose additional complex constraints, markedly increasing computational complexity. However, most existing Deep Reinforcement Learning (DRL)-based methods are restricted to homogeneous scenarios, leading to suboptimal performance when applied to HFVRP and its complex variants. To bridge this gap, we investigate HFVRP under complex constraints and develop a unified DRL framework capable of solving the problem across various variant settings. We introduce the Vehicle-as-Prompt (VaP) mechanism, which formulates the problem as a single-stage autoregressive decision process. Building on this, we propose VaP-CSMV, a framework featuring a cross-semantic encoder and a multi-view decoder that effectively addresses various problem variants and captures the complex mapping relationships between vehicle heterogeneity and customer node attributes. Extensive experimental results demonstrate that VaP-CSMV significantly outperforms existing state-of-the-art DRL-based neural solvers and achieves competitive solution quality compared to traditional heuristic solvers, while reducing inference time to mere seconds. Furthermore, the framework exhibits strong zero-shot generalization capabilities on large-scale and previously unseen problem variants, while ablation studies validate the vital contribution of each component.

[530] On the Geometry of Positional Encodings in Transformers

Giansalvo Cirrincione

Main category: cs.LG

TL;DR: Theoretical analysis of positional encodings in Transformers, establishing necessity theorems, optimal encoding construction via MDS, and showing ALiBi achieves lower stress than sinusoidal and RoPE encodings.

DetailsMotivation: Positional encodings are crucial for Transformers to process word order, but have been designed empirically without theoretical foundation. The paper aims to develop a mathematical theory of what positional encodings should do.

Method: Develops theoretical framework with four main results: necessity theorem for positional signals, positional separation theorem, optimal encoding construction via multidimensional scaling on Hellinger distance, and minimal parametrization analysis. Includes NTK regime proofs and experiments on SST-2 and IMDB datasets with BERT-base.

Result: Establishes theoretical foundations for positional encodings, shows ALiBi achieves much lower stress than sinusoidal and RoPE encodings, consistent with rank-1 interpretation under approximate shift-equivariance. Theoretical predictions confirmed experimentally.

Conclusion: Provides first comprehensive mathematical theory of positional encodings, establishes optimal encoding construction method, and demonstrates practical superiority of ALiBi encoding through both theory and experiments.

Abstract: Neural language models process sequences of words, but the mathematical operations inside them are insensitive to the order in which words appear. Positional encodings are the component added to remedy this. Despite their importance, positional encodings have been designed largely by trial and error, without a mathematical theory of what they ought to do. This paper develops such a theory. Four results are established. First, any Transformer without a positional signal cannot solve any task sensitive to word order (Necessity Theorem). Second, training assigns distinct vector representations to distinct sequence positions at every global minimiser, under mild and verifiable conditions (Positional Separation Theorem). Third, the best achievable approximation to an information-optimal encoding is constructed via classical multidimensional scaling (MDS) on the Hellinger distance between positional distributions; the quality of any encoding is measured by a single number, the stress (Proposition 5, Algorithm 1). Fourth, the optimal encoding has effective rank r = rank(B) <= n-1 and can be represented with r(n+d) parameters instead of nd (minimal parametrisation result). Appendix A develops a proof of the Monotonicity Conjecture within the Neural Tangent Kernel (NTK) regime for masked language modelling (MLM) losses, sequence classification losses, and general losses satisfying a positional sufficiency condition, through five lemmas. Experiments on SST-2 and IMDB with BERT-base confirm the theoretical predictions and reveal that Attention with Linear Biases (ALiBi) achieves much lower stress than the sinusoidal encoding and Rotary Position Embedding (RoPE), consistent with a rank-1 interpretation of the MDS encoding under approximate shift-equivariance.

[531] Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks

Anas Jnini, Elham Kiyani, Khemraj Shukla, Jorge F. Urban, Nazanin Ahmadi Daryakenari, Johannes Muller, Marius Zeinhofer, George Em Karniadakis

Main category: cs.LG

TL;DR: Advanced optimization strategies for physics-informed neural networks (PINNs) including Natural Gradient, Self-Scaling BFGS, and Broyden optimizers to accelerate convergence for solving differential equations.

DetailsMotivation: To improve the efficiency and robustness of neural network optimization for scientific machine learning, particularly for physics-informed neural networks solving challenging PDEs and ODEs, enabling faster convergence to high accuracy for complex physical behavior.

Method: Developed efficient implementations of Natural Gradient optimizer, Self-Scaling BFGS, and Broyden optimizers for PINNs. Proposed new PINN-based methods for solving inviscid Burgers and Euler equations, and addressed scaling challenges for batched training.

Result: Demonstrated performance on various problems including Helmholtz equation, Stokes flow, inviscid Burgers equation, Euler equations for high-speed flows, and stiff ODEs in pharmacokinetics/pharmacodynamics. Compared solutions against high-order numerical methods for rigorous assessment.

Conclusion: Advanced optimization strategies significantly accelerate PINN convergence for challenging differential equations, with scalable solutions for large data-driven problems through batched training approaches.

Abstract: Efficient and robust optimization is essential for neural networks, enabling scientific machine learning models to converge rapidly to very high accuracy – faithfully capturing complex physical behavior governed by differential equations. In this work, we present advanced optimization strategies to accelerate the convergence of physics-informed neural networks (PINNs) for challenging partial (PDEs) and ordinary differential equations (ODEs). Specifically, we provide efficient implementations of the Natural Gradient (NG) optimizer, Self-Scaling BFGS and Broyden optimizers, and demonstrate their performance on problems including the Helmholtz equation, Stokes flow, inviscid Burgers equation, Euler equations for high-speed flows, and stiff ODEs arising in pharmacokinetics and pharmacodynamics. Beyond optimizer development, we also propose new PINN-based methods for solving the inviscid Burgers and Euler equations, and compare the resulting solutions against high-order numerical methods to provide a rigorous and fair assessment. Finally, we address the challenge of scaling these quasi-Newton optimizers for batched training, enabling efficient and scalable solutions for large data-driven problems.

[532] Improving Sparse Memory Finetuning

Satyam Goyal, Anirudh Kanchi, Garv Shah, Prakhar Gupta

Main category: cs.LG

TL;DR: Sparse Memory Finetuning (SMF) retrofits pretrained LLMs with sparse memory modules for continual learning, using KL divergence-based slot selection to update only surprising tokens, minimizing catastrophic forgetting on consumer hardware.

DetailsMotivation: Real-world LLM applications require continual adaptation to new knowledge without degrading existing capabilities, but standard approaches (full finetuning, LoRA) suffer from catastrophic forgetting due to modifying shared dense representations.

Method: Retrofit existing pretrained models (Qwen-2.5-0.5B) with sparse memory modules, introducing a theoretically grounded slot-selection mechanism based on KL divergence to prioritize memory updates for informationally “surprising” tokens relative to a background distribution.

Result: Retrofitted models can acquire new factual knowledge with minimal forgetting of held-out capabilities, validating the sparse update hypothesis in a practical setting on consumer hardware.

Conclusion: Sparse Memory Finetuning offers an effective alternative to standard continual learning approaches by localizing updates to explicit memory layers, enabling practical continual adaptation without catastrophic forgetting.

Abstract: Large Language Models (LLMs) are typically static after training, yet real-world applications require continual adaptation to new knowledge without degrading existing capabilities. Standard approaches to updating models, like full finetuning or parameter-efficient methods (e.g., LoRA), face a fundamental trade-off: catastrophic forgetting. They modify shared dense representations, causing interference across tasks. Sparse Memory Finetuning (SMF) offers a promising alternative by localizing updates to a small subset of parameters in explicit memory layers. In this work, we present an open-source pipeline to retrofit existing pretrained models (Qwen-2.5-0.5B) with sparse memory modules, enabling effective continual learning on consumer hardware. We extend prior work by introducing a theoretically grounded slot-selection mechanism based on Kullback-Leibler (KL) divergence, which prioritizes memory updates for informationally “surprising” tokens relative to a background distribution. Our experiments demonstrate that our retrofitted models can acquire new factual knowledge with minimal forgetting of held-out capabilities, validating the sparse update hypothesis in a practical setting.

[533] DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models

Satyam Goyal, Kushal Patel, Tanush Mittal, Arjun Laxman

Main category: cs.LG

TL;DR: DualDiffusion is a speculative decoding framework for Masked Diffusion Models that uses fast drafter models with efficient approximations and slower verifier models to improve inference speed while maintaining accuracy.

DetailsMotivation: Masked Diffusion Models (MDMs) offer parallel token generation and bidirectional context modeling but suffer from slow inference due to O(N²) computations from bidirectional attention. Existing speedup methods sacrifice generation quality for speed.

Method: Proposes DualDiffusion, a speculative decoding framework combining fast drafter models (using efficient approximations) with slower, more accurate verifier models. Runs multiple steps of lightweight drafter followed by single verification step.

Result: Achieves superior Pareto frontier between generation steps and accuracy compared to existing approaches. Evaluated on MMLU and GSM8K, maintains high accuracy while reducing number of generation steps required.

Conclusion: DualDiffusion effectively pushes the quality-efficiency trade-off curve for masked diffusion language models, offering better speed-accuracy balance than previous methods.

Abstract: Masked Diffusion Models (MDMs) offer a promising alternative to autoregressive language models by enabling parallel token generation and bidirectional context modeling. However, their inference speed is significantly limited by the inability to cache key-value pairs due to bidirectional attention, requiring $O(N^2)$ computations at each generation step. While recent methods like FastDLLM and DkvCache improve inference speed through attention approximations and caching strategies, they achieve speedups at the cost of generation quality. We propose DualDiffusion, a speculative decoding framework for MDMs that combines fast drafter models (using efficient approximations) with slower, more accurate verifier models. By running multiple steps of a lightweight drafter followed by a single verification step, DualDiffusion achieves a superior Pareto frontier between generation steps and accuracy compared to existing approaches. We evaluate our method on MMLU and GSM8K, demonstrating that DualDiffusion maintains high accuracy while reducing the number of generation steps required, effectively pushing the quality-efficiency trade-off curve for masked diffusion language models.

[534] Extending Tabular Denoising Diffusion Probabilistic Models for Time-Series Data Generation

Umang Dobhal, Christina Garcia, Sozo Inoue

Main category: cs.LG

TL;DR: A temporal extension of TabDDPM that generates synthetic time-series data with temporal coherence using lightweight adapters and context-aware embeddings.

DetailsMotivation: TabDDPM generates high-quality synthetic tabular data but assumes sample independence, limiting its applicability to time-series domains where temporal dependencies are critical. The authors aim to extend diffusion models to handle sequential data synthesis while preserving temporal coherence.

Method: Proposes a temporal extension of TabDDPM with sequence awareness through lightweight temporal adapters and context-aware embedding modules. Reformulates sensor data into windowed sequences and explicitly models temporal context using timestep embeddings, conditional activity labels, and observed/missing masks.

Result: The approach generates temporally coherent synthetic sequences that closely resemble real-world sensor patterns. On the WISDM accelerometer dataset, achieves comparable classification performance (macro F1-score 0.64, accuracy 0.71) with enhanced temporal realism, diversity, and coherence compared to baselines.

Conclusion: Diffusion-based models provide effective and adaptable solutions for sequential data synthesis when equipped for temporal reasoning. The approach is particularly advantageous for minority class representation and preserving statistical alignment with real distributions.

Abstract: Diffusion models are increasingly being utilised to create synthetic tabular and time series data for privacy-preserving augmentation. Tabular Denoising Diffusion Probabilistic Models (TabDDPM) generate high-quality synthetic data from heterogeneous tabular datasets but assume independence between samples, limiting their applicability to time-series domains where temporal dependencies are critical. To address this, we propose a temporal extension of TabDDPM, introducing sequence awareness through the use of lightweight temporal adapters and context-aware embedding modules. By reformulating sensor data into windowed sequences and explicitly modeling temporal context via timestep embeddings, conditional activity labels, and observed/missing masks, our approach enables the generation of temporally coherent synthetic sequences. Compared to baseline and interpolation techniques, validation using bigram transition matrices and autocorrelation analysis shows enhanced temporal realism, diversity, and coherence. On the WISDM accelerometer dataset, the suggested system produces synthetic time-series that closely resemble real world sensor patterns and achieves comparable classification performance (macro F1-score 0.64, accuracy 0.71). This is especially advantageous for minority class representation and preserving statistical alignment with real distributions. These developments demonstrate that diffusion based models provide effective and adaptable solutions for sequential data synthesis when they are equipped for temporal reasoning. Future work will explore scaling to longer sequences and integrating stronger temporal architectures.

[535] Jeffreys Flow: Robust Boltzmann Generators for Rare Event Sampling via Parallel Tempering Distillation

Guang Lin, Christian Moya, Di Qi, Xuda Ye

Main category: cs.LG

TL;DR: Jeffreys Flow: A generative framework using symmetric Jeffreys divergence to prevent mode collapse in sampling rough energy landscapes, applied to accelerate quantum thermal state sampling and correct biases in stochastic gradient methods.

DetailsMotivation: Current Boltzmann generators suffer from catastrophic mode collapse when using reverse KL divergence for sampling multi-modal distributions in physical systems with rough energy landscapes, missing specific modes and limiting their effectiveness.

Method: Introduces Jeffreys Flow framework that uses symmetric Jeffreys divergence to distill empirical sampling data from Parallel Tempering trajectories, balancing local target-seeking precision with global mode coverage to suppress mode collapse.

Result: Demonstrates scalability and accuracy on highly non-convex multidimensional benchmarks, including systematic correction of stochastic gradient biases in Replica Exchange Stochastic Gradient Langevin Dynamics and massive acceleration of exact importance sampling in Path Integral Monte Carlo for quantum thermal states.

Conclusion: Jeffreys Flow provides a robust generative framework that effectively mitigates mode collapse in sampling complex physical systems, offering improved performance for quantum thermal state sampling and bias correction in stochastic gradient methods.

Abstract: Sampling physical systems with rough energy landscapes is hindered by rare events and metastable trapping. While Boltzmann generators already offer a solution, their reliance on the reverse Kullback–Leibler divergence frequently induces catastrophic mode collapse, missing specific modes in multi-modal distributions. Here, we introduce the Jeffreys Flow, a robust generative framework that mitigates this failure by distilling empirical sampling data from Parallel Tempering trajectories using the symmetric Jeffreys divergence. This formulation effectively balances local target-seeking precision with global modes coverage. We show that minimizing Jeffreys divergence suppresses mode collapse and structurally corrects inherent inaccuracies via distillation of the empirical reference data. We demonstrate the framework’s scalability and accuracy on highly non-convex multidimensional benchmarks, including the systematic correction of stochastic gradient biases in Replica Exchange Stochastic Gradient Langevin Dynamics and the massive acceleration of exact importance sampling in Path Integral Monte Carlo for quantum thermal states.

[536] LLMs Should Express Uncertainty Explicitly

Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei

Main category: cs.LG

TL;DR: LLMs trained to express uncertainty through verbalized confidence scores and explicit markers during reasoning, improving calibration and enabling better decision-making for tasks like abstention and retrieval.

DetailsMotivation: Current LLMs treat uncertainty as a latent quantity to estimate after generation rather than a signal to express during reasoning. The paper aims to study uncertainty as an interface for control in decision-making tasks like abstention, retrieval, and verification.

Method: Two complementary uncertainty interfaces: 1) Global interface where models verbalize calibrated confidence scores for final answers, and 2) Local interface where models emit explicit markers during reasoning when entering high-risk states. The interfaces are trained as task-matched communication.

Result: Verbalized confidence substantially improves calibration, reduces overconfident errors, and yields the strongest Adaptive RAG controller with more selective retrieval. Reasoning-time uncertainty signaling makes failures visible during generation, improves wrong-answer coverage, and provides effective high-recall retrieval triggers. The interfaces work differently internally: verbal confidence refines how existing uncertainty is decoded, while reasoning-time signaling induces broader late-layer reorganization.

Conclusion: Effective uncertainty in LLMs should be trained as task-matched communication: global confidence for deciding whether to trust final answers, and local signals for deciding when intervention is needed during reasoning.

Abstract: Large language models are increasingly used in settings where uncertainty must drive decisions such as abstention, retrieval, and verification. Most existing methods treat uncertainty as a latent quantity to estimate after generation rather than a signal the model is trained to express. We instead study uncertainty as an interface for control. We compare two complementary interfaces: a global interface, where the model verbalizes a calibrated confidence score for its final answer, and a local interface, where the model emits an explicit marker during reasoning when it enters a high-risk state. These interfaces provide different but complementary benefits. Verbalized confidence substantially improves calibration, reduces overconfident errors, and yields the strongest overall Adaptive RAG controller while using retrieval more selectively. Reasoning-time uncertainty signaling makes previously silent failures visible during generation, improves wrong-answer coverage, and provides an effective high-recall retrieval trigger. Our findings further show that the two interfaces work differently internally: verbal confidence mainly refines how existing uncertainty is decoded, whereas reasoning-time signaling induces a broader late-layer reorganization. Together, these results suggest that effective uncertainty in LLMs should be trained as task-matched communication: global confidence for deciding whether to trust a final answer, and local signals for deciding when intervention is needed.

[537] A Theoretical Framework for Statistical Evaluability of Generative Models

Shashaank Aiyer, Yishay Mansour, Shay Moran, Han Shao

Main category: cs.LG

TL;DR: Theoretical framework for evaluating generative models, analyzing which metrics can be reliably estimated from finite samples vs. those that cannot due to rare events.

DetailsMotivation: Evaluation is challenging for generative models due to their open-ended nature - unclear which metrics are appropriate and whether they can be reliably evaluated from finite samples, unlike supervised learning where test error reliably approximates population error.

Method: Introduces theoretical framework for evaluating generative models, studying two categories of metrics: test-based metrics (including integral probability metrics/IPMs) and Rényi divergences. Analyzes evaluability from finite samples with approximation errors.

Result: IPMs with bounded test classes can be evaluated from finite samples up to multiplicative/additive approximation errors. With finite fat-shattering dimension, IPMs can be evaluated with arbitrary precision. Rényi and KL divergences are not evaluable from finite samples as their values can be critically determined by rare events.

Conclusion: Provides theoretical foundation for evaluating generative models, distinguishing between evaluable metrics (IPMs) and non-evaluable ones (Rényi divergences) from finite samples, with implications for perplexity evaluation.

Abstract: Statistical evaluation aims to estimate the generalization performance of a model using held-out i.i.d.\ test data sampled from the ground-truth distribution. In supervised learning settings such as classification, performance metrics such as error rate are well-defined, and test error reliably approximates population error given sufficiently large datasets. In contrast, evaluation is more challenging for generative models due to their open-ended nature: it is unclear which metrics are appropriate and whether such metrics can be reliably evaluated from finite samples. In this work, we introduce a theoretical framework for evaluating generative models and establish evaluability results for commonly used metrics. We study two categories of metrics: test-based metrics, including integral probability metrics (IPMs), and Rényi divergences. We show that IPMs with respect to any bounded test class can be evaluated from finite samples up to multiplicative and additive approximation errors. Moreover, when the test class has finite fat-shattering dimension, IPMs can be evaluated with arbitrary precision. In contrast, Rényi and KL divergences are not evaluable from finite samples, as their values can be critically determined by rare events. We also analyze the potential and limitations of perplexity as an evaluation method.

[538] Cross-Machine Anomaly Detection Leveraging Pre-trained Time-series Model

Yangmeng Li, Kei Sano, Toshihiro Kitao, Ryoji Anzaki, Yukiya Saitoh, Hironori Moki, Dragan Djurdjanovic

Main category: cs.LG

TL;DR: A cross-machine time-series anomaly detection framework that uses domain-invariant feature extraction from pre-trained foundation model MOMENT to generalize anomaly detection across different individual machines performing the same operations.

DetailsMotivation: Manufacturing requires reliable anomaly detection that can handle differences between nominally identical machines. Current methods struggle with cross-machine generalization when detecting anomalies using sensory data from different individual machines executing the same procedures.

Method: Proposes a framework integrating domain-invariant feature extraction with unsupervised anomaly detection. Uses pre-trained foundation model MOMENT for time-series embeddings, then employs Random Forest Classifiers to disentangle embeddings into machine-related and condition-related features, creating invariant representations across machines.

Result: Experiments on industrial dataset from three different machines performing same operations show the approach outperforms both raw-signal-based and MOMENT-embedding feature baselines, confirming effectiveness in enhancing cross-machine generalization.

Conclusion: The proposed framework successfully addresses cross-machine anomaly detection by creating domain-invariant features that enable effective generalization to unseen target machines, improving manufacturing resilience and quality.

Abstract: Achieving resilient and high-quality manufacturing requires reliable data-driven anomaly detection methods that are capable of addressing differences in behaviors among different individual machines which are nominally the same and are executing the same processes. To address the problem of detecting anomalies in a machine using sensory data gathered from different individual machines executing the same procedure, this paper proposes a cross-machine time-series anomaly detection framework that integrates a domain-invariant feature extractor with an unsupervised anomaly detection module. Leveraging the pre-trained foundation model MOMENT, the extractor employs Random Forest Classifiers to disentangle embeddings into machine-related and condition-related features, with the latter serving as representations which are invariant to differences between individual machines. These refined features enable the downstream anomaly detectors to generalize effectively to unseen target machines. Experiments on an industrial dataset collected from three different machines performing nominally the same operation demonstrate that the proposed approach outperforms both the raw-signal-based and MOMENT-embedding feature baselines, confirming its effectiveness in enhancing cross-machine generalization.

[539] LMI-Net: Linear Matrix Inequality–Constrained Neural Networks via Differentiable Projection Layers

Sunbochen Tang, Andrea Goertzen, Navid Azizan

Main category: cs.LG

TL;DR: LMI-Net is a differentiable projection layer that enforces linear matrix inequality constraints by construction, enabling neural networks to produce outputs that satisfy formal stability and robustness guarantees for dynamical systems.

DetailsMotivation: Existing learning-based methods for control design and certificate synthesis often fail to preserve the hard matrix inequality constraints required for formal guarantees in dynamical systems, limiting their reliability and safety.

Method: Proposes LMI-Net, a modular differentiable projection layer that lifts LMI constraints into an affine equality constraint and positive semidefinite cone intersection, performs forward pass via Douglas-Rachford splitting, and supports efficient backward propagation through implicit differentiation.

Result: LMI-Net substantially improves feasibility over soft-constrained models under distribution shift while retaining fast inference speed, demonstrated through experiments including invariant ellipsoid synthesis and joint controller-and-certificate design for disturbed linear systems.

Conclusion: LMI-Net bridges semidefinite-program-based certification and modern learning techniques by providing a reliable way to enforce formal guarantees in neural network-based control systems.

Abstract: Linear matrix inequalities (LMIs) have played a central role in certifying stability, robustness, and forward invariance of dynamical systems. Despite rapid development in learning-based methods for control design and certificate synthesis, existing approaches often fail to preserve the hard matrix inequality constraints required for formal guarantees. We propose LMI-Net, an efficient and modular differentiable projection layer that enforces LMI constraints by construction. Our approach lifts the set defined by LMI constraints into the intersection of an affine equality constraint and the positive semidefinite cone, performs the forward pass via Douglas-Rachford splitting, and supports efficient backward propagation through implicit differentiation. We establish theoretical guarantees that the projection layer converges to a feasible point, certifying that LMI-Net transforms a generic neural network into a reliable model satisfying LMI constraints. Evaluated on experiments including invariant ellipsoid synthesis and joint controller-and-certificate design for a family of disturbed linear systems, LMI-Net substantially improves feasibility over soft-constrained models under distribution shift while retaining fast inference speed, bridging semidefinite-program-based certification and modern learning techniques.

[540] Training Without Orthogonalization, Inference With SVD: A Gradient Analysis of Rotation Representations

Chris Choy

Main category: cs.LG

TL;DR: Theoretical analysis shows SVD orthogonalization during training causes gradient distortion, while Gram-Schmidt has asymmetric gradient signals, explaining why 9D representations with SVD projection only at inference work best for rotation estimation.

DetailsMotivation: To understand why SVD orthogonalization harms training for rotation estimation while being preferred over Gram-Schmidt at inference, and to provide theoretical foundations for the empirical success of 9D representations with SVD projection.

Method: Detailed gradient analysis of SVD orthogonalization for 3×3 matrices and SO(3) projection, deriving exact spectrum of SVD backward pass Jacobian, analyzing gradient distortion, and comparing with Gram-Schmidt Jacobian properties.

Result: SVD Jacobian has rank 3 with singular values 2/(s_i + s_j) and condition number κ = (s_1 + s_2)/(s_2 + s_3), causing severe gradient distortion when predicted matrix is far from SO(3). Gram-Schmidt Jacobian has asymmetric spectrum leading to unequal gradient signals. 9D parameterization avoids these issues.

Conclusion: Theoretical justification for training with direct 9D regression and applying SVD projection only at inference, explaining why this approach outperforms alternatives for rotation estimation in deep learning.

Abstract: Recent work has shown that removing orthogonalization during training and applying it only at inference improves rotation estimation in deep learning, with empirical evidence favoring 9D representations with SVD projection. However, the theoretical understanding of why SVD orthogonalization specifically harms training, and why it should be preferred over Gram-Schmidt at inference, remains incomplete. We provide a detailed gradient analysis of SVD orthogonalization specialized to $3 \times 3$ matrices and $SO(3)$ projection. Our central result derives the exact spectrum of the SVD backward pass Jacobian: it has rank $3$ (matching the dimension of $SO(3)$) with nonzero singular values $2/(s_i + s_j)$ and condition number $κ= (s_1 + s_2)/(s_2 + s_3)$, creating quantifiable gradient distortion that is most severe when the predicted matrix is far from $SO(3)$ (e.g., early in training when $s_3 \approx 0$). We further show that even stabilized SVD gradients introduce gradient direction error, whereas removing SVD from the training loop avoids this tradeoff entirely. We also prove that the 6D Gram-Schmidt Jacobian has an asymmetric spectrum: its parameters receive unequal gradient signal, explaining why 9D parameterization is preferable. Together, these results provide the theoretical foundation for training with direct 9D regression and applying SVD projection only at inference.

[541] ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

Jingwei Zuo, Xinze Feng, Zien Liu, Kaijian Wang, Fanjiang Ye, Ye Cao, Zhuang Wang, Yuke Wang

Main category: cs.LG

TL;DR: ALTO is a system for efficient LoRA hyperparameter tuning and cluster sharing that accelerates concurrent LoRA jobs by early termination of weak configurations, fused grouped GEMM operations, and improved multi-task scheduling.

DetailsMotivation: LoRA hyperparameter tuning requires many concurrent jobs across heterogeneous tasks, but existing systems handle these independently, wasting computation on weak candidates and leaving GPUs underutilized.

Method: ALTO monitors loss trajectories to terminate unpromising configurations early, uses fused grouped GEMM with rank-local adapter parallelism to co-locate surviving adapters, and combines intra-task and inter-task scheduling for improved multi-task placement.

Result: ALTO achieves up to 13.8× speedup over state-of-the-art systems without sacrificing adapter quality.

Conclusion: ALTO demonstrates that co-designing training systems to exploit optimization opportunities across concurrent LoRA tuning jobs can significantly accelerate hyperparameter search while enabling efficient cluster sharing.

Abstract: Low-Rank Adaptation (LoRA) is now the dominant method for parameter-efficient fine-tuning of large language models, but achieving a high-quality adapter often requires systematic hyperparameter tuning because LoRA performance is highly sensitive to configuration choices. In practice, this leads to many concurrent LoRA jobs, often spanning heterogeneous tasks in multi-tenant environments. Existing systems largely handle these jobs independently, which both wastes computation on weak candidates and leaves GPUs underutilized. We present ALTO (Adaptive LoRA Tuning and Orchestration), a co-designed training system that accelerates LoRA hyperparameter tuning while enabling efficient cluster sharing across heterogeneous tasks. The central insight behind ALTO is that when multiple tuning jobs run concurrently over a shared frozen backbone, they expose optimization opportunities that single-job designs cannot exploit. Building on this, ALTO monitors loss trajectories to terminate unpromising configurations early, uses fused grouped GEMM together with a new rank-local adapter parallelism to co-locate surviving adapters and reclaim freed GPU capacity, and combines intra-task and inter-task scheduling to improve multi-task placement by leveraging the predictable duration of LoRA jobs. Extensive evaluation shows that ALTO achieves up to $13.8\times$ speedup over state-of-the-art without sacrificing adapter quality.

[542] Top-K Retrieval with Fixed-Size Linear-Attention Completion: Backbone- and KV-Format-Preserving Attention for KV-Cache Read Reduction

Yasuto Hoshi, Daisuke Miyashita, Jun Deguchi

Main category: cs.LG

TL;DR: A retrieval-completion attention module that reduces KV cache traffic for long-context generation by combining exact attention over retrieved tokens with estimated attention over unretrieved tokens.

DetailsMotivation: Long-context generation is limited by decode-time KV cache traffic, especially when KV is offloaded beyond GPU memory. Query-aware retrieval reduces traffic but introduces bias when attention mass is spread over unretrieved tokens.

Method: Proposes a retrieval-completion attention module that keeps backbone weights and KV-cache format unchanged. For each query, computes exact attention over sink/tail anchors and query-dependent retrieved Top-K tokens, and estimates the remaining mid-region using a fixed-size feature-map summary computed at prefill time. Adds exact and estimated contributions in the unnormalized domain and applies single normalization.

Result: Improves over selection-only Top-K at matched token-equivalent read budgets across long-context benchmarks, with largest gains in high-entropy heads.

Conclusion: The proposed method effectively reduces KV cache traffic while recovering missing softmax mass without additional attention-side KV reads, making long-context generation more efficient.

Abstract: Long-context generation is increasingly limited by decode-time key-value (KV) cache traffic, particularly when KV is offloaded beyond GPU memory. Query-aware retrieval (e.g., Top-K selection) reduces this traffic by loading only a subset of KV pairs, but renormalizing the softmax over the subset introduces bias when attention mass is spread over unretrieved tokens. We propose a retrieval-completion attention module that keeps backbone weights and the KV-cache format unchanged. For each query, we compute exact attention over sink/tail anchors and the query-dependent retrieved Top-K tokens, and estimate the remaining mid-region numerator and denominator using a fixed-size feature-map summary computed at prefill time. We add the exact and estimated contributions in the unnormalized domain and apply a single normalization, recovering the missing softmax mass without additional attention-side KV reads. Across long-context benchmarks, the proposed method improves over selection-only Top-K at matched token-equivalent read budgets, with the largest gains in high-entropy heads.

[543] Reproducing AlphaZero on Tablut: Self-Play RL for an Asymmetric Board Game

Tõnis Lees, Tambet Matiisen

Main category: cs.LG

TL;DR: AlphaZero adaptation to asymmetric Tablut game using separate policy/value heads per player role with shared trunk, stabilized via data augmentation and past checkpoint play.

DetailsMotivation: AlphaZero's original symmetric architecture struggles with asymmetric games like Tablut where players have different objectives (king capture vs escape), causing conflicting evaluation functions that hinder learning efficiency.

Method: Modified AlphaZero architecture with separate policy and value heads for each player role (attacker/defender) sharing a common residual trunk. Used C4 data augmentation, increased replay buffer, and 25% training games against past checkpoints to stabilize training.

Result: Achieved steady improvement over 100 self-play iterations with BayesElo rating of 1235 vs baseline. Showed decreased policy entropy and average remaining pieces, indicating more focused and decisive play.

Conclusion: AlphaZero’s self-play framework can transfer to highly asymmetric games when using distinct policy/value heads per role and robust stabilization techniques.

Abstract: This work investigates the adaptation of the AlphaZero reinforcement learning algorithm to Tablut, an asymmetric historical board game featuring unequal piece counts and distinct player objectives (king capture versus king escape). While the original AlphaZero architecture successfully leverages a single policy and value head for symmetric games, applying it to asymmetric environments forces the network to learn two conflicting evaluation functions, which can hinder learning efficiency and performance. To address this, the core architecture is modified to use separate policy and value heads for each player role, while maintaining a shared residual trunk to learn common board features. During training, the asymmetric structure introduced training instabilities, notably catastrophic forgetting between the attacker and defender roles. These issues were mitigated by applying C4 data augmentation, increasing the replay buffer size, and having the model play 25 percent of training games against randomly sampled past checkpoints. Over 100 self-play iterations, the modified model demonstrated steady improvement, achieving a BayesElo rating of 1235 relative to a randomly initialized baseline. Training metrics also showed a significant decrease in policy entropy and average remaining pieces, reflecting increasingly focused and decisive play. Ultimately, the experiments confirm that AlphaZero’s self-play framework can transfer to highly asymmetric games, provided that distinct policy/value heads and robust stabilization techniques are employed.

[544] Channel-wise Retrieval for Multivariate Time Series Forecasting

Junhyeok Kang, Jun Seo, Soyeon Park, Sangjun Han, Seohui Bae, Hyeokjun Choe, Soonyoung Lee

Main category: cs.LG

TL;DR: CRAFT introduces channel-wise retrieval-augmented forecasting for multivariate time series, addressing inter-variable heterogeneity by performing independent retrieval per channel using time-domain pruning and frequency-domain ranking.

DetailsMotivation: Existing retrieval-augmented forecasting methods use channel-agnostic strategies that apply the same historical references to all variables, neglecting that different channels have distinct periodicities and spectral profiles, which limits their ability to capture long-range dependencies effectively.

Method: CRAFT performs retrieval independently for each channel using a two-stage pipeline: 1) constructs a sparse relation graph in the time domain to prune irrelevant candidates, and 2) uses spectral similarity in the frequency domain to rank references, emphasizing dominant periodic components while suppressing noise.

Result: Experiments on seven public benchmarks show CRAFT outperforms state-of-the-art forecasting baselines, achieving superior accuracy with practical inference efficiency.

Conclusion: Channel-wise retrieval is more effective than channel-agnostic approaches for multivariate time series forecasting, as it better handles inter-variable heterogeneity and captures long-range dependencies through tailored retrieval per channel.

Abstract: Multivariate time series forecasting often struggles to capture long-range dependencies due to fixed lookback windows. Retrieval-augmented forecasting addresses this by retrieving historical segments from memory, but existing approaches rely on a channel-agnostic strategy that applies the same references to all variables. This neglects inter-variable heterogeneity, where different channels exhibit distinct periodicities and spectral profiles. We propose CRAFT (Channel-wise retrieval-augmented forecasting), a novel framework that performs retrieval independently for each channel. To ensure efficiency, CRAFT adopts a two-stage pipeline: a sparse relation graph constructed in the time domain prunes irrelevant candidates, and spectral similarity in the frequency domain ranks references, emphasizing dominant periodic components while suppressing noise. Experiments on seven public benchmarks demonstrate that CRAFT outperforms state-of-the-art forecasting baselines, achieving superior accuracy with practical inference efficiency.

[545] Same Graph, Different Likelihoods: Calibration of Autoregressive Graph Generators via Permutation-Equivalent Encodings

Laurits Fredsgaard, Aaron Thomas, Michael Riis Andersen, Mikkel N. Schmidt, Mahito Sugiyama

Main category: cs.LG

TL;DR: The paper introduces Linearization Uncertainty (LU) to measure consistency of autoregressive graph generators across different linearizations, showing that models often learn specific linearizations rather than underlying graph structure, with LU being a better quality metric for generated molecules than negative log-likelihood.

DetailsMotivation: Autoregressive graph generators use sequential construction processes, but their likelihoods are only meaningful if consistent across all linearizations of the same graph. Current linearization methods like SENT allow multiple equivalent linearizations, creating potential inconsistencies in model evaluation.

Method: Introduces Linearization Uncertainty (LU) measured via coefficient of variation across equivalent linearizations. Trains transformers under four linearization strategies on two datasets, evaluates using expected calibration error under random permutation, and tests on QM9 molecular graph benchmark.

Result: Biased orderings achieve lower NLL on native order but show ECE two orders of magnitude higher under random permutation, indicating models learn training linearization rather than underlying graph. On QM9, NLL negatively correlates with molecular stability (AUC=0.43), while LU achieves AUC=0.85.

Conclusion: Permutation-based evaluation via Linearization Uncertainty provides more reliable quality assessment for generated graphs/molecules than standard negative log-likelihood, as it measures consistency across linearizations rather than memorization of specific ordering.

Abstract: Autoregressive graph generators define likelihoods via a sequential construction process, but these likelihoods are only meaningful if they are consistent across all linearizations of the same graph. Segmented Eulerian Neighborhood Trails (SENT), a recent linearization method, converts graphs into sequences that can be perfectly decoded and efficiently processed by language models, but admit multiple equivalent linearizations of the same graph. We quantify violations in assigned negative log-likelihood (NLL) using the coefficient of variation across equivalent linearizations, which we call Linearization Uncertainty (LU). Training transformers under four linearization strategies on two datasets, we show that biased orderings achieve lower NLL on their native order but exhibit expected calibration error (ECE) two orders of magnitude higher under random permutation, indicating that these models have learned their training linearization rather than the underlying graph. On the molecular graph benchmark QM9, NLL for generated graphs is negatively correlated with molecular stability (AUC $=0.43$), while LU achieves AUC $=0.85$, suggesting that permutation-based evaluation provides a more reliable quality check for generated molecules. Code is available at https://github.com/lauritsf/linearization-uncertainty

[546] From Uniform to Learned Knots: A Study of Spline-Based Numerical Encodings for Tabular Deep Learning

Manish Kumar, Anton Frederik Thielmann, Christoph Weisser, Benjamin Säfken

Main category: cs.LG

TL;DR: Study of spline-based numerical encodings (B-splines, M-splines, I-splines) for tabular deep learning, comparing different knot placement strategies and their impact on performance across various backbones and tasks.

DetailsMotivation: Numerical preprocessing is crucial for tabular deep learning, but the role of explicit numerical preprocessing in deep learning models is not well understood compared to classical statistical models.

Method: Investigated three spline families (B-splines, M-splines, I-splines) with four knot placement strategies (uniform, quantile-based, target-aware, learnable-knot). Used differentiable knot parameterization for learnable variants and evaluated on diverse regression/classification datasets with MLP, ResNet, and FT-Transformer backbones.

Result: Effect of numerical encodings depends strongly on task, output size, and backbone. For classification, piecewise-linear encoding (PLE) is most robust overall. For regression, no single encoding dominates uniformly - performance depends on spline family, knot placement, and output size. Learnable-knot variants can be optimized stably but increase training cost substantially.

Conclusion: Numerical encodings should be assessed not only for predictive performance but also computational overhead. The choice depends on specific task, model architecture, and computational constraints.

Abstract: Numerical preprocessing remains an important component of tabular deep learning, where the representation of continuous features can strongly affect downstream performance. Although its importance is well established for classical statistical and machine learning models, the role of explicit numerical preprocessing in tabular deep learning remains less well understood. In this work, we study this question with a focus on spline-based numerical encodings. We investigate three spline families for encoding numerical features, namely B-splines, M-splines, and integrated splines (I-splines), under uniform, quantile-based, target-aware, and learnable-knot placement. For the learnable-knot variants, we use a differentiable knot parameterization that enables stable end-to-end optimization of knot locations jointly with the backbone. We evaluate these encodings on a diverse collection of public regression and classification datasets using MLP, ResNet, and FT-Transformer backbones, and compare them against common numerical preprocessing baselines. Our results show that the effect of numerical encodings depends strongly on the task, output size, and backbone. For classification, piecewise-linear encoding (PLE) is the most robust choice overall, while spline-based encodings remain competitive. For regression, no single encoding dominates uniformly. Instead, performance depends on the spline family, knot-placement strategy, and output size, with larger gains typically observed for MLP and ResNet than for FT-Transformer. We further find that learnable-knot variants can be optimized stably under the proposed parameterization, but may substantially increase training cost, especially for M-spline and I-spline expansions. Overall, the results show that numerical encodings should be assessed not only in terms of predictive performance, but also in terms of computational overhead.

[547] Optimal-Transport-Guided Functional Flow Matching for Turbulent Field Generation in Hilbert Space

Li Kunpeng, Wan Chenguang, Qu Zhisong, Lim Kyungtak, Virginie Grandgirard, Xavier Garbet, Yu Hua, Ong Yew Soon

Main category: cs.LG

TL;DR: FOT-CFM is a functional-space generative framework for turbulent flow modeling that learns resolution-invariant dynamics directly in infinite-dimensional Hilbert space using Optimal Transport theory.

DetailsMotivation: Traditional knowledge-based systems struggle with turbulent flow modeling due to complex spatiotemporal dynamics and multi-scale intermittency. While deep generative models show promise, their discrete pixel-based nature limits applicability to functional turbulence data.

Method: Proposes Functional Optimal Transport Conditional Flow Matching (FOT-CFM), a generative framework defined directly in infinite-dimensional function space. Treats physical fields as elements of Hilbert space, learns resolution-invariant generative dynamics at probability measure level, integrates Optimal Transport theory to construct straight-line probability paths between noise and data measures.

Result: Superior fidelity in reproducing high-order turbulent statistics and energy spectra compared to state-of-the-art baselines across diverse chaotic dynamical systems including Navier-Stokes equations, Kolmogorov Flow, and Hasegawa-Wakatani equations.

Conclusion: FOT-CFM addresses fundamental limitations of discrete generative models for turbulence computing by operating directly in functional space, enabling more accurate modeling of turbulent flows with complex spatiotemporal dynamics.

Abstract: High-fidelity modeling of turbulent flows requires capturing complex spatiotemporal dynamics and multi-scale intermittency, posing a fundamental challenge for traditional knowledge-based systems. While deep generative models, such as diffusion models and Flow Matching, have shown promising performance, they are fundamentally constrained by their discrete, pixel-based nature. This limitation restricts their applicability in turbulence computing, where data inherently exists in a functional form. To address this gap, we propose Functional Optimal Transport Conditional Flow Matching (FOT-CFM), a generative framework defined directly in infinite-dimensional function space. Unlike conventional approaches defined on fixed grids, FOT-CFM treats physical fields as elements of an infinite-dimensional Hilbert space, and learns resolution-invariant generative dynamics directly at the level of probability measures. By integrating Optimal Transport (OT) theory, we construct deterministic, straight-line probability paths between noise and data measures in Hilbert space. This formulation enables simulation-free training and significantly accelerates the sampling process. We rigorously evaluate the proposed system on a diverse suite of chaotic dynamical systems, including the Navier-Stokes equations, Kolmogorov Flow, and Hasegawa-Wakatani equations, all of which exhibit rich multi-scale turbulent structures. Experimental results demonstrate that FOT-CFM achieves superior fidelity in reproducing high-order turbulent statistics and energy spectra compared to state-of-the-art baselines.

[548] Controllable Image Generation with Composed Parallel Token Prediction

Jamie Stirling, Noura Al-Moubayed, Chris G. Willcocks, Hubert P. H. Shum

Main category: cs.LG

TL;DR: A theoretically-grounded method for composing multiple conditions in discrete generative models, enabling novel condition combinations and concept weighting, with applications to text-to-image generation.

DetailsMotivation: Conditional discrete generative models struggle to faithfully compose multiple input conditions, limiting their ability to handle novel combinations of conditions not seen during training.

Method: Derives a theoretically-grounded formulation for composing discrete probabilistic generative processes, with masked generation (absorbing diffusion) as a special case. Enables precise specification of novel condition combinations and concept weighting for emphasis/negation.

Result: Achieves 63.4% relative error reduction compared to previous SOTA across 3 datasets (positional CLEVR, relational CLEVR, FFHQ), with average absolute FID improvement of -9.58. Offers 2.3× to 12× real-time speed-up and works with pre-trained discrete text-to-image models.

Conclusion: The method enables faithful composition of multiple conditions in discrete generative models, supporting novel condition combinations and fine-grained control, with significant performance improvements and speed advantages.

Abstract: Conditional discrete generative models struggle to faithfully compose multiple input conditions. To address this, we derive a theoretically-grounded formulation for composing discrete probabilistic generative processes, with masked generation (absorbing diffusion) as a special case. Our formulation enables precise specification of novel combinations and numbers of input conditions that lie outside the training data, with concept weighting enabling emphasis or negation of individual conditions. In synergy with the richly compositional learned vocabulary of VQ-VAE and VQ-GAN, our method attains a $63.4%$ relative reduction in error rate compared to the previous state-of-the-art, averaged across 3 datasets (positional CLEVR, relational CLEVR and FFHQ), simultaneously obtaining an average absolute FID improvement of $-9.58$. Meanwhile, our method offers a $2.3\times$ to $12\times$ real-time speed-up over comparable methods, and is readily applied to an open pre-trained discrete text-to-image model for fine-grained control of text-to-image generation.

[549] Graph Topology Information Enhanced Heterogeneous Graph Representation Learning

He Zhao, Zhiwei Zeng, Yongwei Wang, Chunyan Miao

Main category: cs.LG

TL;DR: ToGRL is a novel framework for heterogeneous graph representation learning that incorporates graph structure learning to improve downstream task performance by learning task-relevant latent topology information and using prompt tuning for better knowledge utilization.

DetailsMotivation: Real-world heterogeneous graphs are noisy and suboptimal for downstream tasks, negatively affecting GRL models. Existing GSL methods are designed for homogeneous graphs, leaving heterogeneous GSL unexplored. Heterogeneous graphs pose additional challenges: graph structure quality has greater impact on GNN-based models, and memory consumption issues arise when applying homogeneous methods to heterogeneous graphs.

Method: ToGRL uses a two-stage approach: 1) A novel GSL module extracts task-relevant topology information from raw graph structure, projects it into topology embeddings, and constructs a new graph with smooth graph signals, separating adjacency matrix optimization from node representation learning to reduce memory consumption. 2) A representation learning module takes the new graph as input to learn embeddings for downstream tasks. The framework also incorporates prompt tuning to better utilize learned knowledge and enhance adaptability.

Result: Extensive experiments on five real-world datasets show that ToGRL outperforms state-of-the-art methods by a large margin.

Conclusion: ToGRL effectively addresses the challenges of heterogeneous graph structure learning by learning high-quality graph structures and representations through task-relevant latent topology information extraction and prompt tuning, demonstrating superior performance over existing methods.

Abstract: Real-world heterogeneous graphs are inherently noisy and usually not in the optimal graph structures for downstream tasks, which often adversely affects the performance of GRL models in downstream tasks. Although Graph Structure Learning (GSL) methods have been proposed to learn graph structures and downstream tasks simultaneously, existing methods are predominantly designed for homogeneous graphs, while GSL for heterogeneous graphs remains largely unexplored. Two challenges arise in this context. Firstly, the quality of the input graph structure has a more profound impact on GNN-based heterogeneous GRL models compared to their homogeneous counterparts. Secondly, most existing homogenous GRL models encounter memory consumption issues when applied directly to heterogeneous graphs. In this paper, we propose a novel Graph Topology learning Enhanced Heterogeneous Graph Representation Learning framework (ToGRL).ToGRL learns high-quality graph structures and representations for downstream tasks by incorporating task-relevant latent topology information. Specifically, a novel GSL module is first proposed to extract downstream task-related topology information from a raw graph structure and project it into topology embeddings. These embeddings are utilized to construct a new graph with smooth graph signals. This two-stage approach to GSL separates the optimization of the adjacency matrix from node representation learning to reduce memory consumption. Following this, a representation learning module takes the new graph as input to learn embeddings for downstream tasks. ToGRL also leverages prompt tuning to better utilize the knowledge embedded in learned representations, thus enhancing adaptability to downstream tasks. Extensive experiments on five real-world datasets show that our ToGRL outperforms state-of-the-art methods by a large margin.

[550] Bivariate Causal Discovery Using Rate-Distortion MDL: An Information Dimension Approach

Tiago Brogueira, Mário A. T. Figueiredo

Main category: cs.LG

TL;DR: RDMDL: A new bivariate causal discovery method using rate-distortion theory to better estimate description length of cause variables in MDL-based causal inference.

DetailsMotivation: Current MDL-based causal discovery methods inadequately estimate the description length of cause variables, relying too heavily on the causal mechanism complexity. The paper aims to improve this by properly modeling the cause variable's description length using rate-distortion theory.

Method: Proposes rate-distortion MDL (RDMDL) that uses rate-distortion theory to measure description length of cause variables. Distortion level is determined via histogram-based density estimation rules, and rate is computed using information dimension with asymptotic approximation. Combined with traditional causal mechanism description length.

Result: RDMDL achieves competitive performance on the Tübingen causal discovery benchmark dataset, demonstrating effectiveness of the rate-distortion approach for cause variable description length estimation.

Conclusion: The rate-distortion approach provides a theoretically grounded way to estimate description length of cause variables in MDL-based causal discovery, leading to improved performance in bivariate causal inference.

Abstract: Approaches to bivariate causal discovery based on the minimum description length (MDL) principle approximate the (uncomputable) Kolmogorov complexity of the models in each causal direction, selecting the one with the lower total complexity. The premise is that nature’s mechanisms are simpler in their true causal order. Inherently, the description length (complexity) in each direction includes the description of the cause variable and that of the causal mechanism. In this work, we argue that current state-of-the-art MDL-based methods do not correctly address the problem of estimating the description length of the cause variable, effectively leaving the decision to the description length of the causal mechanism. Based on rate-distortion theory, we propose a new way to measure the description length of the cause, corresponding to the minimum rate required to achieve a distortion level representative of the underlying distribution. This distortion level is deduced using rules from histogram-based density estimation, while the rate is computed using the related concept of information dimension, based on an asymptotic approximation. Combining it with a traditional approach for the causal mechanism, we introduce a new bivariate causal discovery method, termed rate-distortion MDL (RDMDL). We show experimentally that RDMDL achieves competitive performance on the Tübingen dataset. All the code and experiments are publicly available at github.com/tiagobrogueira/Causal-Discovery-In-Exchangeable-Data.

[551] Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning

Tillmann Rheude, Stefan Hegselmann, Roland Eils, Benjamin Wild

Main category: cs.LG

TL;DR: Gated Symile improves multimodal contrastive learning by adding attention-based gating to handle unreliable or misaligned modalities beyond image-text pairs, outperforming Symile and CLIP on trimodal tasks.

DetailsMotivation: Existing multimodal contrastive methods like Symile treat all modalities symmetrically without modeling reliability differences, which becomes problematic when dealing with more than two modalities that can be misaligned, weakly informative, or missing. This fragility can be hidden in multiplicative interactions, potentially degrading performance silently.

Method: Proposes Gated Symile with a contrastive gating mechanism that adapts modality contributions using attention-based, per-candidate gating. The gate suppresses unreliable inputs by interpolating embeddings toward learnable neutral directions and includes an explicit NULL option when reliable cross-modal alignment is unlikely.

Result: Gated Symile achieves higher top-1 retrieval accuracy than well-tuned Symile and CLIP models across a controlled synthetic benchmark and three real-world trimodal datasets, demonstrating robustness to imperfect multimodal inputs.

Conclusion: Gating is an effective approach for robust multimodal contrastive learning when dealing with more than two modalities, especially when some modalities may be unreliable or misaligned. The work highlights the importance of modeling modality reliability in multimodal systems.

Abstract: Multimodal contrastive learning is increasingly enriched by going beyond image-text pairs. Among recent contrastive methods, Symile is a strong approach for this challenge because its multiplicative interaction objective captures higher-order cross-modal dependence. Yet, we find that Symile treats all modalities symmetrically and does not explicitly model reliability differences, a limitation that becomes especially present in trimodal multiplicative interactions. In practice, modalities beyond image-text pairs can be misaligned, weakly informative, or missing, and treating them uniformly can silently degrade performance. This fragility can be hidden in the multiplicative interaction: Symile may outperform pairwise CLIP even if a single unreliable modality silently corrupts the product terms. We propose Gated Symile, a contrastive gating mechanism that adapts modality contributions on an attention-based, per-candidate basis. The gate suppresses unreliable inputs by interpolating embeddings toward learnable neutral directions and incorporating an explicit NULL option when reliable cross-modal alignment is unlikely. Across a controlled synthetic benchmark that uncovers this fragility and three real-world trimodal datasets for which such failures could be masked by averages, Gated Symile achieves higher top-1 retrieval accuracy than well-tuned Symile and CLIP models. More broadly, our results highlight gating as a step toward robust multimodal contrastive learning under imperfect and more than two modalities.

[552] Expectation Maximization (EM) Converges for General Agnostic Mixtures

Avishek Ghosh

Main category: cs.LG

TL;DR: Gradient EM algorithm for fitting k parametric functions with strongly convex and smooth loss in agnostic setting, extending beyond mixture of linear regression to include mixed linear classifiers and generalized linear regression.

DetailsMotivation: Extend the analysis of gradient EM from mixture of linear regression to a broader class of problems including mixed linear classifiers and mixed generalized linear regression in the agnostic setting where no generative model is assumed.

Method: Propose and analyze gradient EM algorithm for fitting k parametric functions with strongly convex and smooth loss functions, requiring proper initialization and separation conditions for convergence.

Result: With proper initialization and separation conditions, gradient EM converges exponentially to population loss minimizers with high probability, showing effectiveness beyond mixture of linear regression.

Conclusion: Gradient EM is effective for a broad class of parametric function fitting problems in agnostic settings, extending theoretical guarantees beyond traditional mixture models.

Abstract: Mixture of linear regression is well studied in statistics and machine learning, where the data points are generated probabilistically using $k$ linear models. Algorithms like Expectation Maximization (EM) may be used to recover the ground truth regressors for this problem. Recently, in \cite{pal2022learning,ghosh_agnostic} the mixed linear regression problem is studied in the agnostic setting, where no generative model on data is assumed. Rather, given a set of data points, the objective is \emph{fit} $k$ lines by minimizing a suitable loss function. It is shown that a modification of EM, namely gradient EM converges exponentially to appropriately defined loss minimizer even in the agnostic setting. In this paper, we study the problem of \emph{fitting} $k$ parametric functions to given set of data points. We adhere to the agnostic setup. However, instead of fitting lines equipped with quadratic loss, we consider any arbitrary parametric function fitting equipped with a strongly convex and smooth loss. This framework encompasses a large class of problems including mixed linear regression (regularized), mixed linear classifiers (mixed logistic regression, mixed Support Vector Machines) and mixed generalized linear regression. We propose and analyze gradient EM for this problem and show that with proper initialization and separation condition, the iterates of gradient EM converge exponentially to appropriately defined population loss minimizers with high probability. This shows the effectiveness of EM type algorithm which converges to \emph{optimal} solution in the non-generative setup beyond mixture of linear regression.

[553] EEG-MFTNet: An Enhanced EEGNet Architecture with Multi-Scale Temporal Convolutions and Transformer Fusion for Cross-Session Motor Imagery Decoding

Panagiotis Andrikopoulos, Siamak Mehrkanoon

Main category: cs.LG

TL;DR: EEG-MFTNet: A novel deep learning model combining EEGNet with multi-scale temporal convolutions and Transformer encoder to improve motor imagery decoding from EEG signals for brain-computer interfaces.

DetailsMotivation: Accurate motor imagery decoding from EEG signals remains challenging due to noise and cross-session variability, limiting the effectiveness of brain-computer interfaces for individuals with motor impairments.

Method: Proposes EEG-MFTNet based on EEGNet architecture, enhanced with multi-scale temporal convolutions to capture short-term dependencies and a Transformer encoder stream to capture long-range temporal dependencies in EEG signals.

Result: Achieves average classification accuracy of 58.9% on SHU dataset using subject-dependent cross-session setup, outperforming baseline models including EEGNet and its recent derivatives while maintaining low computational complexity and inference latency.

Conclusion: EEG-MFTNet demonstrates potential for real-time BCI applications and highlights the importance of architectural innovations in improving motor imagery decoding for more robust and adaptive BCI systems.

Abstract: Brain-computer interfaces (BCIs) enable direct communication between the brain and external devices, providing critical support for individuals with motor impairments. However, accurate motor imagery (MI) decoding from electroencephalography (EEG) remains challenging due to noise and cross-session variability. This study introduces EEG-MFTNet, a novel deep learning model based on the EEGNet architecture, enhanced with multi-scale temporal convolutions and a Transformer encoder stream. These components are designed to capture both short and long-range temporal dependencies in EEG signals. The model is evaluated on the SHU dataset using a subject-dependent cross-session setup, outperforming baseline models, including EEGNet and its recent derivatives. EEG-MFTNet achieves an average classification accuracy of 58.9% while maintaining low computational complexity and inference latency. The results highlight the model’s potential for real-time BCI applications and underscore the importance of architectural innovations in improving MI decoding. This work contributes to the development of more robust and adaptive BCI systems, with implications for assistive technologies and neurorehabilitation.

[554] Modeling Patient Care Trajectories with Transformer Hawkes Processes

Saumya Pandey, Varun Chandola

Main category: cs.LG

TL;DR: Transformer Hawkes Process model for patient healthcare trajectories with imbalance-aware training for rare event prediction

DetailsMotivation: Patient healthcare trajectories consist of irregularly timed events that are challenging to model due to temporal irregularity and severe class imbalance, making it difficult to predict future care needs and understand utilization patterns.

Method: Combines Transformer-based history encoding with Hawkes process dynamics to model event dependencies in continuous time, with an inverse square-root class weighting strategy to address extreme class imbalance.

Result: Experiments on real-world data show improved performance in predicting event types and times, with clinically meaningful insights for identifying high-risk patient populations.

Conclusion: The Transformer Hawkes Process with imbalance-aware training effectively models irregular healthcare trajectories and improves prediction of rare but clinically important events.

Abstract: Patient healthcare utilization consists of irregularly time-stamped events, such as outpatient visits, inpatient admissions, and emergency encounters, forming individualized care trajectories. Modeling these trajectories is crucial for understanding utilization patterns and predicting future care needs, but is challenging due to temporal irregularity and severe class imbalance. In this work, we build on the Transformer Hawkes Process framework to model patient trajectories in continuous time. By combining Transformer-based history encoding with Hawkes process dynamics, the model captures event dependencies and jointly predicts event type and time-to-event. To address extreme imbalance, we introduce an imbalance-aware training strategy using inverse square-root class weighting. This improves sensitivity to rare but clinically important events without altering the data distribution. Experiments on real-world data demonstrate improved performance and provide clinically meaningful insights for identifying high-risk patient populations.

[555] Weight-Informed Self-Explaining Clustering for Mixed-Type Tabular Data

Lehao Li, Qiang Huang, Yihao Ang, Bryan Kian Hsiang Low, Anthony K. H. Tung, Xiaokui Xiao

Main category: cs.LG

TL;DR: WISE is an unsupervised framework for clustering mixed-type tabular data that unifies representation learning, feature weighting, clustering, and interpretability in a transparent pipeline.

DetailsMotivation: Clustering mixed-type tabular data is challenging due to misaligned numerical-categorical representations, uneven feature relevance, and disconnected post-hoc explanations from the clustering process.

Method: WISE uses Binary Encoding with Padding (BEP) to align heterogeneous features, Leave-One-Feature-Out (LOFO) for diverse feature-weighting views, two-stage weight-aware clustering, and Discriminative FreqItems (DFI) for interpretability.

Result: Extensive experiments on six real-world datasets show WISE outperforms classical and neural baselines in clustering quality while remaining efficient and producing faithful, human-interpretable explanations.

Conclusion: WISE provides a unified, transparent framework for clustering mixed-type data with intrinsic interpretability, addressing representation, weighting, clustering, and explanation challenges simultaneously.

Abstract: Clustering mixed-type tabular data is fundamental for exploratory analysis, yet remains challenging due to misaligned numerical-categorical representations, uneven and context-dependent feature relevance, and disconnected and post-hoc explanation from the clustering process. We propose WISE, a Weight-Informed Self-Explaining framework that unifies representation, feature weighting, clustering, and interpretation in a fully unsupervised and transparent pipeline. WISE introduces Binary Encoding with Padding (BEP) to align heterogeneous features in a unified sparse space, a Leave-One-Feature-Out (LOFO) strategy to sense multiple high-quality and diverse feature-weighting views, and a two-stage weight-aware clustering procedure to aggregate alternative semantic partitions. To ensure intrinsic interpretability, we further develop Discriminative FreqItems (DFI), which yields feature-level explanations that are consistent from instances to clusters with an additive decomposition guarantee. Extensive experiments on six real-world datasets demonstrate that WISE consistently outperforms classical and neural baselines in clustering quality while remaining efficient, and produces faithful, human-interpretable explanations grounded in the same primitives that drive clustering.

[556] The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model

Hongxu Zhou

Main category: cs.LG

TL;DR: Mamba-2 models fail to learn reversible state retrieval in the UNDO Flip-Flop task despite theoretical expressivity, revealing a gap between representational capacity and learnability via gradient descent.

DetailsMotivation: To investigate whether state space models can learn reversible semantic state retrieval, not just monotonic state tracking or structural nesting, by testing their ability to maintain an implicit bounded stack and recover historical states under non-monotonic update sequences.

Method: Introduces the UNDO Flip-Flop task as an extension of standard Flip-Flop with UNDO operations, requiring models to maintain an implicit stack and recover historical states. Evaluates one-layer and two-layer Mamba-2 models on this task, using adversarial retraction pressure tests within training length distribution.

Result: Both Mamba-2 variants fail to learn the stack-based rollback mechanism, converging on a local toggle heuristic instead. Under adversarial testing, the two-layer model collapses to 41.10% accuracy (below random chance). Causal ablation shows the bottleneck is in retrieval, not storage.

Conclusion: There’s a clear distinction between what architectures can theoretically represent and what gradient descent reliably learns. Theoretical expressivity analyses alone cannot capture this learnability gap, especially for reversible state retrieval tasks.

Abstract: State space models (SSMs) have been shown to possess the theoretical capacity to model both star-free sequential tasks and bounded hierarchical structures Sarrof et al. (2024). However, formal expressivity results do not guarantee that gradient-based optimisation will reliably discover the corresponding solutions. Existing benchmarks probe either monotonic state tracking, as in the standard Flip-Flop task, or structural nesting, as in the Dyck languages, but neither isolates reversible semantic state retrieval. We introduce the UNDO Flip-Flop task to fill this gap. By extending the standard Flip-Flop with an UNDO, the task requires a model to maintain an implicit bounded stack and recover historical states under non-monotonic update sequences. We evaluate one-layer and two-layer Mamba-2 under this framework. Both variants fail to acquire the provably expressible stack-based rollback mechanism, converging instead on a local toggle heuristic that inverts the current state rather than retrieving stored history. Under an adversarial retraction pressure test held within the training length distribution, the two-layer model collapses to 41.10% accuracy, which is below random chance. The results confirm systematic rather than incidental failure. Causal ablation shows that the bottleneck lies in retrieval, not storage. These results draw a clear line between what an architecture can in principle represent and what gradient descent reliably learns, a distinction that theoretical expressivity analyses alone cannot capture.

[557] ReLU Networks for Exact Generation of Similar Graphs

Mamoona Ghafoor, Tatsuya Akutsu

Main category: cs.LG

TL;DR: Theoretical characterization of ReLU neural networks that can generate graphs within a prescribed graph edit distance from a source graph, with guaranteed validity and no training data dependency.

DetailsMotivation: There's a growing demand for constrained generative models in applications like cheminformatics, network anomaly synthesis, and structured data augmentation, but existing models are data-driven and may not satisfy edit distance constraints. The paper aims to address the lack of neural architectures that can provably generate graphs within bounded graph edit distance.

Method: The authors theoretically characterize ReLU neural networks capable of generating graphs within a specified graph edit distance. They prove the existence of constant depth and O(n²d) size ReLU networks that deterministically generate graphs within edit distance d from an input graph with n vertices.

Result: Experimental evaluations show the proposed network successfully generates valid graphs for instances with up to 1400 vertices and edit distance bounds up to 140, while baseline generative models fail to generate graphs with the desired edit distance constraints.

Conclusion: The work provides a theoretical foundation for constructing compact generative models with guaranteed validity for graph generation under edit distance constraints, eliminating reliance on training data while ensuring constraint satisfaction.

Abstract: Generation of graphs constrained by a specified graph edit distance from a source graph is important in applications such as cheminformatics, network anomaly synthesis, and structured data augmentation. Despite the growing demand for such constrained generative models in areas including molecule design and network perturbation analysis, the neural architectures required to provably generate graphs within a bounded graph edit distance remain largely unexplored. In addition, existing graph generative models are predominantly data-driven and depend heavily on the availability and quality of training data, which may result in generated graphs that do not satisfy the desired edit distance constraints. In this paper, we address these challenges by theoretically characterizing ReLU neural networks capable of generating graphs within a prescribed graph edit distance from a given graph. In particular, we show the existence of constant depth and O(n^2 d) size ReLU networks that deterministically generate graphs within edit distance d from a given input graph with n vertices, eliminating reliance on training data while guaranteeing validity of the generated graphs. Experimental evaluations demonstrate that the proposed network successfully generates valid graphs for instances with up to 1400 vertices and edit distance bounds up to 140, whereas baseline generative models fail to generate graphs with the desired edit distance. These results provide a theoretical foundation for constructing compact generative models with guaranteed validity.

[558] A Mixture of Experts Foundation Model for Scanning Electron Microscopy Image Analysis

Sk Miraj Ahmed, Yuewei Lin, Chuntian Cao, Shinjae Yoo, Xinpei Wu, Won-Il Lee, Nikhil Tiwale, Dan N. Le, Thi Thu Huong Chu, Jiyoung Kim, Kevin G. Yager, Chang-Yong Nam

Main category: cs.LG

TL;DR: First foundation model for SEM images using self-supervised transformer pretrained on multi-instrument, multi-condition micrographs, enabling generalization across materials and imaging conditions, with defocus-to-focus translation as key demonstration.

DetailsMotivation: SEM imaging is constrained by task-specific models and labor-intensive acquisition processes that limit scalability. There's a need for adaptable foundation models that can generalize across diverse material systems and imaging conditions.

Method: Developed a self-supervised transformer architecture pretrained on a large corpus of multi-instrument, multi-condition scientific micrographs. The model learns transferable representations that can be fine-tuned for downstream tasks, with defocus-to-focus image translation as a key demonstration using unpaired supervision.

Result: The model successfully restores focused detail from defocused inputs without paired supervision and outperforms state-of-the-art techniques across multiple evaluation metrics. It demonstrates generalization across diverse material systems and imaging conditions.

Conclusion: This work establishes the first foundation model for SEM images, laying groundwork for adaptable SEM models that accelerate materials discovery by bridging foundational representation learning with real-world imaging needs.

Abstract: Scanning Electron Microscopy (SEM) is indispensable in modern materials science, enabling high-resolution imaging across a wide range of structural, chemical, and functional investigations. However, SEM imaging remains constrained by task-specific models and labor-intensive acquisition processes that limit its scalability across diverse applications. Here, we introduce the first foundation model for SEM images, pretrained on a large corpus of multi-instrument, multi-condition scientific micrographs, enabling generalization across diverse material systems and imaging conditions. Leveraging a self-supervised transformer architecture, our model learns rich and transferable representations that can be fine-tuned or adapted to a wide range of downstream tasks. As a compelling demonstration, we focus on defocus-to-focus image translation-an essential yet underexplored challenge in automated microscopy pipelines. Our method not only restores focused detail from defocused inputs without paired supervision but also outperforms state-of-the-art techniques across multiple evaluation metrics. This work lays the groundwork for a new class of adaptable SEM models, accelerating materials discovery by bridging foundational representation learning with real-world imaging needs.

[559] On Dominant Manifolds in Reservoir Computing Networks

Noa Kaplan, Alberto Padoan, Anastasia Bizyaeva

Main category: cs.LG

TL;DR: Training of Reservoir Computing networks for temporal forecasting leads to emergence of low-dimensional manifolds, with dominant modes linked to data’s intrinsic dimensionality and information content, connecting RC to Koopman operator theory and Dynamic Mode Decomposition.

DetailsMotivation: To understand how training shapes the geometry of recurrent network dynamics, particularly the emergence of low-dimensional dominant manifolds in Reservoir Computing networks for temporal forecasting tasks, and to establish connections between reservoir computing and Koopman operator theory.

Method: Study a simplified linear and continuous-time reservoir model, analyze dimensionality and structure of dominant modes in relation to training data’s intrinsic dimensionality and information content. For data from autonomous dynamical systems, relate dominant modes to approximations of Koopman eigenfunctions, connecting RC to Dynamic Mode Decomposition. Illustrate eigenvalue motion during training via simulation, and discuss generalization to nonlinear RC through tangent dynamics and differential p-dominance.

Result: Established explicit connection between reservoir computing and Dynamic Mode Decomposition algorithm, showing how dominant modes of trained reservoir approximate Koopman eigenfunctions of original system. Demonstrated eigenvalue motion generating dominant manifolds during training, with theoretical framework for understanding geometry of recurrent network dynamics.

Conclusion: Training shapes reservoir network dynamics by creating low-dimensional dominant manifolds that capture essential temporal patterns, with theoretical links to Koopman operator theory providing deeper understanding of how recurrent networks learn temporal dependencies and generalize to nonlinear systems.

Abstract: Understanding how training shapes the geometry of recurrent network dynamics is a central problem in time-series modeling. We study the emergence of low-dimensional dominant manifolds in the training of Reservoir Computing (RC) networks for temporal forecasting tasks. For a simplified linear and continuous-time reservoir model, we link the dimensionality and structure of the dominant modes directly to the intrinsic dimensionality and information content of the training data. In particular, for training data generated by an autonomous dynamical system, we relate the dominant modes of the trained reservoir to approximations of the Koopman eigenfunctions of the original system, illuminating an explicit connection between reservoir computing and the Dynamic Mode Decomposition algorithm. We illustrate the eigenvalue motion that generates the dominant manifolds during training in simulation, and discuss generalization to nonlinear RC via tangent dynamics and differential p-dominance.

[560] Data Distribution Valuation Using Generalized Bayesian Inference

Cuong N. Nguyen, Cuong V. Nguyen

Main category: cs.LG

TL;DR: A novel Bayesian framework for quantifying the value of data distributions from samples, applicable to problems like annotator evaluation and data augmentation.

DetailsMotivation: The paper addresses the data distribution valuation problem - quantifying the value of data distributions from their samples, which is distinct from classical data valuation and has applications in various practical scenarios.

Method: Developed Generalized Bayes Valuation framework using generalized Bayesian inference with loss constructed from transferability measures, extended to continuous data stream settings using Bayesian principles.

Result: Experimental results confirm the effectiveness and efficiency of the framework in different real-world scenarios, demonstrating its practical utility.

Conclusion: The proposed framework provides a unified solution to seemingly unrelated practical problems like annotator evaluation and data augmentation through data distribution valuation.

Abstract: We investigate the data distribution valuation problem, which aims to quantify the values of data distributions from their samples. This is a recently proposed problem that is related to but different from classical data valuation and can be applied to various applications. For this problem, we develop a novel framework called Generalized Bayes Valuation that utilizes generalized Bayesian inference with a loss constructed from transferability measures. This framework allows us to solve, in a unified way, seemingly unrelated practical problems, such as annotator evaluation and data augmentation. Using the Bayesian principles, we further improve and enhance the applicability of our framework by extending it to the continuous data stream setting. Our experiment results confirm the effectiveness and efficiency of our framework in different real-world scenarios.

[561] Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating

Dipan Maity, Suman Mondal, Arindam Roy

Main category: cs.LG

TL;DR: Gated-SwinRMT combines Swin Transformer’s shifted-window attention with Retentive Networks’ spatial decay and input-dependent gating, showing improved accuracy on image classification tasks.

DetailsMotivation: To create a hybrid vision transformer that combines the strengths of Swin Transformer's shifted-window attention with Retentive Networks' spatial decay mechanisms, enhanced by input-dependent gating to improve efficiency and performance on vision tasks.

Method: Two variants: Gated-SwinRMT-SWAT uses sigmoid activation with balanced ALiBi slopes and SwiGLU gating; Gated-SwinRMT-Retention uses softmax-normalized retention with additive decay bias and explicit G1 sigmoid gate. Both decompose self-attention into width-wise and height-wise retention passes within shifted windows.

Result: On Mini-ImageNet (224×224), Gated-SwinRMT-SWAT achieves 80.22% and Gated-SwinRMT-Retention 78.20% top-1 accuracy vs 73.74% for RMT baseline. On CIFAR-10 (32×32), accuracy advantage compresses from +6.48pp to +0.56pp due to small feature maps causing global attention collapse.

Conclusion: Gated-SwinRMT variants show significant improvements over RMT baseline on larger images, but performance gains diminish on smaller images where adaptive windowing collapses to global attention, suggesting the method is particularly effective for medium-to-large resolution vision tasks.

Abstract: We introduce Gated-SwinRMT, a family of hybrid vision transformers that combine the shifted-window attention of the Swin Transformer with the Manhattan-distance spatial decay of Retentive Networks (RMT), augmented by input-dependent gating. Self-attention is decomposed into consecutive width-wise and height-wise retention passes within each shifted window, where per-head exponential decay masks provide a two-dimensional locality prior without learned positional biases. Two variants are proposed. \textbf{Gated-SwinRMT-SWAT} substitutes softmax with sigmoid activation, implements balanced ALiBi slopes with multiplicative post-activation spatial decay, and gates the value projection via SwiGLU; the Normalized output implicitly suppresses uninformative attention scores. \textbf{Gated-SwinRMT-Retention} retains softmax-normalized retention with an additive log-space decay bias and incorporates an explicit G1 sigmoid gate – projected from the block input and applied after local context enhancement (LCE) but prior to the output projection~$W_O$ – to alleviate the low-rank $W_V !\cdot! W_O$ bottleneck and enable input-dependent suppression of attended outputs. We assess both variants on Mini-ImageNet ($224{\times}224$, 100 classes) and CIFAR-10 ($32{\times}32$, 10 classes) under identical training protocols, utilizing a single GPU due to resource limitations. At ${\approx}77$–$79$,M parameters, Gated-SwinRMT-SWAT achieves $80.22%$ and Gated-SwinRMT-Retention $78.20%$ top-1 test accuracy on Mini-ImageNet, compared with $73.74%$ for the RMT baseline. On CIFAR-10 – where small feature maps cause the adaptive windowing mechanism to collapse attention to global scope – the accuracy advantage compresses from $+6.48$,pp to $+0.56$,pp.

[562] PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space

Asaf Buchnick, Aviv Shamsian, Aviv Navon, Ethan Fetaya

Main category: cs.LG

TL;DR: PromptEvolver uses genetic algorithms and vision-language models to generate natural-language prompts that faithfully reconstruct target images from black-box text-to-image models.

DetailsMotivation: Current text-to-image generation requires extensive trial-and-error to find exact prompts, and existing prompt inversion methods produce suboptimal reconstructions with unnatural, hard-to-interpret prompts that hinder transparency and controllability.

Method: Uses a genetic algorithm to optimize prompts, leveraging a strong vision-language model to guide the evolution process. Works on black-box generation models requiring only image outputs.

Result: Outperforms competing methods across multiple prompt inversion benchmarks, achieving high-fidelity reconstructions with natural-language prompts.

Conclusion: PromptEvolver successfully generates interpretable natural-language prompts while achieving superior reconstruction fidelity compared to existing methods.

Abstract: Text-to-image generation has progressed rapidly, but faithfully generating complex scenes requires extensive trial-and-error to find the exact prompt. In the prompt inversion task, the goal is to recover a textual prompt that can faithfully reconstruct a given target image. Currently, existing methods frequently yield suboptimal reconstructions and produce unnatural, hard-to-interpret prompts that hinder transparency and controllability. In this work, we present PromptEvolver, a prompt inversion approach that generates natural-language prompts while achieving high-fidelity reconstructions of the target image. Our method uses a genetic algorithm to optimize the prompt, leveraging a strong vision-language model to guide the evolution process. Importantly, it works on black-box generation models by requiring only image outputs. Finally, we evaluate PromptEvolver across multiple prompt inversion benchmarks and show that it consistently outperforms competing methods.

[563] A machine learning framework for uncovering stochastic nonlinear dynamics from noisy data

Matteo Bosso, Giovanni Franzese, Kushal Swamy, Maarten Theulings, Alejandro M. Aragón, Farbod Alijani

Main category: cs.LG

TL;DR: Hybrid symbolic regression-probabilistic ML framework that recovers symbolic governing equations while inferring uncertainty in system parameters, combining deep symbolic regression with Gaussian processes for both deterministic dynamics and noise structure.

DetailsMotivation: Real-world systems have inherent noise that affects parameter inference and dynamics understanding. Traditional symbolic regression ignores uncertainty, while Gaussian processes lack insight into underlying dynamics. Need a framework that bridges this gap.

Method: Combines deep symbolic regression with Gaussian process-based maximum likelihood estimation to separately model deterministic dynamics and noise structure without prior assumptions about functional forms.

Result: Validated on numerical benchmarks (harmonic, Duffing, van der Pol oscillators) and experimental coupled biological oscillators. Successfully identifies both symbolic and stochastic components, data-efficient (100-1000 data points), robust to noise.

Conclusion: Framework enables understanding of both structure and variability in dynamical systems where uncertainty is intrinsic, with broad potential across domains.

Abstract: Modeling real-world systems requires accounting for noise - whether it arises from unpredictable fluctuations in financial markets, irregular rhythms in biological systems, or environmental variability in ecosystems. While the behavior of such systems can often be described by stochastic differential equations, a central challenge is understanding how noise influences the inference of system parameters and dynamics from data. Traditional symbolic regression methods can uncover governing equations but typically ignore uncertainty. Conversely, Gaussian processes provide principled uncertainty quantification but offer little insight into the underlying dynamics. In this work, we bridge this gap with a hybrid symbolic regression-probabilistic machine learning framework that recovers the symbolic form of the governing equations while simultaneously inferring uncertainty in the system parameters. The framework combines deep symbolic regression with Gaussian process-based maximum likelihood estimation to separately model the deterministic dynamics and the noise structure, without requiring prior assumptions about their functional forms. We verify the approach on numerical benchmarks, including harmonic, Duffing, and van der Pol oscillators, and validate it on an experimental system of coupled biological oscillators exhibiting synchronization, where the algorithm successfully identifies both the symbolic and stochastic components. The framework is data-efficient, requiring as few as 100-1000 data points, and robust to noise - demonstrating its broad potential in domains where uncertainty is intrinsic and both the structure and variability of dynamical systems must be understood.

[564] Learning $\mathsf{AC}^0$ Under Graphical Models

Gautam Chandrasekaran, Jason Gaitonde, Ankur Moitra, Arsen Vasilyan

Main category: cs.LG

TL;DR: Quasipolynomial-time algorithm for learning AC⁰ circuits beyond product distributions using graphical models with spatial mixing, overcoming Fourier analysis limitations.

DetailsMotivation: The paper addresses the longstanding challenge in computational learning theory of extending learning guarantees from unrealistic product distributions to more natural correlated distributions, particularly for constant-depth circuits.

Method: Develops new sampling algorithms that transfer low-degree polynomial approximation results from uniform distributions to graphical models with polynomial growth and strong spatial mixing properties.

Result: Achieves quasipolynomial-time algorithms for learning AC⁰ circuits under correlated distributions from graphical models, extending to other function classes like monotone functions and halfspaces.

Conclusion: The work provides a significant advance in learning theory by overcoming the product distribution limitation and opening new directions for learning under correlated distributions.

Abstract: In a landmark result, Linial, Mansour and Nisan (J. ACM 1993) gave a quasipolynomial-time algorithm for learning constant-depth circuits given labeled i.i.d. samples under the uniform distribution. Their work has had a deep and lasting legacy in computational learning theory, in particular introducing the $\textit{low-degree algorithm}$. However, an important critique of many results and techniques in the area is the reliance on product structure, which is unlikely to hold in realistic settings. Obtaining similar learning guarantees for more natural correlated distributions has been a longstanding challenge in the field. In particular, we give quasipolynomial-time algorithms for learning $\mathsf{AC}^0$ substantially beyond the product setting, when the inputs come from any graphical model with polynomial growth that exhibits strong spatial mixing. The main technical challenge is in giving a workaround to Fourier analysis, which we do by showing how new sampling algorithms allow us to transfer statements about low-degree polynomial approximation under the uniform setting to graphical models. Our approach is general enough to extend to other well-studied function classes, like monotone functions and halfspaces.

[565] Gym-Anything: Turn any Software into an Agent Environment

Pranjal Aggarwal, Graham Neubig, Sean Welleck

Main category: cs.LG

TL;DR: Gym-Anything framework converts any software into interactive environments for computer-use agents, creating CUA-World with 10K+ long-horizon tasks across 200+ applications using multi-agent setup and verification.

DetailsMotivation: Current computer-use agent research focuses on short-horizon tasks with limited economic value due to high environment creation costs. Need scalable framework for complex software environments to enable research on economically valuable digital activities.

Method: Multi-agent pipeline: coding agent writes setup scripts, downloads real-world data, configures software and produces evidence; audit agent verifies setup against quality checklist. Applied to 200+ software applications across economically valuable occupations.

Result: Created CUA-World with 10K+ long-horizon tasks spanning medical science, astronomy, engineering, enterprise systems. CUA-World-Long benchmark with 500+ step tasks. Distilled trajectories into 2B VLM outperforming larger models. Audit feedback improved Gemini-3-Flash performance from 11.5% to 14.0%.

Conclusion: Gym-Anything enables scalable creation of realistic computer-use environments. CUA-World provides comprehensive benchmark for long-horizon agent research. Multi-agent setup/audit pipeline ensures quality. Released code, infrastructure, and benchmark data to advance field.

Abstract: Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2$\times$ its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.

[566] Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement

Qimin Zhong, Hao Liao, Haiming Qin, Mingyang Zhou, Rui Mao, Wei Chen, Naipeng Chao

Main category: cs.LG

TL;DR: LSE-MTP improves multi-token prediction by anchoring predictions to ground-truth hidden states, reducing structural hallucinations and improving representation alignment between discrete tokens and continuous state representations.

DetailsMotivation: The paper addresses the debate about whether LLMs develop coherent internal world models. While Multi-Token Prediction (MTP) shows promise for learning structured representations, standard MTP suffers from structural hallucinations where discrete token supervision encourages illegal shortcuts in latent space that violate environmental constraints.

Method: Proposes Latent Semantic Enhancement MTP (LSE-MTP), which anchors predictions to ground-truth hidden state trajectories rather than just discrete tokens. This bridges the gap between discrete tokens and continuous state representations by providing supervision at the latent semantic level.

Result: Experiments on synthetic graphs and real-world Manhattan Taxi Ride dataset show that LSE-MTP enhances representation alignment, reduces structural hallucinations, and improves robustness to perturbations compared to standard MTP.

Conclusion: LSE-MTP effectively addresses the limitations of standard MTP by providing latent semantic supervision, leading to better internal world model development in LLMs with improved representation quality and reduced hallucination issues.

Abstract: Whether Large Language Models (LLMs) develop coherent internal world models remains a core debate. While conventional Next-Token Prediction (NTP) focuses on one-step-ahead supervision, Multi-Token Prediction (MTP) has shown promise in learning more structured representations. In this work, we provide a theoretical perspective analyzing the gradient inductive bias of MTP, supported by empirical evidence, showing that MTP promotes the convergence toward internal belief states by inducing representational contractivity via gradient coupling. However, we reveal that standard MTP often suffers from structural hallucinations, where discrete token supervision encourages illegal shortcuts in latent space that violate environmental constraints. To address this, we propose a novel method Latent Semantic Enhancement MTP (LSE-MTP), which anchors predictions to ground-truth hidden state trajectories. Experiments on synthetic graphs and real-world Manhattan Taxi Ride show that LSE-MTP effectively bridges the gap between discrete tokens and continuous state representations, enhancing representation alignment, reducing structural hallucinations, and improving robustness to perturbations.

[567] Target Policy Optimization

Jean Kaddour

Main category: cs.LG

TL;DR: TPO is a new RL method that separates target distribution construction from policy optimization, using cross-entropy to fit policies to constructed targets rather than traditional policy gradient updates.

DetailsMotivation: Standard policy-gradient methods combine target construction and parameter updates, leading to overshooting/undershooting issues dependent on learning rates and optimizer choices. The paper aims to separate these two questions for more stable optimization.

Method: Target Policy Optimization (TPO) constructs a target distribution q_i ∝ p_i^old exp(u_i) from scored completions, then fits the policy to this target using cross-entropy loss. The gradient on sampled-completion logits becomes p^θ - q, vanishing when policy matches target.

Result: TPO matches performance of PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward conditions across tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR.

Conclusion: TPO provides a principled separation of target construction and policy optimization that improves performance, especially in sparse reward scenarios, while maintaining competitive performance on standard tasks.

Abstract: In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emph{Target Policy Optimization} (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution $q_i \propto p_i^{,\mathrm{old}} \exp(u_i)$ and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is $p^θ- q$, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.

[568] Topological Characterization of Churn Flow and Unsupervised Correction to the Wu Flow-Regime Map in Small-Diameter Vertical Pipes

Brady Koenig, Sushovan Majhi, Atish Mitra, Abigail Stein, Burt Todd

Main category: cs.LG

TL;DR: First topology-based mathematical definition of churn flow using Euler Characteristic Surfaces, with unsupervised regime discovery via Multiple Kernel Learning that achieves high accuracy without labeled data.

DetailsMotivation: Churn flow has lacked a quantitative mathematical definition for over 40 years, and existing mechanistic models under-predict slug persistence in small-diameter pipes where interfacial tension and wall-to-wall interactions dominate flow.

Method: Introduces Euler Characteristic Surfaces (ECS) for topological characterization, formulates unsupervised regime discovery as Multiple Kernel Learning (MKL) blending ECS-derived kernels (temporal alignment and amplitude statistics) with gas velocity.

Result: The framework places 64% weight on topology-derived features, identifies churn flow transition 3.81 m/s above existing predictions, shows 1.9× higher topological complexity in churn vs. slug, and achieves 95.6% 4-class accuracy with 100% churn recall without labeled training data.

Conclusion: Provides first mathematical definition of churn flow and demonstrates that unsupervised topological descriptors can challenge and correct widely adopted mechanistic models, matching or exceeding supervised baselines without labeled data.

Abstract: Churn flow-the chaotic, oscillatory regime in vertical two-phase flow-has lacked a quantitative mathematical definition for over $40$ years. We introduce the first topology-based characterization using Euler Characteristic Surfaces (ECS). We formulate unsupervised regime discovery as Multiple Kernel Learning (MKL), blending two complementary ECS-derived kernels-temporal alignment ($L^1$ distance on the $χ(s,t)$ surface) and amplitude statistics (scale-wise mean, standard deviation, max, min)-with gas velocity. Applied to $37$ unlabeled air-water trials from Montana Tech, the self-calibrating framework learns weights $β_{ECS}=0.14$, $β_{amp}=0.50$, $β_{ugs}=0.36$, placing $64%$ of total weight on topology-derived features ($β_{ECS} + β_{amp}$). The ECS-inferred slug/churn transition lies $+3.81$ m/s above Wu et al.’s (2017) prediction in $2$-in. tubing, quantifying reports that existing models under-predict slug persistence in small-diameter pipes where interfacial tension and wall-to-wall interactions dominate flow. Cross-facility validation on $947$ Texas A&M University images confirms $1.9\times$ higher topological complexity in churn vs. slug ($p < 10^{-5}$). Applied to $45$ TAMU pseudo-trials, the same unsupervised framework achieves $95.6%$ $4$-class accuracy and $100%$ churn recall-without any labeled training data-matching or exceeding supervised baselines that require thousands of annotated examples. This work provides the first mathematical definition of churn flow and demonstrates that unsupervised topological descriptors can challenge and correct widely adopted mechanistic models.

[569] In-Place Test-Time Training

Guhao Feng, Shengjie Luo, Kai Hua, Ge Zhang, Di He, Wenhao Huang, Tianle Cai

Main category: cs.LG

TL;DR: In-Place Test-Time Training enables LLMs to dynamically adapt weights during inference by updating MLP projection matrices with a next-token-prediction aligned objective, achieving efficient continual learning.

DetailsMotivation: Current LLMs are limited by static "train then deploy" paradigm and cannot adapt to new information during inference. Test-Time Training offers adaptation but faces architectural incompatibility, computational inefficiency, and misaligned objectives for language modeling.

Method: Proposes In-Place TTT that treats final projection matrices of MLP blocks as adaptable fast weights. Uses a theoretically-grounded objective aligned with next-token-prediction task and implements chunk-wise update mechanism compatible with context parallelism.

Result: Enables 4B-parameter model to achieve superior performance on tasks with contexts up to 128k. When pretrained from scratch, consistently outperforms competitive TTT-related approaches. Provides efficient in-place enhancement without costly retraining.

Conclusion: In-Place TTT establishes a promising step toward continual learning in LLMs by enabling dynamic weight adaptation during inference with efficient, architecture-compatible updates aligned with language modeling objectives.

Abstract: The static train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT’s generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework’s effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.

[570] Understanding Uncertainty Sampling via Equivalent Loss

Shang Liu, Xiaocheng Li

Main category: cs.LG

TL;DR: Uncertainty sampling for active learning lacks theoretical foundations; this work establishes a framework connecting uncertainty measures to equivalent loss functions, providing theoretical guarantees and showing asymptotic superiority over passive learning.

DetailsMotivation: Uncertainty sampling is widely used in active learning but lacks theoretical foundations - there's no consensus on proper uncertainty definitions for specific tasks/losses, nor theoretical guarantees for implementation protocols.

Method: Systematically examines uncertainty sampling in binary classification via notion of equivalent loss that depends on uncertainty measure and original loss function; establishes that uncertainty sampling optimizes this equivalent loss; analyzes properness of uncertainty measures via surrogate property and loss convexity.

Result: When convexity is preserved, provides sample complexity result for equivalent loss and translates to binary loss guarantee via surrogate link function; proves asymptotic superiority of uncertainty sampling over passive learning under mild conditions.

Conclusion: Provides theoretical foundation for uncertainty sampling, connecting uncertainty measures to loss functions, establishing guarantees, and showing superiority over passive learning; discusses extensions to pool-based setting, multi-class classification, and regression.

Abstract: Uncertainty sampling is a prevalent active learning algorithm that queries sequentially the annotations of data samples which the current prediction model is uncertain about. However, the usage of uncertainty sampling has been largely heuristic: There is no consensus on the proper definition of ``uncertainty’’ for a specific task under a specific loss, nor a theoretical guarantee that prescribes a standard protocol to implement the algorithm. In this work, we systematically examine uncertainty sampling algorithms in the binary classification problem via a notion of equivalent loss which depends on the used uncertainty measure and the original loss function, and establish that an uncertainty sampling algorithm is optimizing against such an equivalent loss. The perspective verifies the properness of existing uncertainty measures from two aspects: surrogate property and loss convexity. When the convexity is preserved, we give a sample complexity result for the equivalent loss, and later translate it into a binary loss guarantee via the surrogate link function. We prove the asymptotic superiority of the uncertainty sampling against the passive learning via this approach under mild conditions. We also discuss some potential extensions, including pool-based setting and potential generalization to the multi-class classification as well as the regression problems.

[571] CROSS-Net: Region-Agnostic Taxi-Demand Prediction Using Feature Disentanglement

Ren Ozeki, Haruki Yonekura, Aidana Baimbetova, Hamada Rizk, Hirozumi Yamaguchi

Main category: cs.LG

TL;DR: CROSS-Net: A spatially transferable taxi demand prediction system using multiview graph neural networks and variational autoencoders to disentangle region-specific and region-agnostic features for cross-region generalization.

DetailsMotivation: The growing demand for ride-hailing services requires accurate taxi demand prediction, but existing systems are limited to specific regions and lack generality for unseen areas.

Method: Proposes CROSS-Net using multiview graph neural networks to capture spatial-temporal dependencies, combined with a Variational Autoencoder to disentangle input features into region-specific and region-agnostic components for cross-region transferability.

Result: Experimental results demonstrate effectiveness in accurately forecasting taxi demand even in previously unobserved regions, showing strong generalization capabilities.

Conclusion: CROSS-Net shows potential for optimizing taxi services and improving transportation efficiency on a broader scale through its spatially transferable approach.

Abstract: The growing demand for ride-hailing services has led to an increasing need for accurate taxi demand prediction. Existing systems are limited to specific regions, lacking generality to unseen areas. This paper presents a novel taxi demand prediction system, harnessing the strengths of multiview graph neural networks to capture spatial-temporal dependencies and patterns in urban environments. Additionally, the proposed system CROSS-Net employs a spatially transferable approach, enabling it to train a model that can be deployed to previously unseen regions. To achieve this, the framework incorporates the power of a Variational Autoencoder to disentangle the input features into region-specific and region-agnostic components. The region-agnostic features facilitate cross-region taxi demand predictions, allowing the model to generalize well across different urban areas. Experimental results demonstrate the effectiveness of CROSS-Net in accurately forecasting taxi demand, even in previously unobserved regions, thus showcasing its potential for optimizing taxi services and improving transportation efficiency on a broader scale.

[572] Towards Better Statistical Understanding of Watermarking LLMs

Zhongze Cai, Shang Liu, Hanzhao Wang, Huaiyang Zhong, Xiaocheng Li

Main category: cs.LG

TL;DR: This paper presents an optimization framework for watermarking large language models that balances model distortion and detection ability, with an online dual gradient ascent algorithm achieving asymptotic Pareto optimality.

DetailsMotivation: The paper addresses the need for effective watermarking of LLMs to track AI-generated content while managing the trade-off between preserving model quality (minimizing distortion) and ensuring reliable detection of watermarked content.

Method: The authors formulate watermarking as a constrained optimization problem based on red-green list algorithms, develop an online dual gradient ascent algorithm, and provide theoretical analysis showing asymptotic Pareto optimality between distortion and detection ability.

Result: The proposed algorithm achieves better trade-offs between model distortion and detection ability compared to benchmarks, with theoretical guarantees of averaged increased green list probability and explicit detection ability improvements.

Conclusion: The paper provides a principled optimization framework for LLM watermarking with theoretical guarantees, addresses limitations of existing distortion metrics, and demonstrates practical effectiveness through empirical evaluation.

Abstract: In this paper, we study the problem of watermarking large language models (LLMs). We consider the trade-off between model distortion and detection ability and formulate it as a constrained optimization problem based on the red-green list watermarking algorithm. We show that the optimal solution to the optimization problem enjoys a nice analytical property which provides a better understanding and inspires the algorithm design for the watermarking process. We develop an online dual gradient ascent watermarking algorithm in light of this optimization formulation and prove its asymptotic Pareto optimality between model distortion and detection ability. Such a result guarantees an averaged increased green list probability and henceforth detection ability explicitly (in contrast to previous results). Moreover, we provide a systematic discussion on the choice of the model distortion metrics for the watermarking problem. We justify our choice of KL divergence and present issues with the existing criteria of ``distortion-free’’ and perplexity. Finally, we empirically evaluate our algorithms on extensive datasets against benchmark algorithms.

[573] Interpreting Temporal Graph Neural Networks with Koopman Theory

Michele Guerra, Simone Scardapane, Filippo Maria Bianchi

Main category: cs.LG

TL;DR: Novel explainability methods for spatiotemporal graph neural networks using Koopman theory, with DMD and SINDy approaches to identify relevant spatial-temporal patterns in dynamic graph data.

DetailsMotivation: Spatiotemporal graph neural networks (STGNNs) are powerful but difficult to interpret compared to static models. There's a need for explainability methods that can reveal the learned dynamics and decision processes in temporal graph models.

Method: Two Koopman theory-inspired approaches: 1) Dynamic Mode Decomposition (DMD) for dimensionality reduction and pattern identification, and 2) Sparse Identification of Nonlinear Dynamics (SINDy) for discovering governing equations, applied for the first time as general explainability tools for STGNNs.

Result: On semi-synthetic dissemination datasets, methods correctly identified interpretable features like infection times and infected nodes. Qualitative validation on real-world human motion dataset showed explanations highlighting relevant body parts for action recognition.

Conclusion: The paper presents effective explainability approaches for STGNNs using Koopman theory, successfully identifying relevant spatial-temporal patterns in both synthetic and real-world applications.

Abstract: Spatiotemporal graph neural networks (STGNNs) have shown promising results in many domains, from forecasting to epidemiology. However, understanding the dynamics learned by these models and explaining their behaviour is significantly more difficult than for models that deal with static data. Inspired by Koopman theory, which allows a simple description of intricate, nonlinear dynamical systems, we introduce new explainability approaches for temporal graphs. Specifically, we present two methods to interpret the STGNN’s decision process and identify the most relevant spatial and temporal patterns in the input for the task at hand. The first relies on dynamic mode decomposition (DMD), a Koopman-inspired dimensionality reduction method. The second relies on sparse identification of nonlinear dynamics (SINDy), a popular method for discovering governing equations of dynamical systems, which we use for the first time as a general tool for explainability. On semi-synthetic dissemination datasets, our methods correctly identify interpretable features such as the times at which infections occur and the infected nodes. We also validate the methods qualitatively on a real-world human motion dataset, where the explanations highlight the body parts most relevant for action recognition.

[574] Retrieval Augmented Time Series Forecasting

Kutay Tire, Ege Onur Taga, Muhammed Emrullah Ildiz, Samet Oymak

Main category: cs.LG

TL;DR: RAF introduces retrieval-augmented generation for time-series forecasting, showing improved accuracy across domains, especially for larger foundation models.

DetailsMotivation: The paper is motivated by the success of retrieval-augmented generation (RAG) in LLMs and the emergence of time-series foundation models (TSFMs). The authors question whether RAG benefits extend to time-series forecasting, given the dynamic, event-driven nature of time-series data.

Method: The authors introduce Retrieval Augmented Forecasting (RAF), a principled RAG framework for time-series forecasting. They develop efficient strategies for retrieving related time-series examples and incorporating them into forecasts.

Result: Experiments show RAF improves forecasting accuracy across diverse time series domains. The improvement is more significant for larger TSFM sizes, as demonstrated through mechanistic studies.

Conclusion: RAG is a crucial component for time-series foundation models, and RAF effectively demonstrates how retrieval-augmented approaches can enhance forecasting performance in dynamic, event-driven time-series scenarios.

Abstract: Retrieval-augmented generation (RAG) is a central component of modern LLM systems, particularly in scenarios where up-to-date information is crucial for accurately responding to user queries or when queries exceed the scope of the training data. The advent of time-series foundation models (TSFM), such as Chronos, and the need for effective zero-shot forecasting performance across various time-series domains motivates the question: Do benefits of RAG similarly carry over to time series forecasting? In this paper, we advocate that the dynamic and event-driven nature of time-series data makes RAG a crucial component of TSFMs and introduce a principled RAG framework for time-series forecasting, called Retrieval Augmented Forecasting (RAF). Within RAF, we develop efficient strategies for retrieving related time-series examples and incorporating them into forecast. Through experiments and mechanistic studies, we demonstrate that RAF indeed improves the forecasting accuracy across diverse time series domains and the improvement is more significant for larger TSFM sizes.

[575] Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

Naman Deep Singh, Francesco Croce, Matthias Hein

Main category: cs.LG

TL;DR: PAR (Perturb and Recover) is a simple yet effective method to remove backdoors from CLIP models by perturbing inputs and recovering clean representations, working even with synthetic data.

DetailsMotivation: CLIP models are vulnerable to backdoor attacks due to web-scraped training data and their foundational nature. Existing cleaning techniques fail against simple triggers, creating security risks for real-world deployment.

Method: PAR introduces a two-step approach: 1) Perturb inputs with noise to activate backdoor triggers, 2) Recover clean representations by fine-tuning the model to map perturbed inputs back to original clean representations.

Result: PAR achieves high backdoor removal rates across different CLIP encoders and attack types while preserving standard performance. It works effectively even with only synthetic text-image pairs.

Conclusion: PAR provides a practical defense against backdoor attacks in vision-language models, addressing a critical security vulnerability for real-world deployment of foundational models like CLIP.

Abstract: Vision-Language models like CLIP have been shown to be highly effective at linking visual perception and natural language understanding, enabling sophisticated image-text capabilities, including strong retrieval and zero-shot classification performance. Their widespread use, as well as the fact that CLIP models are trained on image-text pairs from the web, make them both a worthwhile and relatively easy target for backdoor attacks. As training foundational models, such as CLIP, from scratch is very expensive, this paper focuses on cleaning potentially poisoned models via fine-tuning. We first show that existing cleaning techniques are not effective against simple structured triggers used in Blended or BadNet backdoor attacks, exposing a critical vulnerability for potential real-world deployment of these models. Then, we introduce PAR, Perturb and Recover, a surprisingly simple yet effective mechanism to remove backdoors from CLIP models. Through extensive experiments across different encoders and types of backdoor attacks, we show that PAR achieves high backdoor removal rate while preserving good standard performance. Finally, we illustrate that our approach is effective even only with synthetic text-image pairs, i.e. without access to real training data. The code and models are available on \href{https://github.com/nmndeep/PerturbAndRecover}{GitHub}.

[576] Regional climate risk assessment from climate models using probabilistic machine learning

Zhong Yi Wan, Ignacio Lopez-Gomez, Robert Carver, Tapio Schneider, John Anderson, Fei Sha, Leonardo Zepeda-Núñez

Main category: cs.LG

TL;DR: GenFocal is an AI framework that generates fine-scale weather data from coarse climate projections to bridge the resolution gap for regional climate risk assessment, without requiring paired training data.

DetailsMotivation: Climate risk assessment faces a resolution gap between coarse global climate models and the fine-scale information needed for regional decision-making. Existing methods struggle to generate statistically accurate, localized weather data from low-resolution projections.

Method: GenFocal uses an AI framework that generates fine-scale weather from coarse climate projections without requiring paired simulated and observed events during training. It can synthesize complex, long-lived hazards like heat waves and tropical cyclones even when poorly represented in coarse projections.

Result: The framework produces statistically accurate fine-scale weather data and samples high-impact, rare events more accurately than leading methods. It can generate localized information from large-scale climate projections.

Conclusion: GenFocal provides a powerful new paradigm for translating climate projections into actionable, localized information to improve climate adaptation and resilience strategies by bridging the resolution gap in climate modeling.

Abstract: Effective climate risk assessment is hindered by the resolution gap between coarse global climate models and the fine-scale information needed for regional decisions. We introduce GenFocal, an AI framework that generates statistically accurate, fine-scale weather from coarse climate projections, without requiring paired simulated and observed events during training. GenFocal synthesizes complex and long-lived hazards, such as heat waves and tropical cyclones, even when they are not well represented in the coarse climate projections. It also samples high-impact, rare events more accurately than leading methods. By translating large-scale climate projections into actionable, localized information, GenFocal provides a powerful new paradigm to improve climate adaptation and resilience strategies.

[577] How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators

Shang Liu, Hanzhao Wang, Zhongyao Ma, Xiaocheng Li

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to retry later or use alternative methods to access the paper abstract.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2502.06387: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.06387&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[578] Towards Reliable Forgetting: A Survey on Machine Unlearning Verification

Lulu Xue, Shengshan Hu, Wei Lu, Yan Shen, Dongxu Li, Peijin Guo, Ziqi Zhou, Minghui Li, Yanjun Zhang, Leo Yu Zhang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2506.15115: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.15115&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[579] Distribution-dependent Generalization Bounds for Tuning Linear Regression Across Tasks

Maria-Florina Balcan, Saumya Goyal, Dravyansh Sharma

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to analyze paper due to arXiv API rate limiting

Abstract: Failed to fetch summary for 2507.05084: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.05084&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[580] Inference-Time Scaling of Diffusion Language Models via Trajectory Refinement

Meihua Dang, Jiaqi Han, Minkai Xu, Kai Xu, Akash Srivastava, Stefano Ermon

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to access restrictions

Method: Cannot determine method due to access restrictions

Result: Cannot determine results due to access restrictions

Conclusion: Cannot determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2507.08390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.08390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[581] Optimisation of Resource Allocation in Heterogeneous Wireless Networks Using Deep Reinforcement Learning

Oluwaseyi Giwa, Jonathan Shock, Jaco Du Toit, Tobi Awodumila

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2509.25284: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25284&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[582] Supporting Evidence for the Adaptive Feature Program across Diverse Models

Yicheng Li, Qian Lin

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

DetailsMotivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2511.09425: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09425&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[583] Near-optimal Linear Predictive Clustering in Non-separable Spaces via MIP and QPBO Reductions

Jiazhou Liang, Hassan Khurram, Scott Sanner

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2511.10809: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10809&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[584] Distance Is All You Need: Radial Dispersion for Uncertainty Estimation in Large Language Models

Manh Nguyen, Sunil Gupta, Hung Le

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to analyze paper due to technical issues with arXiv API

Abstract: Failed to fetch summary for 2512.04351: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04351&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[585] HOLE: Homological Observation of Latent Embeddings for Neural Network Interpretability

Sudhanva Manjunath Athreya, Paul Rosen

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation due to API access limitations

Method: Cannot determine method due to API access limitations

Result: Cannot determine results due to API access limitations

Conclusion: Cannot determine conclusion due to API access limitations

Abstract: Failed to fetch summary for 2512.07988: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07988&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[586] Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction

Mathieu Blondel, Michael E. Sander, Germain Vivier-Ardisson, Tianlin Liu, Vincent Roulet

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2512.15605: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15605&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[587] Energy-Balanced Hyperspherical Graph Representation Learning via Structural Binding and Entropic Dispersion

Rui Chen, Junjun Guo, Hongbin Wang, Yan Xiang, Yantuan Xian, Zhengtao Yu

Main category: cs.LG

TL;DR: Paper ID 2512.24062 - Unable to fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as abstract is unavailable due to rate limiting from arXiv API

Method: Cannot determine method as abstract is unavailable due to rate limiting from arXiv API

Result: Cannot determine results as abstract is unavailable due to rate limiting from arXiv API

Conclusion: Cannot determine conclusion as abstract is unavailable due to rate limiting from arXiv API

Abstract: Failed to fetch summary for 2512.24062: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24062&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[588] Meta-probabilistic Modeling

Kevin Zhang, Yixin Wang

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with ID 2601.04462 could not be retrieved from arXiv API.

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.

Method: Cannot determine method as paper content is unavailable due to API rate limiting.

Result: Cannot determine results as paper content is unavailable due to API rate limiting.

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting.

Abstract: Failed to fetch summary for 2601.04462: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04462&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[589] A Hessian-Free Actor-Critic Algorithm for Bi-Level Reinforcement Learning with Applications to LLM Fine-Tuning

Sihan Zeng, Sujay Bhatt, Sumitra Ganesh, Alec Koppel

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.16399: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16399&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[590] Information Hidden in Gradients of Regression with Target Noise

Arash Jamshidi, Katsiaryna Haitsiukevich, Kai Puolamäki

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2601.18546: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18546&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[591] Echo State Networks for Time Series Forecasting: Hyperparameter Sweep and Benchmarking

Alexander Häußer

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to draw conclusions due to access limitations

Abstract: Failed to fetch summary for 2602.03912: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03912&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[592] Triplet Feature Fusion for Equipment Anomaly Prediction : An Open-Source Methodology Using Small Foundation Models

Takato Yasuno

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting error

Method: Unable to determine method due to API rate limiting error

Result: Unable to determine results due to API rate limiting error

Conclusion: Unable to draw conclusions due to API rate limiting error

Abstract: Failed to fetch summary for 2602.15089: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15089&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[593] From Human-Level AI Tales to AI Leveling Human Scales

Peter Romero, Fernando Martínez-Plumed, Zachary R. Tidler, Matthieu Téhénan, Sipeng Chen, Álvaro David Gómez Antón, Luning Sun, Manuel Cebrian, Lexin Zhou, Yael Moros Daval, Daniel Romero-Alvarado, Félix Martí Pérez, Kevin Wei, José Hernández-Orallo

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2602.18911: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18911&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[594] QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Jingxuan Zhang, Yunta Hsieh, Zhongwei Wan, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, Mi Zhang

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper ID 2602.20309

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2602.20309: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20309&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[595] Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

Maria-Florina Balcan, Avrim Blum, Kiriaki Fragkia, Zhiyuan Li, Dravyansh Sharma

Main category: cs.LG

TL;DR: Failed to fetch summary for paper 2603.03538 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to technical error in fetching paper information

Method: Unable to determine method due to technical error in fetching paper information

Result: Unable to determine results due to technical error in fetching paper information

Conclusion: Unable to determine conclusion due to technical error in fetching paper information

Abstract: Failed to fetch summary for 2603.03538: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03538&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[596] CLaRE-ty Amid Chaos: Quantifying Representational Entanglement to Predict Ripple Effects in LLM Editing

Manit Baser, Alperen Yildiz, Dinil Mon Divakaran, Mohan Gurusamy

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.19297: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19297&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[597] Olmo Hybrid: From Theory to Practice and Back

William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Chuan Li, Kyle Lo, Saumya Malik, DJ Matusz, Benjamin Minixhofer, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi, Ashish Sabharwal

Main category: cs.LG

TL;DR: Hybrid models mixing attention and recurrent layers outperform pure transformers in scaling efficiency and performance, offering more expressive models that scale better during pretraining.

DetailsMotivation: To determine whether non-transformer architectures (especially hybrid models mixing recurrence and attention) justify scaling efforts by comparing their benefits against pure transformers.

Method: Theoretical analysis of hybrid model expressivity, plus practical training of Olmo Hybrid (7B-parameter model with Gated DeltaNet layers replacing sliding window layers) compared to Olmo 3 7B transformer.

Result: Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, scales significantly more efficiently than transformer, and shows greater expressivity on formal problems.

Conclusion: Hybrid models mixing attention and recurrent layers are a powerful extension to language modeling, providing more expressive models that scale better during pretraining, not just for inference memory reduction.

Abstract: Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.

[598] Stability and Sensitivity Analysis of Relative Temporal-Difference Learning: Extended Version

Masoud S. Sakha, Rushikesh Kamalapurkar, Sean Meyn

Main category: cs.LG

TL;DR: Paper ID 2603.27874 - Unable to fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2603.27874: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27874&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[599] Efficient and Principled Scientific Discovery through Bayesian Optimization: A Tutorial

Zhongwei Yu, Rasul Tutunov, Alexandre Max Maraval, Zikai Xie, Zhenzhi Tan, Jiankang Wang, Bin Cao, Zijing Li, Liangliang Xu, Qi Yang, Jun Jiang, Sanzhong Luo, Zhenxiao Guo, Tongyi Zhang, Haitham Bou-Ammar, Jun Wang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - API request was rate limited

Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2604.01328: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01328&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[600] PI-JEPA: Label-Free Surrogate Pretraining for Coupled Multiphysics Simulation via Operator-Split Latent Prediction

Brandon Yee, Pairie Koh

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - unable to analyze content

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2604.01349: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01349&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[601] YC Bench: a Live Benchmark for Forecasting Startup Outperformance in Y Combinator Batches

Mostapha Benhenda

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.02378: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02378&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[602] WGFINNs: Weak formulation-based GENERIC formalism informed neural networks

Jun Sur Richard Park, Auroni Huque Hashim, Siu Wun Cheung, Youngsoo Choi, Yeonjong Shin

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without paper content

Method: Cannot determine method without paper content

Result: Cannot determine results without paper content

Conclusion: Cannot determine conclusion without paper content

Abstract: Failed to fetch summary for 2604.02601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Benjamin S. Knight, Ahsaas Bajaj

Main category: cs.LG

TL;DR: Survey of regularization evolution with empirical evaluation showing Ridge, Lasso, and ElasticNet are interchangeable for prediction when n/p ≥ 78, but Lasso recall collapses under multicollinearity while ElasticNet maintains performance.

DetailsMotivation: To provide a comprehensive historical survey of regularization techniques and empirically evaluate their performance across different conditions to offer practical guidance for machine learning practitioners.

Method: Historical survey tracing regularization evolution from 1960s to modern techniques, plus empirical evaluation of Ridge, Lasso, ElasticNet, and Post-Lasso OLS across 134,400 simulations spanning 7-dimensional manifold based on eight production-grade ML models.

Result: Ridge, Lasso, and ElasticNet are nearly interchangeable for prediction accuracy when sample-to-feature ratio is sufficient (n/p ≥ 78). However, Lasso recall collapses to 0.18 under high multicollinearity and low SNR, while ElasticNet maintains 0.93 recall.

Conclusion: Practitioners should avoid Lasso or Post-Lasso OLS at high condition numbers with small sample sizes. Provides objective-driven decision guide for selecting optimal regularization framework based on observable feature space attributes.

Abstract: This study surveys the historical development of regularization, tracing its evolution from stepwise regression in the 1960s to recent advancements in formal error control, structured penalties for non-independent features, Bayesian methods, and l0-based regularization (among other techniques). We empirically evaluate the performance of four canonical frameworks – Ridge, Lasso, ElasticNet, and Post-Lasso OLS – across 134,400 simulations spanning a 7-dimensional manifold grounded in eight production-grade machine learning models. Our findings demonstrate that for prediction accuracy when the sample-to-feature ratio is sufficient (n/p >= 78), Ridge, Lasso, and ElasticNet are nearly interchangeable. However, we find that Lasso recall is highly fragile under multicollinearity; at high condition numbers (kappa) and low SNR, Lasso recall collapses to 0.18 while ElasticNet maintains 0.93. Consequently, we advise practitioners against using Lasso or Post-Lasso OLS at high kappa with small sample sizes. The analysis concludes with an objective-driven decision guide to assist machine learning engineers in selecting the optimal scikit-learn-supported framework based on observable feature space attributes.

[604] Edgeworth Accountant: An Analytical Approach to Differential Privacy Composition

Hua Wang, Sheng Gao, Huanyu Zhang, Milan Shen, Weijie J. Su, Jiayuan Wu

Main category: cs.LG

TL;DR: Unable to analyze paper 2206.04236 due to HTTP 429 error when fetching abstract from arXiv API

DetailsMotivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot draw conclusion due to missing paper information

Abstract: Failed to fetch summary for 2206.04236: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2206.04236&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Yicheng Li, Zixiong Yu, Guhan Chen, Qian Lin

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2305.02657: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2305.02657&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[606] The Umeyama algorithm for matching correlated Gaussian geometric models in the low-dimensional regime

Shuyang Gong, Zhangsong Li

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2402.15095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.15095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[607] Approximation properties of neural ODEs

Arturo De Marinis, Davide Murari, Elena Celledoni, Nicola Guglielmi, Brynjulf Owren, Francesco Tudisco

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to paper fetch failure

Method: Unable to determine method due to paper fetch failure

Result: Unable to determine results due to paper fetch failure

Conclusion: Unable to determine conclusion due to paper fetch failure

Abstract: Failed to fetch summary for 2503.15696: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.15696&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[608] STAR: Learning Diverse Robot Skill Abstractions through Rotation-Augmented Vector Quantization

Hao Li, Qi Lv, Rui Shao, Xiang Deng, Yinchuan Li, Jianye Hao, Liqiang Nie

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2506.03863: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.03863&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[609] Service Placement in Small Cell Networks Using Distributed Best Arm Identification in Linear Bandits

Mariam Yahya, Aydin Sezgin, Setareh Maghsudi

Main category: cs.LG

TL;DR: Paper 2506.22480: Unable to fetch summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2506.22480: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.22480&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[610] Gaussian process surrogate with physical law-corrected prior for multi-coupled PDEs defined on irregular geometry

Pucheng Tang, Hongqiao Wang, Wenzhou Lin, Qian Chen, Heng Yong

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.02617: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.02617&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[611] Supersimulators

Cynthia Dwork, Pranay Tankala

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2509.17994: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.17994&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[612] Smart Paste: Automatically Fixing Copy/Paste for Google Developers

Vincent Nguyen, Guilherme Herzog, José Cambronero, Marcus Revaj, Aditya Kini, Alexander Frömmgen, Maxim Tabachnyk

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2510.03843: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03843&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[613] UAV-Assisted Resilience in 6G and Beyond Network Energy Saving: A Multi-Agent DRL Approach

Dao Lan Vy Dinh, Anh Nguyen Thi Mai, Hung Tran, Giang Quynh Le Vu, Tu Dac Ho, Zhenni Pan, Vo Nhan Van, Symeon Chatzinotas, Dinh-Hieu Tran

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.07366: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07366&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[614] Beyond Membership: Limitations of Add/Remove Adjacency in Differential Privacy

Gauri Pradhan, Joonas Jälkö, Santiago Zanella-Béguelin, Antti Honkela

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2511.21804: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21804&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[615] GLUE: Coordinating Pre-Trained Generative Models for System-Level Design

Tim Aebersold, Soheyl Massoudi, Mark D. Fuge

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.19469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[616] Deterministic and probabilistic neural surrogates of global hybrid-Vlasov simulations

Daniel Holmberg, Ivan Zaitsev, Markku Alho, Ioanna Bouri, Fanni Franssila, Haewon Jeong, Minna Palmroth, Teemu Roos

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2601.12614: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12614&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[617] Enhanced Climbing Image Nudged Elastic Band method with Hessian Eigenmode Alignment

Rohit Goswami, Miha Gunde, Hannes Jónsson

Main category: cs.LG

TL;DR: Failed to fetch summary for paper 2601.12630 due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2601.12630: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12630&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[618] The Turing Synthetic Radar Dataset: A dataset for pulse deinterleaving

Edward Gunn, Adam Hosford, Robert Jones, Leo Zeitler, Ian Groves, Victoria Nockles

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.03856: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03856&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Xavier Tardy, Grégoire Lefebvre, Apostolos Kountouris, Haïfa Fares, Amor Nafkha

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.04728: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04728&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[620] Causal Effect Estimation with Learned Instrument Representations

Frances Dean, Jenna Fields, Radhika Bhalerao, Marie Charpignon, Ahmed Alaa

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.10370: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10370&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[621] Imaging-Derived Coronary Fractional Flow Reserve: Advances in Physics-Based, Machine Learning, and Physics-Informed Methods

Tanxin Zhu, Emran Hossen, Chen Zhao, Jingfeng Jiang, Michele Esposito, Jiguang Sun, Weihua Zhou

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.16000: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16000&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[622] LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification

Md Akib Haider, Ahsan Bulbul, Nafis Fuad Shahid, Aimaan Ahmed, Mohammad Ishrak Abedin

Main category: cs.LG

TL;DR: Paper ID 2603.03959 - Unable to fetch abstract due to HTTP 429 error (rate limiting)

DetailsMotivation: Cannot determine motivation due to missing abstract content

Method: Cannot determine method due to missing abstract content

Result: Cannot determine results due to missing abstract content

Conclusion: Cannot draw conclusions due to missing abstract content

Abstract: Failed to fetch summary for 2603.03959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[623] MOELIGA: a multi-objective evolutionary approach for feature selection with local improvement

Leandro Vignolo, Matias Gerard

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2603.20934: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20934&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[624] Symmetrizing Bregman Divergence on the Cone of Positive Definite Matrices: Which Mean to Use and Why

Tushar Sial, Abhishek Halder

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.28917: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28917&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[625] MVNN: A Measure-Valued Neural Network for Learning McKean-Vlasov Dynamics from Particle Data

Liyao Lyu, Xinyue Yu, Hayden Schaeffer

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2604.00333: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00333&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[626] Transfer Learning for Meta-analysis Under Covariate Shift

Zilong Wang, Ali Abdeen, Turgay Ayer

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2604.02656: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02656&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[627] Expressibility of neural quantum states: a Walsh-complexity perspective

Taige Wang

Main category: cs.LG

TL;DR: Unable to analyze paper 2604.03294 due to HTTP 429 error when fetching summary from arXiv API

DetailsMotivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2604.03294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[628] ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs

Jinwu Yang, Jiaan Wu, Zedong Liu, Xinyang Ma, Hairui Zhao, Yida Gu, Yuanhong Huang, Xingchen Liu, Wenjing Huang, Zheng Wei, Jing Xing, Yili Ma, Qingyi Zhang, Baoyi An, Zhongzhe Hu, Shaoteng Liu, Xia Zhu, Jiaxun Lu, Guangming Tan, Dingwen Tao

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

DetailsMotivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2604.03298: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03298&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[629] PhaseFlow4D: Physically Constrained 4D Beam Reconstruction via Feedback-Guided Latent Diffusion

Alexander Scheinker, Alexander Plastun, Peter Ostroumov

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

DetailsMotivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2604.03885: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03885&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

[630] GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

Zihao Lin, Haibo Wang, Zhiyang Xu, Siyao Dai, Huanjie Dong, Xiaohan Wang, Yolo Y. Tang, Yixin Wang, Qifan Wang, Lifu Huang

Main category: cs.MA

TL;DR: GLANCE is a multi-agent framework for music-grounded mashup video creation that uses global-local coordination and bi-loop architecture to address complex video editing challenges.

DetailsMotivation: Existing approaches for music-grounded mashup video creation rely on fixed pipelines or simplified retrieval-and-concatenation paradigms, limiting adaptability to diverse prompts and heterogeneous source materials.

Method: GLANCE uses a bi-loop architecture with outer loop for long-horizon planning and inner loop for segment-wise editing. It introduces global-local coordination with context controller, conflict region decomposition, and bottom-up dynamic negotiation mechanisms.

Result: GLANCE outperforms prior baselines by 33.2% and 15.6% on two task settings using GPT-4o-mini as backbone, with human evaluation confirming video quality and evaluation framework effectiveness.

Conclusion: The GLANCE framework effectively addresses music-grounded video editing challenges through multi-agent coordination and introduces a comprehensive benchmark for rigorous evaluation.

Abstract: Music-grounded mashup video creation is a challenging form of video non-linear editing, where a system must compose a coherent timeline from large collections of source videos while aligning with music rhythm, user intent, story completeness, and long-range structural constraints. Existing approaches typically rely on fixed pipelines or simplified retrieval-and-concatenation paradigms, limiting their ability to adapt to diverse prompts and heterogeneous source materials. In this paper, we present GLANCE, a global-local coordination multi-agent framework for music-grounded nonlinear video editing. GLANCE adopts a bi-loop architecture for better editing practice: an outer loop performs long-horizon planning and task-graph construction, and an inner loop adopts the “Observe-Think-Act-Verify” flow for segment-wise editing tasks and their refinements. To address the cross-segment and global conflict emerging after subtimelines composition, we introduce a dedicated global-local coordination mechanism with both preventive and corrective components, which includes a novelly designed context controller, conflict region decomposition module, and a bottom-up dynamic negotiation mechanism. To support rigorous evaluation, we construct MVEBench, a new benchmark that factorizes editing difficulty along task type, prompt specificity, and music length, and propose an agent-as-a-judge evaluation framework for scalable multi-dimensional assessment. Experimental results show that GLANCE consistently outperforms prior research baselines and open-source product baselines under the same backbone models. With GPT-4o-mini as the backbone, GLANCE improves over the strongest baseline by 33.2% and 15.6% on two task settings, respectively. Human evaluation further confirms the quality of the generated videos and validates the effectiveness of the proposed evaluation framework.

[631] Governance-Aware Agent Telemetry for Closed-Loop Enforcement in Multi-Agent AI Systems

Anshul Pathak, Nishant Jain

Main category: cs.MA

TL;DR: GAAT is a governance-aware telemetry system for multi-agent AI that enables real-time policy enforcement through extended telemetry schema, violation detection, and graduated interventions.

DetailsMotivation: Current observability tools for enterprise multi-agent AI systems only capture dependencies without real-time enforcement, creating an "observe-but-do-not-act" gap where policy violations are detected only after damage occurs.

Method: GAAT introduces four components: 1) Governance Telemetry Schema extending OpenTelemetry with governance attributes, 2) real-time policy violation detection engine with OPA-compatible rules under 200ms latency, 3) Governance Enforcement Bus with graduated interventions, and 4) Trusted Telemetry Plane with cryptographic provenance.

Result: The system closes the loop between telemetry collection and automated policy enforcement for multi-agent systems, enabling real-time governance rather than post-hoc analysis.

Conclusion: GAAT provides a reference architecture for governance-aware agent telemetry that addresses the critical gap in current observability tools by enabling real-time policy enforcement in multi-agent AI systems.

Abstract: Enterprise multi-agent AI systems produce thousands of inter-agent interactions per hour, yet existing observability tools capture these dependencies without enforcing anything. OpenTelemetry and Langfuse collect telemetry but treat governance as a downstream analytics concern, not a real-time enforcement target. The result is an “observe-but-do-not-act” gap where policy violations are detected only after damage is done. We present Governance-Aware Agent Telemetry (GAAT), a reference architecture that closes the loop between telemetry collection and automated policy enforcement for multi-agent systems. GAAT introduces (1) a Governance Telemetry Schema (GTS) extending OpenTelemetry with governance attributes; (2) a real-time policy violation detection engine using OPA-compatible declarative rules under sub-200 ms latency; (3) a Governance Enforcement Bus (GEB) with graduated interventions; and (4) a Trusted Telemetry Plane with cryptographic provenance.

[632] Nash Approximation Gap in Truncated Infinite-horizon Partially Observable Markov Games

Lan Sang, Chinmay Maheshwari

Main category: cs.MA

TL;DR: A framework for approximating infinite-horizon partially observable Markov games using finite-memory truncation, showing that Nash equilibria of the truncated game are ε-Nash equilibria of the original game.

DetailsMotivation: Partially Observable Markov Games (POMGs) are intractable in infinite-horizon settings because belief states and action spaces grow indefinitely with accumulated information over time, making the standard belief-state reformulation computationally infeasible.

Method: Proposes a finite-memory truncation framework that approximates infinite-horizon POMGs by a finite-state, finite-action Markov game where agents condition decisions only on finite windows of common and private information.

Result: Under suitable filter stability (forgetting) conditions, any Nash equilibrium of the truncated game is an ε-Nash equilibrium of the original POMG, with ε → 0 as the truncation length increases.

Conclusion: The finite-memory truncation provides a computationally tractable approximation for infinite-horizon POMGs while maintaining theoretical guarantees on solution quality.

Abstract: Partially Observable Markov Games (POMGs) provide a general framework for modeling multi-agent sequential decision-making under asymmetric information. A common approach is to reformulate a POMG as a fully observable Markov game over belief states, where the state is the conditional distribution of the system state and agents’ private information given common information, and actions correspond to mappings (prescriptions) from private information to actions. However, this reformulation is intractable in infinite-horizon settings, as both the belief state and action spaces grow with the accumulation of information over time. We propose a finite-memory truncation framework that approximates infinite-horizon POMGs by a finite-state, finite-action Markov game, where agents condition decisions only on finite windows of common and private information. Under suitable filter stability (forgetting) conditions, we show that any Nash equilibrium of the truncated game is an $\varepsilon$-Nash equilibrium of the original POMG, where $\varepsilon \to 0$ as the truncation length increases.

[633] DRAMA: Next-Gen Dynamic Orchestration for Resilient Multi-Agent Ecosystems in Flux

Naibo Wang, Yifan Zhang, Sai Liu, Xinkui Zhao, Guanjie Cheng, Yueshen Xu

Main category: cs.MA

TL;DR: DRAMA: Dynamic multi-agent system with modular architecture for resilient collaboration in changing environments through real-time monitoring and flexible task allocation.

DetailsMotivation: Existing multi-agent systems have static architectures with fixed capabilities and rigid task allocation, limiting adaptability to dynamic real-world environments with frequent changes, uncertainty, and variability.

Method: Proposes DRAMA with modular architecture separating control plane (real-time monitoring, centralized planning) and worker plane (autonomous agents). Agents and tasks abstracted as resource objects with defined lifecycles. Task allocation uses affinity-based, loosely coupled mechanism for flexible reassignment.

Result: Enables continuous and robust task execution through real-time monitoring, flexible task reassignment as agents join/depart/become unavailable, and autonomous agents capable of taking over unfinished tasks.

Conclusion: DRAMA addresses limitations of static MAS frameworks by providing dynamic, robust allocation-based system for resilient collaboration in rapidly changing environments.

Abstract: Multi-agent systems (MAS) have demonstrated significant effectiveness in addressing complex problems through coordinated collaboration among heterogeneous agents. However, real-world environments and task specifications are inherently dynamic, characterized by frequent changes, uncertainty, and variability. Despite this, most existing MAS frameworks rely on static architectures with fixed agent capabilities and rigid task allocation strategies, which greatly limits their adaptability to evolving conditions. This inflexibility poses substantial challenges for sustaining robust and efficient multi-agent cooperation in dynamic and unpredictable scenarios. To address these limitations, we propose DRAMA: a Dynamic and Robust Allocation-based Multi-Agent System designed to facilitate resilient collaboration in rapidly changing environments. DRAMA features a modular architecture with a clear separation between the control plane and the worker plane. Both agents and tasks are abstracted as resource objects with well-defined lifecycles, while task allocation is achieved via an affinity-based, loosely coupled mechanism. The control plane enables real-time monitoring and centralized planning, allowing flexible and efficient task reassignment as agents join, depart, or become unavailable, thereby ensuring continuous and robust task execution. The worker plane comprises a cluster of autonomous agents, each with local reasoning, task execution, the ability to collaborate, and the capability to take over unfinished tasks from other agents when needed.

[634] Decoupling Geometric Planning and Execution in Scalable Multi-Agent Path Finding

Fernando Salanova, Eduardo Montijano, Cristian Mahulea

Main category: cs.MA

TL;DR: Hybrid prioritized framework for Multi-Agent Path Finding separates geometric planning from execution-time conflict resolution using cost inflation and decentralized local control

DetailsMotivation: Traditional MAPF solvers use time-expanded models and centralized conflict resolution, which limits scalability in large or dense instances. There's a need for more scalable approaches that can handle many agents efficiently.

Method: Two-stage approach: 1) Geometric Conflict Preemption (GCP) plans agents sequentially with A* on original graph while inflating costs for transitions entering vertices used by higher-priority paths; 2) Decentralized Local Controller (DLC) executes geometric paths using per-vertex FIFO authorization queues and inserts wait actions to avoid conflicts.

Result: Experiments on standard benchmark maps with up to 1000 agents show near-linear runtime scaling and 100% success rate on instances satisfying geometric feasibility assumption.

Conclusion: The hybrid prioritized framework provides a scalable solution for MAPF by separating geometric planning from execution-time conflict resolution, enabling efficient handling of large numbers of agents.

Abstract: Multi-Agent Path Finding (MAPF) requires collision-free trajectories for multiple agents on a shared graph, often with the objective of minimizing the sum-of-costs (SOC). Many optimal and bounded-suboptimal solvers rely on time-expanded models and centralized conflict resolution, which limits scalability in large or dense instances. We propose a hybrid prioritized framework that separates \emph{geometric planning} from \emph{execution-time conflict resolution}. In the first stage, \emph{Geometric Conflict Preemption (GCP)} plans agents sequentially with A* on the original graph while inflating costs for transitions entering vertices used by higher-priority paths, encouraging spatial detours without explicit time reasoning. In the second stage, a \emph{Decentralized Local Controller (DLC)} executes the geometric paths using per-vertex FIFO authorization queues and inserts wait actions to avoid vertex and edge-swap conflicts. Experiments on standard benchmark maps with up to 1000 agents show that the method scales with an near-linear runtime trend and attains a 100% success rate on instances satisfying the geometric feasibility assumption. Page of the project: https://sites.google.com/unizar.es/multi-agent-path-finding/home

cs.MM

[635] LLM2Manim: Pedagogy-Aware AI Generation of STEM Animations

Aastha Joshi, Hongyi Ke, Meet Gajjar, Aaron Christian, Qi Wang, Jun Chen

Main category: cs.MM

TL;DR: LLM-assisted pipeline converts STEM concepts into narrated animations using Manim, following multimedia learning principles, with evaluation showing better learning outcomes than PowerPoint.

DetailsMotivation: High-quality STEM animations are valuable for learning but difficult to create due to time and skill requirements, limiting their use in daily teaching.

Method: Semi-automated human-in-the-loop pipeline using LLM to convert math/physics concepts into Manim animations with constrained prompts, symbol consistency, error regeneration, and expert review.

Result: Animation-based instruction yielded better post-test scores (83% vs 78%), higher learning gains (d=0.67), higher engagement (d=0.94), lower cognitive load (d=0.41), and faster task completion.

Conclusion: LLM-assisted animation generation can make STEM content creation more accessible and practical for classroom use, potentially increasing adoption of multimedia learning.

Abstract: High-quality STEM animations can be useful for learning, but they are still not common in daily teaching, mostly because they take time and special skills to make. In this paper, we present a semi-automated, human-in-the-loop (HITL) pipeline that uses a large language model (LLM) to help convert math and physics concepts into narrated animations with the Python library Manim. The pipeline also tries to follow multimedia learning ideas like segmentation, signaling, and dual coding, so the narration and the visuals are more aligned. To keep the outputs stable, we use constrained prompt templates, a symbol ledger to keep symbols consistent, and we regenerate only the parts that have errors. We also include expert review before the final rendering, because sometimes the generated code or explanation is not fully correct. We tested the approach with 100 undergraduate students in a within-subject A-B study. Each student learned two similar STEM topics, one with the LLM-generated animations and one with PowerPoint slides. In general, the animation-based instruction gives slightly better post-test scores (83% vs.78%, p < .001), and students show higher learning gains (d=0.67). They also report higher engagement (d=0.94) and lower cognitive load (d=0.41). Students finished the tasks faster, and many of them said they prefer the animated format. Overall, these results suggest LLM-assisted animation can make STEM content creation easier, and it may be a practical option for more classrooms.

[636] DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems

Qi Guo, Zheming Yang, Yunqing Hu, Chang Zhao, Wen Ji

Main category: cs.MM

TL;DR: DAT: A system for efficient multimodal event detection in edge-cloud environments using small-large model cascades and adaptive transmission optimization.

DetailsMotivation: Current MLLMs have high computation/communication overhead for continuous video streams in bandwidth-constrained edge-cloud systems, hindering low-latency alerting and effective visual evidence delivery.

Method: 1) Collaborative small-large model cascade: lightweight edge-side model filters non-target frames and performs object detection, triggering MLLM only for suspicious frames. 2) Efficient fine-tuning with visual guidance and semantic prompting. 3) Semantics and bandwidth-aware multi-stream adaptive transmission optimization.

Result: Achieves 98.83% recognition accuracy and 100% output consistency. Reduces weighted semantic alert delay by up to 77.5% under severe congestion, delivers 98.33% of visual evidence within 0.5s.

Conclusion: DAT demonstrates effectiveness of jointly optimizing cascade inference and elastic transmission for efficient multimodal event detection in edge-cloud systems.

Abstract: Multimodal large language models (MLLMs) have shown strong capability in semantic understanding and visual reasoning, yet their use on continuous video streams in bandwidth-constrained edge-cloud systems incurs prohibitive computation and communication overhead and hinders low-latency alerting and effective visual evidence delivery. To address this challenge, we propose DAT to achieve high-quality semantic generation, low-latency event alerting, and effective visual evidence supplementation. To reduce unnecessary deep reasoning costs, we propose a collaborative small-large model cascade. A lightweight edge-side small model acts as a gating module to filter non-target-event frames and perform object detection, triggering MLLM inference only for suspicious frames. Building on this, we introduce an efficient fine-tuning strategy with visual guidance and semantic prompting, which improves structured event understanding, object detection, and output consistency. To ensure low-latency semantic alerting and effective visual evidence supplementation under bandwidth constraints, we further devise a semantics and bandwidth-aware multi-stream adaptive transmission optimization method. Experimental results show that DAT achieves 98.83% recognition accuracy and 100% output consistency. Under severe congestion, it reduces weighted semantic alert delay by up to 77.5% and delivers 98.33% of visual evidence within 0.5 s, demonstrating the effectiveness of jointly optimizing cascade inference and elastic transmission.

[637] Learning Shared Sentiment Prototypes for Adaptive Multimodal Sentiment Analysis

Chen Su, Yuanhe Tian, Yan Song

Main category: cs.MM

TL;DR: PRISM is a multimodal sentiment analysis framework that organizes sentiment evidence in a shared prototype space and uses dynamic modality reweighting during reasoning to better handle complementary, conflicting, or varying-reliability cues.

DetailsMotivation: Existing multimodal sentiment analysis methods compress diverse sentiment cues into single representations early, losing internal structure where cues may complement, conflict, or differ in reliability. Modality importance is typically fixed during fusion, preventing later adjustments.

Method: PRISM organizes multimodal evidence in a shared prototype space for structured cross-modal comparison and adaptive fusion. It applies dynamic modality reweighting during reasoning, allowing continuous refinement of modality contributions as semantic interactions deepen.

Result: Experiments on three benchmark datasets show that PRISM outperforms representative baselines in multimodal sentiment analysis.

Conclusion: PRISM effectively addresses limitations of early aggregation and fixed modality weighting by providing structured affective extraction and adaptive modality evaluation through prototype space organization and dynamic reweighting.

Abstract: Multimodal sentiment analysis (MSA) aims to predict human sentiment from textual, acoustic, and visual information in videos. Recent studies improve multimodal fusion by modeling modality interaction and assigning different modality weights. However, they usually compress diverse sentiment cues into a single compact representation before sentiment reasoning. This early aggregation makes it difficult to preserve the internal structure of sentiment evidence, where different cues may complement, conflict with, or differ in reliability from each other. In addition, modality importance is often determined only once during fusion, so later reasoning cannot further adjust modality contributions. To address these issues, we propose PRISM, a framework that unifies structured affective extraction and adaptive modality evaluation. PRISM organizes multimodal evidence in a shared prototype space, which supports structured cross-modal comparison and adaptive fusion. It further applies dynamic modality reweighting during reasoning, allowing modality contributions to be continuously refined as semantic interactions become deeper. Experiments on three benchmark datasets show that PRISM outperforms representative baselines.

eess.AS

[638] Exploring Speech Foundation Models for Speaker Diarization Across Lifespan

Anfeng Xu, Tiantian Feng, Shrikanth Narayanan

Main category: eess.AS

TL;DR: Cross-lifespan evaluation of speech foundation models for speaker diarization shows performance degradation when adult-trained models are applied to child/older-adult speech, with joint multi-age training improving robustness and targeted adaptation yielding further gains.

DetailsMotivation: Speech foundation models show strong transferability but their robustness to age-related domain shifts in speaker diarization remains underexplored, particularly for conversations involving children, adults, and older adults.

Method: Cross-lifespan evaluation within unified end-to-end neural diarization framework (EEND-VC), comparing models under zero-shot cross-age inference, joint multi-age training, and domain-specific adaptation using Whisper encoder.

Result: Substantial performance degradation when adult-trained models applied to child/older-adult data; joint multi-age training improves robustness without reducing adult performance; targeted age group adaptation yields further gains, especially with Whisper encoder.

Conclusion: Age-related domain shifts significantly impact speech foundation model performance in diarization, but joint training and targeted adaptation strategies can mitigate these issues and improve robustness across lifespan.

Abstract: Speech foundation models have shown strong transferability across a wide range of speech applications. However, their robustness to age-related domain shift in speaker diarization remains underexplored. In this work, we present a cross-lifespan evaluation within a unified end-to-end neural diarization framework (EEND-VC), covering speech samples from conversations involving children, adults, and older adults. We compare models under zero-shot cross-age inference, joint multi-age training, and domain-specific adaptation. Results show substantial performance degradation when models trained on adult-specific speech are applied to child and older-adult conversational data. Moreover, joint multi-age training across different age groups improves robustness without reducing diarization performance in canonical adult conversations, while targeted age group adaptation yields further gains in diarization performance, particularly when using the Whisper encoder.

[639] Active noise cancellation on open-ear smart glasses

Kuang Yuan, Freddy Yifei Liu, Tong Xiao, Yiwen Song, Chengyi Shen, Saksham Bhutani, Justin Chan, Swarun Kumar

Main category: eess.AS

TL;DR: First real-time active noise cancellation system for open-ear smart glasses using microphone array and computational pipeline to estimate and cancel environmental noise.

DetailsMotivation: Smart glasses with open-ear speakers struggle in noisy environments because they can't use conventional ANC techniques that require sealed ear canals. Current open-ear designs are incompatible with traditional ANC approaches.

Method: Developed a low-latency computational pipeline using an array of eight microphones distributed around the glasses frame to estimate noise at the ear, then generates anti-noise signals in real-time using miniaturized open-ear speakers. Created a custom glasses prototype and evaluated across 8 environments under mobility.

Result: Achieved mean noise reduction of 9.6 dB without calibration and 11.2 dB with brief user-specific calibration in the 100-1000 Hz frequency range where environmental noise is concentrated.

Conclusion: Demonstrated the first real-time ANC system for open-ear smart glasses, showing significant noise reduction in challenging mobile environments without requiring sealed ear canals.

Abstract: Smart glasses are becoming an increasingly prevalent wearable platform, with audio as a key interaction modality. However, hearing in noisy environments remains challenging because smart glasses are equipped with open-ear speakers that do not seal the ear canal. Furthermore, the open-ear design is incompatible with conventional active noise cancellation (ANC) techniques, which rely on an error microphone inside or at the entrance of the ear canal to measure the residual sound heard after cancellation. Here we present the first real-time ANC system for open-ear smart glasses that suppresses environmental noise using only microphones and miniaturized open-ear speakers embedded in the glasses frame. Our low-latency computational pipeline estimates the noise at the ear from an array of eight microphones distributed around the glasses frame and generates an anti-noise signal in real-time to cancel environmental noise. We develop a custom glasses prototype and evaluate it in a user study across 8 environments under mobility in the 100–1000 Hz frequency range, where environmental noise is concentrated. We achieve a mean noise reduction of 9.6 dB without any calibration, and 11.2 dB with a brief user-specific calibration.

[640] Multimodal Deep Learning Method for Real-Time Spatial Room Impulse Response Computing

Zhiyu Li, Xinwen Yue, Shenghui Zhao, Jing Wang

Main category: eess.AS

TL;DR: A multimodal deep learning model for VR auralization that generates spatial room impulse responses in real time using scene information and low-order reflection waveforms as inputs.

DetailsMotivation: To enable real-time, scene-specific auditory perception in VR by generating spatial room impulse responses that can be integrated with personalized head-related transfer functions, addressing the computational complexity of traditional methods.

Method: Uses a multimodal approach combining scene information (geometry, acoustic properties, source/listener coordinates) with low-order reflection waveforms computed via geometrical acoustics. These inputs are fed into a deep learning model to generate spatial room impulse responses.

Result: The proposed model demonstrates superior performance, with a new diverse dataset constructed for training and evaluation. The approach reduces computational complexity while maintaining accuracy.

Conclusion: The multimodal deep learning model effectively generates spatial room impulse responses for VR auralization in real time, offering a practical solution for scene-specific auditory reconstruction in virtual environments.

Abstract: We propose a multimodal deep learning model for VR auralization that generates spatial room impulse responses (SRIRs) in real time to reconstruct scene-specific auditory perception. Employing SRIRs as the output reduces computational complexity and facilitates integration with personalized head-related transfer functions. The model takes two modalities as input: scene information and waveforms, where the waveform corresponds to the low-order reflections (LoR). LoR can be efficiently computed using geometrical acoustics (GA) but remains difficult for deep learning models to predict accurately. Scene geometry, acoustic properties, source coordinates, and listener coordinates are first used to compute LoR in real time via GA, and both LoR and these features are subsequently provided as inputs to the model. A new dataset was constructed, consisting of multiple scenes and their corresponding SRIRs. The dataset exhibits greater diversity. Experimental results demonstrate the superior performance of the proposed model.

[641] ML-ARIS: Multilayer Underwater Acoustic Reconfigurable Intelligent Surface with High-Resolution Reflection Control

Lina Pu, Yu Luo, Aijun Song

Main category: eess.AS

TL;DR: ML-ARIS is a multilayered acoustic reconfigurable intelligent surface for underwater communications using piezoelectric layers with adjustable impedance for precise beam steering and synthetic reflection.

DetailsMotivation: To enhance underwater communication systems by developing a more flexible and precise acoustic beam steering technology that can improve signal directionality while reducing environmental interference.

Method: Proposes a multilayered architecture with piezoelectric material layers where each layer’s load impedance is independently adjustable via control circuits, enabling generation of reflected signals with desired amplitudes and orthogonal phases for synthetic reflection.

Result: Simulations and tank experiments confirm the feasibility of ML-ARIS, demonstrating that synthetic reflection with multilayer structures is practical and enables single reflection units to generate high-resolution amplitude and phase reflected waves.

Conclusion: ML-ARIS architecture provides a practical solution for precise acoustic beam steering in underwater communications, offering improved flexibility and performance over traditional approaches.

Abstract: This article introduces a multilayered acoustic reconfigurable intelligent surface (ML-ARIS) architecture designed for the next generation of underwater communications. ML-ARIS incorporates multiple layers of piezoelectric material in each acoustic reflector, with the load impedance of each layer independently adjustable via a control circuit. This design increases the flexibility in generating reflected signals with desired amplitudes and orthogonal phases, enabling passive synthetic reflection using a single acoustic reflector. Such a feature enables precise beam steering, enhancing sound levels in targeted directions while minimizing interference in surrounding environments. Extensive simulations and tank experiments were conducted to verify the feasibility of ML-ARIS. The experimental results indicate that implementing synthetic reflection with a multilayer structure is indeed practical in real-world scenarios, making it possible to use a single reflection unit to generate reflected waves with high-resolution amplitudes and phases.

[642] Rewriting TTS Inference Economics: Lightning V2 on Tenstorrent Achieves 4x Lower Cost Than NVIDIA L40S

Ranjith M. S., Akshat Mandloi, Sudarshan Kamath

Main category: eess.AS

TL;DR: Lightning V2 is a production-grade TTS model optimized for Tenstorrent hardware that achieves high computational fidelity with aggressive precision reduction (BFP8, LoFi) while maintaining audio quality, enabling 4x lower accelerator costs compared to NVIDIA L40S.

DetailsMotivation: TTS models are more numerically fragile than LLMs due to continuous waveform generation and perceptual sensitivity to small perturbations. While aggressive precision reduction techniques work for language models, applying them to TTS systems often causes audible artifacts, phase instability, and spectral distortion.

Method: Precision-aware architectural design and hardware-software co-optimization for Tenstorrent hardware, leveraging Network-on-Chip (NoC), distributed SRAM, and deterministic execution model to reduce memory movement and redundant weight fetches for efficient low-precision inference.

Result: Achieved over 95% LoFi computational fidelity and more than 80% BlockFloat8 deployment without measurable degradation in audio quality. Compared to NVIDIA L40S baseline, achieves approximately 4x lower on-prem accelerator cost at equivalent throughput while maintaining production audio fidelity.

Conclusion: Precision co-design combined with hardware-aware optimization can fundamentally reshape the economics of real-time speech inference, demonstrating that TTS models can achieve efficient low-precision deployment without sacrificing audio quality.

Abstract: Text-to-Speech (TTS) models are significantly more numerically fragile than Large Language Models (LLMs) due to their continuous waveform generation and perceptual sensitivity to small numerical perturbations. While aggressive precision reduction techniques such as BlockFloat8 (BFP8) and low-fidelity (LoFi) compute have been widely adopted in language models, applying similar strategies to TTS systems often results in audible artifacts, phase instability, and spectral distortion. In this work, we present Lightning V2, a production-grade TTS model co-optimized for Tenstorrent hardware. Through precision-aware architectural design and hardware-software co-optimization, we achieve over 95% LoFi computational fidelity and more than 80% BlockFloat8 deployment without measurable degradation in audio quality. Leveraging Tenstorrent’s Network-on-Chip (NoC), distributed SRAM, and deterministic execution model, we reduce memory movement and redundant weight fetches, enabling efficient low-precision inference. Compared to an NVIDIA L40S baseline, Lightning V2 achieves approximately 4x lower on-prem accelerator cost at equivalent throughput, while maintaining production audio fidelity. Our results demonstrate that precision co-design, combined with hardware-aware optimization, can fundamentally reshape the economics of real-time speech inference.

eess.IV

[643] CI-ICM: Channel Importance-driven Learned Image Coding for Machines

Yun Zhang, Junle Liu, Huan Zhang, Zhaoqing Pan, Gangyi Jiang, Weisi Lin

Main category: eess.IV

TL;DR: CI-ICM: Channel Importance-driven learned Image Coding for Machines that optimizes image compression for machine vision tasks by quantifying channel importance, grouping channels non-uniformly, and adapting to multiple downstream tasks.

DetailsMotivation: Traditional human vision-centric image compression methods are suboptimal for machine vision tasks due to different visual properties and feature characteristics. There's a need for compression methods specifically designed to maximize machine vision performance under bitrate constraints.

Method: 1) Channel Importance Generation (CIG) module quantifies channel importance with channel order loss; 2) Feature Channel Grouping and Scaling (FCGS) module non-uniformly groups channels based on importance and adjusts dynamic range; 3) Channel Importance-based Context (CI-CTX) module allocates bits among feature groups; 4) Task-Specific Channel Adaptation (TSCA) module adaptively enhances features for multiple downstream tasks.

Result: On COCO2017 dataset, CI-ICM achieves BD-mAP@50:95 gains of 16.25% in object detection and 13.72% in instance segmentation over baseline codec. Ablation studies validate each component’s effectiveness, and complexity analysis shows practicability.

Conclusion: The work establishes feature channel optimization for machine vision-centric compression, bridging the gap between image coding and machine perception by prioritizing important channels for downstream vision tasks.

Abstract: Traditional human vision-centric image compression methods are suboptimal for machine vision centric compression due to different visual properties and feature characteristics. To address this problem, we propose a Channel Importance-driven learned Image Coding for Machines (CI-ICM), aiming to maximize the performance of machine vision tasks at a given bitrate constraint. First, we propose a Channel Importance Generation (CIG) module to quantify channel importance in machine vision and develop a channel order loss to rank channels in descending order. Second, to properly allocate bitrate among feature channels, we propose a Feature Channel Grouping and Scaling (FCGS) module that non-uniformly groups the feature channels based on their importance and adjusts the dynamic range of each group. Based on FCGS, we further propose a Channel Importance-based Context (CI-CTX) module to allocate bits among feature groups and to preserve higher fidelity in critical channels. Third, to adapt to multiple machine tasks, we propose a Task-Specific Channel Adaptation (TSCA) module to adaptively enhance features for multiple downstream machine tasks. Experimental results on the COCO2017 dataset show that the proposed CI-ICM achieves BD-mAP@50:95 gains of 16.25$%$ in object detection and 13.72$%$ in instance segmentation over the established baseline codec. Ablation studies validate the effectiveness of each contribution, and computation complexity analysis reveals the practicability of the CI-ICM. This work establishes feature channel optimization for machine vision-centric compression, bridging the gap between image coding and machine perception.

[644] Image-Based Metrics in Ultrasound for Estimation of Global Speed-of-Sound

Roman Denkin, Orcun Goksel

Main category: eess.IV

TL;DR: Proposes using conventional image analysis metrics to estimate tissue speed-of-sound from ultrasound images, testing 11 metrics across image quality, similarity, and variation categories for computational efficiency.

DetailsMotivation: Current ultrasound systems rely on assumed speed-of-sound values for imaging, which can lead to inaccuracies. Existing physics- and model-based SoS estimation methods require raw channel data and are computationally intensive. The authors seek a simpler, more accessible approach using conventional image metrics that can operate on post-beamformed/B-mode data.

Method: Study 11 metrics in three categories: image quality metrics (Focus, Tenengrad variation), image similarity metrics (mutual information, correlation), and multi-frame variation metrics. Test these metrics in numerical simulations, phantom experiments, and in vivo scenarios. Evaluate performance on single frames versus compounded frames, and test robustness on small image patches for focal estimation.

Result: Single-frame image quality metrics achieved 5-8 m/s accuracy on phantoms only when applied after compounding multiple frames. Differential image comparison metrics performed better with errors consistently under 8 m/s even on single frame pairs. Mutual information and correlation metrics were robust for small image patches, suitable for focal estimation. Demonstrated clinical applicability through breast density classification based on SoS.

Conclusion: Image-based methods offer a computationally efficient and data-accessible alternative to existing physics- and model-based approaches for SoS estimation. These methods don’t require raw channel data and can operate on post-beamformed/B-mode data, making them practical for clinical applications.

Abstract: Accurate speed-of-sound (SoS) estimation is crucial for ultrasound image formation, yet conventional systems often rely on an assumed value for imaging. We propose to leverage conventional image analysis techniques and metrics as a novel and simple approach to estimate tissue SoS. We study eleven metrics in three categories for assessing image quality, image similarity and multi-frame variation, by testing them in numerical simulations and phantom experiments, as well as testing in an in vivo scenario. Among single-frame image quality metrics, conventional Focus and a proposed metric variation on Tenengrad present satisfactory accuracy (5-8,m/s on phantoms), but only when the metrics are applied after compounding multiple frames. Differential image comparison metrics were more successful overall with errors consistently under 8,m/s even applied on a single pair of frames. Mutual information and correlation metrics were found to be robust in processing relatively small image patches, making them suitable for focal estimation. We present an in vivo study on breast density classification based on SoS, to showcase clinical applicability. The studied metrics do not require access to raw channel data as they can operate on post-beamformed and/or B-mode data. These image-based methods offer a computationally efficient and data-accessible alternative to existing physics- and model-based approaches for SoS estimation.

[645] Ultrasound-based detection and malignancy prediction of breast lesions eligible for biopsy: A multi-center clinical-scenario study using nomograms, large language models, and radiologist evaluation

Ali Abbasian Ardakani, Afshin Mohammadi, Taha Yusuf Kuzan, Beyza Nur Kuzan, Hamid Khorshidi, Ashkan Ghorbani, Alisa Mohebbi, Fariborz Faeghi, Sepideh Hatamikia, U Rajendra Acharya

Main category: eess.IV

TL;DR: Integrated ultrasound nomogram combining BIRADS features and quantitative morphometric characteristics outperforms radiologists and ChatGPT models for breast lesion biopsy recommendation and malignancy prediction.

DetailsMotivation: To develop and validate integrated ultrasound nomograms that combine BIRADS features with quantitative morphometric characteristics to improve breast lesion assessment, reduce unnecessary biopsies, and enhance personalized decision-making in breast imaging.

Method: Retrospective multicenter study of 1747 women with pathologically confirmed breast lesions across three centers in Iran and Turkey. Extracted 10 BIRADS and 26 morphological features from each lesion. Constructed three nomograms via logistic regression: BIRADS-only, morphometric-only, and fused (both feature sets). Compared performance with three radiologists and two ChatGPT variants interpreting deidentified images.

Result: Fused nomogram achieved highest accuracy for biopsy recommendation (83.0%) and malignancy prediction (83.8%), outperforming morphometric nomogram, three radiologists, and both ChatGPT models. AUCs were 0.901 and 0.853 for the two tasks respectively. BIRADS nomogram also outperformed morphometric nomogram, radiologists, and ChatGPT models. External validation confirmed robust generalizability across different ultrasound platforms and populations.

Conclusion: Integrated BIRADS-morphometric nomogram consistently outperforms standalone models, large language models, and radiologists in guiding biopsy decisions and predicting malignancy. These interpretable, externally validated tools have potential to reduce unnecessary biopsies and enhance personalized decision-making in breast imaging.

Abstract: To develop and externally validate integrated ultrasound nomograms combining BIRADS features and quantitative morphometric characteristics, and to compare their performance with expert radiologists and state of the art large language models in biopsy recommendation and malignancy prediction for breast lesions. In this retrospective multicenter, multinational study, 1747 women with pathologically confirmed breast lesions underwent ultrasound across three centers in Iran and Turkey. A total of 10 BIRADS and 26 morphological features were extracted from each lesion. A BIRADS, morphometric, and fused nomogram integrating both feature sets was constructed via logistic regression. Three radiologists (one senior, two general) and two ChatGPT variants independently interpreted deidentified breast lesion images. Diagnostic performance for biopsy recommendation (BIRADS 4,5) and malignancy prediction was assessed in internal and two external validation cohorts. In pooled analysis, the fused nomogram achieved the highest accuracy for biopsy recommendation (83.0%) and malignancy prediction (83.8%), outperforming the morphometric nomogram, three radiologists and both ChatGPT models. Its AUCs were 0.901 and 0.853 for the two tasks, respectively. In addition, the performance of the BIRADS nomogram was significantly higher than the morphometric nomogram, three radiologists and both ChatGPT models for biopsy recommendation and malignancy prediction. External validation confirmed the robust generalizability across different ultrasound platforms and populations. An integrated BIRADS morphometric nomogram consistently outperforms standalone models, LLMs, and radiologists in guiding biopsy decisions and predicting malignancy. These interpretable, externally validated tools have the potential to reduce unnecessary biopsies and enhance personalized decision making in breast imaging.

[646] SatFusion: A Unified Framework for Enhancing Remote Sensing Images via Multi-Frame and Multi-Source Images Fusion

Yufei Tong, Guanjie Cheng, Peihan Wu, Feiyi Chen, Xinkui Zhao, Shuiguang Deng

Main category: eess.IV

TL;DR: SatFusion is a unified framework that bridges multi-frame super-resolution and pansharpening for remote sensing images by combining complementary information from multiple low-resolution multispectral frames with high-resolution panchromatic structural details.

DetailsMotivation: Current remote sensing image enhancement methods like Multi-Frame Super-Resolution (MFSR) and Pansharpening are studied in isolation. MFSR lacks high-resolution structural priors for fine texture recovery, while Pansharpening relies on upsampled low-resolution inputs and is sensitive to noise and misalignment. There's a need for a unified approach that synergistically combines both paradigms.

Method: SatFusion uses two main modules: 1) Multi-Frame Image Fusion (MFIF) extracts HR semantic features by aggregating complementary information from multiple LR multispectral frames, and 2) Multi-Source Image Fusion (MSIF) integrates fine-grained structural details from an HR panchromatic image with implicit pixel-level alignment. An advanced variant SatFusion* adds panchromatic-guided mechanism to MFIF with structure-aware feature embedding and transformer-based adaptive aggregation for spatially adaptive feature selection.

Result: Extensive experiments on four benchmark datasets show that SatFusion effectively resolves the fragility of existing paradigms, delivering superior reconstruction fidelity, robustness, and generalizability by synergistically coupling multi-frame and multi-source priors.

Conclusion: The proposed SatFusion framework successfully bridges multi-frame and multi-source remote sensing image fusion, demonstrating that coupling these complementary priors leads to better image reconstruction than isolated approaches.

Abstract: High-quality remote sensing (RS) image acquisition is fundamentally constrained by physical limitations. While Multi-Frame Super-Resolution (MFSR) and Pansharpening address this by exploiting complementary information, they are typically studied in isolation: MFSR lacks high-resolution (HR) structural priors for fine-grained texture recovery, whereas Pansharpening relies on upsampled low-resolution (LR) inputs and is sensitive to noise and misalignment. In this paper, we propose SatFusion, a novel and unified framework that seamlessly bridges multi-frame and multi-source RS image fusion. SatFusion extracts HR semantic features by aggregating complementary information from multiple LR multispectral frames via a Multi-Frame Image Fusion (MFIF) module, and integrates fine-grained structural details from an HR panchromatic image through a Multi-Source Image Fusion (MSIF) module with implicit pixel-level alignment. To further alleviate the lack of structural priors during multi-frame fusion, we introduce an advanced variant, SatFusion*, which integrates a panchromatic-guided mechanism into the MFIF stage. Through structure-aware feature embedding and transformer-based adaptive aggregation, SatFusion* enables spatially adaptive feature selection, strengthening the coupling between multi-frame and multi-source representations. Extensive experiments on four benchmark datasets validate our core insight: synergistically coupling multi-frame and multi-source priors effectively resolves the fragility of existing paradigms, delivering superior reconstruction fidelity, robustness, and generalizability.

[647] In search of truth: Evaluating concordance of AI-based anatomy segmentation models

Lena Giebeler, Deepa Krishnaswamy, David Clunie, Jakob Wasserthal, Lalith Kumar Shiyam Sundar, Andres Diaz-Pinto, Klaus H. Maier-Hein, Murong Xu, Bjoern Menze, Steve Pieper, Ron Kikinis, Andrey Fedorov

Main category: eess.IV

TL;DR: A framework for comparing AI-based anatomy segmentation models without ground truth using standardized representations and visualization tools.

DetailsMotivation: The growing number of similar anatomy segmentation models creates challenges for evaluation when ground truth annotations are unavailable, requiring practical tools for model comparison and selection.

Method: Harmonize segmentation results into standard interoperable representations, extend 3D Slicer for loading/comparison, use OHIF Viewer for visualization, and apply to evaluate 6 models on 31 anatomical structures from CT scans.

Result: Framework enables automated loading, structure-wise inspection, and model comparison; shows excellent agreement for some structures (lungs) but not others (vertebrae, ribs); allows quick detection of problematic results.

Conclusion: Provides practical tools for model evaluation without ground truth, enabling informed model selection; resources available online including harmonization scripts and visualization tools.

Abstract: Purpose AI-based methods for anatomy segmentation can help automate characterization of large imaging datasets. The growing number of similar in functionality models raises the challenge of evaluating them on datasets that do not contain ground truth annotations. We introduce a practical framework to assist in this task. Approach We harmonize the segmentation results into a standard, interoperable representation, which enables consistent, terminology-based labeling of the structures. We extend 3D Slicer to streamline loading and comparison of these harmonized segmentations, and demonstrate how standard representation simplifies review of the results using interactive summary plots and browser-based visualization using OHIF Viewer. To demonstrate the utility of the approach we apply it to evaluating segmentation of 31 anatomical structures (lungs, vertebrae, ribs, and heart) by six open-source models - TotalSegmentator 1.5 and 2.6, Auto3DSeg, MOOSE, MultiTalent, and CADS - for a sample of Computed Tomography (CT) scans from the publicly available National Lung Screening Trial (NLST) dataset. Results We demonstrate the utility of the framework in enabling automating loading, structure-wise inspection and comparison across models. Preliminary results ascertain practical utility of the approach in allowing quick detection and review of problematic results. The comparison shows excellent agreement segmenting some (e.g., lung) but not all structures (e.g., some models produce invalid vertebrae or rib segmentations). Conclusions The resources developed are linked from https://imagingdatacommons.github.io/segmentation-comparison/ including segmentation harmonization scripts, summary plots, and visualization tools. This work assists in model evaluation in absence of ground truth, ultimately enabling informed model selection.

Last updated: 2026-05-04
Built with Hugo, theme modified on Stack