Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

Xiangyu Zhang, Benjamin John Southwell, Siqi Pan, Xinlei Niu, Beena Ahmed, Julien Epps

Main category: eess.AS

TL;DR: Video-enhanced audio tokenization that integrates visual information while preserving audio reconstruction quality for multimodal understanding tasks.

Details

Motivation: Existing audio tokenizers have limitations in understanding tasks due to single-modality constraints, especially when audio signals are ambiguous. While adding visual information helps understanding, current multimodal fusion approaches degrade audio reconstruction quality, which is unacceptable for end-to-end audio systems requiring high-fidelity generation.

Method: Proposes Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization. Key findings: 1) Fusion location in tokenizer architecture is crucial for preserving reconstruction quality, 2) Contrastive learning is unsuitable for discrete tokenizers, 3) Temporal-axis fusion guided by distinctive features works best. The method integrates visual information into audio tokenizer while maintaining reconstruction fidelity.

Result: The approach maintains high-fidelity audio reconstruction while achieving superior performance on downstream understanding tasks compared to audio-only tokenizers and established multimodal fusion baselines.

Conclusion: Successfully demonstrates that visual information can be integrated into audio tokenizers without degrading reconstruction quality, enabling better multimodal understanding while preserving audio generation capabilities.

Abstract: Audio tokenization has emerged as a critical component in end-to-end audio language models, enabling efficient discrete representation learning for both audio understanding and generation tasks. However, existing audio tokenizers face fundamental limitations in understanding tasks due to single-modality constraints, particularly when audio signals contain ambiguous or incomplete information. While incorporating additional modality information can significantly enhance audio understanding, current multimodal fusion approaches invariably degrade reconstruction quality. This degradation is unacceptable for end-to-end audio systems that require high-fidelity audio generation capabilities. In this work, we investigate the root causes of reconstruction quality degradation in video-enhanced audio tokenization and present three key findings. First, the location of fusion within the tokenizer architecture is crucial for preserving reconstruction quality. Second, we show that contrastive learning, though effective in continuous representation fusion, is unsuitable for discrete tokenizers as it fails to enhance downstream task performance. Third, while feature-dimension fusion approaches achieve moderate success, we discover that fusing along the temporal axis – guided by the concept of distinctive features – yields significantly better results. Building on these insights, we introduce the Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization, the first approach to successfully integrate visual information into audio tokenizer architectures while preserving reconstruction fidelity. Our approach not only maintains high-fidelity reconstruction but also achieves superior performance on downstream understanding tasks compared with audio-only tokenizers and established multimodal fusion baselines.

Relevance: 9/10

[2] CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

Gaoxiang Cong, Liang Li, Jiaxin Ye, Zhedong Zhang, Hongming Shan, Yuankai Qi, Qingming Huang

Main category: cs.SD

TL;DR: A novel movie dubbing framework using Cognitive Synchronous Diffusion Transformer (CoSync-DiT) with flow matching and Joint Semantic and Alignment Regularization for precise lip-sync and natural speech generation.

Details

Motivation: Existing movie dubbing methods fail to achieve precise lip-sync and naturalness due to explicit duration alignment. Implicit alignment solutions suffer from reference audio interference causing timbre and pronunciation degradation in real-world scenarios.

Method: Proposes a flow matching-based movie dubbing framework with Cognitive Synchronous Diffusion Transformer (CoSync-DiT) that progressively guides noise-to-speech generation through acoustic style adaptation, fine-grained visual calibration, and time-aware context alignment. Includes Joint Semantic and Alignment Regularization (JSAR) to constrain frame-level temporal consistency and semantic consistency.

Result: Extensive experiments on standard benchmarks and challenging in-the-wild dubbing benchmarks demonstrate state-of-the-art performance across multiple metrics.

Conclusion: The proposed CoSync-DiT framework with JSAR effectively addresses lip-sync precision and naturalness issues in movie dubbing, outperforming existing methods in both controlled and real-world scenarios.

Abstract: Movie dubbing aims to synthesize speech that preserves the vocal identity of a reference audio while synchronizing with the lip movements in a target video. Existing methods fail to achieve precise lip-sync and lack naturalness due to explicit alignment at the duration level. While implicit alignment solutions have emerged, they remain susceptible to interference from the reference audio, triggering timbre and pronunciation degradation in in-the-wild scenarios. In this paper, we propose a novel flow matching-based movie dubbing framework driven by the Cognitive Synchronous Diffusion Transformer (CoSync-DiT), inspired by the cognitive process of professional actors. This architecture progressively guides the noise-to-speech generative trajectory by executing acoustic style adapting, fine-grained visual calibrating, and time-aware context aligning. Furthermore, we design the Joint Semantic and Alignment Regularization (JSAR) mechanism to simultaneously constrain frame-level temporal consistency on the contextual outputs and semantic consistency on the flow hidden states, ensuring robust alignment. Extensive experiments on both standard benchmarks and challenging in-the-wild dubbing benchmarks demonstrate that our method achieves the state-of-the-art performance across multiple metrics.

Relevance: 9/10

[3] SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

Luoyi Sun, Xiao Zhou, Zeqian Li, Ya Zhang, Yanfeng Wang, Weidi Xie

Main category: cs.SD

TL;DR: SpotSound is an audio-language model designed for temporal grounding of audio events, addressing limitations in existing ALMs by introducing a novel training objective to suppress hallucinated timestamps and creating a challenging benchmark with sparse target events.

Details

Motivation: Current Large Audio-Language Models (ALMs) excel at holistic audio understanding but are unreliable for temporal grounding - pinpointing when events occur in long-form audio. This stems from training data lacking precise timestamps and benchmarks that don't simulate real-world scenarios where short events are obscured by dense background sounds.

Method: SpotSound introduces a novel training objective specifically designed to suppress hallucinated timestamps for events absent from the input. The authors also present SpotSound-Bench, a challenging temporal grounding benchmark where target events occupy less than ~10% of each clip, creating a rigorous ’needle-in-a-haystack’ evaluation scenario.

Result: Experiments demonstrate that SpotSound achieves state-of-the-art results on temporal grounding benchmarks while maintaining robust performance across general downstream audio-language tasks.

Conclusion: SpotSound addresses key limitations in ALMs for temporal grounding through innovative training objectives and challenging benchmarks, advancing the field of audio event localization in complex acoustic environments.

Abstract: Large Audio-Language Models (ALMs) have recently demonstrated remarkable capabilities in holistic audio understanding, yet they remain unreliable for temporal grounding, i.e., the task of pinpointing exactly when an event occurs within long-form audio. This limitation stems from two factors: training data dominated by clip-level supervision lacking precise timestamps, and benchmarks that fail to simulate real-world scenarios where short events are obscured by dense background sounds. In this paper, we introduce SpotSound, an audio language model designed for grounding audio events. SpotSound incorporates a novel training objective, specifically designed to suppress hallucinated timestamps for events absent from the input. Additionally, we present SpotSound-Bench, a challenging temporal grounding benchmark where target events occupy less than ~10% of each clip, creating a rigorous `needle-in-a-haystack’ evaluation. Experiments demonstrate that SpotSound achieves state-of-the-art results on temporal grounding benchmarks while maintaining robust performance across general downstream audio-language tasks. Code, models and benchmark are released on https://loiesun.github.io/spotsound/

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 141]
cs.CV [Total: 207]
cs.AI [Total: 154]
cs.SD [Total: 8]
cs.LG [Total: 151]
cs.MA [Total: 5]
cs.MM [Total: 0]
eess.AS [Total: 13]
eess.IV [Total: 6]

cs.CL

[1] Filtered Reasoning Score: Evaluating Reasoning Quality on a Model’s Most-Confident Traces

Manas Pathak, Xingyao Chen, Shuozhe Li, Amy Zhang, Liu Leqi

Main category: cs.CL

TL;DR: Proposes Filtered Reasoning Score (FRS) to evaluate reasoning quality beyond accuracy, addressing limitations of outcome-based evaluation for LLMs.

Details

Motivation: Current LLM evaluation focuses on accuracy but doesn't assess reasoning quality. Models can achieve correct answers through flawed reasoning (memorization, over-optimization), and models with different reasoning capabilities can have similar accuracy scores.

Method: Proposes reasoning score evaluating traces along dimensions like faithfulness, coherence, utility, and factuality. Introduces Filtered Reasoning Score (FRS) that computes reasoning quality using only top-K% most confident traces to avoid issues with averaging across many possible trajectories.

Result: Models indistinguishable under standard accuracy show significant differences in reasoning quality under FRS. Models with higher FRS on one benchmark tend to perform better on other reasoning benchmarks in both accuracy and reasoning quality.

Conclusion: FRS complements accuracy by capturing transferable reasoning capabilities, providing a more nuanced evaluation of LLM reasoning beyond simple outcome-based metrics.

Abstract: Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. This highlights a fundamental limitation of outcome-based evaluation: models may arrive at correct answers through flawed reasoning, and models with substantially different reasoning capabilities can nevertheless exhibit similar benchmark accuracy, for example due to memorization or over-optimization. In this paper, we ask: given existing benchmarks, can we move beyond outcome-based evaluation to assess the quality of reasoning itself? We seek metrics that (1) differentiate models with similar accuracy and (2) are robust to variations in input prompts and generation configurations. To this end, we propose a reasoning score that evaluates reasoning traces along dimensions such as faithfulness, coherence, utility, and factuality. A remaining question is how to aggregate this score across multiple sampled traces. Naively averaging them is undesirable, particularly in long-horizon settings, where the number of possible trajectories grows rapidly, and low-confidence correct traces are more likely to be coincidental. To address this, we introduce the Filtered Reasoning Score (FRS), which computes reasoning quality using only the top-K% most confident traces. Evaluating with FRS, models that are indistinguishable under standard accuracy exhibit significant differences in reasoning quality. Moreover, models with higher FRS on one benchmark tend to perform better on other reasoning benchmarks, in both accuracy and reasoning quality. Together, these findings suggest that FRS complements accuracy by capturing a model’s transferable reasoning capabilities. We open source our evaluation codebase: https://github.com/Manas2006/benchmark_reproducibility.

[2] Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, Sanjeev Arora

Main category: cs.CL

TL;DR: SD-Zero is a self-distillation method that trains a single model to act as both generator and reviser, using binary rewards to create dense token-level supervision without external teachers or demonstrations.

Details

Motivation: Current verifiable post-training methods have limitations: RLVR uses sparse binary rewards while distillation requires costly external supervision. There's a need for sample-efficient training that doesn't rely on external resources.

Method: SD-Zero trains one model in two roles: Generator produces initial responses, Reviser conditions on those responses and binary rewards to produce improved responses. On-policy self-distillation transfers reviser knowledge back to generator using token distributions as supervision.

Result: On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over base models and outperforms RFT, GRPO, and SDFT under same training budgets.

Conclusion: SD-Zero effectively transforms binary rewards into dense token-level self-supervision, demonstrating token-level self-localization and iterative self-evolution capabilities for improved reasoning performance.

Abstract: Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser’s token distributions conditioned on the generator’s response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator’s response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization.

[3] LLMs Struggle with Abstract Meaning Comprehension More Than Expected

Hamoud Alhazmi, Jiachen Jiang

Main category: cs.CL

TL;DR: A study on abstract meaning comprehension in language models, showing that fine-tuned models outperform LLMs on abstract concept tasks, with a proposed bidirectional attention classifier improving performance by 4.06% and 3.41% on two tasks.

Details

Motivation: Abstract words pose significant challenges for language comprehension due to their non-concrete, high-level semantics. The SemEval-2021 Task 4 (ReCAM) was designed to evaluate models' ability to interpret abstract concepts, revealing limitations in current language models' understanding of abstract meanings.

Method: The study evaluates various models including GPT-4o, BERT, and RoBERTa under zero-shot, one-shot, and few-shot settings on the ReCAM task (cloze-style passages with abstract concept questions). A bidirectional attention classifier is proposed, inspired by human cognitive strategies, which dynamically attends to both passages and options to enhance comprehension.

Result: Most LLMs (including GPT-4o) struggle with abstract meaning comprehension, while fine-tuned models like BERT and RoBERTa perform better. The proposed bidirectional attention classifier improves accuracy by 4.06% on Task 1 and 3.41% on Task 2 compared to baseline fine-tuned models.

Conclusion: Abstract meaning comprehension remains challenging for language models, with fine-tuned models outperforming LLMs. The bidirectional attention approach shows promise for enhancing abstract concept understanding by mimicking human cognitive strategies of dynamic attention between passages and options.

Abstract: Understanding abstract meanings is crucial for advanced language comprehension. Despite extensive research, abstract words remain challenging due to their non-concrete, high-level semantics. SemEval-2021 Task 4 (ReCAM) evaluates models’ ability to interpret abstract concepts by presenting passages with questions and five abstract options in a cloze-style format. Key findings include: (1) Most large language models (LLMs), including GPT-4o, struggle with abstract meaning comprehension under zero-shot, one-shot, and few-shot settings, while fine-tuned models like BERT and RoBERTa perform better. (2) A proposed bidirectional attention classifier, inspired by human cognitive strategies, enhances fine-tuned models by dynamically attending to passages and options. This approach improves accuracy by 4.06 percent on Task 1 and 3.41 percent on Task 2, demonstrating its potential for abstract meaning comprehension.

[4] Benchmarking Deflection and Hallucination in Large Vision-Language Models

Nicholas Moratelli, Christopher Davis, Leonardo F. R. Ribeiro, Bill Byrne, Gonzalo Iglesias

Main category: cs.CL

TL;DR: VLM-DeflectionBench: A benchmark for evaluating multimodal retrieval systems that tests how vision-language models handle conflicting evidence and generate appropriate deflections when knowledge is insufficient.

Details

Motivation: Existing benchmarks for knowledge-based visual question answering (KB-VQA) overlook critical issues: they don't test how models handle conflicts between visual and textual evidence, don't evaluate proper deflection behavior when knowledge is incomplete, and become obsolete quickly as models memorize training data.

Method: Three main contributions: 1) A dynamic data curation pipeline that filters for genuinely retrieval-dependent samples to maintain benchmark difficulty over time; 2) VLM-DeflectionBench with 2,775 samples spanning diverse multimodal retrieval settings; 3) A fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness.

Result: Experiments across 20 state-of-the-art LVLMs show that models usually fail to deflect appropriately when faced with noisy or misleading evidence. The benchmark reveals significant gaps in how models handle uncertainty and conflicting multimodal information.

Conclusion: The work highlights the need to evaluate not just what models know, but how they behave when they don’t know. VLM-DeflectionBench serves as a reusable and extensible benchmark for reliable KB-VQA evaluation, addressing critical gaps in current multimodal retrieval assessment.

Abstract: Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer…) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions. First, we propose a dynamic data curation pipeline that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples. Second, we introduce VLM-DeflectionBench, a benchmark of 2,775 samples spanning diverse multimodal retrieval settings, designed to probe model behaviour under conflicting or insufficient evidence. Third, we define a fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness. Experiments across 20 state-of-the-art LVLMs indicate that models usually fail to deflect in the presence of noisy or misleading evidence. Our results highlight the need to evaluate not only what models know, but how they behave when they do not, and serve as a reusable and extensible benchmark for reliable KB-VQA evaluation. All resources will be publicly available upon publication.

[5] Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

Xin Liu, Lu Wang

Main category: cs.CL

TL;DR: CURE is a framework that improves long-form factuality in LLMs by teaching models to reason about uncertainty at the claim level through structured outputs with explicit confidence estimates.

Details

Motivation: LLMs often hallucinate in long-form generation, and existing approaches only provide single scalar confidence for entire responses, which is insufficient when uncertainty varies across individual claims in long-form content.

Method: Introduces Claim-Aware Reasoning Protocol that structures outputs into atomic claims paired with explicit confidence estimates, then uses multi-stage training pipeline to align model confidence with claim correctness and optimize for factuality.

Result: Improves claim-level accuracy by up to 39.9% on Biography generation, with 16.0% increase in AUROC on FactBench, consistently outperforming supervised and RL baselines on four long-form factuality benchmarks.

Conclusion: CURE effectively improves long-form factuality by teaching LLMs to reason about uncertainty at the claim level, enabling better calibration and selective prediction for uncertain claims.

Abstract: Large language models (LLMs) often hallucinate in long-form generation. Existing approaches mainly improve factuality through post-hoc revision or reinforcement learning (RL) with correctness-based rewards, but they do not teach the model to estimate which parts of its generation are reliable. As a result, models may still state incorrect claims confidently in their responses. Recent advances in reasoning have significantly improved LLM performance, and have been leveraged to estimate confidence by incorporating calibration into RL objectives. However, existing approaches remain limited to a single scalar confidence for the entire response, which is insufficient for long-form generation where uncertainty varies across individual claims. To mitigate this problem, we propose CURE, a framework that improves long-form factuality by teaching LLMs to reason about uncertainty at the claim level. We first introduce a Claim-Aware Reasoning Protocol, which structures outputs into atomic claims paired with explicit confidence estimates. We then develop a multi-stage training pipeline that aligns model confidence with claims’ correctness and then optimizes on factuality. The resulting calibrated confidence further enables selective prediction, allowing the model to abstain from uncertain claims at inference time. Experiments on four long-form factuality benchmarks show that CURE consistently improves factual accuracy over competitive supervised and RL baselines, while maintaining factual recall. In particular, it improves claim-level accuracy by up to 39.9% on Biography generation. These gains are accompanied by improved calibration, as reflected by a 16.0% increase in AUROC on FactBench.

[6] Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

Omar El Bachyr, Yewei Song, Saad Ezzini, Jacques Klein, Tegawendé F. Bissyandé, Anas Zilali, Ulrick Ble, Anne Goujon

Main category: cs.CL

TL;DR: Systematic study of RAG components for PDF question answering, focusing on parsers, chunking strategies, and their impact on answer correctness using financial domain benchmarks.

Details

Motivation: PDFs are designed for human reading but contain heterogeneous content (text, tables, images) that's challenging for automated processing. While RAG systems are promising for PDF understanding, there's no comprehensive study on how different components and design choices affect RAG performance for PDFs.

Method: Focuses on Question Answering as a specific language understanding task, using two benchmarks from the financial domain (including newly created TableQuest). Systematically examines multiple PDF parsers and chunking strategies with varied overlap, analyzing their synergies in preserving document structure and ensuring answer correctness.

Result: Results provide practical guidelines for building robust RAG pipelines for PDF understanding, showing how different parser and chunking strategy combinations affect performance on financial document QA tasks.

Conclusion: The study offers systematic insights into RAG system design for PDF processing, with practical recommendations for component selection and configuration to optimize PDF understanding performance.

Abstract: PDF files are primarily intended for human reading rather than automated processing. In addition, the heterogeneous content of PDFs, such as text, tables, and images, poses significant challenges for parsing and information extraction. To address these difficulties, both practitioners and researchers are increasingly developing new methods, including the promising Retrieval-Augmented Generation (RAG) systems to automated PDF processing. However, there is no comprehensive study investigating how different components and design choices affect the performance of a RAG system for understanding PDFs. In this paper, we propose such a study (1) by focusing on Question Answering, a specific language understanding task, and (2) by leveraging two benchmarks from the financial domain, including TableQuest, our newly generated, publicly available benchmark. We systematically examine multiple PDF parsers and chunking strategies (with varied overlap), along with their potential synergies in preserving document structure and ensuring answer correctness. Overall, our results offer practical guidelines for building robust RAG pipelines for PDF understanding.

[7] Leveraging Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLMs

Shreeya Verma Kathuria, Nitin Mayande, Sharookh Daruwalla, Nitin Joglekar, Charles Weber

Main category: cs.CL

TL;DR: wSSAS is a deterministic framework that improves LLM-based text categorization by using hierarchical classification and signal-to-noise ratio scoring to reduce stochastic behavior and enhance analytical precision.

Details

Motivation: LLMs suffer from stochastic attention mechanisms and noise sensitivity that compromise analytical precision and reproducibility for enterprise-grade text categorization tasks.

Method: Two-phased validation framework: 1) organizes text into hierarchical classification structure (Themes, Stories, Clusters), 2) uses Signal-to-Noise Ratio to prioritize high-value semantic features, and incorporates scoring into Summary-of-Summaries architecture.

Result: wSSAS significantly improves clustering integrity and categorization accuracy across diverse datasets (Google Business reviews, Amazon Product reviews, Goodreads Book reviews), reduces categorization entropy, and provides reproducible LLM-based summaries.

Conclusion: wSSAS offers a high-precision, deterministic process for large-scale text categorization that addresses LLM limitations in enterprise analytics applications.

Abstract: The use of Large Language Models (LLMs) for reliable, enterprise-grade analytics such as text categorization is often hindered by the stochastic nature of attention mechanisms and sensitivity to noise that compromise their analytical precision and reproducibility. To address these technical frictions, this paper introduces the Weighted Syntactic and Semantic Context Assessment Summary (wSSAS), a deterministic framework designed to enforce data integrity on large-scale, chaotic datasets. We propose a two-phased validation framework that first organizes raw text into a hierarchical classification structure containing Themes, Stories, and Clusters. It then leverages a Signal-to-Noise Ratio (SNR) to prioritize high-value semantic features, ensuring the model’s attention remains focused on the most representative data points. By incorporating this scoring mechanism into a Summary-of-Summaries (SoS) architecture, the framework effectively isolates essential information and mitigates background noise during data aggregation. Experimental results using Gemini 2.0 Flash Lite across diverse datasets - including Google Business reviews, Amazon Product reviews, and Goodreads Book reviews - demonstrate that wSSAS significantly improves clustering integrity and categorization accuracy. Our findings indicate that wSSAS reduces categorization entropy and provides a reproducible pathway for improving LLM based summaries based on a high-precision, deterministic process for large-scale text categorization.

[8] LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

Haocheng Xi, Harman Singh, Yuezhou Hu, Coleman Hooper, Rishabh Tiwari, Aditya Tomar, Minjae Lee, Wonjun Kang, Michael Mahoney, Chenfeng Xu, Kurt Keutzer, Amir Gholami

Main category: cs.CL

TL;DR: LOSA introduces locality-aware sparse attention for block-wise diffusion language models to reduce memory-bound attention bottlenecks in long-context scenarios by reusing cached prefix-attention results for stable tokens.

Details

Motivation: Block-wise diffusion language models (DLMs) face memory-bound attention bottlenecks in long-context scenarios. Naive sparse attention fails due to KV Inflation where different queries select different prefix positions, making the union of accessed KV pages large.

Method: LOSA (Locality-aware Sparse Attention) leverages the observation that between consecutive denoising steps, only a small fraction of active tokens exhibit significant hidden-state changes while most stable tokens remain nearly constant. It reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens.

Result: LOSA preserves near-dense accuracy while significantly improving efficiency, achieving up to +9 points in average accuracy at aggressive sparsity levels while maintaining 1.54x lower attention density. It achieves up to 4.14x attention speedup on RTX A6000 GPUs.

Conclusion: LOSA effectively addresses the KV Inflation problem in block-wise DLMs through locality-aware sparse attention, substantially reducing the number of KV indices that must be loaded while maintaining accuracy and improving computational efficiency.

Abstract: Block-wise diffusion language models (DLMs) generate multiple tokens in any order, offering a promising alternative to the autoregressive decoding pipeline. However, they still remain bottlenecked by memory-bound attention in long-context scenarios. Naive sparse attention fails on DLMs due to a KV Inflation problem, where different queries select different prefix positions, making the union of accessed KV pages large. To address this, we observe that between consecutive denoising steps, only a small fraction of active tokens exhibit significant hidden-state changes, while the majority of stable tokens remain nearly constant. Based on this insight, we propose LOSA (Locality-aware Sparse Attention), which reuses cached prefix-attention results for stable tokens and applies sparse attention only to active tokens. This substantially shrinks the number of KV indices that must be loaded, yielding both higher speedup and higher accuracy. Across multiple block-wise DLMs and benchmarks, LOSA preserves near-dense accuracy while significantly improving efficiency, achieving up to +9 points in average accuracy at aggressive sparsity levels while maintaining 1.54x lower attention density. It also achieves up to 4.14x attention speedup on RTX A6000 GPUs, demonstrating the effectiveness of the proposed method.

[9] Robust Explanations for User Trust in Enterprise NLP Systems

Guilin Zhang, Kai Zhao, Jeffrey Friedman, Xu Chu, Amine Anoun, Jerry Ting

Main category: cs.CL

TL;DR: A framework for evaluating black-box explanation robustness in NLP models under realistic noise perturbations, comparing encoder vs decoder architectures with findings that decoder LLMs provide more stable explanations that improve with scale.

Details

Motivation: Enterprise NLP systems require robust explanations for user trust, but pre-deployment validation is challenging with black-box APIs where representation-based explainers are infeasible. There's limited guidance on whether explanations remain stable under real user noise, especially when organizations migrate from encoder classifiers to decoder LLMs.

Method: Proposes a unified black-box robustness evaluation framework using leave-one-out occlusion for token-level explanations. Measures explanation robustness with top-token flip rate under realistic perturbations (swap, deletion, shuffling, back-translation) at multiple severity levels. Conducts systematic cross-architecture comparison across 3 benchmark datasets and 6 models spanning encoder and decoder families (BERT, RoBERTa, Qwen 7B/14B, Llama 8B/70B; 64,800 cases).

Result: Decoder LLMs produce substantially more stable explanations than encoder baselines (73% lower flip rates on average). Stability improves with model scale (44% gain from 7B to 70B). Relates robustness improvements to inference cost, yielding a practical cost-robustness tradeoff curve for model selection.

Conclusion: Decoder LLMs offer more robust explanations than encoder models, with stability improving with scale. The framework provides practical guidance for model and explanation selection in compliance-sensitive applications, balancing robustness with inference costs.

Abstract: Robust explanations are increasingly required for user trust in enterprise NLP, yet pre-deployment validation is difficult in the common case of black-box deployment (API-only access) where representation-based explainers are infeasible and existing studies provide limited guidance on whether explanations remain stable under real user noise, especially when organizations migrate from encoder classifiers to decoder LLMs. To close this gap, we propose a unified black-box robustness evaluation framework for token-level explanations based on leave-one-out occlusion, and operationalize explanation robustness with top-token flip rate under realistic perturbations (swap, deletion, shuffling, and back-translation) at multiple severity levels. Using this protocol, we conduct a systematic cross-architecture comparison across three benchmark datasets and six models spanning encoder and decoder families (BERT, RoBERTa, Qwen 7B/14B, Llama 8B/70B; 64,800 cases). We find that decoder LLMs produce substantially more stable explanations than encoder baselines (73% lower flip rates on average), and that stability improves with model scale (44% gain from 7B to 70B). Finally, we relate robustness improvements to inference cost, yielding a practical cost-robustness tradeoff curve that supports model and explanation selection prior to deployment in compliance-sensitive applications.

[10] Representing expertise accelerates learning from pedagogical interaction data

Dhara Yu, Karthikeya Kaushik, Bill D. Thompson

Main category: cs.CL

TL;DR: Transformer models trained on pedagogical interactions between expert and novice agents show better robustness and generalization than those trained only on expert demonstrations, with agent representation enabling expert-like behavior from limited expert data.

Details

Motivation: While prior work shows that interaction data between multiple individuals improves learning agent performance, it's unclear which specific features of interactions contribute to this improvement. The paper aims to systematically examine factors that make interaction data effective compared to single-agent expert demonstrations.

Method: Generated synthetic datasets of simple interactions between expert and novice agents in a spatial navigation task. Used transformer models trained on different datasets (pedagogical interactions vs. expert-only demonstrations). Controlled paradigm allowed precise operationalization of key distinctions between interaction and solo expert behavior.

Result: Models trained on pedagogical interactions were more robust across various scenarios than models trained only on expert demonstrations. The ability to represent epistemically distinct agents led to expert-like behavior even when expert behavior was rarely observed.

Conclusion: Interaction data provides benefits beyond just more training data; pedagogical interactions and agent representation enable better generalization and robustness. The findings have implications for how multimodal AI systems should be trained with interaction data rather than just expert demonstrations.

Abstract: Work in cognitive science and artificial intelligence has suggested that exposing learning agents to traces of interaction between multiple individuals can improve performance in a variety of settings, yet it remains unknown which features of interactions contribute to this improvement. We examined the factors that support the effectiveness of interaction data, using a controlled paradigm that allowed us to precisely operationalize key distinctions between interaction and an expert acting alone. We generated synthetic datasets of simple interactions between an expert and a novice in a spatial navigation task, and then trained transformer models on those datasets, evaluating performance after exposure to different datasets. Our experiments showed that models trained on pedagogical interactions were more robust across a variety of scenarios compared to models trained only on expert demonstrations, and that having the ability to represent epistemically distinct agents led to expert-like behavior even when expert behavior was rarely observed.

[11] Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

Syed Rifat Raiyan

Main category: cs.CL

TL;DR: LLMs exhibit Identifiable Victim Effect (IVE) - stronger preference for specific victims over statistical groups - with effect sizes larger than humans, modulated by alignment training and exacerbated by standard Chain-of-Thought prompting.

Details

Motivation: As LLMs take on roles in humanitarian triage, grant evaluation, and content moderation, it's critical to understand whether they inherit human moral reasoning biases like the Identifiable Victim Effect, which could impact ethical decision-making in AI systems.

Method: Large-scale empirical investigation across 51,955 API trials with 16 frontier models from 9 organizations, using 10 experiments that port and extend canonical IVE paradigms from human psychology research, analyzing effects of alignment training and prompting strategies.

Result: IVE is prevalent in LLMs with pooled effect size (d=0.223) approximately twice the human single-victim baseline; instruction-tuned models show extreme IVE while reasoning-specialized models invert it; standard CoT prompting nearly triples the effect size while only utilitarian CoT eliminates it.

Conclusion: LLMs exhibit amplified Identifiable Victim Effect compared to humans, with alignment training and prompting strategies significantly modulating this bias, raising concerns for ethical AI deployment in humanitarian decision-making contexts.

Abstract: The Identifiable Victim Effect (IVE) $-$ the tendency to allocate greater resources to a specific, narratively described victim than to a statistically characterized group facing equivalent hardship $-$ is one of the most robust findings in moral psychology and behavioural economics. As large language models (LLMs) assume consequential roles in humanitarian triage, automated grant evaluation, and content moderation, a critical question arises: do these systems inherit the affective irrationalities present in human moral reasoning? We present the first systematic, large-scale empirical investigation of the IVE in LLMs, comprising N=51,955 validated API trials across 16 frontier models spanning nine organizational lineages (Google, Anthropic, OpenAI, Meta, DeepSeek, xAI, Alibaba, IBM, and Moonshot). Using a suite of ten experiments $-$ porting and extending canonical paradigms from Small et al. (2007) and Kogut and Ritov (2005) $-$ we find that the IVE is prevalent but strongly modulated by alignment training. Instruction-tuned models exhibit extreme IVE (Cohen’s d up to 1.56), while reasoning-specialized models invert the effect (down to d=-0.85). The pooled effect (d=0.223, p=2e-6) is approximately twice the single-victim human meta-analytic baseline (d$\approx$0.10) reported by Lee and Feeley (2016) $-$ and likely exceeds the overall human pooled effect by a larger margin, given that the group-victim human effect is near zero. Standard Chain-of-Thought (CoT) prompting $-$ contrary to its role as a deliberative corrective $-$ nearly triples the IVE effect size (from d=0.15 to d=0.41), while only utilitarian CoT reliably eliminates it. We further document psychophysical numbing, perfect quantity neglect, and marginal in-group/out-group cultural bias, with implications for AI deployment in humanitarian and ethical decision-making contexts.

[12] Temporal Flattening in LLM-Generated Text: Comparing Human and LLM Writing Trajectories

Zhanwei Cao, YeoJin Go, Yifan Hu, Shanu Sushmita

Main category: cs.CL

TL;DR: LLMs show “temporal flattening” - they fail to reproduce the natural evolution of human writing styles and cognitive states over time, even when given access to historical context.

Details

Motivation: Human writing evolves longitudinally with changing styles and cognitive states over months/years, but current LLMs are typically deployed as stateless systems. The paper investigates whether LLMs can reproduce such temporal structure across extended periods.

Method: Constructed longitudinal dataset of 412 human authors and 6,086 documents (2012-2024) across academic abstracts, blogs, and news. Compared to trajectories generated by three representative LLMs under standard and history-conditioned settings. Used drift and variance-based metrics over semantic, lexical, and cognitive-emotional representations.

Result: LLMs exhibit temporal flattening: greater lexical diversity but substantially reduced semantic and cognitive-emotional drift relative to humans. Temporal variability patterns alone achieve 94% accuracy and 98% ROC-AUC in distinguishing human from LLM trajectories. Flattening persists regardless of whether LLMs generate independently or with incremental history.

Conclusion: Current LLM deployment paradigms fundamentally lack the ability to reproduce authentic temporal structure in human writing, with implications for synthetic training data and longitudinal text modeling applications.

Abstract: Large language models (LLMs) are increasingly used in daily applications, from content generation to code writing, where each interaction treats the model as stateless, generating responses independently without memory. Yet human writing is inherently longitudinal: authors’ styles and cognitive states evolve across months and years. This raises a central question: can LLMs reproduce such temporal structure across extended time periods? We construct and publicly release a longitudinal dataset of 412 human authors and 6,086 documents spanning 2012–2024 across three domains (academic abstracts, blogs, news) and compare them to trajectories generated by three representative LLMs under standard and history-conditioned generation settings. Using drift and variance-based metrics over semantic, lexical, and cognitive-emotional representations, we find temporal flattening in LLM-generated text. LLMs produce greater lexical diversity but exhibit substantially reduced semantic and cognitive-emotional drift relative to humans. These differences are highly predictive: temporal variability patterns alone achieve 94% accuracy and 98% ROC-AUC in distinguishing human from LLM trajectories. Our results demonstrate that temporal flattening persists regardless of whether LLMs generate independently or with access to incremental history, revealing a fundamental property of current deployment paradigms. This gap has direct implications for applications requiring authentic temporal structure, such as synthetic training data and longitudinal text modeling.

[13] StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

Yuhan Song, Linhao Zhang, Chuhan Wu, Aiwei Liu, Wei Jia, Houfeng Wang, Xiao Zhou

Main category: cs.CL

TL;DR: StableToken: A robust semantic speech tokenizer that uses multi-branch architecture with bit-wise voting to achieve stable token sequences under acoustic perturbations, improving downstream SpeechLLM robustness.

Details

Motivation: Existing semantic speech tokenizers are fragile to meaning-irrelevant acoustic perturbations, causing drastic token sequence changes even at high SNRs where speech remains intelligible. This instability increases learning burden for downstream LLMs and stems from brittle single-path quantization and distant training signals.

Method: Introduces StableToken with a multi-branch architecture that processes audio in parallel, then merges representations through a powerful bit-wise voting mechanism to form a single, stable token sequence. Uses consensus-driven approach for stability.

Result: Sets new SOTA in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. Foundational stability translates to downstream benefits, significantly improving robustness of SpeechLLMs on various tasks.

Conclusion: StableToken addresses fundamental flaws in semantic speech tokenizers by introducing consensus-driven stability, enabling more robust speech understanding systems and improving downstream LLM performance.

Abstract: Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at https://github.com/Tencent/StableToken.

[14] When Self-Reference Fails to Close: Matrix-Level Dynamics in Large Language Models

Ji Ho Bae

Main category: cs.CL

TL;DR: Self-referential prompts alone don’t destabilize LLMs, but non-closing truth recursion (NCTR) prompts cause significant internal matrix dynamics disruption, attention reorganization, and contradictory outputs.

Details

Motivation: To understand how self-referential inputs affect the internal matrix dynamics of large language models, particularly distinguishing between stable and destabilizing forms of self-reference.

Method: Analyzed 4 models (Qwen3-VL-8B, Llama-3.2-11B, Llama-3.3-70B, Gemma-2-9B) with 300 prompts in a 14-level hierarchy at 3 temperatures. Measured 106 scalar metrics across up to 7 analysis passes, focusing on attention effective rank, variance kurtosis, and layer-wise SVD analysis.

Result: Self-reference alone isn’t destabilizing; grounded self-reference is stable. NCTR prompts cause significant disruption: elevated attention effective rank (d=3.14-3.52), 281/397 metric-model combinations differentiate NCTR from stable self-reference, per-layer SVD shows disruption at every layer, classifier achieves AUC 0.81-0.90, and NCTR prompts produce 34-56% more contradictory outputs.

Conclusion: Non-closing truth recursion forces transformers into dynamical regimes where classical matrix-semigroup problems concentrate, providing insights into self-referential failure modes with practical relevance for model reliability.

Abstract: We investigate how self-referential inputs alter the internal matrix dynamics of large language models. Measuring 106 scalar metrics across up to 7 analysis passes on four models from three architecture families – Qwen3-VL-8B, Llama-3.2-11B, Llama-3.3-70B, and Gemma-2-9B – over 300 prompts in a 14-level hierarchy at three temperatures ($T \in {0.0, 0.3, 0.7}$), we find that self-reference alone is not destabilizing: grounded self-referential statements and meta-cognitive prompts are markedly more stable than paradoxical self-reference on key collapse-related metrics, and on several such metrics can be as stable as factual controls. Instability concentrates in prompts inducing non-closing truth recursion (NCTR) – truth-value computations with no finite-depth resolution. NCTR prompts produce anomalously elevated attention effective rank – indicating attention reorganization with global dispersion rather than simple concentration collapse – and key metrics reach Cohen’s $d = 3.14$ (attention effective rank) to $3.52$ (variance kurtosis) vs. stable self-reference in the 70B model; 281/397 metric-model combinations differentiate NCTR from stable self-reference after FDR correction ($q < 0.05$), 198 with $|d| > 0.8$. Per-layer SVD confirms disruption at every sampled layer ($d > +1.0$ in all three models analyzed), ruling out aggregation artifacts. A classifier achieves AUC $0.81$-$0.90$; 30 minimal pairs yield 42/387 significant combinations; 43/106 metrics replicate across all four models. We connect these observations to three classical matrix-semigroup problems and propose, as a conjecture, that NCTR forces finite-depth transformers toward dynamical regimes where these problems concentrate. NCTR prompts also produce elevated contradictory output ($+34$-$56$ percentage points vs. controls), suggesting practical relevance for understanding self-referential failure modes.

[15] Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

Linhao Zhang, Yuhan Song, Aiwei Liu, Chuhan Wu, Sijun Zhang, Wei Jia, Yuan Liu, Houfeng Wang, Xiao Zhou

Main category: cs.CL

TL;DR: UAS-Audio introduces a unified JSON supervision framework that organizes audio into Transcription, Paralinguistics, and Non-linguistic Events to address AudioLLMs’ performance gap between reasoning and acoustic perception.

Details

Motivation: AudioLLMs show a performance inversion: they excel at complex reasoning but underperform on fine-grained acoustic perception due to ASR-centric training that suppresses paralinguistic cues and acoustic events as noise.

Method: Proposes Unified Audio Schema (UAS), a holistic supervision framework that structures audio information into three explicit components (Transcription, Paralinguistics, Non-linguistic Events) in a unified JSON format to achieve comprehensive acoustic coverage while maintaining audio-text alignment.

Result: UAS-Audio boosts fine-grained perception by 10.9% on MMSU over same-size state-of-the-art models while preserving robust reasoning capabilities, validated on MMSU, MMAR, and MMAU benchmarks.

Conclusion: The structured supervision approach effectively addresses AudioLLMs’ perception-reasoning gap by providing explicit acoustic component modeling while maintaining reasoning capabilities through unified audio-text alignment.

Abstract: Recent Audio Large Language Models (AudioLLMs) exhibit a striking performance inversion: while excelling at complex reasoning tasks, they consistently underperform on fine-grained acoustic perception. We attribute this gap to a fundamental limitation of ASR-centric training, which provides precise linguistic targets but implicitly teaches models to suppress paralinguistic cues and acoustic events as noise. To address this, we propose Unified Audio Schema (UAS), a holistic and structured supervision framework that organizes audio information into three explicit components – Transcription, Paralinguistics, and Non-linguistic Events – within a unified JSON format. This design achieves comprehensive acoustic coverage without sacrificing the tight audio-text alignment that enables reasoning. We validate the effectiveness of this supervision strategy by applying it to both discrete and continuous AudioLLM architectures. Extensive experiments on MMSU, MMAR, and MMAU demonstrate that UAS-Audio yields consistent improvements, boosting fine-grained perception by 10.9% on MMSU over the same-size state-of-the-art models while preserving robust reasoning capabilities. Our code and model are publicly available at https://github.com/Tencent/Unified_Audio_Schema.

[16] AlphaEval: Evaluating Agents in Production

Pengrui Lu, Bingyu Xu, Wenjun Zhang, Shengjia Hua, Xuanjian Gao, Ranxiang Ge, Lyumanshan Ye, Linxuan Wu, Yiran Li, Junfei Fish Yu, Yibo Zhang, Ruixin Li, Manxiang Li, Xiao Han, Xiaocong Zhou, Guangyao Chi, Zisheng Chen, Kaishen Chen, Kun Wang, Qihua Xu, Fengyue Meng, Yuchen Ni, Jiajun Li, Jinxiu Liu, Danfeng Zhang, Jingru Zhao, Pengfei Liu

Main category: cs.CL

TL;DR: AlphaEval is a production-grounded benchmark for evaluating AI agents using real-world tasks from commercial deployments, with a framework for converting production requirements into executable evaluation tasks.

Details

Motivation: Existing AI agent benchmarks are retrospective, well-specified, and deterministic, failing to capture production realities where requirements are implicit, inputs are heterogeneous multi-modal documents, tasks require undeclared domain expertise, and success is judged by evolving expert standards.

Method: Created AlphaEval benchmark with 94 tasks from 7 companies across 6 O*NET domains, evaluating complete agent products (not just models) using multiple paradigms: LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, and automated UI testing.

Result: Developed a production-grounded benchmark that captures performance variations invisible to model-level evaluation, along with a systematic requirement-to-benchmark construction framework for converting authentic production requirements into executable evaluation tasks.

Conclusion: AlphaEval provides a production-grounded evaluation framework that better reflects real-world AI agent deployment, with a reproducible methodology for organizations to create their own domain-specific benchmarks.

Abstract: The rapid deployment of AI agents in commercial settings has outpaced the development of evaluation methodologies that reflect production realities. Existing benchmarks measure agent capabilities through retrospectively curated tasks with well-specified requirements and deterministic metrics – conditions that diverge fundamentally from production environments where requirements contain implicit constraints, inputs are heterogeneous multi-modal documents with information fragmented across sources, tasks demand undeclared domain expertise, outputs are long-horizon professional deliverables, and success is judged by domain experts whose standards evolve over time. We present AlphaEval, a production-grounded benchmark of 94 tasks sourced from seven companies deploying AI agents in their core business, spanning six O*NET (Occupational Information Network) domains. Unlike model-centric benchmarks, AlphaEval evaluates complete agent products – Claude Code, Codex, etc. – as commercial systems, capturing performance variations invisible to model-level evaluation. Our evaluation framework covers multiple paradigms (LLM-as-a-Judge, reference-driven metrics, formal verification, rubric-based assessment, automated UI testing, etc.), with individual domains composing multiple paradigms. Beyond the benchmark itself, we contribute a requirement-to-benchmark construction framework – a systematic methodology that transforms authentic production requirements into executable evaluation tasks in minimal time. This framework standardizes the entire pipeline from requirement to evaluation, providing a reproducible, modular process that any organization can adopt to construct production-grounded benchmarks for their own domains.

[17] AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood, Kasun Eranda Wijethilake, Quan Z. Sheng

Main category: cs.CL

TL;DR: AgenticAI-DialogGen is a framework that generates persona-grounded, topic-guided conversations using LLM agents, creating a dataset (TopicGuidedChat) with both short- and long-term memory encoding for improved conversational AI evaluation and fine-tuning.

Details

Motivation: Current LLMs struggle with processing extended conversational contexts due to lack of datasets that properly encode both short- and long-term conversational history. Existing datasets lack memory grounding, overlook topic continuity, or require expensive human annotation.

Method: A modular agent-based framework using LLM agents to extract knowledge graphs, identify topics, build speaker personas, and simulate topic-guided conversations from unstructured conversations. Includes a QA module to generate memory-grounded Question Answer pairs from conversational histories.

Result: Created TopicGuidedChat (TGC) dataset with long-term memory encoded as speaker-specific knowledge graphs and short-term memory as topic-guided conversations. Framework yields higher conversational quality, and LLMs fine-tuned on TGC achieve improved performance on memory-grounded QA tasks.

Conclusion: AgenticAI-DialogGen provides an effective unsupervised approach to generate high-quality conversational datasets with proper memory encoding, addressing limitations in current conversational AI evaluation and training.

Abstract: Recent advancements in Large Language Models (LLMs) have improved their ability to process extended conversational contexts, yet fine-tuning and evaluating short- and long-term memories remain difficult due to the absence of datasets that encode both short- and long-term conversational history. Existing conversational datasets lack memory grounding, overlook topic continuity, or rely on costly human annotation. To address these gaps, we introduce AgenticAI-DialogGen, a modular agent-based framework that generates persona-grounded and topic-guided conversations without human supervision. The framework uses LLM agents to extract knowledge graphs, identify topics, build speaker personas, and simulate topic-guided conversations from unstructured conversations. A QA module generates memory-grounded Question Answer (QA) pairs drawn from short- and long-term conversational histories. We also generated a new dataset entitled, TopicGuidedChat (TGC), where long-term memory is encoded as speaker-specific knowledge graphs and short-term memory as newly generated topic-guided conversations. Evaluations depict that AgenticAI-DialogGen yields higher conversational quality and LLMs fine-tuned on TGC dataset achieve improved performance on memory-grounded QA tasks.

[18] MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

Chung-Ming Chien, Manu Orsini, Eugene Kharitonov, Neil Zeghidour, Karen Livescu, Alexandre Défossez

Main category: cs.CL

TL;DR: MoshiRAG: A modular full-duplex speech-to-speech language model that combines compact real-time interaction with selective retrieval for improved factuality without sacrificing conversational flow.

Details

Motivation: Full-duplex speech models enable real-time conversational AI with natural interactions (pauses, interruptions, backchannels), but struggle with factuality. Scaling model size for better knowledge would make real-time inference too expensive, creating a need for efficient knowledge integration.

Method: Proposes MoshiRAG - modular approach combining compact full-duplex interface with selective retrieval. Uses asynchronous framework to identify knowledge-demanding queries and ground responses in external information. Leverages natural temporal gap between response onset and core information delivery to complete retrieval while maintaining conversation flow.

Result: Achieves factuality comparable to best publicly released non-duplex speech language models while preserving full-duplex interactivity. Flexible design supports plug-and-play retrieval without retraining and shows strong performance on out-of-domain mathematical reasoning tasks.

Conclusion: MoshiRAG successfully bridges the gap between real-time conversational interactivity and factual accuracy in speech-to-speech models through modular retrieval augmentation, enabling efficient knowledge access without compromising natural conversation flow.

Abstract: Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.

[19] Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models

Keshu Wu, Chenchen Kuai, Zihao Li, Jiwan Jiang, Shiyu Shen, Shian Wang, Chan-Wei Hu, Zhengzhong Tu, Yang Zhou

Main category: cs.CL

TL;DR: OKH-RAG introduces order-aware retrieval-augmented generation using knowledge hypergraphs with precedence structure to model interaction sequences, outperforming permutation-invariant baselines on order-sensitive reasoning tasks.

Details

Motivation: Existing RAG methods treat retrieved evidence as unordered sets (permutation invariant), which misaligns with real-world reasoning tasks where outcomes depend on interaction order. There's a need to model order as a first-class property for effective reasoning.

Method: Proposes Order-Aware Knowledge Hypergraph RAG (OKH-RAG) that represents knowledge as higher-order interactions within a hypergraph augmented with precedence structure. Reformulates retrieval as sequence inference over hyperedges using a learned transition model that infers precedence from data without explicit temporal supervision.

Result: OKH-RAG consistently outperforms permutation-invariant baselines on order-sensitive question answering and explanation tasks (tropical cyclone and port operation scenarios). Ablations confirm gains specifically arise from modeling interaction order.

Conclusion: Effective reasoning requires not only retrieving relevant evidence but organizing it into structured sequences. Order-aware retrieval addresses a key limitation of set-based approaches and enables more coherent reasoning processes.

Abstract: Retrieval-augmented generation (RAG) enhances large language models by grounding outputs in retrieved knowledge. However, existing RAG methods including graph- and hypergraph-based approaches treat retrieved evidence as an unordered set, implicitly assuming permutation invariance. This assumption is misaligned with many real-world reasoning tasks, where outcomes depend not only on which interactions occur, but also on the order in which they unfold. We propose Order-Aware Knowledge Hypergraph RAG (OKH-RAG), which treats order as a first-class structural property. OKH-RAG represents knowledge as higher-order interactions within a hypergraph augmented with precedence structure, and reformulates retrieval as sequence inference over hyperedges. Instead of selecting independent facts, it recovers coherent interaction trajectories that reflect underlying reasoning processes. A learned transition model infers precedence directly from data without requiring explicit temporal supervision. We evaluate OKH-RAG on order-sensitive question answering and explanation tasks, including tropical cyclone and port operation scenarios. OKH-RAG consistently outperforms permutation-invariant baselines, and ablations show that these gains arise specifically from modeling interaction order. These results highlight a key limitation of set-based retrieval: effective reasoning requires not only retrieving relevant evidence, but organizing it into structured sequences.

[20] Gradient boundaries through confidence intervals for forced alignment estimates using model ensembles

Matthew C. Kelley

Main category: cs.CL

TL;DR: Paper 2506.01256: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2506.01256: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01256&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[21] Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

Manh Nguyen, Sunil Gupta, Hung Le

Main category: cs.CL

TL;DR: RCS (Radial Consensus Score) is a training-free method for selecting the best response from multiple LLM candidates by modeling semantic consensus through geometric analysis of answer embeddings.

Details

Motivation: Existing methods for selecting the best response from multiple LLM candidates have limitations: self-consistency relies on discrete voting, probability-based methods fail to capture relationships among candidates and underweight high-quality but infrequent responses, and they don't leverage the geometric structure of answer representations.

Method: RCS computes a weighted Fréchet mean (semantic center) of answer embeddings and ranks candidates by their radial distance to this center. It supports multiple weighting schemes (uniform, frequency-based, probability-based) and works in black-box settings.

Result: Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks with five open-weight models show RCS variants consistently outperform strong baselines, with gains increasing as sampling budget grows. RCS also works well as a drop-in replacement for majority voting in multi-agent debate.

Conclusion: Geometric consensus provides a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.

Abstract: Large language models (LLMs) frequently generate multiple candidate responses for a given prompt, yet selecting the most reliable one remains challenging, especially when correctness diverges from surface-level majority agreement. Existing approaches, such as self-consistency, rely on discrete voting, while probability-based methods often fail to capture relationships among candidate answers or tend to underweight high-quality but less frequent responses, and do not fully leverage the geometric structure of answer representations. To address these limitations, we introduce Radial Consensus Score (RCS), a simple, efficient, and training-free method for best-of-N selection. RCS models semantic consensus by computing a weighted Fréchet mean (semantic center) of answer embeddings and ranking candidates by their radial distance to this center. Importantly, RCS provides a general framework that supports multiple weighting schemes, including uniform, frequency-based, and probability-based variants, enabling flexible integration of agreement signals and model confidence while remaining fully applicable in black-box settings. Extensive experiments across seven benchmarks covering short-form QA and long-form reasoning tasks, and five open-weight models, demonstrate that RCS variants consistently outperform strong baselines, with gains becoming more pronounced as the sampling budget increases. RCS also serves as an effective drop-in replacement for majority voting in multi-agent debate and exhibits strong robustness in black-box scenarios. Overall, these results highlight geometric consensus as a scalable and broadly applicable principle for reliable answer selection, extending beyond majority voting to more expressive and robust aggregation in LLM inference.

[22] ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

Han Zhu, Wei Kang, Liyong Guo, Zengwei Yao, Fangjun Kuang, Weiji Zhuang, Zhaoqing Li, Zhifeng Han, Dong Zhang, Xin Zhang, Xingchen Song, Lingxuan Ye, Long Lin, Daniel Povey

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2507.09318: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.09318&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[23] LLM-Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines

Jiechao Gao, Rohan Kumar Yadav, Yuangang Li, Yuandong Pan, Jie Wang, Ying Liu, Michael Lepech

Main category: cs.CL

TL;DR: A framework that transfers LLM knowledge into symbolic Tsetlin Machines for interpretable text classification, achieving BERT-like performance without embeddings or runtime LLM calls.

Details

Motivation: Pretrained language models (PLMs) like BERT provide strong semantic representations but are costly and opaque, while symbolic models like Tsetlin Machines offer transparency but lack semantic generalization. There's a need to combine interpretability with semantic capacity.

Method: Semantic bootstrapping framework: LLM generates sub-intents for class labels, guides synthetic data creation through three-stage curriculum (seed, core, enriched). Non-Negated TM learns from examples to extract high-confidence literals as interpretable semantic cues, which are then injected into real data to align TM clause logic with LLM-inferred semantics.

Result: Improves interpretability and accuracy over vanilla TM across multiple text classification tasks, achieving performance comparable to BERT while remaining fully symbolic and efficient. No embeddings or runtime LLM calls required.

Conclusion: Successfully equips symbolic models with pretrained semantic priors, combining interpretability of symbolic models with semantic capacity of LLMs.

Abstract: Pretrained language models (PLMs) like BERT provide strong semantic representations but are costly and opaque, while symbolic models such as the Tsetlin Machine (TM) offer transparency but lack semantic generalization. We propose a semantic bootstrapping framework that transfers LLM knowledge into symbolic form, combining interpretability with semantic capacity. Given a class label, an LLM generates sub-intents that guide synthetic data creation through a three-stage curriculum (seed, core, enriched), expanding semantic diversity. A Non-Negated TM (NTM) learns from these examples to extract high-confidence literals as interpretable semantic cues. Injecting these cues into real data enables a TM to align clause logic with LLM-inferred semantics. Our method requires no embeddings or runtime LLM calls, yet equips symbolic models with pretrained semantic priors. Across multiple text classification tasks, it improves interpretability and accuracy over vanilla TM, achieving performance comparable to BERT while remaining fully symbolic and efficient.

[24] [b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic

Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David Harwath, David R. Mortensen

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2602.18899: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18899&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[25] Thought-Retriever: Don’t Just Retrieve Raw Data, Retrieve Thoughts for Memory-Augmented Agentic Systems

Tao Feng, Pengrui Han, Guanyu Lin, Ge Liu, Jiaxuan You

Main category: cs.CL

TL;DR: Thought-Retriever is a model-agnostic algorithm that enables LLMs to generate outputs conditioned on arbitrarily long external data by leveraging and organizing intermediate responses from past queries into a self-evolving long-term memory.

Details

Motivation: Current LLMs fail to effectively incorporate massive external knowledge due to context length limitations, and retrieval-augmented approaches are constrained by only retrieving top-K data chunks from millions of possibilities.

Method: The algorithm lets LLMs leverage intermediate responses (thoughts) from past queries, filters meaningless/redundant thoughts, organizes them in thought memory, and retrieves relevant thoughts for new queries, creating a self-evolving long-term memory system.

Result: Thought-Retriever outperforms state-of-the-art baselines by at least 7.6% in F1 score and 16% in win rate across tasks, demonstrates self-evolution capability, and learns to leverage deeper thoughts for abstract queries.

Conclusion: The approach effectively equips LLM-based agents with scalable long-term memory that grows more capable through continuous interaction, overcoming context length limitations for external knowledge integration.

Abstract: Large language models (LLMs) have transformed AI research thanks to their powerful internal capabilities and knowledge. However, existing LLMs still fail to effectively incorporate the massive external knowledge when interacting with the world. Although retrieval-augmented LLMs are proposed to mitigate the issue, they are still fundamentally constrained by the context length of LLMs, as they can only retrieve top-K raw data chunks from the external knowledge base which often consists of millions of data chunks. Here we propose Thought-Retriever, a novel model-agnostic algorithm that helps LLMs generate output conditioned on arbitrarily long external data, without being constrained by the context length or number of retrieved data chunks. Our key insight is to let an LLM fully leverage its intermediate responses generated when solving past user queries (thoughts), filtering meaningless and redundant thoughts, organizing them in thought memory, and retrieving the relevant thoughts when addressing new queries. This effectively equips LLM-based agents with a self-evolving long-term memory that grows more capable through continuous interaction. Besides algorithmic innovation, we further meticulously prepare a novel benchmark, AcademicEval, which requires an LLM to faithfully leverage ultra-long context to answer queries based on real-world academic papers. Extensive experiments on AcademicEval and two other public datasets validate that Thought-Retriever remarkably outperforms state-of-the-art baselines, achieving an average increase of at least 7.6% in F1 score and 16% in win rate across various tasks. More importantly, we further demonstrate two exciting findings: (1) Thought-Retriever can indeed help LLM self-evolve after solving more user queries; (2) Thought-Retriever learns to leverage deeper thoughts to answer more abstract user queries.

[26] Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

Peng Wang, Yanqiao Zhu, Zixuan Jiang, Qinyuan Chen, Xingjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.09121: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09121&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[27] OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, James Zou

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2502.11271: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.11271&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[28] Continuous Knowledge Metabolism: Generating Scientific Hypotheses from Evolving Literature

Jinkai Tao, Yubo Wang, Xiaoyu Liu, Menglin Yang

Main category: cs.CL

TL;DR: CKM framework processes scientific literature through sliding time windows to incrementally update knowledge and generate hypotheses, with incremental processing outperforming batch methods on predictive metrics while reducing computational costs.

Details

Motivation: Scientific hypothesis generation needs to track knowledge evolution over time, not just current knowledge. Existing approaches often process literature in batch mode, missing the dynamic nature of scientific discovery and knowledge accumulation.

Method: Continuous Knowledge Metabolism (CKM) processes literature through sliding time windows, incrementally updating a structured knowledge base. CKM-Lite is an efficient variant for incremental accumulation, while CKM-Full categorizes findings as novel/confirming/contradicting, detects knowledge change signals, and conditions hypothesis generation on evolution trajectories.

Result: CKM-Lite outperforms batch processing on hit rate (+2.8%), hypothesis yield (+3.6), and best-match alignment (+0.43) while reducing token cost by 92%. Analysis of 892 hypotheses shows: 1) incremental processing beats batch baselines; 2) change-aware instrumentation increases novelty but reduces coverage; 3) field trajectory stability correlates with hypothesis success; 4) knowledge convergence signals yield 5x higher hit rate than contradictions.

Conclusion: Hypothesis generation quality depends on both how much literature is processed and how it’s processed. Evaluation frameworks must account for quality-coverage trade-offs rather than optimizing single metrics. The character of generated hypotheses is shaped by knowledge evolution patterns.

Abstract: Scientific hypothesis generation requires tracking how knowledge evolves, not just what is currently known. We introduce Continuous Knowledge Metabolism (CKM), a framework that processes scientific literature through sliding time windows and incrementally updates a structured knowledge base as new findings arrive. We present CKM-Lite, an efficient variant that achieves strong predictive coverage through incremental accumulation, outperforming batch processing on hit rate (+2.8%, p=0.006), hypothesis yield (+3.6, p<0.001), and best-match alignment (+0.43, p<0.001) while reducing token cost by 92%. To understand what drives these differences, we develop CKM-Full, an instrumented variant that categorizes each new finding as novel, confirming, or contradicting, detects knowledge change signals, and conditions hypothesis generation on the full evolution trajectory. Analyzing 892 hypotheses generated by CKM-Full across 50 research topics, alongside parallel runs of the other variants, we report four empirical observations: (1) incremental processing outperforms batch baseline across predictive and efficiency metrics; (2) change-aware instrumentation is associated with higher LLM-judged novelty (Cohen’s d=3.46) but lower predictive coverage, revealing a quality-coverage trade-off; (3) a field’s trajectory stability is associated with hypothesis success (r=-0.28, p=0.051), suggesting boundary conditions for literature-based prediction; (4) knowledge convergence signals are associated with nearly 5x higher hit rate than contradiction signals, pointing to differential predictability across change types. These findings suggest that the character of generated hypotheses is shaped not only by how much literature is processed, but also by how it is processed. They further indicate that evaluation frameworks must account for the quality-coverage trade-off rather than optimize for a single metric.

[29] SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

Zhuofan Wen, Yang Feng

Main category: cs.CL

TL;DR: A novel self-draft speculative decoding framework that uses layer-wise temperature annealing and adaptive speculation length to accelerate LLM inference while maintaining exact output equivalence.

Details

Motivation: Self-draft speculative decoding methods for LLM acceleration have limitations: shallow layers produce overconfident incorrect predictions, and difficult tokens in draft sequences force redundant computation through deeper layers, undermining draft acceptance and speedup.

Method: Proposes a self-draft framework that suppresses spurious confidence via layer-wise temperature annealing in early-exit decisions and adaptively bounds speculation length based on token-wise decoding difficulty. Uses reprocessing of hidden states in a unified parallel pass through deep layers.

Result: Achieves up to 2.33x wall-time speedup over standard autoregressive decoding across diverse long-form generation tasks and multiple model architectures, with no modifications to base LLM parameters.

Conclusion: The proposed method effectively addresses limitations of existing self-draft speculative decoding approaches, providing significant inference acceleration while maintaining exact output equivalence with original models.

Abstract: Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft models but face limitations: shallow layers often produce overconfident yet incorrect token predictions, and the presence of difficult tokens in a draft sequence forces redundant computation through deeper layers, undermining both draft acceptance and overall speedup. To address these issues, we propose a novel self-draft framework that suppresses spurious confidence via layer-wise temperature annealing in early-exit decision and adaptively bounds speculation length based on token-wise decoding difficulty. By reprocessing the hidden states of draft tokens in a unified parallel pass through deep layers, our method maintains exact output equivalence with the original model while maximizing computational efficiency. It requires no modifications to the base LLM parameters and achieves up to 2.33x wall-time speedup over standard autoregressive decoding across diverse long-form generation tasks and multiple model architectures.

[30] Coding-Free and Privacy-Preserving MCP Framework for Clinical Agentic Research Intelligence System

Taehun Kim, Hyeryun Park, Hyeonhoon Lee, Yushin Lee, Kyungsang Kim, Hyung-Chul Lee

Main category: cs.CL

TL;DR: CARIS is an AI agent system that automates clinical research workflows using LLMs and modular tools via MCP, enabling natural language-driven research without direct data access.

Details

Motivation: Clinical research requires domain expertise, programming skills, and sensitive data access, creating barriers for clinicians and external researchers. Current processes are labor-intensive and limit data-driven studies.

Method: Integrates LLMs with modular tools via Model Context Protocol (MCP), keeping databases secure on MCP server. Automates full pipeline: research planning, literature search, cohort construction, IRB documentation, Vibe ML, and report generation with human-in-the-loop refinement.

Result: Evaluated on three heterogeneous datasets: research plans and IRB documents finalized in 3-4 iterations; supported Vibe ML by exploring feature-model combinations and ranking top models; reports achieved 96% LLM evaluation coverage and 82% human evaluation coverage based on TRIPOD+AI framework.

Conclusion: CARIS demonstrates agentic AI can transform clinical hypotheses into executable research workflows across heterogeneous datasets, lowering barriers by eliminating coding and direct data access needs.

Abstract: Clinical research involves labor-intensive processes such as study design, cohort construction, model development, and documentation, requiring domain expertise, programming skills, and access to sensitive patient data. These demands create barriers for clinicians and external researchers conducting data-driven studies. To overcome these limitations, we developed a Clinical Agentic Research Intelligence System (CARIS) that automates the clinical research workflow while preserving data privacy, enabling comprehensive studies without direct access to raw data. CARIS integrates Large Language Models (LLMs) with modular tools via the Model Context Protocol (MCP), enabling natural language-driven orchestration of appropriate tools. Databases remain securely within the MCP server, and users access only the outputs and final research reports. Based on user intent, CARIS automatically executes the full pipeline: research planning, literature search, cohort construction, Institutional Review Board (IRB) documentation, Vibe Machine Learning (ML), and report generation, with iterative human-in-the-loop refinement. We evaluated CARIS on three heterogeneous datasets with distinct clinical tasks. Research plans and IRB documents were finalized within three to four iterations, using evidence from literature and data. The system supported Vibe ML by exploring feature-model combinations, ranking the top ten models, and generating performance visualizations. Final reports showed high completeness based on a checklist derived from the TRIPOD+AI framework, achieving 96% coverage in LLM evaluation and 82% in human evaluation. CARIS demonstrates that agentic AI can transform clinical hypotheses into executable research workflows across heterogeneous datasets. By eliminating the need for coding and direct data access, the system lowers barriers and bridges public and private clinical data environments.

[31] CascadeDebate: Multi-Agent Deliberation for Cost-Aware LLM Cascades

Raeyoung Chang, Dongwook Kwon, Jisoo Lee, Nikhil Verma

Main category: cs.CL

TL;DR: CascadeDebate introduces multi-agent deliberation at escalation boundaries in cascaded LLM systems, using confidence-based routing to activate lightweight agent ensembles for uncertain queries, enabling internal resolution without costly upgrades to larger models or human experts.

Details

Motivation: Current cascaded LLM systems with single-model tiers at each stage struggle with ambiguous queries, causing premature escalations to costlier models or human experts due to under-confidence and inefficient compute scaling. There's a need for better handling of uncertain cases without immediately invoking expensive upgrades.

Method: CascadeDebate inserts multi-agent deliberation directly at each tier’s escalation boundary. It uses confidence-based routers to activate lightweight agent ensembles only for uncertain cases, enabling consensus-driven resolution of ambiguities internally. The unified architecture alternates single-model inference with selective multi-agent deliberation across model scales, with human experts as the final fallback.

Result: Across five benchmarks spanning science, medicine, and general knowledge, CascadeDebate outperforms strong single-model cascades and standalone multi-agent systems by up to 26.75%. An online threshold optimizer boosts accuracy by 20.98 to 52.33% relative improvement over fixed policies and enables elastic adaptation to real-world distributions.

Conclusion: CascadeDebate effectively addresses the limitations of traditional cascaded LLM systems by introducing multi-agent deliberation at escalation boundaries, enabling dynamic scaling of test-time compute according to query difficulty while maintaining accuracy and cost efficiency.

Abstract: Cascaded LLM systems coordinate models of varying sizes with human experts to balance accuracy, cost, and abstention under uncertainty. However, single-model tiers at each stage often struggle with ambiguous queries, triggering premature escalations to costlier models or experts due to under-confidence and inefficient compute scaling. CascadeDebate addresses this gap by inserting multi-agent deliberation directly at each tier’s escalation boundary. Confidence-based routers activate lightweight agent ensembles only for uncertain cases, enabling consensus-driven resolution of ambiguities internally without invoking higher-cost upgrades. Our unified architecture alternates single-model inference with selective multi-agent deliberation across model scales, culminating in human experts as the final fallback. This design scales test-time compute dynamically according to query difficulty. Across five benchmarks spanning science, medicine, and general knowledge, CascadeDebate outperforms strong single-model cascades and standalone multi-agent systems by up to 26.75 percent. An online threshold optimizer proves essential, boosting accuracy by 20.98 to 52.33 percent relative improvement over fixed policies and enabling elastic adaptation to real-world distributions.

[32] Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning

Houxing Ren, Mingjie Zhan, Zimu Lu, Ke Wang, Yunqiao Yang, Haotian Hou, Hongsheng Li

Main category: cs.CL

TL;DR: SpreadsheetAgent: A two-stage multi-agent framework for spreadsheet understanding that incrementally processes large spreadsheets through multimodal analysis (code, images, LaTeX) rather than treating tables as plain text.

Details

Motivation: Existing LLM approaches treat spreadsheets as plain text, missing layout cues and visual semantics, and struggle with massive spreadsheets exceeding LLM input limits. Real-world spreadsheet understanding requires handling large-scale data with structural awareness.

Method: Two-stage multi-agent framework: 1) Construct structural sketch and row/column summaries through incremental interpretation of localized regions using multiple modalities (code execution results, images, LaTeX tables), 2) Task-driven reasoning over intermediate representation with verification module for error reduction.

Result: Achieves 38.16% on Spreadsheet Bench with GPT-OSS-120B, outperforming ChatGPT Agent baseline (35.27%) by 2.89 absolute points. Demonstrates effectiveness on two spreadsheet datasets.

Conclusion: SpreadsheetAgent advances robust and scalable spreadsheet understanding by addressing layout awareness and scale limitations through multimodal incremental processing and verification mechanisms.

Abstract: Spreadsheets are central to real-world applications such as enterprise reporting, auditing, and scientific data management. Despite their ubiquity, existing large language model based approaches typically treat tables as plain text, overlooking critical layout cues and visual semantics. Moreover, real-world spreadsheets are often massive in scale, exceeding the input length that LLMs can efficiently process. To address these challenges, we propose SpreadsheetAgent, a two-stage multi-agent framework for spreadsheet understanding that adopts a step-by-step reading and reasoning paradigm. Instead of loading the entire spreadsheet at once, SpreadsheetAgent incrementally interprets localized regions through multiple modalities, including code execution results, images, and LaTeX tables. The method first constructs a structural sketch and row/column summaries, and then performs task-driven reasoning over this intermediate representation in the Solving Stage. To further enhance reliability, we design a verification module that validates extracted structures via targeted inspections, reducing error propagation and ensuring trustworthy inputs for downstream reasoning. Extensive experiments on two spreadsheet datasets demonstrate the effectiveness of our approach. With GPT-OSS-120B, SpreadsheetAgent achieves 38.16% on Spreadsheet Bench, outperforming the ChatGPT Agent baseline (35.27%) by 2.89 absolute points. These results highlight the potential of SpreadsheetAgent to advance robust and scalable spreadsheet understanding in real-world applications. Code is available at https://github.com/renhouxing/SpreadsheetAgent.git.

[33] ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance

Haoran Li, Yulin Chen, Huihao Jing, Wenbin Hu, Tsz Ho Li, Chanhou Lou, Hong Ting Tsang, Sirui Han, Yangqiu Song

Main category: cs.CL

TL;DR: ContextLens: A semi-rule-based framework using LLMs to ground input context in legal domain for compliance assessment, identifying known/unknown factors for GDPR and EU AI Act compliance.

Details

Motivation: Current LLM-based safety and privacy assessments assume complete/clear context, but real-world contexts are ambiguous and incomplete. Need better methods to reason about context for identifying and mitigating privacy/AI safety risks.

Method: ContextLens framework leverages LLMs to ground input context in legal domain, explicitly identifies known/unknown factors. Instead of direct safety assessment, uses crafted questions spanning applicability, general principles, and detailed provisions to assess compliance with pre-defined priorities/rules.

Result: Significantly improves LLMs’ compliance assessment on GDPR and EU AI Act benchmarks, surpasses existing baselines without training. Can identify ambiguous and missing factors in contexts.

Conclusion: ContextLens provides effective framework for contextualized safety/privacy compliance assessment by addressing real-world ambiguity and incompleteness through legal grounding and structured questioning.

Abstract: Individuals’ concerns about data privacy and AI safety are highly contextualized and extend beyond sensitive patterns. Addressing these issues requires reasoning about the context to identify and mitigate potential risks. Though researchers have widely explored using large language models (LLMs) as evaluators for contextualized safety and privacy assessments, these efforts typically assume the availability of complete and clear context, whereas real-world contexts tend to be ambiguous and incomplete. In this paper, we propose ContextLens, a semi-rule-based framework that leverages LLMs to ground the input context in the legal domain and explicitly identify both known and unknown factors for legal compliance. Instead of directly assessing safety outcomes, our ContextLens instructs LLMs to answer a set of crafted questions that span over applicability, general principles and detailed provisions to assess compliance with pre-defined priorities and rules. We conduct extensive experiments on existing compliance benchmarks that cover the General Data Protection Regulation (GDPR) and the EU AI Act. The results suggest that our ContextLens can significantly improve LLMs’ compliance assessment and surpass existing baselines without any training. Additionally, our ContextLens can further identify the ambiguous and missing factors.

[34] CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

Jingbo Yang, Guanyu Yao, Bairu Hou, Xinghan Yang, Nikolai Glushnev, Iwona Bialynicka-Birula, Duo Ding, Shiyu Chang

Main category: cs.CL

TL;DR: CompliBench: A benchmark for evaluating LLM judges’ ability to detect policy violations in multi-turn dialogues, with automated data generation pipeline for scalable evaluation.

Details

Motivation: As LLMs are deployed as task-oriented agents in enterprise settings, ensuring adherence to complex operational guidelines is critical. Current LLM-as-a-Judge evaluation methods lack reliability assessment for detecting specific policy violations due to data scarcity from expensive human annotation and difficulty synthesizing realistic agent violations.

Method: Developed CompliBench benchmark with scalable automated data generation pipeline that simulates user-agent interactions. Uses controllable flaw injection for precise ground-truth labels of violated guidelines and exact conversation turns, plus adversarial search to ensure challenging perturbations.

Result: Current state-of-the-art proprietary LLMs struggle significantly with detecting policy violations. A small-scale judge model fine-tuned on synthesized data outperforms leading LLMs and generalizes well to unseen business domains.

Conclusion: The automated pipeline provides effective foundation for training robust generative reward models, addressing the critical need for reliable LLM judges in enterprise policy compliance evaluation.

Abstract: As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored. This gap is primarily due to the lack of a systematic data generation method, which has been hindered by the extensive cost of fine-grained human annotation and the difficulty of synthesizing realistic agent violations. In this paper, we introduce CompliBench, a novel benchmark designed to evaluate the ability of LLM judges to detect and localize guideline violations in multi-turn dialogues. To overcome data scarcity, we develop a scalable, automated data generation pipeline that simulates user-agent interactions. Our controllable flaw injection process automatically yields precise ground-truth labels for the violated guideline and the exact conversation turn, while an adversarial search method ensures these introduced perturbations are highly challenging. Our comprehensive evaluation reveals that current state-of-the-art proprietary LLMs struggle significantly with this task. In addition, we demonstrate that a small-scale judge model fine-tuned on our synthesized data outperforms leading LLMs and generalizes well to unseen business domains, highlighting our pipeline as an effective foundation for training robust generative reward models.

[35] ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

Boyang Li, Hongzhe Shou, Yuanyuan Liang, Jingbin Zhang, Fang Zhou

Main category: cs.CL

TL;DR: ToxiTrace is an explainable toxic content detection method for Chinese text that provides readable toxic evidence spans using LLM-guided saliency refinement, gradient-constrained loss, and contrastive reasoning.

Details

Motivation: Existing Chinese toxic content detection methods focus on sentence-level classification but fail to provide readable and contiguous toxic evidence spans, limiting their explainability and practical utility.

Method: Three-component approach: (1) CuSA refines encoder-derived saliency cues into fine-grained toxic spans with lightweight LLM guidance; (2) GCLoss uses gradient-constrained objective to concentrate token-level saliency on toxic evidence while suppressing irrelevant activations; (3) ARCL constructs sample-specific contrastive reasoning pairs to sharpen semantic boundaries between toxic and non-toxic content.

Result: ToxiTrace improves classification accuracy and toxic span extraction while preserving efficient encoder-based inference and producing more coherent, human-readable explanations.

Conclusion: The proposed method successfully addresses the explainability gap in Chinese toxic content detection by providing readable toxic evidence spans while maintaining classification performance.

Abstract: Existing Chinese toxic content detection methods mainly target sentence-level classification but often fail to provide readable and contiguous toxic evidence spans. We propose \textbf{ToxiTrace}, an explainability-oriented method for BERT-style encoders with three components: (1) \textbf{CuSA}, which refines encoder-derived saliency cues into fine-grained toxic spans with lightweight LLM guidance; (2) \textbf{GCLoss}, a gradient-constrained objective that concentrates token-level saliency on toxic evidence while suppressing irrelevant activations; and (3) \textbf{ARCL}, which constructs sample-specific contrastive reasoning pairs to sharpen the semantic boundary between toxic and non-toxic content. Experiments show that ToxiTrace improves classification accuracy and toxic span extraction while preserving efficient encoder-based inference and producing more coherent, human-readable explanations. We have released the model at https://huggingface.co/ArdLi/ToxiTrace.

[36] Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

Tomer Ashuach, Liat Ein-Dor, Shai Gretz, Yoav Katz, Yonatan Belinkov

Main category: cs.CL

TL;DR: LLMs have domain-specific privileged knowledge about answer correctness: self-representations outperform peer representations in factual tasks but not in math reasoning, especially on disagreement subsets where models produce conflicting predictions.

Details

Motivation: To investigate whether large language models possess privileged knowledge about answer correctness similar to human introspection, examining if models have internal states that provide information about correctness unavailable through external observation.

Method: Trained correctness classifiers on question representations from both a model’s own hidden states and external models, comparing self-probes vs peer-model probes. Evaluated on standard benchmarks and disagreement subsets where models produce conflicting predictions. Analyzed layer-wise localization of privileged knowledge.

Result: No advantage found in standard evaluation (self-probes comparable to peer-model probes). On disagreement subsets, discovered domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks but show no advantage in math reasoning. Factual advantage emerges progressively from early-to-mid layers onward.

Conclusion: LLMs possess domain-specific privileged knowledge about answer correctness, with factual knowledge showing self-representation advantages consistent with model-specific memory retrieval, while math reasoning lacks such privileged access.

Abstract: Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model’s own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

[37] Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

Ziyang Liu

Main category: cs.CL

TL;DR: Cooperative paging uses keyword bookmarks to manage LLM context overflow, enabling on-demand recall of evicted content and outperforming other methods on multi-session conversation benchmarks.

Details

Motivation: LLMs have limited context windows, requiring old content to be evicted during long conversations, but there's no efficient way to recover this information when needed later in the dialogue.

Method: Proposes cooperative paging: replace evicted conversation segments with minimal keyword bookmarks (~8-24 tokens), and provide the model with a recall() tool to retrieve full content when needed. Evaluates different paging strategies (boundary policies, eviction policies) and bookmark generation methods.

Result: Cooperative paging achieves highest answer quality on LoCoMo benchmark (10 real multi-session conversations, 300+ turns), outperforming truncation, BM25, word-overlap retrieval, search-tool baseline, and full context across four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5). Key findings: coarse fixed-size pages work best (96.7%), content-aware topic_shift collapses (56.7%), eviction policy is data-dependent, and bookmark discrimination remains a bottleneck (57% correct page selection).

Conclusion: Cooperative paging is an effective approach for managing long conversations in LLMs, but bookmark specificity is crucial for accurate recall. The method demonstrates practical value for real-world multi-session dialogues.

Abstract: When LLM conversations grow beyond the context window, old content must be evicted – but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods – outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context – on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination – the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

[38] SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models

SungHo Kim, Juhyeong Park, Eda Atalay, SangKeun Lee

Main category: cs.CL

TL;DR: SCRIPT is a model-agnostic module that injects subcharacter compositional knowledge into Korean language models by enhancing subword embeddings with structural granularity from Jamo units, improving performance on Korean NLU and NLG tasks.

Details

Motivation: Korean has a featural writing system where characters are composed of subcharacter units (Jamo) that encode linguistically meaningful morphophonological processes. Current Korean language models use subword tokenization that doesn't capture this internal compositional structure, limiting their ability to represent Korean's linguistic properties.

Method: SCRIPT is a model-agnostic module that enhances subword embeddings with structural granularity from Jamo units. It doesn’t require architectural changes or additional pre-training, but rather injects subcharacter compositional knowledge into existing Korean PLMs by decomposing characters into their constituent Jamo components.

Result: SCRIPT enhances all baselines across various Korean natural language understanding and generation tasks. Linguistic analyses show it reshapes the embedding space to better capture grammatical regularities and semantically cohesive variations.

Conclusion: The proposed SCRIPT module effectively injects subcharacter compositional knowledge into Korean language models, improving performance on Korean NLP tasks while better capturing the linguistic structure of Korean through its featural writing system.

Abstract: Korean is a morphologically rich language with a featural writing system in which each character is systematically composed of subcharacter units known as Jamo. These subcharacters not only determine the visual structure of Korean but also encode frequent and linguistically meaningful morphophonological processes. However, most current Korean language models (LMs) are based on subword tokenization schemes, which are not explicitly designed to capture the internal compositional structure of characters. To address this limitation, we propose SCRIPT, a model-agnostic module that injects subcharacter compositional knowledge into Korean PLMs. SCRIPT allows to enhance subword embeddings with structural granularity, without requiring architectural changes or additional pre-training. As a result, SCRIPT enhances all baselines across various Korean natural language understanding (NLU) and generation (NLG) tasks. Moreover, beyond performance gains, detailed linguistic analyses show that SCRIPT reshapes the embedding space in a way that better captures grammatical regularities and semantically cohesive variations. Our code is available at https://github.com/SungHo3268/SCRIPT.

[39] ReasonXL: Shifting LLM Reasoning Language Without Sacrificing Performance

Daniil Gurgurov, Tom Röhr, Sebastian von Rohrscheidt, Josef van Genabith, Alexander Löser, Simon Ostermann

Main category: cs.CL

TL;DR: ReasonXL: A parallel corpus for training LLMs to reason in non-English languages, with a two-stage adaptation pipeline (SFT + RLVR) that enables models to reason entirely in target languages while maintaining performance.

Details

Motivation: Most LLMs are English-centric and reason in English even when solving non-English problems, creating a mismatch for non-English usage scenarios. This paper addresses the need for models that can reason directly in target languages.

Method: Three contributions: (1) ReasonXL corpus with over 2M aligned reasoning traces across 5 European languages; (2) Two-stage adaptation pipeline: supervised fine-tuning followed by reinforcement learning with verifiable rewards; (3) Representational analysis of adaptation across model depth.

Result: Models adapted with this approach match or exceed baseline performance, maintain general knowledge, preserve cross-lingual transfer, and show RLVR achieves greater behavioral divergence with smaller parameter updates than SFT.

Conclusion: LLMs can be effectively adapted to reason entirely in target languages using parallel reasoning traces and a two-stage adaptation pipeline, with RLVR providing more efficient representational rerouting than SFT alone.

Abstract: Despite advances in multilingual capabilities, most large language models (LLMs) remain English-centric in their training and, crucially, in their production of reasoning traces. Even when tasked with non-English problems, these models predominantly reason in English, creating a fundamental mismatch for non-English usage scenarios. We address this disparity directly with three contributions. (i) We introduce ReasonXL, the first large-scale parallel corpus of cross-domain reasoning traces spanning five European languages (English, German, French, Italian, and Spanish), with over two million aligned samples per language, each comprising prompts, reasoning traces, and final outputs, enabling direct supervision of language-specific reasoning. (ii) Using ReasonXL, we demonstrate that LLMs can be adapted to reason entirely in a desired target language, using a simple two-stage pipeline of supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR). The resulting models match or exceed baseline performance, with minimal loss in general knowledge and broadly preserved cross-lingual transfer. (iii) We conduct an extensive representational analysis of the adaptation and find a clear functional division across model depth: early layers contain an activation bottleneck that causally determines language identity, while upper layers concentrate the weight and activation changes driven by adaptation. We further find that RLVR achieves greater behavioral divergence from the base model with smaller parameter updates than SFT, suggesting a more efficient representational rerouting despite much smaller weight updates.

[40] From Myopic Selection to Long-Horizon Awareness: Sequential LLM Routing for Multi-Turn Dialogue

Jiarui Zhang, Xiangyu Liu, Yong Hu, Chaoyue Niu, Hang Zeng, Shaojie Tang, Fan Wu, Guihai Chen

Main category: cs.CL

TL;DR: DialRouter: A multi-turn LLM routing system using MCTS exploration and learned lightweight policies for optimal cumulative performance in dialogues

Details

Motivation: Existing LLM routing methods are myopic and fail to maximize cumulative performance in multi-turn dialogues due to interaction dynamics and delayed rewards, necessitating a shift from single-turn to long-horizon sequential routing.

Method: Proposes DialRouter which: 1) uses Monte Carlo Tree Search (MCTS) to explore dialogue branches from different LLM selections and collect high-reward trajectories, 2) learns a lightweight routing policy from search data augmented with retrieval-based future state approximation, enabling multi-turn routing without online search.

Result: Experiments on open-domain and domain-specific dialogue tasks across diverse candidate sets (open-source and closed-source LLMs) show DialRouter significantly outperforms single LLMs and existing routing baselines in task success rate, achieving superior performance-cost trade-off with cost-aware rewards.

Conclusion: DialRouter successfully addresses the challenge of multi-turn LLM routing by moving from myopic selection to long-horizon sequential routing, demonstrating effectiveness across diverse dialogue tasks and LLM combinations.

Abstract: Multi-turn dialogue is the predominant form of interaction with large language models (LLMs). While LLM routing is effective in single-turn settings, existing methods fail to maximize cumulative performance in multi-turn dialogue due to interaction dynamics and delayed rewards. To address this challenge, we move from myopic, single-turn selection to long-horizon sequential routing for multi-turn dialogue. Accordingly, we propose DialRouter, which first performs MCTS to explore dialogue branches induced by different LLM selections and collect trajectories with high cumulative rewards. DialRouter then learns a lightweight routing policy from search-derived data, augmented with retrieval-based future state approximation, enabling multi-turn routing without online search. Experiments on both open-domain and domain-specific dialogue tasks across diverse candidate sets of both open-source and closed-source LLMs demonstrate that DialRouter significantly outperforms single LLMs and existing routing baselines in task success rate, while achieving a superior performance-cost trade-off when combined with a cost-aware reward.

[41] KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates

Yudong Li, Jiawei Cai, Linlin Shen

Main category: cs.CL

TL;DR: KoCo introduces semantic coordinates to give LLMs explicit contextual awareness during pre-training, improving performance, convergence speed, and reducing hallucinations.

Details

Motivation: Standard LLM pre-training flattens documents into token sequences, missing the real-world contextual structure that humans naturally use to understand information. This gap limits models' ability to properly contextualize knowledge.

Method: Knowledge Coordinate Conditioning (KoCo) maps each document into a three-dimensional semantic coordinate that represents its position in knowledge space. These coordinates are prepended as textual prefixes during pre-training to give the model explicit contextual awareness.

Result: KoCo significantly enhances performance across 10 downstream tasks, accelerates pre-training convergence by approximately 30%, and helps models distinguish stable facts from noise, effectively mitigating hallucination in generated outputs.

Conclusion: Explicitly modeling knowledge coordinates provides LLMs with better contextual awareness, leading to improved performance, faster training, and reduced hallucinations by helping models understand the structural relationships between documents.

Abstract: Standard Large Language Model (LLM) pre-training typically treats corpora as flattened token sequences, often overlooking the real-world context that humans naturally rely on to contextualize information. To bridge this gap, we introduce Knowledge Coordinate Conditioning (KoCo), a simple method that maps every document into a three-dimensional semantic coordinate. By prepending these coordinates as textual prefixes for pre-training, we aim to equip the model with explicit contextual awareness to learn the documents within the real-world knowledge structure. Experiment results demonstrate that KoCo significantly enhances performance across 10 downstream tasks and accelerates pre-training convergence by approximately 30%. Furthermore, our analysis indicates that explicitly modeling knowledge coordinates helps the model distinguish stable facts from noise, effectively mitigating hallucination in generated outputs.

[42] Agentic Insight Generation in VSM Simulations

Micha Selak, Dirk Krechel, Adrian Ulges, Sven Spieckermann, Niklas Stoehr, Andreas Loehr

Main category: cs.CL

TL;DR: Two-step agentic architecture for analyzing complex value stream map simulations using LLMs with domain knowledge infusion

Details

Motivation: Extracting insights from value stream map simulations is challenging and error-prone; existing LLM approaches struggle with subtle situational differences needed to distinguish similar data sources in this domain

Method: Proposes a decoupled, two-step agentic architecture that separates orchestration from data analysis, leveraging progressive data discovery infused with domain expert knowledge

Result: Top-tier LLMs achieve up to 86% accuracy with high robustness across evaluation runs, demonstrating framework viability

Conclusion: The proposed architecture effectively addresses limitations of existing approaches for value stream map analysis by enabling intelligent data source selection and multi-hop reasoning

Abstract: Extracting actionable insights from complex value stream map simulations can be challenging, time-consuming, and error-prone. Recent advances in large language models offer new avenues to support users with this task. While existing approaches excel at processing raw data to gain information, they are structurally unfit to pick up on subtle situational differences needed to distinguish similar data sources in this domain. To address this issue, we propose a decoupled, two-step agentic architecture. By separating orchestration from data analysis, the system leverages progressive data discovery infused with domain expert knowledge. This architecture allows the orchestration to intelligently select data sources and perform multi-hop reasoning across data structures while maintaining a slim internal context. Results from multiple state-of-the-art large language models demonstrate the framework’s viability: with top-tier models achieving accuracies of up to 86% and demonstrating high robustness across evaluation runs.

[43] Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

Sihang Jia, Shuliang Liu, Songbo Yang, Yibo Yan, Xin Zou, Xuming Hu

Main category: cs.CL

TL;DR: DeP is a training-free framework that reduces hallucinations in multimodal LLMs by applying controlled textual perturbations during decoding to counteract language priors dominating visual evidence.

Details

Motivation: Multimodal LLMs suffer from inference hallucinations due to language priors dominating visual evidence. Existing training-free methods either perturb visual representations (deviating from natural image distribution) or use intrusive manipulations that compromise generative fluency.

Method: DeP (Decoding by Perturbation) uses controlled textual interventions during decoding. It applies multi-level textual perturbations to elicit latent language priors, leverages attention variance to enhance stable evidence regions while suppressing suspicious noise, and constructs interpretable prior drift direction using logits statistics to counteract probability biases from textual co-occurrences.

Result: Extensive experiments confirm DeP effectively reduces hallucinations and achieves superior performance across multiple benchmarks.

Conclusion: DeP provides a novel training-free approach to mitigate multimodal hallucinations by addressing the hypersensitivity of visual grounding to textual phrasing during decoding, offering better performance than existing methods.

Abstract: Multimodal Large Language Models frequently suffer from inference hallucinations, partially stemming from language priors dominating visual evidence. Existing training-free mitigation methods either perturb the visual representation and deviate from the natural image distribution, or enforce intrusive manipulations that compromise the model’s inherent generative fluency. We introduce a novel perspective that multimodal hallucination manifests as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Building on this insight, we propose Decoding by Perturbation (DeP), a training-free framework mitigating prior-induced hallucinations via controlled textual interventions. DeP employs a dynamic probe applying multi-level textual perturbations to elicit latent language priors. Leveraging attention variance, it enhances stable evidence regions while suppressing suspicious noise in the feature space. Furthermore, it constructs an interpretable prior drift direction using logits statistics to counteract probability biases from textual co-occurrences. Extensive experiments confirm DeP effectively reduces hallucinations and achieves superior performance across multiple benchmarks.

[44] GLeMM: A large-scale multilingual dataset for morphological research

Hathout Nabil, Basilio Calderone, Fiammetta Namer, Franck Sajous

Main category: cs.CL

TL;DR: GLeMM is a large-scale, automated derivational morphology resource covering 7 European languages, created from Wiktionary to enable data-driven research on form-meaning relations in word formation.

Details

Motivation: Current studies in derivational morphology rely on intuition and limited data, making them difficult to replicate and generalize. There's a need for large-scale, automated resources to enable data-driven morphological research.

Method: GLeMM is created automatically from Wiktionary articles using identical design across all languages. It includes automatic annotation of morphological features and semantic descriptions for a significant subset of entries.

Result: A large-scale derivational resource covering German, English, Spanish, French, Italian, Polish, and Russian, enabling researchers to address questions about form-meaning relations and test computational methods for derivational morphology.

Conclusion: GLeMM provides a valuable resource for data-driven morphological research, addressing reproducibility issues and enabling systematic study of derivational morphology across multiple languages.

Abstract: In derivational morphology, what mechanisms govern the variation in form-meaning relations between words? The answers to this type of questions are typically based on intuition and on observations drawn from limited data, even when a wide range of languages is considered. Many of these studies are difficult to replicate and generalize. To address this issue, we present GLeMM, a new derivational resource designed for experimentation and data-driven description in morphology. GLeMM is characterized by (i) its large size, (ii) its extensive coverage (currently amounting to seven European languages, i.e., German, English, Spanish, French, Italian, Polish, Russian, (iii) its fully automated design, identical across all languages, (iv) the automatic annotation of morphological features on each entry, as well as (v) the encoding of semantic descriptions for a significant subset of these entries. It enables researchers to address difficult questions, such as the role of form and meaning in word-formation, and to develop and experimentally test computational methods that identify the structures of derivational morphology. The article describes how GLeMM is created using Wiktionary articles and presents various case studies illustrating possible applications of the resource.

[45] Latent-Condensed Transformer for Efficient Long Context Modeling

Zeng You, Yaofo Chen, Qiuwu Chen, Ying Sun, Shuhai Zhang, Yingjian Li, Yaowei Wang, Mingkui Tan

Main category: cs.CL

TL;DR: Latent-Condensed Attention (LCA) jointly reduces computational cost and KV cache in LLMs by condensing context within a latent space, achieving 2.5× speedup and 90% KV cache reduction at 128K context.

Details

Motivation: LLMs struggle with long contexts due to linear KV cache growth and quadratic self-attention complexity. Existing approaches address these bottlenecks separately, but sparse methods can't operate on compressed latent structures, missing joint optimization opportunities.

Method: Proposes Latent-Condensed Attention (LCA) which directly condenses context within MLA’s latent space. The representation is disentangled into semantic latent vectors and positional keys, with semantic vectors aggregated via query-aware pooling and positional keys preserved via anchor selection.

Result: LCA achieves up to 2.5× prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance. The method is architecture-agnostic and extends to other attention mechanisms like GQA.

Conclusion: LCA provides an effective joint optimization approach for reducing both computational cost and KV cache in LLMs for long contexts, with theoretical guarantees and practical efficiency improvements.

Abstract: Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA’s compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA’s latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both computational cost and KV cache without adding parameters. Beyond MLA, LCA’s design is architecture-agnostic and readily extends to other attention mechanisms such as GQA. Theoretically, we prove a length-independent error bound. Experiments show LCA achieves up to 2.5$\times$ prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance.

[46] Mining Large Language Models for Low-Resource Language Data: Comparing Elicitation Strategies for Hausa and Fongbe

Mahounan Pericles Adjovi, Roald Eiselen, Prasenjit Mitra

Main category: cs.CL

TL;DR: Strategic prompting extracts usable text data from commercial LLMs for low-resource West African languages Hausa and Fongbe, with GPT-4o Mini outperforming Gemini 2.5 Flash by 6-41x.

Details

Motivation: LLMs are trained on data from low-resource language communities but remain inaccessible through commercial APIs; this paper explores whether strategic prompting can extract usable text data for underrepresented languages.

Method: Systematically compare six elicitation task types across two commercial LLMs (GPT-4o Mini and Gemini 2.5 Flash) for Hausa and Fongbe, analyzing optimal prompting strategies for each language.

Result: GPT-4o Mini extracts 6-41 times more usable target-language words per API call than Gemini; optimal strategies differ by language: Hausa benefits from functional text and dialogue, while Fongbe requires constrained generation prompts.

Conclusion: Strategic prompting can effectively extract usable text data from commercial LLMs for low-resource languages, with performance varying significantly between models and requiring language-specific approaches.

Abstract: Large language models (LLMs) are trained on data contributed by low-resource language communities, yet the linguistic knowledge encoded in these models remains accessible only through commercial APIs. This paper investigates whether strategic prompting can extract usable text data from LLMs for two West African languages: Hausa (Afroasiatic, approximately 80 million speakers) and Fongbe (Niger-Congo, approximately 2 million speakers). We systematically compare six elicitation task types across two commercial LLMs (GPT-4o Mini and Gemini 2.5 Flash). GPT-4o Mini extracts 6-41 times more usable target-language words per API call than Gemini. Optimal strategies differ by language: Hausa benefits from functional text and dialogue, while Fongbe requires constrained generation prompts. We release all generated corpora and code.

[47] Meet Dynamic Individual Preferences: Resolving Conflicting Human Value with Paired Fine-Tuning

Shanyong Wang, Shuhang Lin, Yining Zhao, Xi Zhu, Yongfeng Zhang

Main category: cs.CL

TL;DR: PFT framework aligns LLMs with contradictory and evolving individual preferences using a new Value Conflict Dilemma dataset, outperforming traditional methods in handling conflicting preferences.

Details

Motivation: While LLMs have improved alignment with general human preferences, they struggle with adapting to diverse, dynamic, and contradictory individual preferences, which is crucial for personalized AI systems.

Method: Introduces Preference-Paired Fine-Tuning (PFT) framework and Value Conflict Dilemma (VCD) dataset containing scenarios with conflicting human preferences for evaluation.

Result: PFT achieves 96.6% accuracy in multi-choice classification, highest open-ended generation score of 8.69, outperforms DPO, SFT and traditional methods, and shows 44.76% improvement in user-specific preference alignment with limited data.

Conclusion: PFT effectively handles contradictory and evolving individual preferences, enabling better personalization of LLMs to diverse user needs.

Abstract: Recent advances in large language models (LLMs) have significantly improved the alignment of models with general human preferences. However, a major challenge remains in adapting LLMs to individual preferences, which are not only diverse but also dynamic. In this paper, we introduce a novel framework, Preference-Paired Fine-Tuning (PFT), designed to align models with contradictory and evolving individual preferences. We present a new dataset, Value Conflict Dilemma (VCD), which includes scenarios that involve conflicting human preferences, facilitating the evaluation of our approach. Our experiments demonstrate that PFT outperforms single-preference training methods, achieving up to 96.6% accuracy in multi-choice classification tasks and the highest open-ended generation score of 8.69. PFT also shows significant improvements over DPO, SFT and some traditional training methods, especially when handling conflicting preferences. Additionally, with limited user history data, models can inferring preference vector rapidly, achieving a 44.76% improvement in user-specific preference alignment in comparison to single-preference models.

[48] KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning

Shuai Wang, Yinan Yu

Main category: cs.CL

TL;DR: KG-Reasoner: An RL-trained LLM framework that integrates multi-step KG reasoning into a unified thinking phase, enabling dynamic path exploration and backtracking for complex knowledge-intensive queries.

Details

Motivation: LLMs struggle with knowledge-intensive reasoning despite strong language abilities. While KGs provide external knowledge representation, existing approaches decompose reasoning into isolated pipeline steps, constraining flexibility and causing information loss between steps.

Method: KG-Reasoner uses Reinforcement Learning to train an LLM to internalize KG traversal, creating a unified “thinking” phase where the model dynamically explores reasoning paths and performs backtracking when needed, rather than using fixed pipeline steps.

Result: Experiments on eight multi-hop and knowledge-intensive reasoning benchmarks show KG-Reasoner achieves competitive or superior performance compared to state-of-the-art methods.

Conclusion: The end-to-end RL framework enables more flexible and coherent reasoning over KGs by integrating multi-step reasoning into a unified thinking process, addressing limitations of pipeline-based approaches.

Abstract: Large Language Models (LLMs) exhibit strong abilities in natural language understanding and generation, yet they struggle with knowledge-intensive reasoning. Structured Knowledge Graphs (KGs) provide an effective form of external knowledge representation and have been widely used to enhance performance in classical Knowledge Base Question Answering (KBQA) tasks. However, performing precise multi-hop reasoning over KGs for complex queries remains highly challenging. Most existing approaches decompose the reasoning process into a sequence of isolated steps executed through a fixed pipeline. While effective to some extent, such designs constrain reasoning flexibility and fragment the overall decision process, often leading to incoherence and the loss of critical intermediate information from earlier steps. In this paper, we introduce KG-Reasoner, an end-to-end framework that integrates multi-step reasoning into a unified “thinking” phase of a Reasoning LLM. Through Reinforcement Learning (RL), the LLM is trained to internalize the KG traversal process, enabling it to dynamically explore reasoning paths, and perform backtracking when necessary. Experiments on eight multi-hop and knowledge-intensive reasoning benchmarks demonstrate that KG-Reasoner achieves competitive or superior performance compared to the state-of-the-art methods. Codes are available at the repository: https://github.com/Wangshuaiia/KG-Reasoner.

[49] Calibrated Confidence Estimation for Tabular Question Answering

Lukas Voss

Main category: cs.CL

TL;DR: Systematic study of LLM calibration for tabular QA shows severe overconfidence, reveals self-evaluation vs perturbation dichotomy, and proposes Multi-Format Agreement method for better confidence estimation.

Details

Motivation: While LLMs are increasingly used for tabular question answering, their calibration on structured data remains largely unstudied, with existing work focusing mainly on textual QA.

Method: Compares five confidence estimation methods across five frontier LLMs and two tabular QA benchmarks. Proposes Multi-Format Agreement (MFA) which exploits deterministic serialization variations (Markdown, HTML, JSON, CSV) unique to structured data.

Result: All models show severe overconfidence (smooth ECE 0.35-0.64 vs 0.10-0.15 for textual QA). Self-evaluation methods achieve AUROC 0.42-0.76 while perturbation methods achieve 0.78-0.86. MFA reduces ECE by 44-63% and achieves mean AUROC 0.80.

Conclusion: LLMs are poorly calibrated for tabular QA. Perturbation methods outperform self-evaluation methods. MFA provides effective confidence estimation at lower cost and combines well with sampling methods.

Abstract: Large language models (LLMs) are increasingly deployed for tabular question answering, yet calibration on structured data is largely unstudied. This paper presents the first systematic comparison of five confidence estimation methods across five frontier LLMs and two tabular QA benchmarks. All models are severely overconfident (smooth ECE 0.35-0.64 versus 0.10-0.15 reported for textual QA). A consistent self-evaluation versus perturbation dichotomy replicates across both benchmarks and all four fully-covered models: self-evaluation methods (verbalized, P(True)) achieve AUROC 0.42-0.76, while perturbation methods (semantic entropy, self-consistency, and our Multi-Format Agreement) achieve AUROC 0.78-0.86. Per-model paired bootstrap tests reject the null at p<0.001 after Holm-Bonferroni correction, and a 3-seed check on GPT-4o-mini gives a per-seed standard deviation of only 0.006. The paper proposes Multi-Format Agreement (MFA), which exploits the lossless and deterministic serialization variation unique to structured data (Markdown, HTML, JSON, CSV) to estimate confidence at 20% lower API cost than sampling baselines. MFA reduces ECE by 44-63%, generalizes across all four models on TableBench (mean AUROC 0.80), and combines complementarily with sampling: an MFA + self-consistency ensemble lifts AUROC from 0.74 to 0.82. A secondary contribution, structure-aware recalibration, improves AUROC by +10 percentage points over standard post-hoc methods.

[50] Latent Planning Emerges with Scale

Michael Hanna, Emmanuel Ameisen

Main category: cs.CL

TL;DR: LLMs exhibit latent planning abilities that increase with scale, where internal representations cause future token generation and shape preceding context, demonstrated through simple planning tasks and rhyming couplets.

Details

Motivation: To understand the extent to which LLMs implicitly plan during text generation, despite not explicitly verbalizing plans, and to develop a framework for measuring this latent planning capability.

Method: Defined latent planning as internal representations that (1) cause specific future token generation and (2) shape preceding context. Studied Qwen-3 family (0.6B-14B) on simple planning tasks and rhyming couplets, analyzing feature representations and planning mechanisms.

Result: Latent planning ability increases with model scale. Larger models possess features representing planned words and cause appropriate context generation. On rhyming tasks, models often identify rhymes ahead of time but seldom plan far ahead. Planning can be elicited and increases with scale when steering toward planned words.

Conclusion: LLMs do exhibit latent planning capabilities that scale with model size, providing mechanistic evidence of how planning abilities develop in language models and offering a framework for measuring such planning.

Abstract: LLMs can perform seemingly planning-intensive tasks, like writing coherent stories or functioning code, without explicitly verbalizing a plan; however, the extent to which they implicitly plan is unknown. In this paper, we define latent planning as occurring when LLMs possess internal planning representations that (1) cause the generation of a specific future token or concept, and (2) shape preceding context to license said future token or concept. We study the Qwen-3 family (0.6B-14B) on simple planning tasks, finding that latent planning ability increases with scale. Models that plan possess features that represent a planned-for word like “accountant”, and cause them to output “an” rather than “a”; moreover, even the less-successful Qwen-3 4B-8B have nascent planning mechanisms. On the more complex task of completing rhyming couplets, we find that models often identify a rhyme ahead of time, but even large models seldom plan far ahead. However, we can elicit some planning that increases with scale when steering models towards planned words in prose. In sum, we offer a framework for measuring planning and mechanistic evidence of how models’ planning abilities grow with scale.

[51] Topology-Aware Reasoning over Incomplete Knowledge Graph with Graph-Based Soft Prompting

Shuai Wang, Xixi Wang, Yinan Yu

Main category: cs.CL

TL;DR: A graph-based soft prompting framework for multi-hop KBQA that uses GNNs to encode structural subgraphs into soft prompts, enabling LLMs to reason over richer context and be less sensitive to missing edges in knowledge graphs.

Details

Motivation: LLMs are prone to hallucinations in knowledge-intensive scenarios, and while KBQA grounds generation in KGs, most multi-hop KBQA methods rely on explicit edge traversal which is fragile to KG incompleteness.

Method: Proposes a graph-based soft prompting framework that shifts reasoning from node-level path traversal to subgraph-level reasoning. Uses GNN to encode extracted structural subgraphs into soft prompts, enabling LLMs to reason over richer structural context. Introduces a two-stage paradigm: lightweight LLM first uses soft prompts to identify relevant entities/relations, then more powerful LLM generates evidence-aware answers.

Result: Achieves state-of-the-art performance on three out of four multi-hop KBQA benchmarks, demonstrating effectiveness in reducing sensitivity to missing edges.

Conclusion: The graph-based soft prompting framework effectively addresses KG incompleteness issues in multi-hop KBQA by enabling richer structural reasoning through GNN-encoded soft prompts.

Abstract: Large Language Models (LLMs) have shown remarkable capabilities across various tasks but remain prone to hallucinations in knowledge-intensive scenarios. Knowledge Base Question Answering (KBQA) mitigates this by grounding generation in Knowledge Graphs (KGs). However, most multi-hop KBQA methods rely on explicit edge traversal, making them fragile to KG incompleteness. In this paper, we proposed a novel graph-based soft prompting framework that shifts the reasoning paradigm from node-level path traversal to subgraph-level reasoning. Specifically, we employ a Graph Neural Network (GNN) to encode extracted structural subgraphs into soft prompts, enabling LLM to reason over richer structural context and identify relevant entities beyond immediate graph neighbors, thereby reducing sensitivity to missing edges. Furthermore, we introduce a two-stage paradigm that reduces computational cost while preserving good performance: a lightweight LLM first leverages the soft prompts to identify question-relevant entities and relations, followed by a more powerful LLM for evidence-aware answer generation. Experiments on four multi-hop KBQA benchmarks show that our approach achieves state-of-the-art performance on three of them, demonstrating its effectiveness. Code is available at the repository: https://github.com/Wangshuaiia/GraSP.

[52] Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis

Kang He, Yuzhe Ding, Xinrong Wang, Fei Li, Chong Teng, Donghong Ji

Main category: cs.CL

TL;DR: EBMC framework improves multimodal sentiment analysis by enhancing weaker modalities and balancing modality contributions through semantic disentanglement, energy-guided coordination, and instance-aware trust distillation.

Details

Motivation: Current multimodal sentiment analysis approaches struggle with modality imbalance where dominant modalities overshadow weaker ones, leading to competition that degrades fusion performance and robustness, especially under noisy or missing modality conditions.

Method: Proposes Enhance-then-Balance Modality Collaboration (EBMC) framework with three components: 1) semantic disentanglement and cross-modal enhancement to strengthen weaker modalities, 2) Energy-guided Modality Coordination for implicit gradient rebalancing via differentiable equilibrium objective, and 3) Instance-aware Modality Trust Distillation for sample-level reliability estimation and adaptive fusion weight modulation.

Result: EBMC achieves state-of-the-art or competitive results on multimodal sentiment analysis benchmarks and maintains strong performance under missing-modality settings, demonstrating improved robustness.

Conclusion: The EBMC framework effectively addresses modality imbalance in multimodal sentiment analysis by enhancing weaker modalities and balancing contributions, leading to improved performance and robustness across various conditions.

Abstract: Multimodal sentiment analysis (MSA) integrates heterogeneous text, audio, and visual signals to infer human emotions. While recent approaches leverage cross-modal complementarity, they often struggle to fully utilize weaker modalities. In practice, dominant modalities tend to overshadow non-verbal ones, inducing modality competition and limiting overall contributions. This imbalance degrades fusion performance and robustness under noisy or missing modalities. To address this, we propose a novel model, Enhance-then-Balance Modality Collaboration framework (EBMC). EBMC improves representation quality via semantic disentanglement and cross-modal enhancement, strengthening weaker modalities. To prevent dominant modalities from overwhelming others, an Energy-guided Modality Coordination mechanism achieves implicit gradient rebalancing via a differentiable equilibrium objective. Furthermore, Instance-aware Modality Trust Distillation estimates sample-level reliability to adaptively modulate fusion weights, ensuring robustness. Extensive experiments demonstrate that EBMC achieves state-of-the-art or competitive results and maintains strong performance under missing-modality settings.

[53] When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP

Mahounan Pericles Adjovi, Roald Eiselen, Prasenjit Mitra

Main category: cs.CL

TL;DR: LLM-based data augmentation and back-translation methods were evaluated for Hausa and Fongbe languages on NER and POS tasks, revealing task-specific effectiveness rather than universal benefits.

Details

Motivation: Address data scarcity for low-resource African languages by evaluating data augmentation methods to improve NLP performance, focusing on how augmentation effectiveness varies across tasks and languages.

Method: Evaluated two data augmentation methods: LLM-based generation using Gemini 2.5 Flash and back-translation using NLLB-200. Tested on Hausa and Fongbe languages using MasakhaNER 2.0 and MasakhaPOS benchmarks for named entity recognition and part-of-speech tagging tasks.

Result: Augmentation effectiveness was task-dependent rather than language or LLM quality dependent. For NER, neither method improved over baseline (LLM reduced Hausa NER by 0.24% F1, Fongbe by 1.81% F1). For POS tagging, LLM improved Fongbe by 0.33% accuracy, back-translation improved Hausa by 0.17%, while back-translation reduced Fongbe POS by 0.35%.

Conclusion: Data augmentation should be treated as a task-specific intervention rather than a universally beneficial preprocessing step, challenging assumptions that LLM generation quality predicts augmentation success.

Abstract: Data scarcity limits NLP development for low-resource African languages. We evaluate two data augmentation methods – LLM-based generation (Gemini 2.5 Flash) and back-translation (NLLB-200) – for Hausa and Fongbe, two West African languages that differ substantially in LLM generation quality. We assess augmentation on named entity recognition (NER) and part-of-speech (POS) tagging using MasakhaNER 2.0 and MasakhaPOS benchmarks. Our results reveal that augmentation effectiveness depends on task type rather than language or LLM quality alone. For NER, neither method improves over baseline for either language; LLM augmentation reduces Hausa NER by 0.24% F1 and Fongbe NER by 1.81% F1. For POS tagging, LLM augmentation improves Fongbe by 0.33% accuracy, while back-translation improves Hausa by 0.17%; back-translation reduces Fongbe POS by 0.35% and has negligible effect on Hausa POS. The same LLM-generated synthetic data produces opposite effects across tasks for Fongbe – hurting NER while helping POS – suggesting task structure governs augmentation outcomes more than synthetic data quality. These findings challenge the assumption that LLM generation quality predicts augmentation success, and provide actionable guidance: data augmentation should be treated as a task-specific intervention rather than a universally beneficial preprocessing step.

[54] FABLE: Fine-grained Fact Anchoring for Unstructured Model Editing

Peng Wang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu

Main category: cs.CL

TL;DR: FABLE is a hierarchical framework for unstructured model editing that decouples fine-grained fact injection from holistic text generation using a two-stage, fact-first approach.

Details

Motivation: Existing unstructured model editing methods often memorize text holistically without reliable fine-grained fact access, creating a mismatch between holistic recall and fine-grained fact retrieval.

Method: Two-stage hierarchical framework: discrete facts are anchored in shallow layers first, followed by minimal updates to deeper layers to produce coherent text, reflecting the unidirectional Transformer flow.

Result: FABLE substantially improves fine-grained question answering while maintaining state-of-the-art holistic editing performance, as evaluated on the UnFine benchmark with fact-level metrics.

Conclusion: The decoupling approach resolves the mismatch between holistic recall and fine-grained fact access, enabling more reliable model editing with better fact retrieval capabilities.

Abstract: Unstructured model editing aims to update models with real-world text, yet existing methods often memorize text holistically without reliable fine-grained fact access. To address this, we propose FABLE, a hierarchical framework that decouples fine-grained fact injection from holistic text generation. FABLE follows a two-stage, fact-first strategy: discrete facts are anchored in shallow layers, followed by minimal updates to deeper layers to produce coherent text. This decoupling resolves the mismatch between holistic recall and fine-grained fact access, reflecting the unidirectional Transformer flow in which surface-form generation amplifies rather than corrects underlying fact representations. We also introduce UnFine, a diagnostic benchmark with fine-grained question-answer pairs and fact-level metrics for systematic evaluation. Experiments show that FABLE substantially improves fine-grained question answering while maintaining state-of-the-art holistic editing performance. Our code is publicly available at https://github.com/caskcsg/FABLE.

[55] Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

Xudong Wang, Chaoning Zhang, Qigan Sun, Zhenzhen Huang, Chang Lu, Sheng Zheng, Zeyu Ma, Caiyan Qin, Yang Yang, Hengtao Shen

Main category: cs.CL

TL;DR: Tri-RAG: A structured triplet-based retrieval framework that transforms knowledge into Condition-Proof-Conclusion triplets for more efficient and precise retrieval-augmented generation.

Details

Motivation: Existing RAG approaches retrieve unstructured text fragments which introduce redundant/weakly relevant information, leading to excessive context accumulation, reduced semantic alignment, fragmented reasoning chains, degraded generation quality, and increased token consumption.

Method: Transforms external knowledge into standardized structured triplets (Condition, Proof, Conclusion) using lightweight prompt-based adaptation with frozen model parameters. Uses triplet head Condition as semantic anchor for retrieval and matching, enabling precise identification of query-relevant knowledge without concatenating lengthy raw texts.

Result: Significantly improves retrieval quality and reasoning efficiency across multiple benchmark datasets, produces more stable generation behavior, and enables more efficient resource utilization in complex reasoning scenarios.

Conclusion: Tri-RAG achieves favorable balance between retrieval accuracy and context token efficiency through structured triplet-based retrieval and reasoning-aligned context construction.

Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucination in large language models (LLMs) by incorporating external knowledge during generation. However, the effectiveness of RAG depends not only on the design of the retriever and the capacity of the underlying model, but also on how retrieved evidence is structured and aligned with the query. Existing RAG approaches typically retrieve and concatenate unstructured text fragments as context, which often introduces redundant or weakly relevant information. This practice leads to excessive context accumulation, reduced semantic alignment, and fragmented reasoning chains, thereby degrading generation quality while increasing token consumption. To address these challenges, we propose Tri-RAG, a structured triplet-based retrieval framework that improves retrieval efficiency through reasoning-aligned context construction. Tri-RAG automatically transforms external knowledge from natural language into standardized structured triplets consisting of Condition, Proof, and Conclusion, explicitly capturing logical relations among knowledge fragments using lightweight prompt-based adaptation with frozen model parameters. Building on this representation, the triplet head Condition is treated as an explicit semantic anchor for retrieval and matching, enabling precise identification of query-relevant knowledge units without directly concatenating lengthy raw texts. As a result, Tri-RAG achieves a favorable balance between retrieval accuracy and context token efficiency. Experimental results across multiple benchmark datasets demonstrate that Tri-RAG significantly improves retrieval quality and reasoning efficiency, while producing more stable generation behavior and more efficient resource utilization in complex reasoning scenarios.

[56] Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data

Vadim Borisov

Main category: cs.CL

TL;DR: Large-scale synthetic multilingual emotion classification dataset creation and model training across 23 languages with strong performance matching English-only models.

Details

Motivation: Address the scarcity of annotated emotion data in multilingual settings, as existing corpora are predominantly English, single-label, and cover few languages.

Method: Constructed synthetic training corpus of over 1M multi-label samples across 23 languages using culturally-adapted generation and programmatic quality filtering. Trained and compared six multilingual transformer encoders from DistilBERT to XLM-R-Large under identical conditions.

Result: XLM-R-Large achieved 0.868 F1-micro and 0.987 AUC-micro on in-domain test set. Zero-shot evaluation on human-annotated datasets showed XLM-R-Large matches or exceeds English-only specialist models on threshold-free ranking metrics while natively supporting all 23 languages.

Conclusion: Successfully created large-scale multilingual emotion classification system that performs competitively with English-only models while supporting 23 languages, addressing data scarcity in this domain.

Abstract: Emotion classification in multilingual settings remains constrained by the scarcity of annotated data: existing corpora are predominantly English, single-label, and cover few languages. We address this gap by constructing a large-scale synthetic training corpus of over 1M multi-label samples (50k per language) across 23 languages: Arabic, Bengali, Dutch, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin, Polish, Portuguese, Punjabi, Russian, Spanish, Swahili, Tamil, Turkish, Ukrainian, Urdu, and Vietnamese, covering 11 emotion categories using culturally-adapted generation and programmatic quality filtering. We train and compare six multilingual transformer encoders, from DistilBERT (135M parameters) to XLM-R-Large (560M parameters), under identical conditions. On our in-domain test set, XLM-R-Large achieves 0.868 F1-micro and 0.987 AUC-micro. To validate against human-annotated data, we evaluate all models zero-shot on GoEmotions (English) and SemEval-2018 Task 1 E-c (English, Arabic, Spanish). On threshold-free ranking metrics, XLM-R-Large matches or exceeds English-only specialist models, tying on AP-micro (0.636) and LRAP (0.804) while surpassing on AUC-micro (0.810 vs. 0.787), while natively supporting all 23 languages. The best base-sized model is publicly available at https://huggingface.co/tabularisai/multilingual-emotion-classification

[57] Learning Chain Of Thoughts Prompts for Predicting Entities, Relations, and even Literals on Knowledge Graphs

Alkid Baci, Luke Friedrichs, Caglar Demir, N’Dah Jean Kouagou, Axel-Cyrille Ngonga Ngomo

Main category: cs.CL

TL;DR: RALP reformulates link prediction as a prompt learning problem using LLMs, learning string-based chain-of-thought prompts as scoring functions for knowledge graph triples through Bayesian Optimization, achieving state-of-the-art results on KGE benchmarks and strong performance on OWL reasoning tasks.

Details

Motivation: Knowledge graph embedding models struggle with unseen entities, relations, and literals in dynamic, heterogeneous graphs, while pretrained LLMs generalize well through prompting. The authors aim to leverage LLM reasoning capabilities as a flexible alternative to embedding-based methods for link prediction.

Method: RALP reformulates link prediction as a prompt learning problem, learning string-based chain-of-thought prompts as scoring functions for triples. It uses Bayesian Optimization through the MIPRO algorithm to identify effective prompts from fewer than 30 training examples without gradient access. At inference, it predicts missing entities, relations, or whole triples and assigns confidence scores based on learned prompts.

Result: RALP improves state-of-the-art KGE models by over 5% MRR across datasets and enhances generalization via high-quality inferred triples. On OWL reasoning tasks with complex class expressions, it achieves over 88% Jaccard similarity.

Conclusion: Prompt-based LLM reasoning serves as a flexible alternative to embedding-based methods for knowledge graph completion, demonstrating strong performance on both transductive tasks and complex OWL reasoning with minimal training data.

Abstract: Knowledge graph embedding (KGE) models perform well on link prediction but struggle with unseen entities, relations, and especially literals, limiting their use in dynamic, heterogeneous graphs. In contrast, pretrained large language models (LLMs) generalize effectively through prompting. We reformulate link prediction as a prompt learning problem and introduce RALP, which learns string-based chain-of-thought (CoT) prompts as scoring functions for triples. Using Bayesian Optimization through MIPRO algorithm, RALP identifies effective prompts from fewer than 30 training examples without gradient access. At inference, RALP predicts missing entities, relations or whole triples and assigns confidence scores based on the learned prompt. We evaluate on transductive, numerical, and OWL instance retrieval benchmarks. RALP improves state-of-the-art KGE models by over 5% MRR across datasets and enhances generalization via high-quality inferred triples. On OWL reasoning tasks with complex class expressions (e.g., $\exists hasChild.Female$, $\geq 5 ; hasChild.Female$), it achieves over 88% Jaccard similarity. These results highlight prompt-based LLM reasoning as a flexible alternative to embedding-based methods. We release our implementation, training, and evaluation pipeline as open source: https://github.com/dice-group/RALP .

[58] InsightFlow: LLM-Driven Synthesis of Patient Narratives for Mental Health into Causal Models

Shreya Gupta, Prottay Kumar Adhikary, Bhavyaa Dave, Salam Michael Singh, Aniket Deroy, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: InsightFlow: An LLM-based system that automatically generates clinical 5P causal graphs from psychotherapy transcripts, showing structural and semantic similarity to human expert formulations.

Details

Motivation: Clinical case formulation using the 5P framework is time-consuming and varies across clinicians. There's a need for automated tools to generate causal models from therapy transcripts to augment clinical workflows.

Method: LLM-based approach that generates 5P-aligned causal graphs from patient-therapist dialogues. Evaluated using 46 psychotherapy intake transcripts annotated by clinical experts, comparing LLM-generated graphs against human formulations using structural (NetSimile), semantic (embedding similarity), and expert-rated clinical criteria.

Result: Generated graphs show structural similarity comparable to inter-annotator agreement and high semantic alignment with human graphs. Expert evaluations rate outputs as moderately complete, consistent, and clinically useful. LLM graphs tend to form more interconnected structures compared to chain-like human patterns, but overall complexity and content coverage are similar.

Conclusion: LLMs can produce clinically meaningful case formulation graphs within the natural variability of expert practice. InsightFlow highlights potential for automated causal modeling to augment clinical workflows, with future work needed to improve temporal reasoning and reduce redundancy.

Abstract: Clinical case formulation organizes patient symptoms and psychosocial factors into causal models, often using the 5P framework. However, constructing such graphs from therapy transcripts is time consuming and varies across clinicians. We present InsightFlow, an LLM based approach that automatically generates 5P aligned causal graphs from patient-therapist dialogues. Using 46 psychotherapy intake transcripts annotated by clinical experts, we evaluate LLM generated graphs against human formulations using structural (NetSimile), semantic (embedding similarity), and expert rated clinical criteria. The generated graphs show structural similarity comparable to inter annotator agreement and high semantic alignment with human graphs. Expert evaluations rate the outputs as moderately complete, consistent, and clinically useful. While LLM graphs tend to form more interconnected structures compared to the chain like patterns of human graphs, overall complexity and content coverage are similar. These results suggest that LLMs can produce clinically meaningful case formulation graphs within the natural variability of expert practice. InsightFlow highlights the potential of automated causal modeling to augment clinical workflows, with future work needed to improve temporal reasoning and reduce redundancy.

[59] Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

Xingyu Lin, Yilin Wen, Du Su, Jinchang Hou, En Wang, Wenbin Liu, Chenfu Bao, Zhonghou Lv

Main category: cs.CL

TL;DR: TEPO is a token-level optimization framework that improves mathematical reasoning in LLMs by addressing sparse token rewards through sequence-level likelihood linking and targeted KL-Divergence constraints.

Details

Motivation: Existing methods like GRPO struggle with token-level sparse rewards in chain-of-thought reasoning, often leading to entropy collapse or model degradation due to undifferentiated token-level entropy regularization.

Method: TEPO uses (1) sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) a token-level KL-Divergence mask constraint targeting tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates.

Result: TEPO achieves state-of-the-art performance on mathematical reasoning benchmarks and enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.

Conclusion: TEPO effectively addresses token-level sparse reward challenges in CoT reasoning, improving both performance and training efficiency for mathematical reasoning tasks.

Abstract: Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly in their mathemat ical reasoning performance. However, GRPO and related entropy regularization methods still struggle with token-level sparse-rewards, which is an inherent chal lenge in chain-of-thought (CoT) reasoning. These approaches often rely on undifferen tiated token-level entropy regularization, which easily leads to entropy collapse or model degradation under sparse token rewards. In this work, we propose TEPO, a novel token-level framework that (1) leverages sequence-level likelihood to link group-level rewards with individual tokens via token-level aggregation, and (2) introduces a token-level KL-Divergence mask constraint that targets tokens with positive advantages and decreasing entropy to mitigate abrupt policy updates. Experiments demonstrate that TEPO not only achieves state-of-the-art performance on mathematical reasoning benchmarks but also markedly enhances training stability, reducing convergence time by 50% compared with GRPO/DAPO.

[60] Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark

Terra Blevins, Stephen Mayhew, Marek Šuppa, Hila Gonen, Shachar Mirkin, Vasile Pais, Kaja Dobrovoljc, Voula Giouli, Jun Kevin, Eugene Jang, Eungseo Kim, Jeongyeon Seo, Xenophon Gialis, Yuval Pinter

Main category: cs.CL

TL;DR: UNER project builds gold-standard multilingual NER benchmarks using standardized annotation guidelines across languages to evaluate multilingual language models.

Details

Motivation: Multilingual LLMs need reliable evaluation benchmarks across languages, but gold-standard NER datasets remain scarce for most languages, limiting proper assessment of model capabilities.

Method: Uses general tagset and thorough annotation guidelines to collect standardized, cross-lingual named entity span annotations through community collaboration (organizers, annotators, collaborators).

Result: Released UNER v1 in 2024, with continued expansion and active community involvement in building multilingual NER benchmarks.

Conclusion: The UNER project addresses the critical need for standardized multilingual evaluation benchmarks to properly assess and improve multilingual language models across diverse languages.

Abstract: While multilingual language models promise to bring the benefits of LLMs to speakers of many languages, gold-standard evaluation benchmarks in most languages to interrogate these assumptions remain scarce. The Universal NER project, now entering its fourth year, is dedicated to building gold-standard multilingual Named Entity Recognition (NER) benchmark datasets. Inspired by existing massively multilingual efforts for other core NLP tasks (e.g., Universal Dependencies), the project uses a general tagset and thorough annotation guidelines to collect standardized, cross-lingual annotations of named entity spans. The first installment (UNER v1) was released in 2024, and the project has continued and expanded since then, with various organizers, annotators, and collaborators in an active community.

[61] Generating Effective CoT Traces for Mitigating Causal Hallucination

Yiheng Zhao, Jun Yan

Main category: cs.CL

TL;DR: This paper addresses causal hallucination in smaller LLMs for event causality identification by creating a pipeline to generate effective Chain-of-Thought traces and introducing a new metric to quantify causal hallucination.

Details

Motivation: Smaller LLMs (≤1.5B parameters) suffer from severe causal hallucination in event causality identification tasks, and while fine-tuning with Chain-of-Thought traces is promising, there's currently no CoT dataset available for ECI and no metric to quantify causal hallucination.

Method: 1) Investigate criteria for effective CoT traces to mitigate causal hallucination, 2) Design a pipeline to generate CoT traces meeting these criteria, 3) Introduce Causal Hallucination Rate (CHR) metric to quantify hallucination and guide CoT criteria formulation.

Result: Fine-tuning with generated CoT traces substantially reduces causal hallucination in smaller LLMs while improving mean accuracy. Models show strong cross-dataset/difficulty generalization and robustness under misleading intervention prompts.

Conclusion: The proposed pipeline effectively generates CoT traces that mitigate causal hallucination in smaller LLMs for ECI tasks, with the CHR metric providing valuable quantification and guidance for this problem.

Abstract: Although large language models (LLMs) excel in complex reasoning tasks, they suffer from severe causal hallucination in event causality identification (ECI), particularly in smaller models ($\leq$1.5B parameters). A promising approach to address this issue is to fine-tune them with Chain-of-Thought (CoT) traces. However, there is currently a lack of CoT trace dataset available for ECI. In this paper, we first investigate the essential criteria that effective CoT traces should possess to mitigate causal hallucination in smaller models. We then design a pipeline to generate CoT traces that meet these criteria. Moreover, since there is currently no metric for quantifying causal hallucination, we also introduce a new metric, the Causal Hallucination Rate (CHR), to quantify causal hallucination, guide the formulation of effective CoT trace criteria, and validate the effectiveness of our pipeline. Our experiments show that fine-tuning with the CoT traces generated by our pipeline not only substantially reduces causal hallucination in smaller LLMs but also improves mean accuracy. Moreover, the fine-tuned models exhibit strong cross-dataset and cross-difficulty generalization, as well as robustness under misleading intervention prompts.

Jihao Dai, Dingjun Wu, Yuxuan Chen, Zheni Zeng, Yukun Yan, Zhenghao Liu, Maosong Sun

Main category: cs.CL

TL;DR: NaviRAG introduces hierarchical knowledge navigation for RAG systems, moving from flat retrieval to active multi-granularity information seeking using LLM agents.

Details

Motivation: Traditional RAG uses flat retrieval that maps queries directly to isolated text segments, struggling with complex tasks requiring conditional retrieval and dynamic synthesis across different granularity levels (broad concepts to specific evidence).

Method: 1) Structures knowledge documents hierarchically preserving semantic relationships from coarse-grained topics to fine-grained details. 2) Uses LLM agent to actively navigate records, iteratively identifying information gaps and retrieving relevant content from appropriate granularity levels.

Result: Extensive experiments on long-document QA benchmarks show NaviRAG consistently improves both retrieval recall and end-to-end answer performance over conventional RAG baselines. Ablation studies confirm gains from multi-granular evidence localization and dynamic retrieval planning.

Conclusion: NaviRAG makes RAG systems more intelligent and autonomous by shifting from passive segment retrieval to active knowledge navigation across hierarchical structures.

Abstract: Retrieval-augmented generation (RAG) typically relies on a flat retrieval paradigm that maps queries directly to static, isolated text segments. This approach struggles with more complex tasks that require the conditional retrieval and dynamic synthesis of information across different levels of granularity (e.g., from broad concepts to specific evidence). To bridge this gap, we introduce NaviRAG, a novel framework that shifts from passive segment retrieval to active knowledge navigation. NaviRAG first structures the knowledge documents into a hierarchical form, preserving semantic relationships from coarse-grained topics to fine-grained details. Leveraging this reorganized knowledge records, a large language model (LLM) agent actively navigates the records, iteratively identifying information gaps and retrieving relevant content from the most appropriate granularity level. Extensive experiments on long-document QA benchmarks show that NaviRAG consistently improves both retrieval recall and end-to-end answer performance over conventional RAG baselines. Ablation studies confirm performance gains stem from our method’s capacity for multi-granular evidence localization and dynamic retrieval planning. We further discuss efficiency, applicable scenario, and future directions of our method, hoping to make RAG systems more intelligent and autonomous.

[63] Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning

Timon Ziegenbein, Maja Stahl, Henning Wachsmuth

Main category: cs.CL

TL;DR: A reinforcement learning approach that teaches LLMs human-like editing strategies to improve argument appropriateness through self-contained, meaning-preserving sentence-level edits.

Details

Motivation: There's a mismatch between human and LLM editing strategies: LLMs often make multiple scattered edits that change meaning, while humans make self-contained, meaning-preserving edits. The goal is to teach LLMs human-like editing to improve argument appropriateness.

Method: Reinforcement learning approach using group relative policy optimization with a multi-component reward function that jointly optimizes edit-level semantic similarity, fluency, pattern conformity, and argument-level appropriateness.

Result: Outperforms competitive baselines and state-of-the-art in human-like editing, with multi-round editing achieving appropriateness close to full rewriting in both automatic and human evaluations.

Conclusion: The approach successfully teaches LLMs human-like editing strategies, producing self-contained sentence-level edit suggestions that can be accepted or rejected independently, significantly improving argument appropriateness.

Abstract: Editing human-written text has become a standard use case of large language models (LLMs), for example, to make one’s arguments more appropriate for a discussion. Comparing human to LLM-generated edits, however, we observe a mismatch in editing strategies: While LLMs often perform multiple scattered edits and tend to change meaning notably, humans rather encapsulate dependent changes in self-contained, meaning-preserving edits. In this paper, we present a reinforcement learning approach that teaches LLMs human-like editing to improve the appropriateness of arguments. Our approach produces self-contained sentence-level edit suggestions that can be accepted or rejected independently. We train the approach using group relative policy optimization with a multi-component reward function that jointly optimizes edit-level semantic similarity, fluency, and pattern conformity as well as argument-level appropriateness. In automatic and human evaluation, it outperforms competitive baselines and the state of the art in human-like editing, with multi-round editing achieving appropriateness close to full rewriting.

[64] EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution

Shiyu He, Minchi Kuang, Mengxian Wang, Bin Hu, Tingxiang Gu

Main category: cs.CL

TL;DR: EvoSpark framework enables coherent long-horizon narratives in LLM-based multi-agent systems by addressing social memory stacking and narrative-spatial dissonance through stratified memory, generative mise-en-scène, and unified narrative operations.

Details

Motivation: Current LLM-based multi-agent systems struggle with maintaining coherent long-horizon narratives due to stochastic generative emergence, leading to social memory stacking (accumulating conflicting relational states) and narrative-spatial dissonance (detached spatial logic from plot).

Method: Proposes EvoSpark framework with three key components: 1) Stratified Narrative Memory using Role Socio-Evolutionary Base for dynamic experience metabolism, 2) Generative Mise-en-Scène mechanism for Role-Location-Plot alignment, and 3) Unified Narrative Operation Engine with Emergent Character Grounding Protocol to transform stochastic sparking into persistent characters.

Result: EvoSpark significantly outperforms baselines across diverse paradigms, enabling sustained generation of expressive and coherent narrative experiences in long-horizon simulations.

Conclusion: EvoSpark successfully bridges the gap in maintaining logically coherent long-horizon narratives within Endogenous Interactive Agent Societies by addressing fundamental challenges of stochastic generative emergence through integrated memory, spatial, and operational mechanisms.

Abstract: Realizing endogenous narrative evolution in LLM-based multi-agent systems is hindered by the inherent stochasticity of generative emergence. In particular, long-horizon simulations suffer from social memory stacking, where conflicting relational states accumulate without resolution, and narrative-spatial dissonance, where spatial logic detaches from the evolving plot. To bridge this gap, we propose EvoSpark, a framework specifically designed to sustain logically coherent long-horizon narratives within Endogenous Interactive Agent Societies. To ensure consistency, the Stratified Narrative Memory employs a Role Socio-Evolutionary Base as living cognition, dynamically metabolizing experiences to resolve historical conflicts. Complementarily, Generative Mise-en-Scène mechanism enforces Role-Location-Plot alignment, synchronizing character presence with the narrative flow. Underpinning these is the Unified Narrative Operation Engine, which integrates an Emergent Character Grounding Protocol to transform stochastic sparking into persistent characters. This engine establishes a substrate that expands a minimal premise into an open-ended, evolving story world. Experiments demonstrate that EvoSpark significantly outperforms baselines across diverse paradigms, enabling the sustained generation of expressive and coherent narrative experiences.

[65] The role of System 1 and System 2 semantic memory structure in human and LLM biases

Katherine Abramski, Giulio Rossetti, Massimo Stella

Main category: cs.CL

TL;DR: The paper investigates implicit gender bias in humans vs LLMs through dual process theory, finding humans have irreducible semantic memory structures that regulate bias while LLMs lack this human-like conceptual knowledge.

Details

Motivation: To understand cognitive mechanisms behind implicit bias in humans and LLMs using dual process theory, and to identify fundamental differences in how bias manifests and is regulated between human and machine cognition.

Method: Model System 1 and System 2 thinking as semantic memory networks with distinct structures built from comparable datasets generated by both humans and LLMs, then use network-based evaluation metrics to analyze relationships between semantic memory structure and implicit gender bias.

Result: Semantic memory structures are irreducible only in humans (not LLMs), and semantic memory structure relates consistently to implicit bias only in humans, with lower bias levels in System 2 structures. LLMs lack certain types of human-like conceptual knowledge that contribute to bias regulation.

Conclusion: Fundamental differences exist between human and machine cognition regarding bias regulation, with humans possessing irreducible semantic structures that help mitigate bias while LLMs lack this capability, highlighting limitations in current LLM architectures.

Abstract: Implicit biases in both humans and large language models (LLMs) pose significant societal risks. Dual process theories propose that biases arise primarily from associative System 1 thinking, while deliberative System 2 thinking mitigates bias, but the cognitive mechanisms that give rise to this phenomenon remain poorly understood. To better understand what underlies this duality in humans, and possibly in LLMs, we model System 1 and System 2 thinking as semantic memory networks with distinct structures, built from comparable datasets generated by both humans and LLMs. We then investigate how these distinct semantic memory structures relate to implicit gender bias using network-based evaluation metrics. We find that semantic memory structures are irreducible only in humans, suggesting that LLMs lack certain types of human-like conceptual knowledge. Moreover, semantic memory structure relates consistently to implicit bias only in humans, with lower levels of bias in System~2 structures. These findings suggest that certain types of conceptual knowledge contribute to bias regulation in humans, but not in LLMs, highlighting fundamental differences between human and machine cognition.

[66] Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

Eliya Habba, Itay Itzhak, Asaf Yehudai, Yotam Perlitz, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen, Gabriel Stanovsky

Main category: cs.CL

TL;DR: A framework using multidimensional Item Response Theory with anchor items to calibrate new benchmarks to existing evaluation suites, enabling score comparability across different evaluation periods with minimal additional evaluation cost.

Details

Motivation: The increasing number of language models and benchmarks makes comprehensive evaluation expensive and impractical. Models are often evaluated on different samples, making scores difficult to compare across studies. There's a need for a method to extend benchmark suites over time while preserving score comparability.

Method: Proposes a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to existing evaluation suites. The approach holds previously calibrated item parameters fixed while using a fixed anchor set for each dataset, allowing results from different evaluation periods to be compared directly.

Result: In experiments with over 400 models, the framework predicts full-evaluation performance within 2-3 percentage points using only 100 anchor questions per dataset, with Spearman ρ ≥ 0.9 for ranking preservation. This demonstrates the ability to extend benchmark suites over time while preserving score comparability at constant evaluation cost per new dataset.

Conclusion: The proposed IRT-based framework with anchor items enables efficient and comparable evaluation of language models across different time periods and benchmark expansions, solving the growing pains of benchmark evaluation in NLP.

Abstract: The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies. To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item parameters fixed. Our approach supports a realistic evaluation setting in which datasets are introduced over time and models are evaluated only on the datasets available at the time of evaluation, while a fixed anchor set for each dataset is used so that results from different evaluation periods can be compared directly. In large-scale experiments on more than $400$ models, our framework predicts full-evaluation performance within 2-3 percentage points using only $100$ anchor questions per dataset, with Spearman $ρ\geq 0.9$ for ranking preservation, showing that it is possible to extend benchmark suites over time while preserving score comparability, at a constant evaluation cost per new dataset. Code available at https://github.com/eliyahabba/growing-pains

[67] Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

Ronald Skorobogat, Ameya Prabhu, Matthias Bethge

Main category: cs.CL

TL;DR: Paper proposes round-trip translation as a better multilingual evaluation method than current benchmarks, showing it correlates with real-world performance and introduces LiT benchmark.

Details

Motivation: Current multilingual benchmarks measure mathematical reasoning and factual recall rather than true multilingual proficiency, leading to misleading evaluations of frontier models' multilingual capabilities.

Method: Proposes round-trip translation evaluation: translate text from source to target language and back, then measure semantic gaps between original and result. Introduces Lost in Translation (LiT) benchmark using this method across widely spoken languages.

Result: Round-trip translation correlates almost perfectly (ρ = 0.94) with user ratings on real-world multilingual tasks (LMArena), requires no human references, and doesn’t need more capable multilingual judges than tested models.

Conclusion: Round-trip translation is a superior method for evaluating true multilingual generation capabilities of frontier models, and the LiT benchmark provides realistic multilingual evaluation.

Abstract: Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\r{ho} = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.

[68] MetFuse: Figurative Fusion between Metonymy and Metaphor

Saptarshi Ghosh, Tianyu Jiang

Main category: cs.CL

TL;DR: MetFuse is a novel dataset and framework for studying figurative language fusion between metonymy and metaphor, containing 4,000 human-verified sentences that show how these two figurative types interact and co-occur in natural language.

Details

Motivation: Current computational work studies metonymy and metaphor in isolation, but they often co-occur in natural language. There's a need to understand their interactions and create resources for studying figurative language fusion.

Method: Created a framework to transform literal sentences into three figurative variants (metonymic, metaphoric, hybrid). Built MetFuse dataset with 1,000 human-verified quadruplets (4,000 sentences total). Conducted extrinsic experiments on eight benchmarks and analyzed human and LLM performance.

Result: Augmenting training data with MetFuse consistently improves both metonymy and metaphor classification, with hybrid examples yielding largest gains on metonymy tasks. Both humans and LLMs better identify metonymy in hybrid sentences than in metonymy-only sentences.

Conclusion: MetFuse enables study of figurative language fusion and shows that metaphor presence makes metonymic nouns more explicit. The dataset provides valuable resource for figurative language understanding research.

Abstract: Metonymy and metaphor often co-occur in natural language, yet computational work has studied them largely in isolation. We introduce a framework that transforms a literal sentence into three figurative variants: metonymic, metaphoric, and hybrid. Using this framework, we construct MetFuse, the first dedicated dataset of figurative fusion between metonymy and metaphor, containing 1,000 human-verified meaning-aligned quadruplets totaling 4,000 sentences. Extrinsic experiments on eight existing benchmarks show that augmenting training data with MetFuse consistently improves both metonymy and metaphor classification, with hybrid examples yielding the largest gains on metonymy tasks. Using this dataset, we also analyze how the presence of one figurative type impacts another. Our findings show that both human annotators and large language models better identify metonymy in hybrid sentences than in metonymy-only sentences, demonstrating that the presence of a metaphor makes a metonymic noun more explicit. Our dataset is publicly available at: https://github.com/cincynlp/MetFuse.

[69] GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

Amir Hossein Kargaran, Nafiseh Nikeghbal, Jana Diesner, François Yvon, Hinrich Schütze

Main category: cs.CL

TL;DR: GlotOCR Bench is a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts, revealing that even state-of-the-art vision-language models fail to generalize beyond 30 scripts and rely heavily on language model pretraining rather than visual recognition.

Details

Motivation: Current OCR evaluation is concentrated on a small cluster of high- and mid-resource scripts, lacking comprehensive assessment of OCR generalization across diverse scripts. There's a need to understand how well vision-language models handle the full spectrum of Unicode scripts.

Method: Created GlotOCR Bench with clean and degraded image variants rendered from real multilingual texts using fonts from Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType. Supports both LTR and RTL scripts with manual verification of correct rendering. Evaluated a broad suite of open-weight and proprietary vision-language models.

Result: Most models perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance tracks script-level pretraining coverage, indicating reliance on language model pretraining over visual recognition. Models either produce random noise or hallucinate characters from similar known scripts when confronted with unfamiliar scripts.

Conclusion: Current OCR systems rely heavily on language model pretraining rather than visual recognition capabilities. There’s significant room for improvement in cross-script generalization for vision-language models in OCR tasks.

Abstract: Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.

[70] Accelerating Speculative Decoding with Block Diffusion Draft Trees

Liran Ringel, Yaniv Romano

Main category: cs.CL

TL;DR: DDTree improves speculative decoding by constructing draft trees from block diffusion drafter distributions, enabling multiple verification paths in a single forward pass for higher acceptance rates.

Details

Motivation: While DFlash's block diffusion drafter generates entire draft blocks efficiently, it only verifies a single trajectory per round, potentially limiting acceptance length. The authors aim to improve speculative decoding by enabling verification of multiple draft trajectories simultaneously.

Method: DDTree constructs a draft tree directly from per-position distributions of a block diffusion drafter. Under a fixed node budget, it uses a best-first heap algorithm to select continuations most likely to match the target model according to a surrogate metric. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask.

Result: DDTree achieves state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters like EAGLE-3 and improving upon vanilla DFlash by enabling verification of multiple draft trajectories.

Conclusion: DDTree represents a leading approach to speculative decoding by combining the efficiency of block diffusion drafters with tree-based verification, enabling higher acceptance rates and better performance through parallel verification of multiple draft trajectories.

Abstract: Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model’s output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.

[71] PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

Han Bao, Penghao Zhang, Yue Huang, Zhengqing Yuan, Yanchi Ru, Rui Su, Yujun Zhou, Xiangqi Wang, Kehan Guo, Nitesh V Chawla, Yanfang Ye, Xiangliang Zhang

Main category: cs.CL

TL;DR: PolicyBench benchmark evaluates LLMs on policy comprehension across US-China systems, with PolicyMoE model showing strengths in application tasks over memorization/understanding.

Details

Motivation: LLMs are increasingly used in real-world policy decision-making, but their ability to comprehend and reason about policy-related content remains underexplored, creating a gap in understanding their capabilities for governance applications.

Method: Created PolicyBench - first large-scale cross-system benchmark (US-China) with 21K cases across policy areas; assessed three capabilities following Bloom’s taxonomy: Memorization, Understanding, and Application; proposed PolicyMoE - domain-specialized Mixture-of-Experts model with expert modules aligned to each cognitive level.

Result: PolicyMoE models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yield highest accuracy on structured reasoning tasks; reveals key limitations of current LLMs in policy understanding.

Conclusion: The work identifies limitations in current LLMs for policy comprehension and suggests paths toward more reliable, policy-focused models through specialized architectures like MoE.

Abstract: Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf{\textit{PolicyBench}}, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom’s taxonomy, the benchmark assesses three core capabilities: (1) \textbf{Memorization}: factual recall of policy knowledge, (2) \textbf{Understanding}: conceptual and contextual reasoning, and (3) \textbf{Application}: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf{\textit{PolicyMoE}}, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.

[72] One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram

Main category: cs.CL

TL;DR: Instruction-tuned LLMs show severe response collapse under simple lexical constraints, losing 14-48% comprehensiveness, while base models remain robust - revealing fragility created by instruction tuning’s coupling of task competence to surface-form templates.

Details

Motivation: To investigate how robust instruction-tuned LLMs are when faced with trivial lexical constraints, and whether their helpfulness collapses under simple restrictions like banning a single punctuation character or common word.

Method: Tested three open-weight model families and one closed-weight model (GPT-4o-mini) with simple lexical constraints. Used pairwise evaluation with GPT-4o-mini and GPT-4o as judges. Conducted mechanistic analysis including two-pass generation (free generation followed by constrained rewriting) and linear probes on prompt representations. Compared instruction-tuned models with base models under identical constraints.

Result: Instruction-tuned LLMs collapse their responses under trivial constraints, losing 14-48% comprehensiveness. GPT-4o-mini suffers 31% loss with 99% baseline win rate. Base models show no systematic collapse. Two-pass generation recovers 59-96% of response length. Linear probes predict response length with R²=0.51-0.93 for instruction-tuned models but negative R² for base models. Standard LLM-as-judge evaluation detects only 3.5% quality drop vs 23% in pairwise evaluation.

Conclusion: Instruction tuning creates fragility by coupling task competence to narrow surface-form templates, making models vulnerable to trivial constraints. This reveals a methodological blind spot in constrained generation assessment and shows that commercially deployed models share this vulnerability.

Abstract: Instruction-tuned large language models produce helpful, structured responses, but how robust is this helpfulness when trivially constrained? We show that simple lexical constraints (banning a single punctuation character or common word) cause instruction-tuned LLMs to collapse their responses, losing 14–48% of comprehensiveness in pairwise evaluation across three open-weight model families and one closed-weight model (GPT-4o-mini). The baseline response is preferred in 77–100% of 1,920 pairwise comparisons judged by GPT-4o-mini and GPT-4o. Notably, GPT-4o-mini suffers 31% comprehensiveness loss (99% baseline win rate), demonstrating that the fragility extends to commercially deployed closed-weight models, contrary to prior findings on format-level constraints. Through mechanistic analysis, we identify this as a planning failure: two-pass generation (free generation followed by constrained rewriting) recovers 59–96% of response length, and linear probes on prompt representations predict response length with $R^2 = 0.51$–$0.93$ before generation begins, with $R^2$ tracking collapse severity across models. The same probes yield negative $R^2$ on base models, confirming that instruction tuning creates the representational structure encoding the collapse decision. Crucially, base models show no systematic collapse under identical constraints, with effects that are small, noisy, and bidirectional, demonstrating that instruction tuning creates this fragility by coupling task competence to narrow surface-form templates. The effect replicates on MT-Bench across all eight task categories. We further show that standard independent LLM-as-judge evaluation detects only a 3.5% average quality drop where pairwise evaluation reveals 23%, exposing a methodological blind spot in how constrained generation is assessed.

[73] Toward Autonomous Long-Horizon Engineering for ML Research

Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji-Rong Wen, Kai Jia

Main category: cs.CL

TL;DR: AiScientist: A system for autonomous long-horizon ML research engineering using hierarchical orchestration and File-as-Bus workspace for durable state continuity, achieving significant improvements on ML research benchmarks.

Details

Motivation: Long-horizon ML research engineering is challenging as agents must sustain coherent progress across multiple stages (task comprehension, environment setup, implementation, experimentation, debugging) over extended periods, requiring both structured orchestration and durable state continuity.

Method: AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through summaries and workspace maps, while specialized agents repeatedly re-ground on durable artifacts (analyses, plans, code, experimental evidence) rather than relying on conversational handoffs.

Result: AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies show File-as-Bus protocol is key, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed.

Conclusion: Long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem. The File-as-Bus approach with durable state continuity is crucial for sustained performance.

Abstract: Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.

[74] Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation

Yuqian Wu, Wei Chen, Zhengjun Huang, Junle Chen, Qingxiang Liu, Kai Wang, Xiaofang Zhou, Yuxuan Liang

Main category: cs.CL

TL;DR: A minimalist conversational memory framework that addresses signal sparsity in long dialogues through turn isolation retrieval and query-driven pruning, achieving robust performance with high efficiency.

Details

Motivation: Existing conversational memory systems suffer from context dilution as conversations grow, with the primary bottleneck being the Signal Sparsity Effect in latent knowledge manifolds rather than memory architecture itself.

Method: Proposes a minimalist framework with two key components: Turn Isolation Retrieval (TIR) that uses max-activation strategy to capture turn-level signals instead of global aggregation, and Query-Driven Pruning (QDP) that removes redundant sessions and conversational filler to create compact, high-density evidence sets.

Result: Extensive experiments on multiple benchmarks show the method achieves robust performance across diverse settings, consistently outperforming strong baselines while maintaining high efficiency in tokens and latency.

Conclusion: The framework establishes a new minimalist baseline for conversational memory by addressing signal sparsity through simple yet effective retrieval and pruning mechanisms, demonstrating that complex hierarchical summarization or reinforcement learning may not be necessary for effective long-term dialogue management.

Abstract: Existing conversational memory systems rely on complex hierarchical summarization or reinforcement learning to manage long-term dialogue history, yet remain vulnerable to context dilution as conversations grow. In this work, we offer a different perspective: the primary bottleneck may lie not in memory architecture, but in the \textit{Signal Sparsity Effect} within the latent knowledge manifold. Through controlled experiments, we identify two key phenomena: \textit{Decisive Evidence Sparsity}, where relevant signals become increasingly isolated with longer sessions, leading to sharp degradation in aggregation-based methods; and \textit{Dual-Level Redundancy}, where both inter-session interference and intra-session conversational filler introduce large amounts of non-informative content, hindering effective generation. Motivated by these insights, we propose \method, a minimalist framework that brings conversational memory back to basics, relying solely on retrieval and generation via Turn Isolation Retrieval (TIR) and Query-Driven Pruning (QDP). TIR replaces global aggregation with a max-activation strategy to capture turn-level signals, while QDP removes redundant sessions and conversational filler to construct a compact, high-density evidence set. Extensive experiments on multiple benchmarks demonstrate that \method achieves robust performance across diverse settings, consistently outperforming strong baselines while maintaining high efficiency in tokens and latency, establishing a new minimalist baseline for conversational memory.

[75] E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Zihan Liao, Jun Wang, Hang Yu, Lingxiao Wei, Jianguo Li, Jun Wang, Wei Zhang

Main category: cs.CL

TL;DR: Paper 2409.06679: Failed to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2409.06679: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.06679&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[76] GigaCheck: Detecting LLM-generated Content via Object-Centric Span Localization

Irina Tolstykh, Aleksandra Tsybina, Sergey Yakubson, Aleksandr Gordeev, Vladimir Dokholyan, Maksim Kuprashevich

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2410.23728: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.23728&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[77] Speaker effects in language comprehension: An integrative model of language and speaker processing

Hanlin Wu, Zhenguang G. Cai

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to paper fetch failure

Method: Unable to determine method due to paper fetch failure

Result: Unable to determine results due to paper fetch failure

Conclusion: Unable to determine conclusion due to paper fetch failure

Abstract: Failed to fetch summary for 2412.07238: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.07238&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[78] AdaMCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Multilingual Chain-of-Thought

Weihua Zheng, Xin Huang, Zhengyuan Liu, Tarun Kumar Vangani, Bowei Zou, Xiyan Tao, Yuhao Wu, Ai Ti Aw, Nancy F. Chen, Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: Unable to analyze paper 2501.16154 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract content is unavailable due to API rate limiting

Method: Cannot determine method as abstract content is unavailable due to API rate limiting

Result: Cannot determine results as abstract content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as abstract content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2501.16154: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.16154&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[79] Fine-Tuning LLMs for Report Summarization: Analysis on Supervised and Unsupervised Data

Swati Rallapalli, Shannon Gallagher, Andrew O. Mellinger, Jasmine Ratchford, Anusha Sinha, Tyler Brooks, William R. Nichols, Nick Winski, Bryan Brown

Main category: cs.CL

TL;DR: Unable to analyze paper 2503.10676 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot draw conclusions without access to the paper abstract

Abstract: Failed to fetch summary for 2503.10676: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.10676&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[80] Joint Flashback Adaptation for Forgetting-Resistant Instruction Tuning

Yukun Zhao, Lingyong Yan, Zhenyang Li, Shuaiqiang Wang, Zhumin Chen, Zhaochun Ren, Dawei Yin

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2505.15467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.15467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[81] Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning

Yihong Wu, Liheng Ma, Muzhi Li, Jiaming Zhou, Lei Ding, Jianye Hao, Ho-fung Leung, Irwin King, Yingxue Zhang, Jian-Yun Nie

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2505.17086: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17086&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[82] Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji

Main category: cs.CL

TL;DR: Unable to analyze paper 2507.06448 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions about paper content due to retrieval failure

Abstract: Failed to fetch summary for 2507.06448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.06448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[83] Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector

Amal Chebbi, Babajide Kolade

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2509.07177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[84] Variation in Verification: Understanding Verification Dynamics in Large Language Models

Yefan Zhou, Austin Xu, Yilun Zhou, Janvijay Singh, Jiang Gui, Shafiq Joty

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2509.17995: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.17995&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[85] DyBBT: Dynamic Balance via Bandit-inspired Targeting for Dialog Policy with Cognitive Dual-Systems

Shuyu Zhang, Yifan Wei, Jialuo Yuan, Xinru Wang, Yanmin Zhu, Bin Li, Yujie Liu

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2509.19695: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19695&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[86] HiCoLoRA: Addressing Context-Prompt Misalignment via Hierarchical Collaborative LoRA for Zero-Shot DST

Shuyu Zhang, Yifan Wei, Xinru Wang, Yanmin Zhu, Yangfan He, Yixuan Weng, Bin Li, Yujie Liu

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: N/A - Paper content not accessible

Method: N/A - Paper content not accessible

Result: N/A - Paper content not accessible

Conclusion: N/A - Paper content not accessible

Abstract: Failed to fetch summary for 2509.19742: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19742&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[87] Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving

Shunfeng Zheng, Yudi Zhang, Meng Fang, Zihan Zhang, Zhitan Wu, Mykola Pechenizkiy, Ling Chen

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2510.00919: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00919&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[88] LLM as Attention-Informed NTM and Topic Modeling as long-input Generation: Interpretability and long-Context Capability

Xuan Xu, Zhongliang Yang, Haolun Li, Beilin Chu, Rui Tian, Yu Li, Shaolin Tan, Linna Zhou

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.03174: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03174&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[89] Enhancing Agentic Textual Graph Retrieval with Synthetic Stepwise Supervision

Ge Chang, Jinbo Su, Jiacheng Liu, Pengfei Yang, Yuhao Shang, Huiwen Zheng, Hongli Ma, Yan Liang, Yuanchun Li, Yunxin Liu

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2510.03323: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03323&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[90] Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

Raoyuan Zhao, Yihong Liu, Lena Altinger, Hinrich Schütze, Michael A. Hedderich

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.09536: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09536&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[91] Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

Qingyu Ren, Qianyu He, Powei Chang, Jie Zeng, Zeye Sun, Fei Yu, Jiaqing Liang, Yanghua Xiao

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to access limitations

Abstract: Failed to fetch summary for 2510.14420: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14420&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[92] Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation

Jinliang Liu, Jiale Bai, Shaoning Zeng

Main category: cs.CL

TL;DR: Unable to analyze paper 2510.15552 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract retrieval failed

Method: Cannot determine method as abstract retrieval failed

Result: Cannot determine results as abstract retrieval failed

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.15552: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15552&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[93] CoRoVA: Compressed Representations for Vector-Augmented Code Completion

Daria Cherniuk, Nikita Sukhorukov, Danil Gusak, Nikita Sushko, Danil Sivtsov, Elena Tutubalina, Evgeny Frolov

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.19644: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19644&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[94] Why Did Apple Fall: Evaluating Curiosity in Large Language Models

Haoyu Wang, Sihang Jiang, Yuyan Chen, Xiaojun Meng, Jiansheng Wei, Yitong Wang, Yanghua Xiao

Main category: cs.CL

TL;DR: Paper 2510.20635: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2510.20635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[95] Retrieval as a Decision: Training-Free Adaptive Gating for Efficient RAG

Yufeng Wang, Lu wei, Haibin Ling

Main category: cs.CL

TL;DR: Unable to analyze paper 2511.09803 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot draw conclusions as abstract is unavailable due to rate limiting error

Abstract: Failed to fetch summary for 2511.09803: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09803&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[96] Reasoning about Intent for Ambiguous Requests

Irina Saparina, Mirella Lapata

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.10453: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10453&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[97] Olmo 3

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shane Arora, Shashank Gupta, Taira Anderson, Teng Xiao, Tyler Murray, Tyler Romero, Victoria Graf, Akari Asai, Akshita Bhagia, Alexander Wettig, Alisa Liu, Aman Rangapur, Chloe Anastasiades, Costa Huang, Dustin Schwenk, Harsh Trivedi, Ian Magnusson, Jaron Lochner, Jiacheng Liu, Lester James V. Miranda, Maarten Sap, Malia Morgan, Michael Schmitz, Michal Guerquin, Michael Wilson, Regan Huff, Ronan Le Bras, Rui Xin, Rulin Shao, Sam Skjonsberg, Shannon Zejiang Shen, Shuyue Stella Li, Tucker Wilde, Valentina Pyatkin, Will Merrill, Yapei Chang, Yuling Gu, Zhiyuan Zeng, Ashish Sabharwal, Luke Zettlemoyer, Pang Wei Koh, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2512.13961: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13961&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[98] FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions

Kris W Pan, Yongmin Yoo

Main category: cs.CL

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv

Details

Motivation: Unable to determine motivation due to abstract fetch failure

Method: Unable to determine method due to abstract fetch failure

Result: Unable to determine results due to abstract fetch failure

Conclusion: Unable to determine conclusion due to abstract fetch failure

Abstract: Failed to fetch summary for 2601.02589: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02589&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[99] Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong, Li Huan, Cong Wang

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2601.05524: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05524&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[100] Generation-Augmented Generation: A Plug-and-Play Framework for Private Knowledge Injection in Large Language Models

Rongji Li, Jian Xu, Yi Chen, Xueqing Chen, Yisheng Yang, Jiayi Wang, Xingyu Chen, Chunyu Xie, Dawei Leng, Xu-Yao Zhang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2601.08209: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08209&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[101] Understanding or Memorizing? A Case Study of German Definite Articles in Language Models

Jonathan Drechsel, Erisa Bytyqi, Steffen Herbold

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.09313: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09313&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Yuanxiang Liu, Songze Li, Xiaoke Guo, Zhaoyan Gong, Qifei Zhang, Huajun Chen, Wen Zhang

Main category: cs.CL

TL;DR: Paper 2601.11047: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to unavailability of paper content

Method: Cannot determine method due to unavailability of paper content

Result: Cannot determine results due to unavailability of paper content

Conclusion: Cannot determine conclusion due to unavailability of paper content

Abstract: Failed to fetch summary for 2601.11047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[103] Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Feijiang Han, Qibo Xue, Zeping Yu, Chenming Shang, Xiao Liang, Jing Xiong, Hui Shen, Chaofan Tao, Zhengwu Liu, Senjie Jin, Zhiheng Xi, Dongdong Zhang, Sophia Ananiadou, Tao Gui, Ruobing Xie, Hayden Kwok-Hay So, Hinrich Schütze, Xuanjing Huang, Qi Zhang, Ngai Wong

Main category: cs.CL

TL;DR: Paper 2601.14004: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2601.14004: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14004&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[104] PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models

Haoyu Zheng, Yun Zhu, Yuqian Yuan, Bo Yuan, Wenqiao Zhang, Siliang Tang, Jun Xiao

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2601.19917: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19917&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Felipe D. Toro-Hernández, Jesuino Vieira Filho, Rodrigo M. Cabral-Carvalho

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2602.05971: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05971&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[106] Using Learning Progressions to Guide AI Feedback for Science Learning

Xin Xia, Nejla Yuruk, Yun Wang, Xiaoming Zhai

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze the paper content

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.03249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[107] KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection

Archie Sage, Salvatore Greco

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.06552: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06552&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[108] How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence

Alex Anvi Eponon, Ildar Batyrshin, Christian E. Maldonado-Sifuentes, Grigori Sidorov

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - arXiv API returned HTTP 429 (Too Many Requests) error

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content

Abstract: Failed to fetch summary for 2603.18203: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18203&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[109] Hear Both Sides: Efficient Multi-Agent Debate via Diversity-Aware Message Retention

Manh Nguyen, Anh Nguyen, Dung Nguyen, Svetha Venkatesh, Hung Le

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.20640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[110] League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Xiaobing Sun, Tian Xia, Kai Chen, Xiaofeng Wang, Baosheng Wang

Main category: cs.CL

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API access error

Method: Unable to determine method due to API access error

Result: Unable to determine results due to API access error

Conclusion: Unable to analyze paper due to technical issues with arXiv API access

Abstract: Failed to fetch summary for 2507.22359: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.22359&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[111] KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning

Shuai Wang, Yinan Yu

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2603.21440: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21440&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[112] Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

Yao Chen, Yilong Chen, Yinqi Yang, Junyuan Shang, Zhenyu Zhang, Zefeng Zhang, Shuaiyi Nie, Shuohuan Wang, Yu Sun, Hua Wu, HaiFeng Wang, Tingwen Liu

Main category: cs.CL

TL;DR: Paper ID 2603.23998 could not be fetched due to HTTP 429 error (rate limiting), so no abstract content is available for analysis.

Details

Motivation: Unable to determine motivation due to missing paper content. The HTTP 429 error indicates the arXiv API rate limit was exceeded.

Method: No method information available since the paper content could not be retrieved.

Result: No results can be analyzed without access to the paper content.

Conclusion: Cannot provide conclusions about a paper whose content is unavailable due to API rate limiting.

Abstract: Failed to fetch summary for 2603.23998: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23998&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[113] GRADE: Probing Knowledge Gaps in LLMs through Gradient Subspace Dynamics

Yujing Wang, Yuanbang Liang, Yukun Lai, Hainan Zhang, Hanqi Yan

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2604.02830: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02830&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[114] StoryScope: Investigating idiosyncrasies in AI fiction

Jenna Russell, Rishanth Rajendhran, Chau Minh Pham, Mohit Iyyer, John Wieting

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2604.03136: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03136&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[115] Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

Jun Zhang, Yicheng Ji, Feiyang Ren, Yihang Li, Bowen Zeng, Zonghao Chen, Ke Chen, Lidan Shou, Gang Chen, Huan Li

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to retry later or use alternative methods to access the paper content.

Details

Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.

Method: Cannot determine method without access to the paper content. The technical approach is unknown due to the fetch failure.

Result: Cannot determine results without access to the paper content. No experimental findings or evaluations can be analyzed.

Conclusion: Cannot draw conclusions about the paper’s contributions without access to its content. The analysis is blocked by technical limitations.

Abstract: Failed to fetch summary for 2604.05546: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05546&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[116] Graph-Based Chain-of-Thought Pruning for Reducing Redundant Reflections in Reasoning LLMs

Hongyuan Yuan, Xinran He, Run Shao, Bolei He, Xianwei Xue, Mengke Chen, Qiutong Pan, Haiwei Wang, Haifeng Li

Main category: cs.CL

TL;DR: Unable to analyze paper 2604.05643 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2604.05643: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05643&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[117] Measuring What Matters!! Assessing Therapeutic Principles in Mental-Health Conversation

Abdullah Mazhar, Het Riteshkumar Shah, Aseem Srivastava, Smriti Joshi, Md Shad Akhtar

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.05795: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05795&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[118] JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, Fei Yuan

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.23538: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23538&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[119] CLEAR: Cross-Lingual Enhancement in Alignment via Reverse-training

Seungyoon Lee, Minhyuk Kim, Seongtae Hong, Youngjoon Jang, Dongsuk Oh, Heuiseok Lim

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.05821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[120] AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation

Guanran Luo, Wentao Qiu, Wanru Zhao, Wenhan Lv, Zhongquan Jian, Meihong Wang, Qingqiang Wu

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2604.06812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[121] Many-Tier Instruction Hierarchy in LLM Agents

Jingyu Zhang, Tianjian Li, William Jurayj, Hongyuan Zhan, Benjamin Van Durme, Daniel Khashabi

Main category: cs.CL

TL;DR: ManyIH introduces a scalable instruction hierarchy paradigm for LLM agents to resolve conflicts across many privilege levels, with a benchmark showing current models struggle with fine-grained conflict resolution.

Details

Motivation: Current instruction hierarchy approaches assume fixed, small privilege levels (typically <5) with rigid role labels, which is inadequate for real-world agentic settings where conflicts can arise across many more sources and contexts.

Method: Proposes Many-Tier Instruction Hierarchy (ManyIH) paradigm for resolving instruction conflicts with arbitrarily many privilege levels. Introduces ManyIH-Bench benchmark with up to 12 privilege levels across 853 agentic tasks (coding and instruction-following) spanning 46 real-world agents.

Result: Experiments show current frontier models perform poorly (~40% accuracy) when instruction conflict scales, highlighting the urgent need for methods targeting fine-grained, scalable instruction conflict resolution.

Conclusion: The work underscores the need for explicit methods targeting fine-grained, scalable instruction conflict resolution in agentic settings, as current models struggle with complex privilege hierarchies.

Abstract: Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, other agents, and more-each carrying different levels of trust and authority. When these instructions conflict, agents must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.

[122] Nationality encoding in language model hidden states: Probing culturally differentiated representations in persona-conditioned academic text

Paul Jackson, Ruizhe Li, Elspeth Edelstein

Main category: cs.CL

TL;DR: Unable to analyze paper 2604.10151 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to lack of access to its content

Abstract: Failed to fetch summary for 2604.10151: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10151&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[123] CocoaBench: Evaluating Unified Digital Agents in the Wild

CocoaBench Team, Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zilong Wang, Zhoujun Cheng, Haoxiang Zhang, Junli Wang, Hexi Jin, Boyuan Zheng, Kun Zhou, Yu Wang, Feng Yao, Licheng Liu, Yijiang Li, Zhifei Li, Zhengtao Han, Pracha Promthaw, Tommaso Cerruti, Xiaohan Fu, Ziqiao Ma, Jingbo Shang, Lianhui Qin, Julian McAuley, Eric P. Xing, Zhengzhong Liu, Rupesh Kumar Srivastava, Zhiting Hu

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2604.11201: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11201&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[124] Judge Like Human Examiners: A Weighted Importance Multi-Point Evaluation Framework for Generative Tasks with Long-form Answers

Guoxin Yu, Chulun Zhou, Lemao Liu, Qi Wang, Mo Yu, Jialong Tang, Baosong Yang, Xiang Ao, Wai Lam, Yue Yu

Main category: cs.CL

TL;DR: Failed to fetch summary for paper 2604.11246 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to determine conclusion due to failed summary fetch

Abstract: Failed to fetch summary for 2604.11246: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11246&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[125] METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues

Haofu Yang, Jiaji Liu, Chen Huang, Faguo Wu, Wenqiang Lei, See-Kiong Ng

Main category: cs.CL

TL;DR: Unable to analyze paper 2604.11427 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2604.11427: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11427&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Liujie Zhang, Benzhe Ning, Rui Yang, Xiaoyan Yu, Jiaxing Li, Lumeng Wu, Jia Liu, Minghao Li, Weihang Chen, Weiqi Hu, Lei Zhang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to technical limitations

Result: No results available - arXiv API returned HTTP 429 error indicating too many requests

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content

Abstract: Failed to fetch summary for 2604.11554: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11554&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[127] LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, Ge Liu

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2604.11748: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11748&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[128] Large Language Models are Powerful Electronic Health Record Encoders

Stefan Hegselmann, Georg von Arnim, Tillmann Rheude, Noel Kronenberg, David Sontag, Gerhard Hindricks, Roland Eils, Benjamin Wild

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - paper summary retrieval failed

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2502.17403: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.17403&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[129] On the Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions

Felix Stollenwerk

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to draw conclusions due to failed API request

Abstract: Failed to fetch summary for 2503.21708: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.21708&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[130] AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin

Jian Xiong, Jingbo Zhou, Jingyong Ye, Qiang Huang, Dejing Dou

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.14264: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.14264&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[131] SEW: Self-Evolving Agentic Workflows for Automated Code Generation

Siwei Liu, Jinyuan Fang, Han Zhou, Yingxu Wang, Zaiqiao Meng

Main category: cs.CL

TL;DR: Unable to analyze paper 2505.18646 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2505.18646: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18646&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[132] Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

Jiaqi Weng, Han Zheng, Hanyu Zhang, Ej Zhou, Qinqin He, Jialing Tao, Hui Xue, Zhixuan Chu, Xiting Wang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in accessing paper content

Method: Unable to determine method due to technical error in accessing paper content

Result: Unable to determine results due to technical error in accessing paper content

Conclusion: Unable to draw conclusions due to technical error in accessing paper content

Abstract: Failed to fetch summary for 2509.18127: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18127&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[133] SeedPrints: Fingerprints Can Even Tell Which Seed Your Large Language Model Was Trained From

Yao Tong, Haonan Wang, Siquan Li, Kenji Kawaguchi, Tianyang Hu

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.26404: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26404&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[134] CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Miguel Carvalho, Helder Dias, Bruno Martins

Main category: cs.CL

TL;DR: Paper 2511.19820: Unable to fetch summary due to HTTP 429 rate limiting error

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2511.19820: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19820&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[135] Revisiting the Reliability of Language Models in Instruction-Following

Jianshuo Dong, Yutong Zhang, Yan Liu, Zhenyu Zhong, Tao Wei, Chao Zhang, Han Qiu

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.14754: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14754&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[136] Reasoning Graphs: Self-Improving, Deterministic RAG through Evidence-Centric Feedback

Matthew Penaroza

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2604.07595: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07595&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[137] ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.11236: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11236&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[138] MoDora: Tree-Based Semi-Structured Document Analysis System

Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He, Shihan Yu, Qianqian Xu, Bin Wang, Guoliang Li, Conghui He, Fan Wu

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.23061: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23061&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Karan Goyal, Dikshant Kukreja, Vikram Goyal, Mukesh Mohania

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to analyze paper content due to technical fetch error

Abstract: Failed to fetch summary for 2603.17361: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17361&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[140] ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

Annette Taberner-Miller

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.00136: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00136&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[141] WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

Yingjian Zhu, Xinming Wang, Kun Ding, Ying Wang, Bin Fan, Shiming Xiang

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2604.05818 exists but summary cannot be retrieved.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2604.05818: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05818&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.CV

[142] UniMark: Unified Adaptive Multi-bit Watermarking for Autoregressive Image Generators

Yigit Yilmaz, Elena Petrova, Mehmet Kaya, Lucia Rossi, Amir Rahman

Main category: cs.CV

TL;DR: A training-free watermarking framework for autoregressive image generators that enables multi-bit message embedding with adaptive semantic grouping and unified token replacement.

Details

Motivation: Existing AR image generation watermarking methods have three key limitations: only support zero-bit watermarks for binary verification, use static codebook partitioning vulnerable to security attacks, and lack generalization across diverse AR paradigms.

Method: Proposes a framework with three components: Adaptive Semantic Grouping (ASG) for dynamic codebook partitioning based on semantic similarity and secret key; Block-wise Multi-bit Encoding (BME) for reliable message transmission with error correction; Unified Token-Replacement Interface (UTRI) to support different AR paradigms (next-token and next-scale prediction).

Result: Achieves state-of-the-art performance in image quality (FID), watermark detection accuracy, and multi-bit message extraction. Maintains robustness against various attacks including cropping, JPEG compression, noise, blur, color jitter, and random erasing.

Conclusion: The proposed framework addresses key limitations of existing AR watermarking methods by enabling multi-bit message embedding, improving security through adaptive partitioning, and providing generalization across different AR architectures.

Abstract: Invisible watermarking for autoregressive (AR) image generation has recently gained attention as a means of protecting image ownership and tracing AI-generated content. However, existing approaches suffer from three key limitations: (1) they embed only zero-bit watermarks for binary verification, lacking the ability to convey multi-bit messages; (2) they rely on static codebook partitioning strategies that are vulnerable to security attacks once the partition is exposed; and (3) they are designed for specific AR architectures, failing to generalize across diverse AR paradigms. We propose \method{}, a training-free, unified watermarking framework for autoregressive image generators that addresses all three limitations. \method{} introduces three core components: \textbf{Adaptive Semantic Grouping (ASG)}, which dynamically partitions codebook entries based on semantic similarity and a secret key, ensuring both image quality preservation and security; \textbf{Block-wise Multi-bit Encoding (BME)}, which divides the token sequence into blocks and encodes different bits across blocks with error-correcting codes for reliable message transmission; and \textbf{a Unified Token-Replacement Interface (UTRI)} that abstracts the watermark embedding process to support both next-token prediction (e.g., LlamaGen) and next-scale prediction (e.g., VAR) paradigms. We provide theoretical analysis on detection error rates and embedding capacity. Extensive experiments on three AR models demonstrate that \method{} achieves state-of-the-art performance in image quality (FID), watermark detection accuracy, and multi-bit message extraction, while maintaining robustness against cropping, JPEG compression, Gaussian noise, blur, color jitter, and random erasing attacks.

[143] GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality

Zhiwei Zhang, Xingyuan Zeng, Xinkai Kong, Kunquan Zhang, Haoyuan Liang, Bohan Shi, Juepeng Zheng, Jianxi Huang, Yutong Lu, Haohuan Fu

Main category: cs.CV

TL;DR: GTPBD-MM is the first multimodal benchmark for global terraced parcel extraction, integrating optical imagery, text descriptions, and DEM data to address challenges in mountainous agricultural monitoring.

Details

Motivation: Existing benchmarks focus on regular, flat farmland but lack support for complex terraced parcels in mountainous regions that require joint visual recognition, semantic discrimination, and terrain-aware geometric understanding.

Method: Built upon GTPBD, GTPBD-MM integrates high-resolution optical imagery, structured text descriptions, and DEM data, supporting evaluation under Image-only, Image+Text, and Image+Text+DEM settings. Also proposes ETTerra, a multimodal baseline network for terraced parcel delineation.

Result: Extensive experiments show that textual semantics and terrain geometry provide complementary cues beyond visual appearance alone, yielding more accurate, coherent, and structurally consistent delineation results in complex terraced scenes.

Conclusion: The multimodal approach combining visual, textual, and elevation data significantly improves terraced parcel extraction in challenging mountainous environments, addressing limitations of existing unimodal methods.

Abstract: Agricultural parcel extraction plays an important role in remote sensing-based agricultural monitoring, supporting parcel surveying, precision management, and ecological assessment. However, existing public benchmarks mainly focus on regular and relatively flat farmland scenes. In contrast, terraced parcels in mountainous regions exhibit stepped terrain, pronounced elevation variation, irregular boundaries, and strong cross-regional heterogeneity, making parcel extraction a more challenging problem that jointly requires visual recognition, semantic discrimination, and terrain-aware geometric understanding. Although recent studies have advanced visual parcel benchmarks and image-text farmland understanding, a unified benchmark for complex terraced parcel extraction under aligned image-text-DEM settings remains absent. To fill this gap, we present GTPBD-MM, the first multimodal benchmark for global terraced parcel extraction. Built upon GTPBD, GTPBD-MM integrates high-resolution optical imagery, structured text descriptions, and DEM data, and supports systematic evaluation under Image-only, Image+Text, and Image+Text+DEM settings. We further propose Elevation and Text guided Terraced parcel network (ETTerra), a multimodal baseline for terraced parcel delineation. Extensive experiments demonstrate that textual semantics and terrain geometry provide complementary cues beyond visual appearance alone, yielding more accurate, coherent, and structurally consistent delineation results in complex terraced scenes.

[144] MedConcept: Unsupervised Concept Discovery for Interpretability in Medical VLMs

Md Rakibul Haque, KM Arefeen Sultan, Tushar Kataria, Shireen Elhabian

Main category: cs.CV

TL;DR: MedConcept is an unsupervised framework that discovers latent medical concepts in pretrained vision-language models and grounds them in clinically verifiable textual semantics for interpretability.

Details

Motivation: Medical VLMs achieve strong performance but have opaque latent representations that limit clinical trust and explainability. Current interpretability methods are task-specific and don't provide reusable concept-level explanations from shared pretrained representations.

Method: Identifies sparse neuron-level concept activations from pretrained VLM representations, translates them into pseudo-report-style summaries, and uses a quantitative semantic verification protocol with an independent medical LLM as evaluator to assess concept alignment with radiology reports.

Result: Introduces three concept scores (Aligned, Unaligned, Uncertain) to quantify semantic support, contradiction, or ambiguity relative to radiology reports, providing a quantitative baseline for assessing interpretability in medical VLMs.

Conclusion: MedConcept enables physician-level inspection of internal model reasoning and provides a framework for trustworthy clinical deployment of pretrained medical VLMs through concept-based interpretability.

Abstract: While medical Vision-Language models (VLMs) achieve strong performance on tasks such as tumor or organ segmentation and diagnosis prediction, their opaque latent representations limit clinical trust and the ability to explain predictions. Interpretability of these multimodal representations are therefore essential for the trustworthy clinical deployment of pretrained medical VLMs. However, current interpretability methods, such as gradient- or attention-based visualizations, are often limited to specific tasks such as classification. Moreover, they do not provide concept-level explanations derived from shared pretrained representations that can be reused across downstream tasks. We introduce MedConcept, a framework that uncovers latent medical concepts in a fully unsupervised manner and grounds them in clinically verifiable textual semantics. MedConcept identifies sparse neuron-level concept activations from pretrained VLM representations and translates them into pseudo-report-style summaries, enabling physician-level inspection of internal model reasoning. To address the lack of quantitative evaluation in concept-based interpretability, we introduce a quantitative semantic verification protocol that leverages an independent pretrained medical LLM as a frozen external evaluator to assess concept alignment with radiology reports. We define three concept scores, Aligned, Unaligned, and Uncertain, to quantify semantic support, contradiction, or ambiguity relative to radiology reports and use them exclusively for post hoc evaluation. These scores provide a quantitative baseline for assessing interpretability in medical VLMs. All codes, prompt and data to be released on acceptance. Ke

[145] EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

Jianzhe Ma, Zhonghao Cao, Shangkui Chen, Yichen Xu, Wenxuan Wang, Qin Jin

Main category: cs.CV

TL;DR: EgoEsportsQA: A video QA benchmark for evaluating Video-LLMs in fast-paced esports environments, revealing gaps in tactical reasoning and micro-operation understanding.

Details

Motivation: Current Video-LLMs excel in slow-paced real-world videos but lack evaluation in high-velocity virtual environments. Existing benchmarks focus on daily activities, missing rigorous tests for fast, rule-bound reasoning in esports scenarios.

Method: Created EgoEsportsQA benchmark with 1,745 QA pairs from professional esports matches across 3 FPS games using a six-stage pipeline. Structured questions into two-dimensional taxonomy: 11 cognitive capability sub-tasks (perception/reasoning) and 6 esports knowledge sub-tasks.

Result: State-of-the-art Video-LLMs achieve only 71.58% accuracy, showing significant gaps: better at basic visual perception than deep tactical reasoning, and better at macro-progression than micro-operations. Ablation experiments reveal architectural weaknesses.

Conclusion: The benchmark reveals limitations of current Video-LLMs in virtual environments, connects real-world and virtual domains, and provides guidance for optimizing esports applications and advancing Video-LLMs in egocentric environments.

Abstract: While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.

[146] V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos

Chengkun Yue, Chuanzhi Xu, Jiangpeng He

Main category: cs.CV

TL;DR: V-Nutri: A video-based nutrition estimation framework that uses cooking process information from egocentric videos to improve dish-level nutrition estimation beyond single image approaches.

Details

Motivation: Existing nutrition estimation methods rely on single images of completed dishes, which is fundamentally limited because many nutritionally relevant ingredients (oils, sauces, mixed components) become visually ambiguous after cooking, making accurate calorie and macronutrient estimation difficult.

Method: Proposes V-Nutri, a staged framework combining Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from final dish frame and cooking process keyframes. Includes a cooking keyframes selection module using VideoMamba-based event-detection model targeting ingredient-addition moments.

Result: Experiments on HD-EPIC dataset show process cues provide complementary nutritional evidence, improving nutrition estimation under controlled conditions. Benefit of process keyframes depends strongly on backbone representation capacity and event detection quality.

Conclusion: Cooking process information from egocentric videos can contribute to dish-level nutrition estimation, establishing first benchmark for video-based nutrition estimation with V-Nutri framework.

Abstract: Nutrition estimation of meals from visual data is an important problem for dietary monitoring and computational health, but existing approaches largely rely on single images of the finally completed dish. This setting is fundamentally limited because many nutritionally relevant ingredients and transformations, such as oils, sauces, and mixed components, become visually ambiguous after cooking, making accurate calorie and macronutrient estimation difficult. In this paper, we investigate whether the cooking process information from egocentric cooking videos can contribute to dish-level nutrition estimation. First, we further manually annotated the HD-EPIC dataset and established the first benchmark for video-based nutrition estimation. Most importantly, we propose V-Nutri, a staged framework that combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking process keyframes extracted from the egocentric videos. V-Nutri also includes a cooking keyframes selection module, a VideoMamba-based event-detection model that targets ingredient-addition moments. Experiments on the HD-EPIC dataset show that process cues can provide complementary nutritional evidence, improving nutrition estimation under controlled conditions. Our results further indicate that the benefit of process keyframes depends strongly on backbone representation capacity and event detection quality. Our code and annotated dataset is available at https://github.com/K624-YCK/V-Nutri.

[147] A Workflow to Efficiently Generate Dense Tissue Ground Truth Masks for Digital Breast Tomosynthesis

Tamerlan Mustafaev, Oleg Kruglov, Margarita Zuley, Luana de Mero Omena, Guilherme Muniz de Oliveira, Vitor de Sousa Franca, Bruno Barufaldi, Robert Nishikawa, Juhun Lee

Main category: cs.CV

TL;DR: A framework for semi-automatic segmentation of dense fibroglandular tissue in digital breast tomosynthesis (DBT) images that reduces annotation time by requiring only central slice annotation with iterative threshold adjustment across volume slices.

Details

Motivation: Accurate segmentation of fibroglandular tissue in DBT images is essential for personalized breast cancer risk estimation, but algorithm development is limited by scarce human-delineated training data. Current manual segmentation approaches are time-consuming and labor-intensive.

Method: Proposed framework enables users to outline a rough ROI enclosing dense tissue on the central slice and select a segmentation threshold. The algorithm projects the ROI to remaining slices and iteratively adjusts slice-specific thresholds to maintain consistent dense tissue delineation across the DBT volume.

Result: Evaluation on 44 DBT volumes showed high inter-reader agreement (median Dice = 0.84) and method accuracy (median Dice = 0.83) compared to manual segmentations on 176 slices from 20th and 80th percentile slices.

Conclusion: The framework substantially reduces annotation time and labor while maintaining high segmentation accuracy, addressing the critical need for annotated DBT data for algorithm development in breast cancer screening.

Abstract: Digital breast tomosynthesis (DBT) is now the standard of care for breast cancer screening in the USA. Accurate segmentation of fibroglandular tissue in DBT images is essential for personalized risk estimation, but algorithm development is limited by scarce human-delineated training data. In this study we introduce a time- and labor-saving framework to generate a human-annotated binary segmentation mask for dense tissue in DBT. Our framework enables a user to outline a rough region of interest (ROI) enclosing dense tissue on the central reconstructed slice of a DBT volume and select a segmentation threshold to generate the dense tissue mask. The algorithm then projects the ROI to the remaining slices and iteratively adjusts slice-specific thresholds to maintain consistent dense tissue delineation across the DBT volume. By requiring annotation only on the central slice, the framework substantially reduces annotation time and labor. We used 44 DBT volumes from the DBTex dataset for evaluation. Inter-reader agreement was assessed by computing patient-wise Dice similarity coefficients between segmentation masks produced by two radiologists, yielding a median of 0.84. Accuracy of the proposed method was evaluated by having a radiologist manually segment the 20th and 80th percentile slices from each volume (CC and MLO views; 176 slices total) and calculate Dice scores between the manual and proposed segmentations, yielding a median of 0.83.

[148] Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

Miao Liu, Fangda Wei, Jing Wang, Xinyuan Qian

Main category: cs.CV

TL;DR: Paper introduces Listening Deepfake Detection (LDD) task, creates ListenForge dataset using 5 Listening Head Generation methods, and proposes MANet network that captures motion inconsistencies in listener videos guided by speaker audio.

Details

Motivation: Current deepfake detection focuses only on speaking scenarios, but realistic attacks involve both speaking and listening states. Listening deepfakes remain unexplored due to dataset scarcity, but offer breakthrough opportunities due to lower quality of synthesized listening reactions.

Method: Proposes MANet (Motion-aware and Audio-guided Network) that captures subtle motion inconsistencies in listener videos while leveraging speaker’s audio semantics to guide cross-modal fusion. Creates ListenForge dataset using 5 Listening Head Generation methods.

Result: Existing Speaking Deepfake Detection models perform poorly in listening scenarios. MANet achieves significantly superior performance on ListenForge dataset, demonstrating effectiveness of the proposed approach.

Conclusion: Highlights need to rethink deepfake detection beyond traditional speaking-centric paradigm, opens new directions for multimodal forgery analysis in interactive communication settings.

Abstract: Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker’s appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of ’listening deepfakes’ remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening Head Generation (LHG) methods. To address the distinctive characteristics of listening forgeries, we propose MANet, a Motion-aware and Audio-guided Network that captures subtle motion inconsistencies in listener videos while leveraging speaker’s audio semantics to guide cross-modal fusion. Extensive experiments demonstrate that existing Speaking Deepfake Detection (SDD) models perform poorly in listening scenarios. In contrast, MANet achieves significantly superior performance on ListenForge. Our work highlights the necessity of rethinking deepfake detection beyond the traditional speaking-centric paradigm and opens new directions for multimodal forgery analysis in interactive communication settings. The dataset and code are available at https://anonymous.4open.science/r/LDD-B4CB.

[149] Physics-Grounded Monocular Vehicle Distance Estimation Using Standardized License Plate Typography

Manognya Lokesh Reddy, Zheng Liu

Main category: cs.CV

TL;DR: A framework using US license plates as fiducial markers for metric ranging in vehicles, combining plate detection, state identification, and hybrid depth fusion to estimate distance, velocity, and time-to-collision without training data.

Details

Motivation: Need for accurate, low-cost inter-vehicle distance estimation for ADAS and autonomous driving. LiDAR/radar are expensive, while monocular camera methods suffer from scale ambiguity, require supervised training, and lack safety certification.

Method: 1) Four-method parallel plate detector for robust plate reading across lighting conditions. 2) Three-stage state identification engine combining OCR text matching, color scoring, and lightweight neural network. 3) Hybrid depth fusion with inverse-variance weighting, online scale alignment, and 1D constant-velocity Kalman filter for smoothed distance, velocity, and time-to-collision estimation.

Result: 2.3% coefficient of variation in character height measurements, 36% reduction in distance-estimate variance compared to prior plate-width methods. Mean absolute error of 2.3% at 10m, continuous output during brief occlusions, outperforming deep learning baselines by 5x in relative error.

Conclusion: The framework successfully resolves scale ambiguity in monocular distance estimation using license plates as passive fiducial markers, providing accurate, certifiable distance measurements without training data or active illumination.

Abstract: Accurate inter-vehicle distance estimation is a cornerstone of Advanced Driver Assistance Systems (ADAS) and autonomous driving. While LiDAR and radar provide high precision, their high cost prohibits widespread adoption in mass-market vehicles. Monocular camera-based estimation offers a low-cost alternative but suffers from fundamental scale ambiguity. Recent deep learning methods for monocular depth achieve impressive results yet require expensive supervised training, suffer from domain shift, and produce predictions that are difficult to certify for safety-critical deployment. This paper presents a framework that exploits the standardized typography of United States license plates as passive fiducial markers for metric ranging, resolving scale ambiguity through explicit geometric priors without any training data or active illumination. First, a four-method parallel plate detector achieves robust plate reading across the full automotive lighting range. Second, a three-stage state identification engine fusing OCR text matching, multi-design color scoring, and a lightweight neural network classifier provides robust identification across all ambient conditions. Third, hybrid depth fusion with inverse-variance weighting and online scale alignment, combined with a one-dimensional constant-velocity Kalman filter, delivers smoothed distance, relative velocity, and time-to-collision for collision warning. Baseline validation reproduces a 2.3% coefficient of variation in character height measurements and a 36% reduction in distance-estimate variance compared with plate-width methods from prior work. Extensive outdoor experiments confirm a mean absolute error of 2.3% at 10 m and continuous distance output during brief plate occlusions, outperforming deep learning baselines by a factor of five in relative error.

[150] EigenCoin: sassanid coins classification based on Bhattacharyya distance

Rahele Allahverdi, Mohammad Mahdi Dehshibi, Azam Bastanfard, Daryoosh Akbarzadeh

Main category: cs.CV

TL;DR: EigenCoin method for imbalanced coin classification using manifold learning with Bhattacharyya distance, outperforming other algorithms by 9.45-21.75% accuracy while handling overfitting.

Details

Motivation: Addressing pattern recognition challenges with imbalanced databases, specifically applied to Sassanid coins classification, testing holistic vs feature-based approaches.

Method: EigenCoin manifold with Bhattacharyya distance consisting of three steps: manifold construction, mapping test data, and classification.

Result: EigenCoin outperformed other observed algorithms with accuracy improvements from 9.45% up to 21.75%, while demonstrating capability to handle over-fitting problems.

Conclusion: The EigenCoin approach is effective for imbalanced coin classification tasks, showing superiority over other methods in both accuracy and overfitting resistance.

Abstract: Solving pattern recognition problems using imbalanced databases is a hot topic, which entices researchers to bring it into focus. Therefore, we consider this problem in the application of Sassanid coins classification. Our focus is not only on proposing EigenCoin manifold with Bhattacharyya distance for the classification task, but also on testing the influence of the holistic and feature-based approaches. EigenCoin consists of three main steps namely manifold construction, mapping test data, and classification. Conducted experiments show EigenCoin outperformed other observed algorithms and achieved the accuracy from 9.45% up to 21.75%, while it has the capability of handling the over-fitting problem.

[151] DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment

Xinyue Li, Shubo Xu, Zhichao Zhang, Zhaolin Cai, Yitong Chen, Guangtao Zhai

Main category: cs.CV

TL;DR: DPC-VQA is a decoupling framework that uses frozen MLLMs for perceptual priors and lightweight calibration for video quality assessment, reducing training costs while maintaining performance.

Details

Motivation: Current MLLMs show promise for VQA but adapting them to new scenarios is expensive due to large-scale retraining and costly MOS annotations. The authors argue that pretrained MLLMs already provide useful perceptual priors, and the main challenge is efficiently calibrating these priors to target MOS spaces.

Method: DPC-VQA decouples perception and calibration: uses a frozen MLLM to provide base quality estimates and perceptual priors, and employs a lightweight calibration branch to predict residual corrections for target-scenario adaptation. This avoids costly end-to-end retraining.

Result: DPC-VQA achieves competitive performance on both UGC and AIGC benchmarks, using less than 2% of trainable parameters compared to conventional MLLM-based VQA methods, and remains effective with only 20% of MOS labels.

Conclusion: The decoupling approach enables efficient adaptation of MLLMs for VQA tasks, significantly reducing training costs and data requirements while maintaining reliable performance across different content types.

Abstract: Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20% of MOS labels. The code will be released upon publication.

[152] Fall Risk and Gait Analysis in Community-Dwelling Older Adults using World-Spaced 3D Human Mesh Recovery

Chitra Banarjee, Patrick Kwon, Ania Lipat, Rui Xie, Chen Chen, Ladda Thiamwong

Main category: cs.CV

TL;DR: A pipeline using 3D Human Mesh Recovery to extract gait parameters from videos of older adults doing the Timed Up and Go test, showing correlations with IMU measurements and associations with fall risk metrics.

Details

Motivation: Current clinical gait assessment for older adults is limited to basic stopwatch measurements, lacking detailed spatiotemporal analysis. There's a need for accessible, ecologically valid gait analysis in community settings to better assess fall risk and overall health.

Method: Uses a 3D Human Mesh Recovery model to extract gait parameters from video recordings of older adults completing the Timed Up and Go test. Parameters include step time, sit-to-stand duration, and step length. Validation against IMU-based insole measurements and analysis using linear mixed effects models.

Result: Video-derived step time significantly correlated with IMU measurements. Shorter, more variable step lengths and longer sit-to-stand durations were associated with higher self-rated fall risk and fear of falling.

Conclusion: The pipeline enables accessible, ecologically valid gait analysis in community settings, providing clinically relevant insights into fall risk through video-based assessment.

Abstract: Gait assessment is a key clinical indicator of fall risk and overall health in older adults. However, standard clinical practice is largely limited to stopwatch-measured gait speed. We present a pipeline that leverages a 3D Human Mesh Recovery (HMR) model to extract gait parameters from recordings of older adults completing the Timed Up and Go (TUG) test. From videos recorded across different community centers, we extract and analyze spatiotemporal gait parameters, including step time, sit-to-stand duration, and step length. We found that video-derived step time was significantly correlated with IMU-based insole measurements. Using linear mixed effects models, we confirmed that shorter, more variable step lengths and longer sit-to-stand durations were predicted by higher self-rated fall risk and fear of falling. These findings demonstrate that our pipeline can enable accessible and ecologically valid gait analysis in community settings.

[153] EDGE-Shield: Efficient Denoising-staGE Shield for Violative Content Filtering via Scalable Reference-Based Matching

Takara Taniguchi, Ryohei Shimizu, Duc Minh Vo, Kota Izumi, Shiqi Yang, Teppei Suzuki

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 406 error when querying arXiv API for paper ID 2604.06063

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2604.06063: Page request resulted in HTTP 406 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06063&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[154] Turbo-DDCM: Fast and Flexible Zero-Shot Diffusion-Based Image Compression

Amit Vaisman, Guy Ohayon, Hila Manor, Michael Elad, Tomer Michaeli

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 406 error - unable to analyze content

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2511.06424: Page request resulted in HTTP 406 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06424&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[155] INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

Somraj Gautam, Anathapindika Dravichi, Gaurav Harit

Main category: cs.CV

TL;DR: INDOTABVQA is a cross-lingual Table Visual Question Answering benchmark for Bahasa Indonesia document images, featuring 1,593 images with tables and multilingual questions to evaluate VLMs on document understanding tasks.

Details

Motivation: There's a need for language-diverse, domain-specific datasets to evaluate Vision-Language Models on real-world document understanding tasks, particularly for underrepresented languages like Bahasa Indonesia and cross-lingual settings.

Method: Created a dataset of 1,593 document images with tables in three visual styles, paired with question-answer sets in four languages (Bahasa Indonesia, English, Hindi, Arabic). Benchmarked leading VLMs and fine-tuned models with LoRA, testing spatial priors via table region coordinates.

Result: Found substantial performance gaps in VLMs, especially on complex tables and low-resource languages. Fine-tuning improved accuracy by 11.6-17.8%, and adding spatial priors boosted performance by 4-7%.

Conclusion: Targeted fine-tuning and spatial priors significantly enhance VLM performance on specialized document understanding. INDOTABVQA provides valuable resources for advancing cross-lingual, structure-aware document understanding research.

Abstract: We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in accuracy. Providing explicit table region coordinates as additional input further improves performance by 4-7%, demonstrating the value of Spatial priors for table-based reasoning. Our findings underscore the importance of language-diverse, domain-specific datasets and demonstrate that targeted fine-tuning can significantly enhance VLM performance on specialized document understanding tasks. INDOTABVQA provides a valuable resource for advancing research in cross-lingual, structure-aware document understanding, especially in underrepresented regions of the world. Full dataset can be accessed in huggingface at: https://huggingface.co/datasets/NusaBharat/INDOTABVQA}

[156] LoViF 2026 The First Challenge on Weather Removal in Videos

Chenghao Qian, Xin Li, Yeying Jin, Shangguan Sun, Yilian Zhong, Yuxiang Chen, Shibo Yin, Yushun Fang, Xilei Zhu, Yahui Wang, Chen Lu, Ying Fu, Jianan Tian, Jifan Zhang, Chen Zhou, Junyang Jiang, Yuping Sun, Zhuohang Shi, Xiaojing Liu, Jiao Liu, Yatong Zhou, Shuai Liu, Qiang Deng, Jiajia Mi, Qianhao Luo, Weiling Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 406 error - paper ID 2604.10655 is not accessible

Details

Motivation: Unable to determine motivation due to paper access failure

Method: Unable to determine method due to paper access failure

Result: Unable to determine results due to paper access failure

Conclusion: Paper content unavailable for analysis

Abstract: Failed to fetch summary for 2604.10655: Page request resulted in HTTP 406 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10655&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[157] Ultra-low-light computer vision using trained photon correlations

Mandar M. Sohoni, Jérémie Laydevant, Mathieu Ouellet, Shi-Yuan Ma, Ryotatsu Yanagimoto, Benjamin A. Ash, Tatsuhiro Onodera, Tianyu Wang, Logan G. Wright, Peter L. McMahon

Main category: cs.CV

TL;DR: Correlation-aware training (CAT) optimizes correlated-photon illumination with Transformer backend for object recognition in ultra-low-light conditions, achieving 15% accuracy improvement over conventional methods.

Details

Motivation: Traditional correlated-photon illumination focuses on image reconstruction, but computer vision tasks like object recognition don't require perfect images. The goal is to develop a hybrid optical-electronic pipeline that leverages photon correlations specifically for inference tasks rather than reconstruction.

Method: Correlation-aware training (CAT): end-to-end optimization of a trainable correlated-photon illumination source and a Transformer backend. The system learns to benefit from photon correlations using ≤100 shots, with the Transformer backend processing the correlated illumination patterns for object recognition.

Result: Achieved up to 15 percentage points improvement in classification accuracy over conventional uncorrelated-illumination methods in ultra-low-light and noisy conditions. Also outperformed untrained correlated-photon illumination approaches.

Conclusion: Specializing photon correlation patterns for specific computer vision tasks (object recognition) and jointly training illumination with digital backends enables superior accuracy in photon-budget-constrained scenarios beyond traditional image reconstruction methods.

Abstract: Illumination using correlated photon sources has been established as an approach to allowing high-fidelity images to be reconstructed from noisy camera frames by taking advantage of the knowledge that signal photons are spatially correlated whereas detector clicks due to noise are uncorrelated. However, in computer-vision tasks, the goal is often not ultimately to reconstruct an image, but to make inferences about a scene – such as what object is present. Here we show how correlated-photon illumination can be used to gain an advantage in a hybrid optical-electronic computer-vision pipeline for object recognition. We demonstrate correlation-aware training (CAT): end-to-end optimization of a trainable correlated-photon illumination source and a Transformer backend in a way that the Transformer can learn to benefit from the correlations, using a small number (<= 100) of shots. We show a classification accuracy enhancement of up to 15 percentage points over conventional, uncorrelated-illumination-based computer vision in ultra-low-light and noisy imaging conditions, as well as an improvement over using untrained correlated-photon illumination. Our work illustrates how specializing to a computer-vision task – object recognition – and training the pattern of photon correlations in conjunction with a digital backend allows us to push the limits of accuracy in highly photon-budget-constrained scenarios beyond existing methods focused on image reconstruction.

[158] ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

Myungchul Kim, Kwanyong Park, Junmo Kim, In So Kweon

Main category: cs.CV

TL;DR: ARGOS is a benchmark for multi-camera person search reformulated as interactive reasoning with information asymmetry, requiring agents to plan, question, and eliminate candidates using spatial-temporal tools.

Details

Motivation: Traditional person search methods assume complete information, but real-world scenarios involve vague witness statements and information asymmetry. The authors aim to create a more realistic benchmark that requires interactive reasoning with limited information.

Method: ARGOS reformulates person search as interactive reasoning where agents receive vague witness statements and must decide what to ask, when to invoke spatial/temporal tools, and interpret ambiguous responses. It uses a Spatio-Temporal Topology Graph encoding camera connectivity and transition times. The benchmark includes 2,691 tasks across 14 real-world scenarios in three tracks: semantic perception, spatial reasoning, and temporal reasoning.

Result: Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3). Ablations confirm removing domain-specific tools drops accuracy by up to 49.6 percentage points.

Conclusion: ARGOS presents a challenging benchmark for interactive reasoning in multi-camera person search, highlighting the importance of domain-specific tools and the current limitations of LLMs in handling information asymmetry and spatial-temporal reasoning.

Abstract: We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

[159] The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results

Xingyu Qiu, Yuqian Fu, Jiawei Geng, Bin Ren, Jiancheng Pan, Zongwei Wu, Hao Tang, Yanwei Fu, Radu Timofte, Nicu Sebe, Mohamed Elhoseiny, Lingyi Hong, Mingxi Cheng, Xingqi He, Runze Li, Xingdong Sheng, Wenqiang Zhang, Jiacong Liu, Shu Luo, Yikai Qin, Yaze Zhao, Yongwei Jiang, Yixiong Zou, Zhe Zhang, Yang Yang, Kaiyu Li, Bowen Fu, Zixuan Jiang, Ke Li, Hui Qiao, Xiangyong Cao, Xuanlong Yu, Youyang Sha, Longfei Liu, Di Yang, Xi Shen, Kyeongryeol Go, Taewoong Jang, Saiprasad Meesiyawar, Ravi Kirasur, Rakshita Kulkarni, Bhoomi Deshpande, Harsh Patil, Uma Mudenagudi, Shuming Hu, Chao Chen, Tao Wang, Wei Zhou, Qi Xu, Zhenzhao Xing, Dandan Zhao, Hanzhe Xia, Dongdong Lu, Zhe Zhang, Jingru Wang, Guangwei Huang, Jiachen Tu, Yaokun Shi, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Liwei Zhou, Bei Dou, Tao Wu, Zekang Fan, Junjie Liu, Adhémar de Senneville, Flavien Armangeon, Mengbers, Yazhe Lyu, Zhimeng Xin, Zijian Zhuang, Hongchun Zhu, Li Wang

Main category: cs.CV

TL;DR: NTIRE 2026 CD-FSOD Challenge report on cross-domain few-shot object detection with 128 registered participants and 696 submissions, analyzing innovative methods for detecting objects in unseen domains with limited annotations.

Details

Motivation: Cross-domain few-shot object detection (CD-FSOD) is challenging for existing detectors and few-shot learning approaches, especially when generalizing across distinct domains. The challenge aims to systematically evaluate and promote progress in detecting objects in unseen target domains under limited annotation conditions.

Method: The paper describes a challenge framework with open-source and closed-source tracks where participants submitted various approaches for CD-FSOD. The report analyzes submitted methods including innovative strategies that push performance frontiers in cross-domain few-shot detection.

Result: Strong community interest with 128 registered participants, 696 submissions, 31 active teams, and 19 teams submitting valid final results. The challenge successfully evaluated diverse approaches and identified methods that advance CD-FSOD performance.

Conclusion: The NTIRE 2026 CD-FSOD Challenge successfully promoted progress in cross-domain few-shot object detection, with community participation demonstrating innovative approaches to address the challenging problem of detecting objects in unseen domains with limited annotations.

Abstract: Cross-domain few-shot object detection (CD-FSOD) remains a challenging problem for existing object detectors and few-shot learning approaches, particularly when generalizing across distinct domains. As part of NTIRE 2026, we hosted the second CD-FSOD Challenge to systematically evaluate and promote progress in detecting objects in unseen target domains under limited annotation conditions. The challenge received strong community interest, with 128 registered participants and a total of 696 submissions. Among them, 31 teams actively participated, and 19 teams submitted valid final results. Participants explored a wide range of strategies, introducing innovative methods that push the performance frontier under both open-source and closed-source tracks. This report presents a detailed overview of the NTIRE 2026 CD-FSOD Challenge, including a summary of the submitted approaches and an analysis of the final results across all participating teams. Challenge Codes: https://github.com/ohMargin/NTIRE2026_CDFSOD.

[160] A Multi-Agent Feedback System for Detecting and Describing News Events in Satellite Imagery

Madeline Anderson, Mikhail Klassen, Ash Hoover, Kerri Cahoy

Main category: cs.CV

TL;DR: SkyScraper: An iterative multi-agent workflow that geocodes news articles and synthesizes captions for satellite image sequences, creating a multi-temporal event captioning dataset for remote sensing.

Details

Motivation: There's a lack of multi-temporal event captioning datasets in remote sensing (at least two images per sequence) due to the significant time and labor required to search for visible events in satellite imagery and label multi-temporal sequences.

Method: Developed SkyScraper, an iterative multi-agent workflow that geocodes news articles and synthesizes captions for corresponding satellite image sequences using agentic feedback to surface new multi-temporal events.

Result: SkyScraper successfully finds 5x more events than traditional geocoding methods and was used to curate a new multi-temporal captioning dataset with 5,000 sequences from a large database of global news articles.

Conclusion: Agentic feedback is an effective strategy for surfacing new multi-temporal events in satellite imagery, and the framework supports journalism and reporting efforts by automatically identifying imagery related to news events.

Abstract: Changes in satellite imagery often occur over multiple time steps. Despite the emergence of bi-temporal change captioning datasets, there is a lack of multi-temporal event captioning datasets (at least two images per sequence) in remote sensing. This gap exists because (1) searching for visible events in satellite imagery and (2) labeling multi-temporal sequences require significant time and labor. To address these challenges, we present SkyScraper, an iterative multi-agent workflow that geocodes news articles and synthesizes captions for corresponding satellite image sequences. Our experiments show that SkyScraper successfully finds 5x more events than traditional geocoding methods, demonstrating that agentic feedback is an effective strategy for surfacing new multi-temporal events in satellite imagery. We apply our framework to a large database of global news articles, curating a new multi-temporal captioning dataset with 5,000 sequences. By automatically identifying imagery related to news events, our work also supports journalism and reporting efforts.

[161] SynthPix: A lightspeed PIV image generator

Antonio Terpin, Alan Bonomi, Francesco Banelli, Raffaello D’Andrea

Main category: cs.CV

TL;DR: Paper ID 2512.09664 - Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as abstract is unavailable due to arXiv API rate limiting

Method: Cannot determine method as abstract is unavailable due to arXiv API rate limiting

Result: Cannot determine results as abstract is unavailable due to arXiv API rate limiting

Conclusion: Cannot determine conclusion as abstract is unavailable due to arXiv API rate limiting

Abstract: Failed to fetch summary for 2512.09664: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09664&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[162] TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Bingyi Cao, Koert Chen, Kevis-Kokitsi Maninis, Kaifeng Chen, Arjun Karpur, Ye Xia, Sahil Dua, Tanmaya Dabral, Guangxing Han, Bohyung Han, Joshua Ainslie, Alex Bewley, Mithun Jacob, René Wagner, Washington Ramos, Krzysztof Choromanski, Mojtaba Seyedhosseini, Howard Zhou, André Araujo

Main category: cs.CV

TL;DR: TIPSv2 improves vision-language models by enhancing patch-text alignment through distillation techniques, modified pretraining objectives, and better training strategies.

Details

Motivation: Current vision-language models struggle with aligning dense patch representations with corresponding text embeddings, limiting their effectiveness in downstream applications that require fine-grained understanding.

Method: Proposes iBOT++ objective where unmasked tokens also contribute to loss, patch-level distillation that surprisingly improves alignment beyond teacher models, modified exponential moving average setup, and caption sampling strategy for synthetic captions.

Result: TIPSv2 achieves strong performance on 9 tasks across 20 datasets, generally on par with or better than recent vision encoder models, with improved patch-text alignment capabilities.

Conclusion: The proposed techniques significantly enhance vision-language pretraining, particularly for dense patch-text alignment, resulting in versatile models suitable for diverse downstream applications.

Abstract: Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment – surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop TIPSv2, a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models. Code and models are released via our project page at https://gdm-tipsv2.github.io/ .

[163] INFORM-CT: INtegrating LLMs and VLMs FOR Incidental Findings Management in Abdominal CT

Idan Tankel, Nir Mazor, Rafi Brada, Christina LeBedis, Guy ben-Yosef

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.14732: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14732&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[164] Curvelet-Based Frequency-Aware Feature Enhancement for Deepfake Detection

Salar Adel Sabri, Ramadhan J. Mstafa

Main category: cs.CV

TL;DR: Novel deepfake detection method using Curvelet Transform with wedge-level attention and scale-aware spatial masking, achieving high accuracy on compressed facial content.

Details

Motivation: Deepfake detection methods relying on spatial-domain features degrade under compression, prompting need for frequency-domain approaches. Curvelet Transform offers superior directional and multiscale properties but remains unexplored for deepfake detection.

Method: Introduces Curvelet-based detection with wedge-level attention and scale-aware spatial masking to emphasize discriminative frequency components. Refined frequency cues are reconstructed and passed to modified pretrained Xception network for classification.

Result: Achieves 98.48% accuracy and 99.96% AUC on FaceForensics++ low compression, maintaining strong performance under high compression.

Conclusion: Curvelet Transform is effective for deepfake detection, offering robustness to compression and interpretability through frequency-domain analysis.

Abstract: The proliferation of sophisticated generative models has significantly advanced the realism of synthetic facial content, known as deepfakes, raising serious concerns about digital trust. Although modern deep learning-based detectors perform well, many rely on spatial-domain features that degrade under compression. This limitation has prompted a shift toward integrating frequency-domain representations with deep learning to improve robustness. Prior research has explored frequency transforms such as Discrete Cosine Transform (DCT), Fast Fourier Transform (FFT), and Wavelet Transform, among others. However, to the best of our knowledge, the Curvelet Transform, despite its superior directional and multiscale properties, remains entirely unexplored in the context of deepfake detection. In this work, we introduce a novel Curvelet-based detection approach that enhances feature quality through wedge-level attention and scale-aware spatial masking, both trained to selectively emphasize discriminative frequency components. The refined frequency cues are reconstructed and passed to a modified pretrained Xception network for classification. Evaluated on two compression qualities in the challenging FaceForensics++ dataset, our method achieves 98.48% accuracy and 99.96% AUC on FF++ low compression, while maintaining strong performance under high compression, demonstrating the efficacy and interpretability of Curvelet-informed forgery detection.

[165] Does Visual Token Pruning Improve Calibration? An Empirical Study on Confidence in MLLMs

Kaizhen Tan

Main category: cs.CV

TL;DR: Visual token pruning in MLLMs affects model calibration, not just accuracy; coverage-based pruning can improve calibration while maintaining accuracy, unlike saliency-based methods.

Details

Motivation: Existing work on visual token pruning for efficient inference in multimodal large language models mainly evaluates task accuracy, but overlooks how pruning affects model calibration (whether predicted confidence matches actual correctness).

Method: Evaluated LLaVA-1.5-7B on POPE and ScienceQA-IMG using Expected Calibration Error (ECE), Brier score, and AURC under various pruning strategies (SCOPE with different saliency weights, saliency-only pruning, FastV, random pruning) across multiple token budgets.

Result: On POPE, coverage-based SCOPE achieved substantially lower ECE than unpruned model while maintaining similar accuracy; reducing saliency weight improved calibration at all token budgets with minimal accuracy impact. Saliency-based pruning worsened calibration, and FastV caused severe degradation. On ScienceQA-IMG, pruning reduced ECE with stable/slightly improved accuracy.

Conclusion: Visual token pruning should be evaluated for confidence quality, not just accuracy, especially for multimodal systems needing reliable decisions; coverage-based pruning can improve calibration while maintaining efficiency.

Abstract: Visual token pruning is a widely used strategy for efficient inference in multimodal large language models (MLLMs), but existing work mainly evaluates it with task accuracy. In this paper, we study how visual token pruning affects model calibration, that is, whether predicted confidence matches actual correctness. Using LLaVA-1.5-7B on POPE and ScienceQA-IMG, we evaluate Expected Calibration Error (ECE), Brier score, and AURC under several pruning strategies, including SCOPE with different saliency weights, saliency-only pruning, FastV, and random pruning, across multiple token budgets. Our results show that pruning does not simply trade reliability for efficiency. On POPE, a pure-coverage setting in SCOPE achieves substantially lower ECE than the full unpruned model while maintaining similar accuracy. An internal alpha-sweep further shows a consistent trend: reducing the saliency weight improves calibration at all tested token budgets, while accuracy changes only slightly. In contrast, saliency-based pruning leads to worse calibration, and real FastV causes severe performance degradation in our setting. On ScienceQA-IMG, pruning also reduces ECE, with accuracy remaining stable or slightly improving. We additionally study the gap power exponent in coverage-based selection and find that its default setting is not always optimal. Overall, our results suggest that visual token pruning should be evaluated not only by accuracy, but also by confidence quality, especially for multimodal systems that need reliable decisions.

[166] Privacy-Preserving Structureless Visual Localization via Image Obfuscation

Vojtech Panek, Patrik Beliansky, Zuzana Kukelova, Torsten Sattler

Main category: cs.CV

TL;DR: Privacy-preserving visual localization using simple image obfuscation (e.g., semantic segmentations) with existing structureless pipelines, achieving state-of-the-art accuracy while protecting query images and scene representations.

Details

Motivation: Cloud-based visual localization systems raise privacy concerns as they reveal private details through images sent to servers or representations stored on servers. Existing privacy-preserving approaches are complex, slow, and less accurate than non-privacy-preserving methods.

Method: Uses structureless localization methods with simple image obfuscation based on common image operations, such as replacing RGB images with semantic segmentations. Shows that existing structureless pipelines require no special adjustments as modern feature matchers can match obfuscated images directly.

Result: Achieves state-of-the-art pose accuracy for privacy-preserving approaches on multiple datasets. The method is easy to implement and ensures privacy of both query images and scene representations.

Conclusion: Simple image obfuscation combined with structureless localization provides an effective privacy-preserving solution that maintains high accuracy while being straightforward to implement.

Abstract: Visual localization is the task of estimating the camera pose of an image relative to a scene representation. In practice, visual localization systems are often cloud-based. Naturally, this raises privacy concerns in terms of revealing private details through the images sent to the server or through the representations stored on the server. Privacy-preserving localization aims to avoid such leakage of private details. However, the resulting localization approaches are significantly more complex, slower, and less accurate than their non-privacy-preserving counterparts. In this paper, we consider structureless localization methods in the context of privacy preservation. Structureless methods represent the scene through a set of reference images with known camera poses and intrinsics. In contrast to existing methods proposing representations that are as privacy-preserving as possible, we study a simple image obfuscation approach based on common image operations, e.g., replacing RGB images with (semantic) segmentations. We show that existing structureless pipelines do not need any special adjustments, as modern feature matchers can match obfuscated images out of the box. The results are easy-to-implement pipelines that can ensure both the privacy of the query images and the scene representations. Detailed experiments on multiple datasets show that the resulting methods achieve state-of-the-art pose accuracy for privacy-preserving approaches.

[167] OpenTME: An Open Dataset of AI-powered H&E Tumor Microenvironment Profiles from TCGA

Maaike Galama, Nina Kozar-Gillan, Christina Embacher, Todd Dembo, Cornelius Böhm, Evelyn Ramberger, Julika Ribbat-Idel, Rosemarie Krupar, Verena Aumiller, Miriam Hägele, Kai Standvoss, Gerrit Erdmann, Blanca Pablos, Ari Angelo, Simon Schallenberg, Andrew Norgan, Viktor Matyas, Klaus-Robert Müller, Maximilian Alber, Lukas Ruff, Frederick Klauschen

Main category: cs.CV

TL;DR: OpenTME is an open-access dataset of tumor microenvironment profiles from 3,634 H&E-stained whole-slide images across five cancer types, generated using AI-powered pathology foundation models with over 4,500 quantitative readouts per slide.

Details

Motivation: Large-scale, consistent, and quantitative characterization of the tumor microenvironment from routine H&E-stained histopathology is scarce, limiting research in cancer progression, treatment response, and biomarker discovery.

Method: Used Atlas H&E-TME, an AI-powered application built on pathology foundation models, to perform tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis on 3,634 whole-slide images from TCGA across five cancer types.

Result: Created OpenTME dataset with over 4,500 quantitative readouts per slide at cell-level resolution, available for non-commercial academic research on Hugging Face, covering bladder, breast, colorectal, liver, and lung cancers.

Conclusion: OpenTME serves as a valuable resource for biomarker discovery, spatial biology research, and computational method development for tumor microenvironment analysis, with plans for future expansion.

Abstract: The tumor microenvironment (TME) plays a central role in cancer progression, treatment response, and patient outcomes, yet large-scale, consistent, and quantitative TME characterization from routine hematoxylin and eosin (H&E)-stained histopathology remains scarce. We introduce OpenTME, an open-access dataset of pre-computed TME profiles derived from 3,634 H&E-stained whole-slide images across five cancer types (bladder, breast, colorectal, liver, and lung cancer) from The Cancer Genome Atlas (TCGA). All outputs were generated using Atlas H&E-TME, an AI-powered application built on the Atlas family of pathology foundation models, which performs tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis, yielding over 4,500 quantitative readouts per slide at cell-level resolution. OpenTME is available for non-commercial academic research on Hugging Face. We will continue to expand OpenTME over time and anticipate it will serve as a resource for biomarker discovery, spatial biology research, and the development of computational methods for TME analysis.

[168] INST-Align: Implicit Neural Alignment for Spatial Transcriptomics via Canonical Expression Fields

Bonian Han, Cong Qi, Przemyslaw Musialski, Zhi Wei

Main category: cs.CV

TL;DR: INST-Align: An unsupervised framework for joint alignment and reconstruction of spatial transcriptomics data using coordinate-based deformation networks and shared canonical expression fields.

Details

Motivation: Spatial transcriptomics faces challenges with large non-rigid deformations across tissue slices and inter-slice batch effects when alignment and integration are treated as separate problems, limiting accurate 3D reconstruction and analysis.

Method: Proposes INST-Align with a coordinate-based deformation network coupled with a shared Canonical Expression Field (implicit neural representation mapping spatial coordinates to expression embeddings). Uses two-phase training: first establishes stable canonical embedding space, then jointly optimizes deformation and spatial-feature matching with cross-slice parameter sharing.

Result: Achieves state-of-the-art performance across nine datasets with mean OT Accuracy (0.702), NN Accuracy (0.719), and Chamfer distance reductions up to 94.9% on large-deformation sections compared to baselines. Produces biologically meaningful spatial embeddings and coherent 3D tissue reconstruction.

Conclusion: INST-Align successfully addresses coupled alignment and integration challenges in spatial transcriptomics through joint optimization, enabling accurate 3D reconstruction and meaningful spatial analysis of tissue organization.

Abstract: Spatial transcriptomics (ST) measures mRNA expression while preserving spatial organization, but multi-slice analysis faces two coupled difficulties: large non-rigid deformations across slices and inter-slice batch effects when alignment and integration are treated independently. We present INST-Align, an unsupervised pairwise framework that couples a coordinate-based deformation network with a shared Canonical Expression Field, an implicit neural representation mapping spatial coordinates to expression embeddings, for joint alignment and reconstruction. A two-phase training strategy first establishes a stable canonical embedding space and then jointly optimizes deformation and spatial-feature matching, enabling mutually constrained alignment and representation learning. Cross-slice parameter sharing of the canonical field regularizes ambiguous correspondences and absorbs batch variation. Across nine datasets, INST-Align achieves state-of-the-art mean OT Accuracy (0.702), NN Accuracy (0.719), and Chamfer distance, with Chamfer reductions of up to 94.9% on large-deformation sections relative to the strongest baseline. The framework also yields biologically meaningful spatial embeddings and coherent 3D tissue reconstruction. The code will be released after review phase.

[169] PC-MIL: Decoupling Feature Resolution from Supervision Scale in Whole-Slide Learning

Syed Fahim Ahmed, Gnanesh Rasineni, Florian Koehler, Abu Zahid Bin Aziz, Mei Wang, Attila Gyulassy, Brian Summa, J. Quincy Brown, Valerio Pascucci, Shireen Y. Elhabian

Main category: cs.CV

TL;DR: PC-MIL introduces progressive multi-context supervision in WSI classification to preserve anatomical structure, decoupling feature resolution from supervision scale to improve cross-context generalization.

Details

Motivation: Standard slide-level MIL for WSI classification is underconstrained - it only optimizes global labels, encouraging feature aggregation without learning anatomically meaningful localization. This creates a mismatch between supervision scale and clinical reasoning where clinicians assess millimeter-scale regions.

Method: PC-MIL treats spatial extent of supervision as a design dimension, using fixed 20x features while varying MIL bag extent in millimeter units. It anchors supervision at clinically motivated 2mm scale and progressively mixes slide- and region-level supervision in controlled proportions for explicit train-context x test-context analysis.

Result: On 1,476 prostate WSIs from five public datasets for binary cancer detection, modest regional supervision improves cross-context performance, and balanced multi-context training stabilizes accuracy across slide and regional evaluation without sacrificing global performance.

Conclusion: Anatomical context is an independent axis of generalization in MIL, orthogonal to feature resolution. Supervision extent shapes MIL inductive bias and supports anatomically grounded WSI generalization.

Abstract: Whole-slide image (WSI) classification in computational pathology is commonly formulated as slide-level Multiple Instance Learning (MIL) with a single global bag representation. However, slide-level MIL is fundamentally underconstrained: optimizing only global labels encourages models to aggregate features without learning anatomically meaningful localization. This creates a mismatch between the scale of supervision and the scale of clinical reasoning. Clinicians assess tumor burden, focal lesions, and architectural patterns within millimeter-scale regions, whereas standard MIL is trained only to predict whether “somewhere in the slide there is cancer.” As a result, the model’s inductive bias effectively erases anatomical structure. We propose Progressive-Context MIL (PC-MIL), a framework that treats the spatial extent of supervision as a first-class design dimension. Rather than altering magnification, patch size, or introducing pixel-level segmentation, we decouple feature resolution from supervision scale. Using fixed 20x features, we vary MIL bag extent in millimeter units and anchor supervision at a clinically motivated 2mm scale to preserve comparable tumor burden and avoid confounding scale with lesion density. PC-MIL progressively mixes slide- and region-level supervision in controlled proportions, enabling explicit train-context x test-context analysis. On 1,476 prostate WSIs from five public datasets for binary cancer detection, we show that anatomical context is an independent axis of generalization in MIL, orthogonal to feature resolution: modest regional supervision improves cross-context performance, and balanced multi-context training stabilizes accuracy across slide and regional evaluation without sacrificing global performance. These results demonstrate that supervision extent shapes MIL inductive bias and support anatomically grounded WSI generalization.

[170] PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation

Minjae Lee, Sungwoo Hur, Soojin Hwang, Won Hwa Kim

Main category: cs.CV

TL;DR: PR-MaGIC is a training-free test-time framework that refines prompts for in-context segmentation using gradient flow from SAM’s mask decoder, improving segmentation quality without additional training.

Details

Motivation: Current in-context segmentation methods using Visual Foundation Models like SAM still generate sub-optimal prompts due to visual inconsistencies between support and query images, degrading segmentation quality.

Method: PR-MaGIC refines prompts via gradient flow derived from SAM’s mask decoder, integrating seamlessly into in-context segmentation frameworks with a simple top-1 selection strategy for stabilization.

Result: Extensive evaluations show PR-MaGIC consistently improves segmentation quality across various benchmarks, effectively mitigating inadequate prompts without requiring additional training or architectural modifications.

Conclusion: PR-MaGIC provides a theoretically grounded yet practical solution for prompt refinement in in-context segmentation, enhancing SAM-based segmentation without training overhead.

Abstract: Visual Foundation Models (VFMs) such as the Segment Anything Model (SAM) have significantly advanced broad use of image segmentation. However, SAM and its variants necessitate substantial manual effort for prompt generation and additional training for specific applications. Recent approaches address these limitations by integrating SAM into in-context (one/few shot) segmentation, enabling auto-prompting through semantic alignment between query and support images. Despite these efforts, they still generate sub-optimal prompts that degrade segmentation quality due to visual inconsistencies between support and query images. To tackle this limitation, we introduce PR-MaGIC (Prompt Refinement via Mask Decoder Gradient Flow for In-Context Segmentation), a training-free test-time framework that refines prompts via gradient flow derived from SAM’s mask decoder. PR-MaGIC seamlessly integrates into in-context segmentation frameworks, being theoretically grounded yet practically stabilized through a simple top-1 selection strategy that ensures robust performance across samples. Extensive evaluations demonstrate that PR-MaGIC consistently improves segmentation quality across various benchmarks, effectively mitigating inadequate prompts without requiring additional training or architectural modifications.

[171] HTDC: Hesitation-Triggered Differential Calibration for Mitigating Hallucination in Large Vision-Language Models

Xinyun Liu

Main category: cs.CV

TL;DR: HTDC is a training-free decoding framework that reduces hallucinations in large vision-language models by activating calibration only at hesitation-prone steps, using visual and semantic nullification probes to suppress hallucination-prone candidates.

Details

Motivation: Large vision-language models suffer from hallucinations due to unstable visual grounding and over-reliance on language priors. Existing training-free methods apply calibration at every decoding step, causing unnecessary computation and potentially disrupting stable predictions.

Method: Proposes Hesitation-Triggered Differential Calibration (HTDC) that identifies layer-wise hesitation (fluctuations in token preference across intermediate layers) as a signal of grounding instability. When triggered, HTDC contrasts the full branch with two lightweight probes: visual-nullification probe and semantic-nullification probe to suppress hallucination-prone candidates.

Result: Experiments on hallucination benchmarks show HTDC consistently reduces hallucinations while maintaining strong task accuracy, achieving favorable trade-off between effectiveness and computational overhead.

Conclusion: HTDC provides an efficient training-free decoding framework that selectively activates calibration only when needed, effectively reducing hallucinations in vision-language models without disrupting stable predictions.

Abstract: Large vision-language models (LVLMs) achieve strong multimodal performance, but still suffer from hallucinations caused by unstable visual grounding and over-reliance on language priors. Existing training-free decoding methods typically apply calibration at every decoding step, introducing unnecessary computation and potentially disrupting stable predictions. We address this problem by identifying layer-wise hesitation, a simple signal of grounding instability reflected by fluctuations in token preference across intermediate layers. Based on this observation, we propose Hesitation-Triggered Differential Calibration (HTDC), a training-free decoding framework that preserves standard full-branch inference and activates calibration only at hesitation-prone steps. When triggered, HTDC contrasts the full branch with two lightweight probes, a visual-nullification probe and a semantic-nullification probe, to suppress hallucination-prone candidates while avoiding unnecessary intervention on stable steps. Experiments on representative hallucination benchmarks show that HTDC consistently reduces hallucinations while maintaining strong task accuracy, achieving a favorable trade-off between effectiveness and computational overhead.

[172] Beyond Perception Errors: Semantic Fixation in Large Vision-Language Models

Md Tanvirul Alam

Main category: cs.CV

TL;DR: VLMs exhibit semantic fixation - they preserve default interpretations even when prompts specify alternative valid mappings, revealed through controlled game benchmarks showing consistent accuracy gaps between standard and inverse rule formulations.

Details

Motivation: Large vision-language models often rely on familiar semantic priors, but existing evaluations don't cleanly separate perception failures from rule-mapping failures. The authors want to study this behavior as semantic fixation - preserving default interpretations even when prompts specify alternative valid mappings.

Method: Introduce VLM-Fix, a controlled benchmark over four abstract strategy games evaluating identical terminal board states under paired standard and inverse rule formulations. Test 14 open and closed VLMs, use prompt interventions (neutral vs. semantically loaded aliases), examine post-training effects, and apply late-layer activation steering.

Result: Accuracy consistently favors standard rules across all models, revealing a robust semantic-fixation gap. Neutral alias prompts narrow the inverse-rule gap while semantically loaded aliases reopen it. Post-training shows strong rule alignment, and late-layer activation steering can partially recover degraded performance.

Conclusion: VLMs exhibit semantic fixation that persists across models and interventions. This behavior is partly editable in late representations, suggesting opportunities for targeted interventions to improve model flexibility in rule interpretation.

Abstract: Large vision-language models (VLMs) often rely on familiar semantic priors, but existing evaluations do not cleanly separate perception failures from rule-mapping failures. We study this behavior as semantic fixation: preserving a default interpretation even when the prompt specifies an alternative, equally valid mapping. To isolate this effect, we introduce VLM-Fix, a controlled benchmark over four abstract strategy games that evaluates identical terminal board states under paired standard and inverse rule formulations. Across 14 open and closed VLMs, accuracy consistently favors standard rules, revealing a robust semantic-fixation gap. Prompt interventions support this mechanism: neutral alias prompts substantially narrow the inverse-rule gap, while semantically loaded aliases reopen it. Post-training is strongly rule-aligned: training on one rule improves same-rule transfer but hurts opposite-rule transfer, while joint-rule training improves broader transfer. To test external validity beyond synthetic games, we evaluate analogous defamiliarization interventions on VLMBias and observe the same qualitative pattern. Finally, late-layer activation steering partially recovers degraded performance, indicating that semantic-fixation errors are at least partly editable in late representations. Project page, code, and dataset available at https://maveryn.github.io/vlm-fix/.

[173] ViLL-E: Video LLM Embeddings for Retrieval

Rohit Gupta, Jayakrishnan Unnikrishnan, Fan Fei, Sheng Liu, Son Tran, Mubarak Shah

Main category: cs.CV

TL;DR: ViLL-E is a unified VideoLLM architecture with adaptive embedding generation that improves video retrieval and temporal localization while maintaining VideoQA performance, achieving SotA results in composed video retrieval and retrieval from long text.

Details

Motivation: Current VideoLLMs excel at text-based video understanding tasks but underperform specialized embedding models in retrieval tasks like Text-to-Video Retrieval and Moment Retrieval. There's a need for a unified architecture that can handle both generative and retrieval tasks effectively.

Method: ViLL-E introduces a novel embedding generation mechanism with adaptive computation (thinking longer for complex videos, stopping early for easy ones). Uses three-stage training: 1) large-scale pre-training with video-caption pairs, 2) continual training on detailed-caption dataset, 3) task-specific fine-tuning on multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching.

Result: Significantly improves temporal localization (avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models). Achieves performance comparable to SotA specialized embedding models while remaining competitive on VideoQA. Unlocks new zero-shot capabilities: +5% over SotA in composed video retrieval and +2% over SotA in retrieval from long text.

Conclusion: ViLL-E demonstrates that a unified VideoLLM architecture with joint contrastive-generative training can effectively handle both generative and retrieval tasks, bridging the gap between VideoLLMs and specialized embedding models while enabling new zero-shot capabilities.

Abstract: Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to “think longer” for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models), achieving performance comparable to state-of-the-art specialized embedding models while remaining competitive on VideoQA tasks. Furthermore, our joint contrastive-generative training unlocks new zero-shot capabilities, significantly outperforming state-of-the-art methods in composed video retrieval (+5% over SotA) and retrieval from long text (+2% over SotA).

[174] Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

Sebastian Cajas, Ashaba Judith, Rahul Gorijavolu, Sahil Kapadia, Hillary Clinton Kasimbazi, Leo Kinyera, Emmanuel Paul Kwesiga, Sri Sri Jaithra Varma Manthena, Luis Filipe Nakayama, Ninsiima Doreen, Leo Anthony Celi

Main category: cs.CV

TL;DR: Medical image super-resolution using latent diffusion models benefits significantly from domain-specific VAEs rather than generic ones, with MedVAE improving PSNR by 2.91-3.29 dB across medical imaging modalities.

Details

Motivation: Current latent diffusion models for medical image super-resolution use VAEs designed for natural photographs, which may not be optimal for medical imaging. The paper investigates whether this default choice constrains reconstruction quality.

Method: Replaced the generic Stable Diffusion VAE with MedVAE (domain-specific autoencoder pretrained on 1.6M medical images) while keeping all other pipeline components fixed. Conducted controlled experiments across knee MRI, brain MRI, and chest X-ray datasets (n=1,820). Used wavelet decomposition to analyze frequency bands and performed ablations across inference schedules, prediction targets, and generative architectures.

Result: MedVAE yielded +2.91 to +3.29 dB PSNR improvement across all medical imaging modalities with large effect sizes (Cohen’s d = 1.37 to 1.86). Wavelet decomposition showed advantages in finest spatial frequency bands encoding anatomical fine structure. The performance gap remained stable (±0.15 dB) across ablations while hallucination rates remained comparable between methods.

Conclusion: The VAE choice, not the diffusion architecture, is the dominant constraint on reconstruction quality in medical image super-resolution. Autoencoder reconstruction quality predicts downstream SR performance (R²=0.67), suggesting domain-specific VAE selection should precede diffusion architecture search.

Abstract: Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on reconstruction quality. In a controlled experiment holding all other pipeline components fixed, replacing the generic Stable Diffusion VAE with MedVAE, a domain-specific autoencoder pretrained on more than 1.6 million medical images, yields +2.91 to +3.29 dB PSNR improvement across knee MRI, brain MRI, and chest X-ray (n = 1,820; Cohen’s d = 1.37 to 1.86, all p < 10^{-20}, Wilcoxon signed-rank). Wavelet decomposition localises the advantage to the finest spatial frequency bands encoding anatomically relevant fine structure. Ablations across inference schedules, prediction targets, and generative architectures confirm the gap is stable within plus or minus 0.15 dB, while hallucination rates remain comparable between methods (Cohen’s h < 0.02 across all datasets), establishing that reconstruction fidelity and generative hallucination are governed by independent pipeline components. These results provide a practical screening criterion: autoencoder reconstruction quality, measurable without diffusion training, predicts downstream SR performance (R^2 = 0.67), suggesting that domain-specific VAE selection should precede diffusion architecture search. Code and trained model weights are publicly available at https://github.com/sebasmos/latent-sr.

[175] VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale

Parth Parag Kulkarni, Rohit Gupta, Prakash Chandra Chhipa, Mubarak Shah

Main category: cs.CV

TL;DR: VidTAG is a dual-encoder framework for fine-grained video geolocalization that performs frame-to-GPS retrieval using self-supervised and language-aligned features, with temporal consistency modules for trajectory generation.

Details

Motivation: Existing video geolocalization methods have limitations: classification-based approaches only provide coarse city-level granularity, while image retrieval methods require extensive image galleries that are infeasible to compile globally. GPS coordinate galleries are easier to construct, creating an opportunity for more practical fine-grained video geolocalization.

Method: Proposes VidTAG with a dual-encoder framework for frame-to-GPS retrieval using both self-supervised and language-aligned features. Introduces TempGeo module to align frame embeddings for temporal consistency, and GeoRefiner module (encoder-decoder architecture) to refine GPS features using aligned frame embeddings.

Result: Outperforms baselines with 20% improvement at 1 km threshold over GeoCLIP, and beats current State-of-the-Art by 25% on global coarse-grained video geolocalization (CityGuessr68k). Demonstrates ability to generate temporally consistent trajectories on Mapillary (MSLS) and GAMa datasets.

Conclusion: The approach enables fine-grained video geolocalization and lays a strong foundation for future research in this area, addressing limitations of existing methods through a practical GPS-based retrieval framework with temporal consistency mechanisms.

Abstract: The task of video geolocalization aims to determine the precise GPS coordinates of a video’s origin and map its trajectory; with applications in forensics, social media, and exploration. Existing classification-based approaches operate at a coarse city-level granularity and fail to capture fine-grained details, while image retrieval methods are impractical on a global scale due to the need for extensive image galleries which are infeasible to compile. Comparatively, constructing a gallery of GPS coordinates is straightforward and inexpensive. We propose VidTAG, a dual-encoder framework that performs frame-to-GPS retrieval using both self-supervised and language-aligned features. To address temporal inconsistencies in video predictions, we introduce the TempGeo module, which aligns frame embeddings, and the GeoRefiner module, an encoder-decoder architecture that refines GPS features using the aligned frame embeddings. Evaluations on Mapillary (MSLS) and GAMa datasets demonstrate our model’s ability to generate temporally consistent trajectories and outperform baselines, achieving a 20% improvement at the 1 km threshold over GeoCLIP. We also beat current State-of-the-Art by 25% on global coarse grained video geolocalization (CityGuessr68k). Our approach enables fine-grained video geolocalization and lays a strong foundation for future research. More details on the project webpage: https://parthpk.github.io/vidtag_webpage/

[176] Nucleus-Image: Sparse MoE for Image Generation

Chandan Akiti, Ajay Modukuri, Murali Nandan Nagarapu, Gunavardhan Akiti, Haozhe Liu

Main category: cs.CV

TL;DR: Nucleus-Image is a text-to-image generation model using sparse mixture-of-experts diffusion transformers that achieves state-of-the-art performance with only ~2B active parameters per forward pass through efficient architecture and training techniques.

Details

Motivation: To create a high-quality text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency, matching or exceeding leading models while significantly reducing computational costs during inference.

Method: Uses sparse MoE diffusion transformer architecture with Expert-Choice Routing (17B total parameters, 64 experts per layer), streamlined architecture excluding text tokens from transformer backbone, joint attention for text KV sharing, decoupled routing for stability, large-scale training corpus (1.5B pairs), progressive resolution curriculum, and Muon optimizer.

Result: Matches or exceeds leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only ~2B parameters per forward pass, achieving performance of larger models at fraction of inference cost without post-training optimization.

Conclusion: Sparse MoE scaling is highly effective for high-quality image generation, enabling state-of-the-art performance with efficient inference, and the model is released as first fully open-source MoE diffusion model at this quality level.

Abstract: We present Nucleus-Image, a text-to-image generation model that establishes a new Pareto frontier in quality-versus-efficiency by matching or exceeding leading models on GenEval, DPG-Bench, and OneIG-Bench while activating only approximately 2B parameters per forward pass. Nucleus-Image employs a sparse mixture-of-experts (MoE) diffusion transformer architecture with Expert-Choice Routing that scales total model capacity to 17B parameters across 64 routed experts per layer. We adopt a streamlined architecture optimized for inference efficiency by excluding text tokens from the transformer backbone entirely and using joint attention that enables text KV sharing across timesteps. To improve routing stability when using timestep modulation, we introduce a decoupled routing design that separates timestep-aware expert assignment from timestep-conditioned expert computation. We construct a large-scale training corpus of 1.5B high-quality training pairs spanning 700M unique images through multi-stage filtering, deduplication, aesthetic tiering, and caption curation. Training follows a progressive resolution curriculum (256 to 512 to 1024) with multi-aspect-ratio bucketing at every stage, coupled with progressive sparsification of the expert capacity factor. We adopt the Muon optimizer and share our parameter grouping recipe tailored for diffusion models with timestep modulation. Nucleus-Image demonstrates that sparse MoE scaling is a highly effective path to high-quality image generation, reaching the performance of models with significantly larger active parameter budgets at a fraction of the inference cost. These results are achieved without post-training optimization of any kind: no reinforcement learning, no direct preference optimization, and no human preference tuning. We release the training recipe, making Nucleus-Image the first fully open-source MoE diffusion model at this quality.

[177] Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment

Xinjie Zhang, Qiang Li, Xiaowen Ma, Axi Niu, Li Yan, Qingsen Yan

Main category: cs.CV

TL;DR: DS-IEQA: A unified framework for Image Editing Quality Assessment that jointly learns evaluation criteria and score representations through automated metric prompt optimization and distance-aware score modeling.

Details

Motivation: Existing MLLM-based approaches for Image Editing Quality Assessment rely on human heuristic prompting, leading to rigid metric definitions and distance-agnostic score modeling that fails to align with implicit human criteria and capture continuous score spaces.

Method: Proposes DS-IEQA with two key components: 1) Feedback-Driven Metric Prompt Optimization (FDMPO) to automatically refine metric definitions via probabilistic feedback, and 2) Token-Decoupled Distance Regression Loss (TDRL) which decouples numerical tokens from language modeling to explicitly model score continuity through expected distance minimization.

Result: The method achieves superior performance, ranking 4th in the 2026 NTIRE X-AIGC Quality Assessment Track 2 without any additional training data.

Conclusion: DS-IEQA provides an effective unified framework for IEQA that addresses limitations of existing MLLM-based approaches by jointly learning evaluation criteria and score representations, enabling better alignment with human assessment.

Abstract: Recent advances in image editing have heightened the need for reliable Image Editing Quality Assessment (IEQA). Unlike traditional methods, IEQA requires complex reasoning over multimodal inputs and multi-dimensional assessments. Existing MLLM-based approaches often rely on human heuristic prompting, leading to two key limitations: rigid metric prompting and distance-agnostic score modeling. These issues hinder alignment with implicit human criteria and fail to capture the continuous structure of score spaces. To address this, we propose Define-and-Score Image Editing Quality Assessment (DS-IEQA), a unified framework that jointly learns evaluation criteria and score representations. Specifically, we introduce Feedback-Driven Metric Prompt Optimization (FDMPO) to automatically refine metric definitions via probabilistic feedback. Furthermore, we propose Token-Decoupled Distance Regression Loss (TDRL), which decouples numerical tokens from language modeling to explicitly model score continuity through expected distance minimization. Extensive experiments show our method’s superior performance; it ranks 4th in the 2026 NTIRE X-AIGC Quality Assessment Track 2 without any additional training data.

[178] Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation

Wentai Zhang, Ronghui Xi, Shiyao Peng, Jiayu Huang, Haoran Luo, Zichen Tang, Haihong E

Main category: cs.CV

TL;DR: PASA is a training-free framework that uses precision-allocated sparse attention to accelerate video diffusion transformers while eliminating temporal flickering through curvature-aware dynamic budgeting and stochastic routing.

Details

Motivation: Video Diffusion Transformers achieve high-fidelity video generation but suffer from massive computational burden due to self-attention. Existing sparse attention methods cause severe visual flickering from static sparsity patterns and deterministic block routing.

Method: 1) Curvature-aware dynamic budgeting that profiles generation trajectory acceleration to allocate computation budget during critical semantic transitions. 2) Hardware-aligned grouped approximations instead of global homogenizing estimations. 3) Stochastic selection bias in attention routing to soften rigid boundaries and eliminate selection oscillation.

Result: PASA achieves substantial inference acceleration while consistently producing remarkably fluid and structurally stable video sequences in evaluations on leading video diffusion models.

Conclusion: PASA provides an effective training-free framework for efficient and temporally smooth video generation by addressing the limitations of existing sparse attention methods through dynamic budgeting and stochastic routing.

Abstract: Video Diffusion Transformers have revolutionized high-fidelity video generation but suffer from the massive computational burden of self-attention. While sparse attention provides a promising acceleration solution, existing methods frequently provoke severe visual flickering caused by static sparsity patterns and deterministic block routing. To resolve these limitations, we propose Precision-Allocated Sparse Attention (PASA), a training-free framework designed for highly efficient and temporally smooth video generation. First, we implement a curvature-aware dynamic budgeting mechanism. By profiling the generation trajectory acceleration across timesteps, we elastically allocate the exact-computation budget to secure high-precision processing strictly during critical semantic transitions. Second, we replace global homogenizing estimations with hardware-aligned grouped approximations, successfully capturing fine-grained local variations while maintaining peak compute throughput. Finally, we incorporate a stochastic selection bias into the attention routing mechanism. This probabilistic approach softens rigid selection boundaries and eliminates selection oscillation, effectively eradicating the localized computational starvation that drives temporal flickering. Extensive evaluations on leading video diffusion models demonstrate that PASA achieves substantial inference acceleration while consistently producing remarkably fluid and structurally stable video sequences.

[179] BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition

Qingyuan Cai, Saihui Hou, Xuecai Hu, Yongzhen Huang

Main category: cs.CV

TL;DR: BarbieGait: A synthetic gait dataset with virtual clothing changes for cross-clothing gait recognition research, plus GaitCLIF model for cloth-invariant feature learning.

Details

Motivation: Real-world gait recognition faces challenges from diverse clothing styles that obscure gait patterns. Existing datasets lack controlled clothing variations needed to study cross-clothing issues systematically.

Method: 1) Created BarbieGait synthetic dataset by mapping real subjects to virtual engine for controlled clothing changes while preserving gait identity. 2) Proposed GaitCLIF (Gait-oriented CLoth-Invariant Feature) model to learn cloth-invariant features despite increased intra-class variance from clothing diversity.

Result: Method significantly improves cross-clothing performance on BarbieGait and existing gait benchmarks. BarbieGait enables validation of cross-clothing issues difficult to verify with real-world data.

Conclusion: BarbieGait advances cross-clothing gait recognition capabilities and promotes related research through extensive synthetic cross-clothing data. GaitCLIF provides robust baseline for cloth-invariant feature learning.

Abstract: Gait recognition, as a reliable biometric technology, has seen rapid development in recent years while it faces significant challenges caused by diverse clothing styles in the real world. This paper introduces BarbieGait, a synthetic gait dataset where real-world subjects are uniquely mapped into a virtual engine to simulate extensive clothing changes while preserving their gait identity information. As a pioneering work, BarbieGait provides a controllable gait data generation method, enabling the production of large datasets to validate cross-clothing issues that are difficult to verify with real-world data. However, the diversity of clothing increases intra-class variance and makes one of the biggest challenges to learning cloth-invariant features under varying clothing conditions. Therefore, we propose GaitCLIF (Gait-oriented CLoth-Invariant Feature) as a robust baseline model for cross-clothing gait recognition. Through extensive experiments, we validate that our method significantly improves cross-clothing performance on BarbieGait and the existing popular gait benchmarks. We believe that BarbieGait, with its extensive cross-clothing gait data, will further advance the capabilities of gait recognition in cross-clothing scenarios and promote progress in related research.

[180] ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models

Xinliang Wang, Yifeng Shi, Zhenyu Wu

Main category: cs.CV

TL;DR: ArtifactWorld: A framework for repairing 3D Gaussian Splatting artifacts using systematic data expansion and a dual-model video diffusion approach with artifact heatmap guidance.

Details

Motivation: 3D Gaussian Splatting (3DGS) suffers from geometric and photometric degradations under sparse-view constraints, and current generative restoration approaches have limitations including insufficient temporal coherence, lack of explicit spatial constraints, and limited training data, leading to multi-view inconsistencies and erroneous geometric hallucinations.

Method: 1) Creates a fine-grained taxonomy of 3DGS artifacts and builds a large training set of 107.5K paired video clips; 2) Uses a homogeneous dual-model paradigm with video diffusion backbone; 3) Employs an isomorphic predictor to localize structural defects via artifact heatmap; 4) Guides restoration through Artifact-Aware Triplet Fusion mechanism for intensity-guided spatio-temporal repair.

Result: Extensive experiments show ArtifactWorld achieves state-of-the-art performance in sparse novel view synthesis and robust 3D reconstruction.

Conclusion: ArtifactWorld effectively resolves 3DGS artifact repair through systematic data expansion and a homogeneous dual-model paradigm, demonstrating superior performance in 3D reconstruction and novel view synthesis tasks.

Abstract: 3D Gaussian Splatting (3DGS) delivers high-fidelity real-time rendering but suffers from geometric and photometric degradations under sparse-view constraints. Current generative restoration approaches are often limited by insufficient temporal coherence, a lack of explicit spatial constraints, and a lack of large-scale training data, resulting in multi-view inconsistencies, erroneous geometric hallucinations, and limited generalization to diverse real-world artifact distributions. In this paper, we present ArtifactWorld, a framework that resolves 3DGS artifact repair through systematic data expansion and a homogeneous dual-model paradigm. To address the data bottleneck, we establish a fine-grained phenomenological taxonomy of 3DGS artifacts and construct a comprehensive training set of 107.5K diverse paired video clips to enhance model robustness. Architecturally, we unify the restoration process within a video diffusion backbone, utilizing an isomorphic predictor to localize structural defects via an artifact heatmap. This heatmap then guides the restoration through an Artifact-Aware Triplet Fusion mechanism, enabling precise, intensity-guided spatio-temporal repair within native self-attention. Extensive experiments demonstrate that ArtifactWorld achieves state-of-the-art performance in sparse novel view synthesis and robust 3D reconstruction. Code and dataset will be made public.

[181] ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception

Huanzhen Wang, Ziheng Zhou, Jiaqi Song, Li He, Yunshi Lan, Yan Wang, Wenqiang Zhang

Main category: cs.CV

TL;DR: ARGen is a two-stage framework for generating dynamic facial expressions to address data scarcity in emotion recognition, using affective semantic injection and adaptive reinforcement diffusion for robust emotion perception.

Details

Motivation: Dynamic facial expression recognition in the wild faces challenges due to data scarcity and long-tail distributions, which prevent models from effectively learning temporal dynamics of scarce emotions.

Method: Two-stage approach: 1) Affective Semantic Injection (ASI) establishes affective knowledge alignment using facial Action Units and retrieval-augmented prompt generation via large-scale visual-language models; 2) Adaptive Reinforcement Diffusion (ARD) integrates text-conditioned image-to-video diffusion with reinforcement learning, using inter-frame conditional guidance and multi-objective reward functions.

Result: Extensive experiments show ARGen substantially enhances synthesis fidelity and improves recognition performance, establishing an interpretable and generalizable generative augmentation paradigm.

Conclusion: ARGen provides an effective framework for dynamic facial expression generation that addresses data scarcity issues in emotion recognition through interpretable generative augmentation.

Abstract: Dynamic facial expression recognition in the wild remains challenging due to data scarcity and long-tail distributions, which hinder models from effectively learning the temporal dynamics of scarce emotions. To address these limitations, we propose ARGen, an Affect-Reinforced Generative Augmentation Framework that enables data-adaptive dynamic expression generation for robust emotion perception. ARGen operates in two stages: Affective Semantic Injection (ASI) and Adaptive Reinforcement Diffusion (ARD). The ASI stage establishes affective knowledge alignment through facial Action Units and employs a retrieval-augmented prompt generation strategy to synthesize consistent and fine-grained affective descriptions via large-scale visual-language models, thereby injecting interpretable emotional priors into the generation process. The ARD stage integrates text-conditioned image-to-video diffusion with reinforcement learning, introducing inter-frame conditional guidance and a multi-objective reward function to jointly optimize expression naturalness, facial integrity, and generative efficiency. Extensive experiments on both generation and recognition tasks verify that ARGen substantially enhances synthesis fidelity and improves recognition performance, establishing an interpretable and generalizable generative augmentation paradigm for vision-based affective computing.

[182] Style-Decoupled Adaptive Routing Network for Underwater Image Enhancement

Hang Xu, Chen Long, Bing Wang, Hao Chen, Zhen Dong

Main category: cs.CV

TL;DR: SDAR-Net is an adaptive underwater image enhancement framework that decouples degradation styles from scene structure and uses adaptive routing for personalized restoration, achieving SOTA performance on real-world benchmarks.

Details

Motivation: Existing underwater image enhancement methods use uniform mapping based on average dataset distributions, causing over-processing of mildly degraded images or insufficient recovery for severely degraded ones. There's a need for adaptive enhancement that handles varying degradation levels.

Method: SDAR-Net decouples image features into dynamic degradation style embeddings and static scene structural representations. It introduces an adaptive routing mechanism that evaluates style features and predicts soft weights at different enhancement states to guide weighted fusion of representations for personalized restoration.

Result: SDAR-Net achieves state-of-the-art performance with PSNR of 25.72 dB on real-world benchmarks and demonstrates utility in downstream vision tasks.

Conclusion: The proposed adaptive enhancement framework effectively addresses the limitations of uniform mapping approaches by providing personalized restoration based on specific degradation styles, leading to superior performance on underwater image enhancement.

Abstract: Underwater Image Enhancement (UIE) is essential for robust visual perception in marine applications. However, existing methods predominantly rely on uniform mapping tailored to average dataset distributions, leading to over-processing mildly degraded images or insufficient recovery for severe ones. To address this challenge, we propose a novel adaptive enhancement framework, SDAR-Net. Unlike existing uniform paradigms, it first decouples specific degradation styles from the input and subsequently modulates the enhancement process adaptively. Specifically, since underwater degradation primarily shifts the appearance while keeping the scene structure, SDAR-Net formulates image features into dynamic degradation style embeddings and static scene structural representations through a carefully designed training framework. Subsequently, we introduce an adaptive routing mechanism. By evaluating style features and adaptively predicting soft weights at different enhancement states, it guides the weighted fusion of the corresponding image representations, accurately satisfying the adaptive restoration demands of each image. Extensive experiments show that SDAR-Net achieves a new state-of-the-art (SOTA) performance with a PSNR of 25.72 dB on real-world benchmark, and demonstrates its utility in downstream vision tasks. Our code is available at https://github.com/WHU-USI3DV/SDAR-Net.

[183] DreamStereo: Towards Real-Time Stereo Inpainting for HD Videos

Yuan Huang, Sijie Zhao, Jing Cheng, Hao Xu, Shaohui Jiao

Main category: cs.CV

TL;DR: A novel stereo video inpainting framework with three components: Gradient-Aware Parallax Warping for continuous edges, Parallax-Based Dual Projection for generating training data without stereo videos, and Sparsity-Aware Stereo Inpainting for 10.7x speedup via token reduction.

Details

Motivation: Stereo video inpainting faces challenges due to scattered occlusion regions along object boundaries and lack of high-quality datasets, while existing methods waste computation on unchanged pixels.

Method: Three interconnected components: 1) Gradient-Aware Parallax Warping (GAPW) for smooth occlusion regions, 2) Parallax-Based Dual Projection (PBDP) to generate stereo inpainting pairs without stereo videos, 3) Sparsity-Aware Stereo Inpainting (SASI) that reduces redundant tokens for 10.7x speedup.

Result: Achieves real-time processing of HD videos (768x1280) at 25 FPS on single A100 GPU with over 70% token reduction and comparable quality to full-computation methods.

Conclusion: The proposed framework addresses stereo video inpainting challenges by combining geometric consistency, data generation without stereo videos, and efficient computation through sparsity awareness.

Abstract: Stereo video inpainting, which aims to fill the occluded regions of warped videos with visually coherent content while maintaining temporal consistency, remains a challenging open problem. The regions to be filled are scattered along object boundaries and occupy only a small fraction of each frame, leading to two key challenges. First, existing approaches perform poorly on such tasks due to the scarcity of high-quality stereo inpainting datasets, which limits their ability to learn effective inpainting priors. Second, these methods apply equal processing to all regions of the frame, even though most pixels require no modification, resulting in substantial redundant computation. To address these issues, we introduce three interconnected components. We first propose Gradient-Aware Parallax Warping (GAPW), which leverages backward warping and the gradient of the coordinate mapping function to obtain continuous edges and smooth occlusion regions. Then, a Parallax-Based Dual Projection (PBDP) strategy is introduced, which incorporates GAPW to produce geometrically consistent stereo inpainting pairs and accurate occlusion masks without requiring stereo videos. Finally, we present Sparsity-Aware Stereo Inpainting (SASI), which reduces over 70% of redundant tokens, achieving a 10.7x speedup during diffusion inference and delivering results comparable to its full-computation counterpart, enabling real-time processing of HD (768 x 1280) videos at 25 FPS on a single A100 GPU.

[184] MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer

Dongkyung Kang, Jaeyeon Hwang, Junseo Park, Minji Kang, Yeryeong Lee, Beomseok Ko, Hanyoung Roh, Jeongmin Shin, Hyeryung Jang

Main category: cs.CV

TL;DR: MAST is a training-free diffusion-based framework for multi-style image transfer that uses mask-guided attention allocation to prevent boundary artifacts and maintain structural consistency when applying multiple styles.

Details

Motivation: Current diffusion-based style transfer models work well with single styles but struggle with multi-style scenarios, leading to boundary artifacts, unstable stylization, and structural inconsistency due to interference between multiple style representations.

Method: Proposes MAST with four modules: 1) Layout-preserving Query Anchoring to maintain semantic structure, 2) Logit-level Attention Mass Allocation to distribute attention across spatial regions, 3) Sharpness-aware Temperature Scaling to restore attention sharpness, and 4) Discrepancy-aware Detail Injection to compensate for high-frequency detail losses.

Result: Extensive experiments show MAST effectively mitigates boundary artifacts and maintains structural consistency while preserving texture fidelity and spatial coherence, even as the number of applied styles increases.

Conclusion: MAST provides an effective training-free solution for multi-style transfer that overcomes limitations of existing diffusion-based methods by explicitly controlling content-style interactions through attention mechanism modifications.

Abstract: Style transfer aims to render a content image with the visual characteristics of a reference style while preserving its underlying semantic layout and structural geometry. While recent diffusion-based models demonstrate strong stylization capabilities by leveraging powerful generative priors and controllable internal representations, they typically assume a single global style. Extending them to multi-style scenarios often leads to boundary artifacts, unstable stylization, and structural inconsistency due to interference between multiple style representations. To overcome these limitations, we propose MAST (Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer), a novel training-free framework that explicitly controls content-style interactions within the diffusion attention mechanism. To achieve artifact-free and structure-preserving stylization, MAST integrates four connected modules. First, Layout-preserving Query Anchoring prevents global layout collapse by firmly anchoring the semantic structure using content queries. Second, Logit-level Attention Mass Allocation deterministically distributes attention probability mass across spatial regions, seamlessly fusing multiple styles without boundary artifacts. Third, Sharpness-aware Temperature Scaling restores the attention sharpness degraded by multi-style expansion. Finally, Discrepancy-aware Detail Injection adaptively compensates for localized high-frequency detail losses by measuring structural discrepancies. Extensive experiments demonstrate that MAST effectively mitigates boundary artifacts and maintains structural consistency, preserving texture fidelity and spatial coherence even as the number of applied styles increases.

[185] LiveMoments: Reselected Key Photo Restoration in Live Photos via Reference-guided Diffusion

Clara Xue, Zizheng Yan, Zhenning Shi, Yuhang Yu, Jingyu Zhuang, Qi Zhang, Jinwei Chen, Qingnan Fan

Main category: cs.CV

TL;DR: LiveMoments is a reference-guided image restoration framework that enhances reselected key photos in Live Photos by leveraging the original high-quality key photo as reference guidance.

Details

Motivation: When users reselect alternative frames as key photos in Live Photos, these frames suffer from quality degradation due to the video pipeline's lower quality compared to the photo capture ISP pipeline. There's a need for dedicated restoration techniques to bridge this quality gap.

Method: A two-branch neural network with a reference branch extracting structural/textural information from the original high-quality key photo, and a main branch restoring the reselected frame using reference guidance. Includes a unified Motion Alignment module for spatial alignment at both latent and image levels.

Result: Experiments on real and synthetic Live Photos show significant improvements in perceptual quality and fidelity over existing solutions, especially in scenes with fast motion or complex structures.

Conclusion: LiveMoments effectively addresses the quality gap in reselected Live Photo frames through reference-guided restoration with motion alignment, providing a practical solution for enhancing user-selected alternative frames.

Abstract: Live Photo captures both a high-quality key photo and a short video clip to preserve the precious dynamics around the captured moment. While users may choose alternative frames as the key photo to capture better expressions or timing, these frames often exhibit noticeable quality degradation, as the photo capture ISP pipeline delivers significantly higher image quality than the video pipeline. This quality gap highlights the need for dedicated restoration techniques to enhance the reselected key photo. To this end, we propose LiveMoments, a reference-guided image restoration framework tailored for the reselected key photo in Live Photos. Our method employs a two-branch neural network: a reference branch that extracts structural and textural information from the original high-quality key photo, and a main branch that restores the reselected frame using the guidance provided by the reference branch. Furthermore, we introduce a unified Motion Alignment module that incorporates motion guidance for spatial alignment at both the latent and image levels. Experiments on real and synthetic Live Photos demonstrate that LiveMoments significantly improves perceptual quality and fidelity over existing solutions, especially in scenes with fast motion or complex structures. Our code is available at https://github.com/OpenVeraTeam/LiveMoments.

[186] Boosting Robust AIGI Detection with LoRA-based Pairwise Training

Ruiyang Xia, Qi Zhang, Yaowen Xu, Zhaofan Zou, Hao Sun, Zhongjiang He, Xuelong Li

Main category: cs.CV

TL;DR: A novel LoRA-based Pairwise Training strategy for robust AI-generated image detection under severe distortions, achieving 3rd place in NTIRE challenge

Details

Motivation: Current AI-generated image detectors work well on clean datasets but fail under real-world distortions; need robust detection methods for practical deployment

Method: LoRA-based Pairwise Training with targeted finetuning of visual foundation model, distortion/size simulations to match validation/test distributions, and pairwise training to decouple generalization and robustness optimization

Result: Achieved 3rd place in NTIRE Robust AI-Generated Image Detection in the Wild challenge, demonstrating superior robustness to distortions

Conclusion: The proposed LPT strategy effectively addresses the vulnerability of AIGI detectors to real-world distortions through targeted finetuning and distribution simulation

Abstract: The proliferation of highly realistic AI-Generated Image (AIGI) has necessitated the development of practical detection methods. While current AIGI detectors perform admirably on clean datasets, their detection performance frequently decreases when deployed “in the wild”, where images are subjected to unpredictable, complex distortions. To resolve the critical vulnerability, we propose a novel LoRA-based Pairwise Training (LPT) strategy designed specifically to achieve robust detection for AIGI under severe distortions. The core of our strategy involves the targeted finetuning of a visual foundation model, the deliberate simulation of data distribution during the training phase, and a unique pairwise training process. Specifically, we introduce distortion and size simulations to better fit the distribution from the validation and test sets. Based on the strong visual representation capability of the visual foundation model, we finetune the model to achieve AIGI detection. The pairwise training is utilized to improve the detection via decoupling the generalization and robustness optimization. Experiments show that our approach secured the 3th placement in the NTIRE Robust AI-Generated Image Detection in the Wild challenge

[187] Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors

Rong Wang, Ruyi Zha, Ziang Cheng, Jiayu Yang, Pulak Purkait, Hongdong Li

Main category: cs.CV

TL;DR: A method for generating realistic orbital videos from single images using 3D shape priors from foundational generative models to ensure geometric consistency.

Details

Motivation: Existing video generation methods rely on pixel-wise attention which fails for long-range extrapolation like rear-view synthesis, lacking geometric consistency and plausible structure.

Method: Uses 3D foundational generative model’s latent features at two scales: denoised global latent vector for structural guidance and latent images from volumetric features for view-dependent geometry. Multi-scale 3D adapter injects features via cross-attention into base video model.

Result: Superior visual quality, shape realism, and multi-view consistency compared to SOTA methods, robust generalization to complex camera trajectories and in-the-wild images.

Conclusion: 3D shape priors from foundational models effectively address geometric consistency in orbital video generation, enabling realistic long-range extrapolation from single images.

Abstract: We present a novel method for generating geometrically realistic and consistent orbital videos from a single image of an object. Existing video generation works mostly rely on pixel-wise attention to enforce view consistency across frames. However, such mechanism does not impose sufficient constraints for long-range extrapolation, e.g. rear-view synthesis, in which pixel correspondences to the input image are limited. Consequently, these works often fail to produce results with a plausible and coherent structure. To tackle this issue, we propose to leverage rich shape priors from a 3D foundational generative model as an auxiliary constraint, motivated by its capability of modeling realistic object shape distributions learned from large 3D asset corpora. Specifically, we prompt the video generation with two scales of latent features encoded by the 3D foundation model: (i) a denoised global latent vector as an overall structural guidance, and (ii) a set of latent images projected from volumetric features to provide view-dependent and fine-grained geometry details. In contrast to commonly used 2.5D representations such as depth or normal maps, these compact features can model complete object shapes, and help to improve inference efficiency by avoiding explicit mesh extraction. To achieve effective shape conditioning, we introduce a multi-scale 3D adapter to inject feature tokens to the base video model via cross-attention, which retains its capabilities from general video pretraining and enables a simple and model-agonistic fine-tuning process. Extensive experiments on multiple benchmarks show that our method achieves superior visual quality, shape realism and multi-view consistency compared to state-of-the-art methods, and robustly generalizes to complex camera trajectories and in-the-wild images.

[188] Cell Instance Segmentation via Multi-Task Image-to-Image Schrödinger Bridge

Hayato Inoue, Shota Harada, Shumpei Takezaki, Ryoma Bise

Main category: cs.CV

TL;DR: Proposes a Schrödinger Bridge-based image-to-image generation framework for cell instance segmentation, integrating boundary-aware supervision and achieving competitive performance without SAM pre-training or post-processing.

Details

Motivation: Existing cell instance segmentation methods rely on deterministic predictions with post-processing, lacking explicit constraints on global structure of instance masks. The authors aim to develop a more principled approach using distribution-based generation.

Method: Multi-task image-to-image Schrödinger Bridge framework that formulates instance segmentation as distribution-based image-to-image generation. Uses boundary-aware supervision through reverse distance maps and deterministic inference for stable predictions.

Result: Achieves competitive or superior performance on PanNuke dataset without SAM pre-training or additional post-processing. Shows robustness on MoNuSeg dataset with limited training data.

Conclusion: Schrödinger Bridge-based image-to-image generation provides an effective framework for cell instance segmentation, offering a principled alternative to traditional deterministic approaches.

Abstract: Existing cell instance segmentation pipelines typically combine deterministic predictions with post-processing, which imposes limited explicit constraints on the global structure of instance masks. In this work, we propose a multi-task image-to-image Schrödinger Bridge framework that formulates instance segmentation as a distribution-based image-to-image generation problem. Boundary-aware supervision is integrated through a reverse distance map, and deterministic inference is employed to produce stable predictions. Experimental results on the PanNuke dataset demonstrate that the proposed method achieves competitive or superior performance without relying on SAM pre-training or additional post-processing. Additional results on the MoNuSeg dataset show robustness under limited training data. These findings indicate that Schrödinger Bridge-based image-to-image generation provides an effective framework for cell instance segmentation.

[189] RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation

Guoan Xu, Yang Xiao, Guangwei Gao, Dongchen Zhu, Wenjing Jia, Guo-Jun Qi

Main category: cs.CV

TL;DR: RSGMamba: A reliability-aware multimodal fusion framework using self-gated state space models for semantic segmentation, addressing modality reliability issues in RGB-D and RGB-T data.

Details

Motivation: Existing multimodal fusion methods assume all modalities are equally reliable, leading to degraded performance when auxiliary modalities (depth, thermal) are noisy, misaligned, or incomplete. Need to explicitly model modality reliability.

Method: Proposes Reliability-aware Self-Gated State Space Model (RSGMamba) with RSGMB blocks that explicitly model modality reliability and dynamically regulate cross-modal interactions via self-gating. Includes Local Cross-Gated Modulation (LCGM) for fine-grained spatial detail refinement.

Result: Achieves SOTA on RGB-D (NYUDepth V2: 58.8% mIoU, SUN-RGBD: 54.0% mIoU) and RGB-T (MFNet: 61.1% mIoU, PST900: 88.9% mIoU) benchmarks with only 48.6M parameters, outperforming prior methods by up to +1.6%.

Conclusion: Explicit modeling of modality reliability through self-gated state space models significantly improves multimodal semantic segmentation performance, validating the effectiveness of reliability-aware fusion strategies.

Abstract: Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross-modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross-modal fusion from the perspective of modality reliability and propose a novel framework termed the Reliability-aware Self-Gated State Space Model (RSGMamba). At the core of our method is the Reliability-aware Self-Gated Mamba Block (RSGMB), which explicitly models modality reliability and dynamically regulates cross-modal interactions through a self-gating mechanism. Unlike conventional fusion strategies that indiscriminately exchange information across modalities, RSGMB enables reliability-aware feature selection and enhancing informative feature aggregation. In addition, a lightweight Local Cross-Gated Modulation (LCGM) is incorporated to refine fine-grained spatial details, complementing the global modeling capability of RSGMB. Extensive experiments demonstrate that RSGMamba achieves state-of-the-art performance on both RGB-D and RGB-T semantic segmentation benchmarks, resulting 58.8% / 54.0% mIoU on NYUDepth V2 and SUN-RGBD (+0.4% / +0.7% over prior best), and 61.1% / 88.9% mIoU on MFNet and PST900 (up to +1.6%), with only 48.6M parameters, thereby validating the effectiveness and superiority of the proposed approach.

[190] Self-Adversarial One Step Generation via Condition Shifting

Deyuan Liu, Peng Sun, Yansen Han, Zhenglin Cheng, Chuyan Chen, Tao Lin

Main category: cs.CV

TL;DR: APEX: A discriminator-free framework for efficient one-step text-to-image synthesis using endogenous adversarial correction from flow models via condition shifting.

Details

Motivation: Existing one-step text-to-image methods face a three-way tradeoff among fidelity, inference speed, and training efficiency. Discriminator-based approaches cause training instability and high overhead, while regression-based methods lose fine details.

Method: APEX extracts adversarial correction signals endogenously from flow models through condition shifting. A transformation creates a shifted condition branch whose velocity field serves as an independent estimator of the generation distribution, providing GAN-aligned gradients without external discriminators.

Result: APEX’s 0.6B model surpasses FLUX-Schnell 12B in one-step quality. With LoRA tuning on Qwen-Image 20B, it achieves GenEval score of 0.89 at NFE=1 in 6 hours, beating the original 50-step teacher (0.87) with 15.33× inference speedup.

Conclusion: APEX provides a plug-and-play discriminator-free framework for efficient one-step text-to-image synthesis that maintains high fidelity while being architecture-preserving and compatible with parameter-efficient tuning methods like LoRA.

Abstract: The push for efficient text to image synthesis has moved the field toward one step sampling, yet existing methods still face a three way tradeoff among fidelity, inference speed, and training efficiency. Approaches that rely on external discriminators can sharpen one step performance, but they often introduce training instability, high GPU memory overhead, and slow convergence, which complicates scaling and parameter efficient tuning. In contrast, regression based distillation and consistency objectives are easier to optimize, but they typically lose fine details when constrained to a single step. We present APEX, built on a key theoretical insight: adversarial correction signals can be extracted endogenously from a flow model through condition shifting. Using a transformation creates a shifted condition branch whose velocity field serves as an independent estimator of the model’s current generation distribution, yielding a gradient that is provably GAN aligned, replacing the sample dependent discriminator terms that cause gradient vanishing. This discriminator free design is architecture preserving, making APEX a plug and play framework compatible with both full parameter and LoRA based tuning. Empirically, our 0.6B model surpasses FLUX-Schnell 12B (20$\times$ more parameters) in one step quality. With LoRA tuning on Qwen-Image 20B, APEX reaches a GenEval score of 0.89 at NFE=1 in 6 hours, surpassing the original 50-step teacher (0.87) and providing a 15.33$\times$ inference speedup. Code is available https://github.com/LINs-lab/APEX.

[191] HyperLiDAR: Adaptive Post-Deployment LiDAR Segmentation via Hyperdimensional Computing

Ivannia Gomez Moreno, Yi Yao, Ye Tian, Xiaofan Yu, Flavio Ponzina, Michael Sullivan, Jingyi Zhang, Mingyu Yang, Hun Seok Kim, Tajana Rosing

Main category: cs.CV

TL;DR: HyperLiDAR: A lightweight LiDAR semantic segmentation framework using Hyperdimensional Computing for efficient on-device post-deployment adaptation in edge applications like autonomous driving.

Details

Motivation: Real-world LiDAR segmentation systems face performance degradation when environments shift, but edge devices have strict computational constraints that prevent conventional large neural networks from adapting on-device. There's a need for lightweight, efficient adaptation methods for post-deployment scenarios.

Method: HyperLiDAR uses Hyperdimensional Computing (HDC) for fast learning and high efficiency, inspired by human brain processing. It includes a buffer selection strategy to focus learning on the most informative points, addressing the high data volume bottleneck in LiDAR scans.

Result: HyperLiDAR outperforms or achieves comparable adaptation performance to state-of-the-art segmentation methods while achieving up to 13.8x speedup in retraining on two LiDAR segmentation benchmarks and two representative devices.

Conclusion: HyperLiDAR demonstrates that HDC-based approaches can provide efficient, lightweight solutions for on-device LiDAR segmentation adaptation, making real-world deployment more feasible under edge computing constraints.

Abstract: LiDAR semantic segmentation plays a pivotal role in 3D scene understanding for edge applications such as autonomous driving. However, significant challenges remain for real-world deployments, particularly for on-device post-deployment adaptation. Real-world environments can shift as the system navigates through different locations, leading to substantial performance degradation without effective and timely model adaptation. Furthermore, edge systems operate under strict computational and energy constraints, making it infeasible to adapt conventional segmentation models (based on large neural networks) directly on-device. To address the above challenges, we introduce HyperLiDAR, the first lightweight, post-deployment LiDAR segmentation framework based on Hyperdimensional Computing (HDC). The design of HyperLiDAR fully leverages the fast learning and high efficiency of HDC, inspired by how the human brain processes information. To further improve the adaptation efficiency, we identify the high data volume per scan as a key bottleneck and introduce a buffer selection strategy that focuses learning on the most informative points. We conduct extensive evaluations on two state-of-the-art LiDAR segmentation benchmarks and two representative devices. Our results show that HyperLiDAR outperforms or achieves comparable adaptation performance to state-of-the-art segmentation methods, while achieving up to a 13.8x speedup in retraining.

[192] All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

Tanzila Rahman, Renjie Liao, Leonid Sigal

Main category: cs.CV

TL;DR: A unified synthetic data generation pipeline for multimodal video understanding that automatically produces unlimited video data with diverse supervision across multiple tasks, combined with VQA-based fine-tuning to enhance reasoning abilities.

Details

Motivation: Collecting and annotating real-world multimodal video data is costly, slow, and limited in diversity, creating a bottleneck for training multimodal large language models for video understanding tasks.

Method: Proposes a unified synthetic data generation pipeline that automatically produces unlimited multimodal video data with rich supervision across multiple tasks (object counting, QA, segmentation). Also introduces VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than using simple captions.

Result: Models trained predominantly on synthetic data generalize effectively to real-world datasets, often outperforming traditionally trained counterparts across three challenging tasks: video object counting, video-based visual question answering, and video object segmentation.

Conclusion: Unified synthetic data pipelines offer a scalable alternative to expensive real-world annotation for multimodal video understanding, demonstrating strong generalization capabilities to real-world data.

Abstract: Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA-based fine-tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages deeper visual grounding and reasoning. We evaluate our approach in three challenging tasks: video object counting, video-based visual question answering, and video object segmentation. Experimental results demonstrate that models trained predominantly on synthetic data generalize effectively to real-world datasets, often outperforming traditionally trained counterparts. Our findings highlight the potential of unified synthetic data pipelines as a scalable alternative to expensive real-world annotation for multimodal video understanding.

[193] Bridging the Micro–Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization

Xiaojie Liang, Zhimin Chen, Ziqi Sheng, Wei Lu

Main category: cs.CV

TL;DR: FASA is a unified framework for image manipulation localization that bridges the gap between low-level forensic cues and high-level semantics to detect both traditional manipulations and diffusion-generated edits.

Details

Motivation: Current image manipulation localization methods struggle with the micro-macro gap - they either rely on low-level forensic artifacts (effective for traditional manipulations) or high-level semantics (better for diffusion-generated edits), but not both. As generative editing advances, a unified approach is needed to handle both types of manipulations.

Method: FASA extracts manipulation-sensitive frequency cues via adaptive dual-band DCT, learns manipulation-aware semantic priors through patch-level contrastive alignment on frozen CLIP representations, injects these priors into a hierarchical frequency pathway using semantic-frequency side adapters, and employs a prototype-guided, frequency-gated mask decoder to integrate semantic consistency with boundary-aware localization.

Result: Extensive experiments on OpenSDI and traditional manipulation benchmarks show state-of-the-art localization performance, strong cross-generator and cross-dataset generalization, and robustness under common image degradations.

Conclusion: FASA successfully bridges the micro-macro gap in image manipulation localization by unifying frequency-based forensic analysis with semantic understanding, providing a robust solution for detecting both traditional and diffusion-generated manipulations.

Abstract: As generative image editing advances, image manipulation localization (IML) must handle both traditional manipulations with conspicuous forensic artifacts and diffusion-generated edits that appear locally realistic. Existing methods typically rely on either low-level forensic cues or high-level semantics alone, leading to a fundamental micro–macro gap. To bridge this gap, we propose FASA, a unified framework for localizing both traditional and diffusion-generated manipulations. Specifically, we extract manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learn manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. We then inject these priors into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and employ a prototype-guided, frequency-gated mask decoder to integrate semantic consistency with boundary-aware localization for tampered region prediction. Extensive experiments on OpenSDI and multiple traditional manipulation benchmarks demonstrate state-of-the-art localization performance, strong cross-generator and cross-dataset generalization, and robust performance under common image degradations.

[194] Detecting Precise Hand Touch Moments in Egocentric Video

Huy Anh Nguyen, Feras Dayoub, Minh Hoai

Main category: cs.CV

TL;DR: A method for detecting precise hand-object contact moments in egocentric videos using hand-informed context enhancement and grasp-aware loss, achieving state-of-the-art performance on a new dataset.

Details

Motivation: Frame-level detection of hand-object contact moments is crucial for AR, HCI, assistive tech, and robot learning, but challenging due to subtle motion variations, occlusions, and fine-grained manipulation patterns in egocentric videos.

Method: Proposes HiCE (Hand-informed Context Enhanced) module using cross-attention to leverage spatiotemporal features from hand regions and surrounding context, plus grasp-aware loss with soft labels emphasizing hand pose patterns and movement dynamics.

Result: Outperforms state-of-the-art event-spotting baselines by 16.91% average precision on TouchMoment dataset (4,021 videos, 8,456 contact moments) under strict 2-frame tolerance evaluation.

Conclusion: The approach effectively addresses challenges in precise contact moment detection through hand-informed context modeling and grasp-aware training, enabling accurate frame-level contact detection in egocentric videos.

Abstract: We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives. To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see’) that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanning over one million frames. Experiments on TouchMoment show that, under a strict evaluation criterion that counts a prediction as correct only if it falls within a two-frame tolerance of the ground-truth moment, our method achieves substantial gains and outperforms state-of-the-art event-spotting baselines by 16.91% average precision.

[195] Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization

Zanyi Wang, Fan Li, Dengyang Jiang, Liuzhuozheng Li, Yunhua Zhong, Guang Dai, Mengmeng Wang

Main category: cs.CV

TL;DR: ST-GD is a data-efficient framework that adapts pre-trained 2D visual-language models to spatio-temporal video grounding tasks using lightweight adapters and a temporal decoder, addressing data scarcity in video understanding.

Details

Motivation: Spatio-temporal video grounding requires large-scale annotated data which is expensive to collect, especially for specialized domains. Existing approaches either overfit on limited datasets or lack temporal awareness needed for precise localization in videos.

Method: ST-GD keeps pre-trained 2D visual-language models (like Grounding DINO) frozen and injects lightweight adapters (~10M parameters) to add spatio-temporal awareness. It includes a novel temporal decoder for boundary prediction while preserving pre-trained priors.

Result: ST-GD achieves competitive performance on limited-scale HC-STVG v1/v2 benchmarks and maintains robust generalization on the VidSTG dataset, demonstrating effectiveness in data-scarce scenarios.

Conclusion: ST-GD provides a powerful paradigm for complex video understanding under strict small-data constraints by efficiently adapting pre-trained 2D models to video tasks without destroying their valuable priors.

Abstract: Spatio-temporal video grounding (STVG) aims to localize queried objects within dynamic video segments. Prevailing fully-trained approaches are notoriously data-hungry. However, gathering large-scale STVG data is exceptionally challenging: dense frame-level bounding boxes and complex temporal language alignments are prohibitively expensive to annotate, especially for specialized video domains. Consequently, conventional models suffer from severe overfitting on these inherently limited datasets, while zero-shot foundational models lack the task-specific temporal awareness needed for precise localization. To resolve this small-data challenge, we introduce ST-GD, a data-efficient framework that adapts pre-trained 2D visual-language models (e.g., Grounding DINO) to video tasks. To avoid destroying pre-trained priors on small datasets, ST-GD keeps the base model frozen and strategically injects lightweight adapters (~10M trainable parameters) to instill spatio-temporal awareness, alongside a novel temporal decoder for boundary prediction. This design naturally counters data scarcity. Consequently, ST-GD excels in data-scarce scenarios, achieving highly competitive performance on the limited-scale HC-STVG v1/v2 benchmarks, while maintaining robust generalization on the VidSTG dataset. This validates ST-GD as a powerful paradigm for complex video understanding under strict small-data constraints.

[196] Fundus Image-based Glaucoma Screening via Retinal Knowledge-Oriented Dynamic Multi-Level Feature Integration

Yuzhuo Zhou, Chi Liu, Sheng Shen, Zongyuan Ge, Fengshi Jing, Shiran Zhang, Yu Jiang, Anli Wang, Wenjian Liu, Feilong Yang, Tianqing Zhu, Xiaotong Han

Main category: cs.CV

TL;DR: A retinal knowledge-oriented glaucoma screening framework that integrates dynamic multi-scale feature learning with domain-specific retinal priors, achieving state-of-the-art performance on large-scale datasets.

Details

Motivation: Existing deep learning models for glaucoma screening lack explicit integration of retinal anatomical knowledge, limiting robustness across heterogeneous clinical datasets. Fixed-region feature extraction is insufficient as pathological cues may appear beyond predefined anatomical regions.

Method: Proposes a tri-branch framework capturing global retinal context, structural features of optic disc/cup, and dynamically localized pathological regions. Uses Dynamic Window Mechanism to adaptively identify diagnostically informative regions and Knowledge-Enhanced Convolutional Attention Module incorporating retinal priors from a pre-trained foundation model.

Result: Achieves AUC of 98.5% and accuracy of 94.6% on AIROGS dataset, outperforming diverse baselines. Shows strong cross-domain generalization on SMDG-19 benchmark datasets.

Conclusion: Knowledge-guided attention combined with adaptive lesion localization significantly improves robustness of automated glaucoma screening systems, demonstrating the value of integrating domain-specific anatomical knowledge with dynamic feature learning.

Abstract: Automated diagnosis based on color fundus photography is essential for large-scale glaucoma screening. However, existing deep learning models are typically data-driven and lack explicit integration of retinal anatomical knowledge, which limits their robustness across heterogeneous clinical datasets. Moreover, pathological cues in fundus images may appear beyond predefined anatomical regions, making fixed-region feature extraction insufficient for reliable diagnosis. To address these challenges, we propose a retinal knowledge-oriented glaucoma screening framework that integrates dynamic multi-scale feature learning with domain-specific retinal priors. The framework adopts a tri-branch structure to capture complementary retinal representations, including global retinal context, structural features of the optic disc/cup, and dynamically localized pathological regions. A Dynamic Window Mechanism is devised to adaptively identify diagnostically informative regions, while a Knowledge-Enhanced Convolutional Attention Module incorporates retinal priors extracted from a pre-trained foundation model to guide attention learning. Extensive experiments on the large-scale AIROGS dataset demonstrate that the proposed method outperforms diverse baselines, achieving an AUC of 98.5% and an accuracy of 94.6%. Additional evaluations on multiple datasets from the SMDG-19 benchmark further confirm its strong cross-domain generalization capability, indicating that knowledge-guided attention combined with adaptive lesion localization can significantly improve the robustness of automated glaucoma screening systems.

[197] Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection

Haifeng Zhang, Qinghui He, Xiuli Bi, Bo Liu, Chi-Man Pun, Bin Xiao

Main category: cs.CV

TL;DR: A Multi-dimensional Adversarial Feature Learning (MAFL) framework for detecting AI-generated images that addresses asymmetric bias learning by suppressing generation-pattern and content biases to improve cross-model generalization.

Details

Motivation: The rapid development of generative AI has lowered barriers to creating high-quality fake images, threatening information authenticity. Existing detection methods suffer from data bias where models learn specific generative patterns rather than common features across different models, limiting generalization.

Method: Proposes MAFL framework using pretrained multimodal image encoder as backbone, constructs real-fake feature learning network with adversarial bias-learning branch using multi-dimensional adversarial loss. Creates adversarial training between authenticity-discriminative feature learning and bias feature learning to suppress generation-pattern and content biases.

Result: Outperforms state-of-the-art methods by 10.89% in accuracy and 8.57% in Average Precision. Achieves over 80% detection accuracy even when trained with only 320 images on public datasets.

Conclusion: MAFL effectively captures fundamental differences between real and generated images by focusing on shared generative features across models, enhancing cross-model generalization while reducing reliance on large-scale training data.

Abstract: In recent years, the rapid development of generative artificial intelligence technology has significantly lowered the barrier to creating high-quality fake images, posing a serious challenge to information authenticity and credibility. Existing generated image detection methods typically enhance generalization through model architecture or network design. However, their generalization performance remains susceptible to data bias, as the training data may drive models to fit specific generative patterns and content rather than the common features shared by images from different generative models (asymmetric bias learning). To address this issue, we propose a Multi-dimensional Adversarial Feature Learning (MAFL) framework. The framework adopts a pretrained multimodal image encoder as the feature extraction backbone, constructs a real-fake feature learning network, and designs an adversarial bias-learning branch equipped with a multi-dimensional adversarial loss, forming an adversarial training mechanism between authenticity-discriminative feature learning and bias feature learning. By suppressing generation-pattern and content biases, MAFL guides the model to focus on the generative features shared across different generative models, thereby effectively capturing the fundamental differences between real and generated images, enhancing cross-model generalization, and substantially reducing the reliance on large-scale training data. Through extensive experimental validation, our method outperforms existing state-of-the-art approaches by 10.89% in accuracy and 8.57% in Average Precision (AP). Notably, even when trained with only 320 images, it can still achieve over 80% detection accuracy on public datasets.

[198] OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion

Dongjian Yu, Weiqing Min, Qian Jiang, Xing Lin, Xin Jin, Shuqiang Jiang

Main category: cs.CV

TL;DR: OmniFood8K dataset with 8,036 Chinese food samples and NutritionSynth-115K synthetic dataset for nutrition prediction from single RGB images using depth estimation and frequency-aligned fusion.

Details

Motivation: Existing food nutrition datasets focus on Western cuisines and lack Chinese dishes, while state-of-the-art methods rely on depth sensors limiting daily applicability. Need for comprehensive Chinese food dataset and RGB-only nutrition prediction.

Method: 1) Predict depth from single RGB image using Scale-Shift Residual Adapter for refinement; 2) Frequency-Aligned Fusion Module to hierarchically align RGB and depth features in frequency domain; 3) Mask-based Prediction Head for dynamic channel selection emphasizing key ingredient regions.

Result: Extensive experiments on multiple datasets demonstrate superiority over existing approaches. OmniFood8K provides 8,036 food samples with nutritional annotations and multi-view images, plus NutritionSynth-115K synthetic dataset.

Conclusion: The proposed framework enables accurate nutritional prediction from single RGB images without depth sensors, addressing limitations of existing methods and datasets for Chinese cuisine.

Abstract: Accurate estimation of food nutrition plays a vital role in promoting healthy dietary habits and personalized diet management. Most existing food datasets primarily focus on Western cuisines and lack sufficient coverage of Chinese dishes, which restricts accurate nutritional estimation for Chinese meals. Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food samples, each with detailed nutritional annotations and multi-view images. In addition, to enhance models’ capability in nutritional prediction, we construct NutritionSynth-115K, a large-scale synthetic dataset that introduces compositional variations while preserving precise nutritional labels. Moreover, we propose an end-to-end framework for nutritional prediction from a single RGB image. First, we predict a depth map from a single RGB image and design the Scale-Shift Residual Adapter (SSRA) to refine it for global scale consistency and local structural preservation. Second, we propose the Frequency-Aligned Fusion Module (FAFM) to hierarchically align and fuse RGB and depth features in the frequency domain. Finally, we design a Mask-based Prediction Head (MPH) to emphasize key ingredient regions via dynamic channel selection for more accurate prediction. Extensive experiments on multiple datasets demonstrate the superiority of our method over existing approaches. Project homepage: https://yudongjian.github.io/OmniFood8K-food/

[199] Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, Chanyoung Park

Main category: cs.CV

TL;DR: DSTP is a training-free framework that improves visual token pruning for multimodal LLMs by addressing Relevant Visual Information Shift during decoding, enabling better performance on complex visual reasoning tasks.

Details

Motivation: Existing visual token pruning methods work well for simple visual understanding but fail to generalize to complex visual reasoning tasks due to Relevant Visual Information Shift (RVIS) during decoding, which previous studies have underexplored.

Method: Proposes Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that aligns visual tokens with shifting reasoning requirements during the decoding stage to address RVIS.

Result: DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, yields performance gains across visual understanding benchmarks, works across diverse state-of-the-art architectures with minimal computational overhead.

Conclusion: DSTP effectively addresses the RVIS problem in visual token pruning, enabling better generalization to complex visual reasoning tasks while maintaining efficiency across various multimodal LLM architectures.

Abstract: Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.

[200] Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

Ravikumar Balakrishnan, Sanket Mendapara, Ankit Garg

Main category: cs.CV

TL;DR: Typographic prompt injection attacks on VLMs use adversarial text rendered as images to bypass safety mechanisms, with effectiveness varying by font size, visual conditions, and model vulnerability.

Details

Motivation: As VLMs become perceptual backbones for autonomous agents, typographic prompt injection attacks pose a growing threat where adversarial text bypasses safety mechanisms by being rendered as images, requiring understanding of heterogeneous attack surfaces across varying visual conditions and model vulnerabilities.

Method: Evaluated 1,000 prompts from SALAD-Bench across four VLMs (GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, Qwen3-VL-4B-Instruct) under varying font sizes (6-28px) and visual transformations (rotation, blur, noise, contrast changes), analyzing attack success rates and embedding distances from multimodal embedding models.

Result: Font size significantly affects ASR with small fonts (6px) yielding near-zero ASR; text attacks more effective than image attacks for GPT-4o and Claude; embedding distance shows strong negative correlation with ASR; heavy degradations increase embedding distance by 10-12% and reduce ASR by 34-96%; rotation affects models asymmetrically.

Conclusion: Model-specific robustness patterns preclude universal defenses, offering empirical guidance for selecting VLM backbones in adversarial environments, highlighting the need for tailored security approaches rather than one-size-fits-all solutions.

Abstract: We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6–28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10–12% and reduce ASR by 34–96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.

Hao Wang, Jiqing Zhang, Xin Yang, Baocai Yin, Lu Jiang, Zetian Mi, Huibing Wang

Main category: cs.CV

TL;DR: A modality-agnostic framework that generates multi-modal prompts for SAM to improve camouflaged object detection across various auxiliary modalities (RGB-Depth, RGB-Thermal, RGB-Polarization).

Details

Motivation: Existing camouflaged object detection approaches rely on modality-specific architectures or customized fusion strategies, limiting scalability and cross-modal generalization. There's a need for a unified framework that can efficiently adapt to arbitrary auxiliary modalities.

Method: Proposes a framework that generates modality-agnostic multi-modal prompts for Segment Anything Model (SAM). Uses interactions between data-driven content domain and knowledge-driven prompt domain to distill task-relevant cues into unified prompts. Includes a lightweight Mask Refine Module to calibrate coarse predictions using fine-grained prompt cues for better boundary accuracy.

Result: Extensive experiments on RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks validate the framework’s effectiveness and generalization capabilities, showing significant performance improvements on COD tasks.

Conclusion: The proposed modality-agnostic framework enables parameter-efficient adaptation to arbitrary auxiliary modalities and significantly improves camouflaged object detection performance across diverse multi-modal scenarios.

Abstract: Camouflaged Object Detection (COD) aims to segment objects that blend seamlessly into complex backgrounds, with growing interest in exploiting additional visual modalities to enhance robustness through complementary information. However, most existing approaches generally rely on modality-specific architectures or customized fusion strategies, which limit scalability and cross-modal generalization. To address this, we propose a novel framework that generates modality-agnostic multi-modal prompts for the Segment Anything Model (SAM), enabling parameter-efficient adaptation to arbitrary auxiliary modalities and significantly improving overall performance on COD tasks. Specifically, we model multi-modal learning through interactions between a data-driven content domain and a knowledge-driven prompt domain, distilling task-relevant cues into unified prompts for SAM decoding. We further introduce a lightweight Mask Refine Module to calibrate coarse predictions by incorporating fine-grained prompt cues, leading to more accurate camouflaged object boundaries. Extensive experiments on RGB-Depth, RGB-Thermal, and RGB-Polarization benchmarks validate the effectiveness and generalization of our modality-agnostic framework.

[202] Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

Jiawei Fan, Shigeng Wang, Chao Li, Xiaolong Liu, Anbang Yao

Main category: cs.CV

TL;DR: CoM-PT accelerates vision foundation model training by creating a model chain where only the smallest model gets standard pre-training, and larger models learn from smaller predecessors through parameter and feature space knowledge transfer.

Details

Motivation: Current acceleration methods optimize individual models, but training vision foundation models is computationally expensive. The authors aim to accelerate training at the model family level, achieving better scaling efficiency as model families expand.

Method: Creates a pre-training sequence (model chain) arranged by ascending model size. Only the smallest model undergoes standard pre-training. Larger models are trained through sequential inverse knowledge transfer from smaller predecessors, reusing both parameter space and feature space knowledge.

Result: Achieves performance mostly superior to standard individual training while significantly reducing training costs. Validated across 45 datasets. When pre-training on CC3M: 72% computational complexity reduction for ViT-L; acceleration ratios improve from 4.13X to 5.68X to 7.09X as model family scales from 3 to 4 to 7 models.

Conclusion: CoM-PT provides performance-lossless training acceleration for vision foundation models that scales efficiently with model family expansion. The method is agnostic to pre-training paradigms and could extend to other computationally intensive scenarios like large language model pre-training.

Abstract: In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.

[203] Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning

Jungwon Choi, Eunwoo Kim

Main category: cs.CV

TL;DR: TPT-Anchor: A dual-modality anchor-guided framework for test-time prompt tuning that uses text and image anchors to filter informative views and provide stable supervision for vision-language model adaptation.

Details

Motivation: Standard test-time prompt tuning suffers from miscalibrated confidence scores under distribution shift, causing irrelevant views to guide adaptation. The paper aims to ground view selection in semantic evidence rather than unreliable internal confidence.

Method: Proposes a dual-modality anchor-guided framework with: 1) Text anchors from attribute-rich descriptions for fine-grained class semantics, 2) Adaptive image anchors capturing test-time statistics, 3) Anchor-based view filtering using alignment and confidence metrics, and 4) Confidence-weighted ensemble combining anchor predictions with original outputs for stable prompt updates.

Result: Extensive experiments on 15 benchmark datasets demonstrate state-of-the-art performance, showing significant improvements over existing test-time adaptation methods for vision-language models.

Conclusion: Anchor-guided supervision provides a robust foundation for prompt updates in vision-language models, effectively addressing miscalibration issues and improving adaptation performance under distribution shift.

Abstract: Test-Time Prompt Tuning (TPT) adapts vision-language models using augmented views, but its effectiveness is hindered by the challenge of determining which views are beneficial. Standard entropy-based filtering relies on the internal confidence scores of the model, which are often miscalibrated under distribution shift, assigning high confidence to irrelevant crops or background regions while ignoring semantic content. To address this, we propose a dual-modality anchor-guided framework that grounds view selection in semantic evidence. We introduce a text anchor from attribute-rich descriptions, to provide fine-grained class semantics, and an adaptive image anchor that captures evolving test-time statistics. Using these anchors, we filter views based on alignment and confidence, ensuring that only informative views guide adaptation. Moreover, we treat the anchors as auxiliary predictive heads and combine their predictions with the original output in a confidence-weighted ensemble, yielding a stable supervision signal for prompt updates. Extensive experiments on 15 benchmark datasets demonstrate new state-of-the-art performance, highlighting the contribution of anchor-guided supervision as a foundation for robust prompt updates.

[204] DeferredSeg: A Multi-Expert Deferral Framework for Trustworthy Medical Image Segmentation

Qiuyu Tian, Haoliang Sun, Yunshan Wang, Yinghuan Shi, Yilong Yin

Main category: cs.CV

TL;DR: DeferredSeg is a human-AI collaboration framework for medical image segmentation that learns when to defer uncertain pixel predictions to human experts, improving reliability and trustworthiness.

Details

Motivation: Deep learning segmentation models often produce unreliable confidence scores, especially in ambiguous regions, which undermines clinical trustworthiness. The paper addresses this by creating a system that can identify when to defer decisions to human experts.

Method: Extends base segmentor with aggregated deferral predictor and routing channels that dynamically route pixels to either AI or human experts. Uses pixel-wise surrogate collaboration loss for training deferral decisions and spatial-coherence loss for smooth deferral masks. Also extends to multi-expert setting with load-balancing penalty.

Result: DeferredSeg consistently outperforms baselines on three challenging medical datasets using MedSAM and CENet as base segmentors, demonstrating effectiveness for trustworthy dense medical segmentation.

Conclusion: The framework provides a model-agnostic approach for trustworthy medical segmentation through human-AI collaboration, with extensions to multi-expert settings and practical deployment considerations.

Abstract: Segmentation models based on deep neural networks demonstrate strong generalization for medical image segmentation. However, they often exhibit overconfidence or underconfidence, leading to unreliable confidence scores for segmentation masks, especially in ambiguous regions. This undermines the trustworthiness required for clinical deployment. Motivated by the learning-to-defer (L2D) paradigm, we introduce DeferredSeg, a deferral-aware segmentation framework, i.e., a Human–AI collaboration system that determines whether to defer predictions to human experts in specific regions. DeferredSeg extends the base segmentor with an aggregated deferral predictor and additional routing channels that dynamically route each pixel to either the base segmentor or a human expert. To train this routing efficiently, we introduce a pixel-wise surrogate collaboration loss that supervises deferral decisions. In addition, to preserve spatial coherence within deferral regions, we propose a spatial-coherence loss that enforces smooth deferral masks, thereby enhancing reliability. Beyond single-expert deferral, we further extend the framework to a multi-expert setting by introducing multiple discrepancy experts for collaborative decision-making. To prevent overloading or underutilizing individual experts, we further design a load-balancing penalty that evenly distributes workload across expert branches. We evaluate DeferredSeg on three challenging medical datasets using MedSAM and CENet as the base segmentor for fair comparison. Experimental results show that DeferredSeg consistently outperforms the baseline, demonstrating its effectiveness for trustworthy dense medical segmentation. Moreover, the proposed framework is model-agnostic and can be readily applied to other segmentation architectures.

[205] A Hybrid Architecture for Benign-Malignant Classification of Mammography ROIs

Mohammed Asad, Mohit Bajpai, Sudhir Singh, Rahul Katarya

Main category: cs.CV

TL;DR: Hybrid CNN-Vision Mamba model for breast lesion classification in mammography, combining EfficientNetV2-M for local features with Vision Mamba (State Space Model) for efficient global context modeling.

Details

Motivation: CNNs are effective for local patterns but poor at long-range dependencies in mammography, while Vision Transformers have quadratic computational costs. Need efficient global context modeling for accurate breast lesion classification.

Method: Proposes hybrid architecture: EfficientNetV2-M extracts local features, Vision Mamba (State Space Model) handles global context with linear complexity. Binary classification of mammography ROIs from CBIS-DDSM dataset into benign/malignant classes.

Result: Achieves strong lesion-level classification performance in ROI-based setting by combining CNN backbone with linear-complexity sequence model.

Conclusion: Hybrid CNN-Vision Mamba approach effectively addresses limitations of both CNNs and Transformers for mammography analysis, providing efficient global context modeling for breast lesion classification.

Abstract: Accurate characterization of suspicious breast lesions in mammography is important for early diagnosis and treatment planning. While Convolutional Neural Networks (CNNs) are effective at extracting local visual patterns, they are less suited to modeling long-range dependencies. Vision Transformers (ViTs) address this limitation through self-attention, but their quadratic computational cost can be prohibitive. This paper presents a hybrid architecture that combines EfficientNetV2-M for local feature extraction with Vision Mamba, a State Space Model (SSM), for efficient global context modeling. The proposed model performs binary classification of abnormality-centered mammography regions of interest (ROIs) from the CBIS-DDSM dataset into benign and malignant classes. By combining a strong CNN backbone with a linear-complexity sequence model, the approach achieves strong lesion-level classification performance in an ROI-based setting.

[206] IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation

Haoyu Zheng, Tianwei Lin, Wei Wang, Zhuonan Wang, Wenqiao Zhang, Jiaqi Zhu, Feifei Shao

Main category: cs.CV

TL;DR: IAD-Unify: A unified framework for industrial anomaly detection that jointly supports anomaly segmentation, region-grounded understanding, and mask-guided generation using a dual-encoder approach with frozen DINOv2 region expert and Qwen3.5-4B vision-language backbone.

Details

Motivation: Real-world industrial inspection requires three capabilities: localizing defects, explaining them in natural language, and generating controlled defect edits. Existing approaches fail to support all three capabilities within a unified framework and evaluation protocol.

Method: Proposes IAD-Unify, a dual-encoder unified framework where a frozen DINOv2-based region expert supplies precise anomaly evidence to a shared Qwen3.5-4B vision-language backbone via lightweight token injection. This enables joint anomaly segmentation, region-grounded understanding, and mask-guided generation. Also constructs Anomaly-56K, a comprehensive unified multi-task evaluation platform with 59,916 images across 24 categories and 104 defect variants.

Result: Four key findings: (1) region grounding is crucial for understanding (removing it degrades location accuracy by >76 percentage points); (2) predicted-region performance closely matches oracle, confirming deployment viability; (3) region-grounded generation achieves best full-image fidelity and masked-region perceptual quality; (4) pre-initialized joint training improves understanding at negligible generation cost (-0.16 dB). Achieves strong performance on MMAD benchmark with robust cross-category generalization.

Conclusion: IAD-Unify successfully unifies three key industrial inspection capabilities in a single framework, demonstrating the importance of region grounding for both understanding and generation tasks. The framework shows strong generalization to unseen categories and provides a viable solution for real-world deployment.

Abstract: Real-world industrial inspection requires not only localizing defects, but also explaining them in natural language and generating controlled defect edits. However, existing approaches fail to jointly support all three capabilities within a unified framework and evaluation protocol. We propose IAD-Unify, a dual-encoder unified framework in which a frozen DINOv2-based region expert supplies precise anomaly evidence to a shared Qwen3.5-4B vision-language backbone via lightweight token injection, jointly enabling anomaly segmentation, region-grounded understanding, and mask-guided generation. To enable unified evaluation, we further construct Anomaly-56K, a comprehensive unified multi-task IAD evaluation platform, spanning 59,916 images across 24 categories and 104 defect variants. Controlled ablations yield four findings: (i) region grounding is the decisive mechanism for understanding, removing it degrades location accuracy by >76 pp; (ii) predicted-region performance closely matches oracle, confirming deployment viability; (iii) region-grounded generation achieves the best full-image fidelity and masked-region perceptual quality; and (iv) pre-initialized joint training improves understanding at negligible generation cost (-0.16 dB). IAD-Unify further achieves strong performance on the MMAD benchmark, including categories unseen during training, demonstrating robust cross-category generalization.

[207] DiffusionPrint: Learning Generative Fingerprints for Diffusion-Based Inpainting Localization

Paschalis Giakoumoglou, Symeon Papadopoulos

Main category: cs.CV

TL;DR: DiffusionPrint is a patch-level contrastive learning framework for detecting diffusion-based image forgeries by learning generative fingerprints that survive latent decoding distortions.

Details

Motivation: Modern diffusion-based inpainting models disrupt camera-level noise patterns that existing forensic methods rely on, making image forgery localization challenging. These models use full regeneration pipelines with latent decoders that introduce spectral distortions.

Method: Patch-level contrastive learning framework using MoCo-style objective with cross-category hard negative mining and generator-aware classification head. Learns forensic signals robust to latent decoding distortions by exploiting consistent generative fingerprints in inpainted regions from the same model.

Result: DiffusionPrint consistently improves localization across multiple generative models, with gains up to +28% on mask types unseen during fine-tuning. Shows generalization to unseen generative architectures when integrated into TruFor, MMFusion, and lightweight fusion baselines.

Conclusion: The framework successfully addresses the challenge of detecting diffusion-based forgeries by learning robust forensic signals that survive latent decoding distortions, providing a discriminative secondary modality for fusion-based image forgery localization.

Abstract: Modern diffusion-based inpainting models pose significant challenges for image forgery localization (IFL), as their full regeneration pipelines reconstruct the entire image via a latent decoder, disrupting the camera-level noise patterns that existing forensic methods rely on. We propose DiffusionPrint, a patch-level contrastive learning framework that learns a forensic signal robust to the spectral distortions introduced by latent decoding. It exploits the fact that inpainted regions generated by the same model share a consistent generative fingerprint, using this as a self-supervisory signal. DiffusionPrint trains a convolutional backbone via a MoCo-style objective with cross-category hard negative mining and a generator-aware classification head, producing a forensic feature map that serves as a highly discriminative secondary modality in fusion-based IFL frameworks. Integrated into TruFor, MMFusion, and a lightweight fusion baseline, DiffusionPrint consistently improves localization across multiple generative models, with gains of up to +28% on mask types unseen during fine-tuning and confirmed generalization to unseen generative architectures. Code is available at https://github.com/mever-team/diffusionprint

[208] Euler-inspired Decoupling Neural Operator for Efficient Pansharpening

Anqi Zhu, Mengting Ma, Yizhen Jiang, Xiangdong Li, Kai Zheng, Jiaxin Li, Wei Zhang

Main category: cs.CV

TL;DR: EDNO is a physics-inspired pansharpening framework that uses Euler’s formula in frequency domain to decouple spatial and spectral fusion, achieving better efficiency-performance balance than diffusion-based methods.

Details

Motivation: Current deep learning methods for pansharpening, especially diffusion-based approaches, suffer from spectral-spatial blurring and high computational costs due to their stochastic nature and iterative sampling. There's a need for more efficient methods that maintain both spatial and spectral quality.

Method: Proposes Euler-inspired Decoupling Neural Operator (EDNO) that redefines pansharpening as continuous functional mapping in frequency domain. Uses Euler’s formula to transform features to polar coordinates, with Euler Feature Interaction Layer (EFIL) containing: 1) Explicit Feature Interaction Module for geometric alignment via linear weighting simulating phase rotation, and 2) Implicit Feature Interaction Module using feed-forward network for spectral distribution modeling.

Result: EDNO demonstrates superior efficiency-performance balance compared to heavyweight architectures on three datasets, offering better computational efficiency while maintaining high quality results.

Conclusion: EDNO provides an effective physics-inspired framework for pansharpening that addresses computational efficiency and quality issues of current methods through frequency domain processing and explicit-implicit feature decoupling.

Abstract: Pansharpening aims to synthesize high-resolution multispectral (HR-MS) images by fusing the spatial textures of panchromatic (PAN) images with the spectral information of low-resolution multispectral (LR-MS) images. While recent deep learning paradigms, especially diffusion-based operators, have pushed the performance boundaries, they often encounter spectral-spatial blurring and prohibitive computational costs due to their stochastic nature and iterative sampling. In this paper, we propose the Euler-inspired Decoupling Neural Operator (EDNO), a physics-inspired framework that redefines pansharpening as a continuous functional mapping in the frequency domain. Departing from conventional Cartesian feature processing, our EDNO leverages Euler’s formula to transform features into a polar coordinate system, enabling a novel explicit-implicit interaction mechanism. Specifically, we develop the Euler Feature Interaction Layer (EFIL), which decouples the fusion task into two specialized modules: 1) Explicit Feature Interaction Module, utilizing a linear weighting scheme to simulate phase rotation for adaptive geometric alignment; and 2) Implicit Feature Interaction Module, employing a feed-forward network to model spectral distributions for superior color consistency. By operating in the frequency domain, EDNO inherently captures global receptive fields while maintaining discretization-invariance. Experimental results on the three datasets demonstrate that EDNO offers a superior efficiency-performance balance compared to heavyweight architectures.

[209] T2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models

Nihal Jaiswal, Siddhartha Arjaria, Gyanendra Chaubey, Ankush Kumar, Aditya Singh, Anchal Chaurasiya

Main category: cs.CV

TL;DR: T2I-BiasBench: A comprehensive evaluation framework with 13 metrics to measure demographic bias, element omission, and cultural collapse in text-to-image diffusion models.

Details

Motivation: Text-to-image models inherit and amplify demographic imbalances and cultural biases from training data, but existing evaluation frameworks don't comprehensively address all three key dimensions of bias simultaneously.

Method: Developed T2I-BiasBench with 13 complementary metrics (6 established + 7 new/adapted) to evaluate three open-source diffusion models against Gemini 2.5 Flash baseline, generating 1,574 images across five structured prompt categories.

Result: Key findings: (1) Stable Diffusion and BK-SDM show bias amplification in beauty prompts; (2) contextual constraints reduce professional-role gender bias; (3) all models collapse to narrow cultural representations, including RLHF-aligned Gemini.

Conclusion: T2I-BiasBench provides standardized, fine-grained bias evaluation for generative models, revealing that current alignment techniques don’t resolve cultural coverage gaps in text-to-image generation.

Abstract: Text-to-image (T2I) generative models achieve impressive visual fidelity but inherit and amplify demographic imbalances and cultural biases embedded in training data. We introduce T2I-BiasBench, a unified evaluation framework of thirteen complementary metrics that jointly captures demographic bias, element omission, and cultural collapse in diffusion models - the first framework to address all three dimensions simultaneously. We evaluate three open-source models - Stable Diffusion v1.5, BK-SDM Base, and Koala Lightning - against Gemini 2.5 Flash (RLHF-aligned) as a reference baseline. The benchmark comprises 1,574 generated images across five structured prompt categories. T2I-BiasBench integrates six established metrics with seven additional measures: four newly proposed (Composite Bias Score, Grounded Missing Rate, Implicit Element Missing Rate, Cultural Accuracy Ratio) and three adapted (Hallucination Score, Vendi Score, CLIP Proxy Score). Three key findings emerge: (1) Stable Diffusion v1.5 and BK-SDM exhibit bias amplification (>1.0) in beauty-related prompts; (2) contextual constraints such as surgical PPE substantially attenuate professional-role gender bias (Doctor CBS = 0.06 for SD v1.5); and (3) all models, including RLHF-aligned Gemini, collapse to a narrow set of cultural representations (CAS: 0.54-1.00), confirming that alignment techniques do not resolve cultural coverage gaps. T2I-BiasBench is publicly released to support standardized, fine-grained bias evaluation of generative models. The project page is available at: https://gyanendrachaubey.github.io/T2I-BiasBench/

[210] SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker

Junbin Su, Ziteng Xue, Shihui Zhang, Kun Chen, Weiming Hu, Zhipeng Zhang

Main category: cs.CV

TL;DR: SEATrack is a parameter-efficient multimodal tracker that addresses the performance-efficiency trade-off through cross-modal alignment and global fusion, achieving state-of-the-art results across RGB-T, RGB-D, and RGB-E tracking tasks.

Details

Motivation: Recent PEFT methods in multimodal tracking have sacrificed efficiency for performance gains, undermining the core promise of parameter-efficient fine-tuning. The paper aims to resolve this performance-efficiency dilemma by focusing on cross-modal alignment and efficient fusion.

Method: Proposes SEATrack with two key innovations: 1) AMG-LoRA for cross-modal alignment, combining Low-Rank Adaptation for domain adaptation with Adaptive Mutual Guidance to refine attention maps across modalities, and 2) Hierarchical Mixture of Experts (HMoE) for efficient global relation modeling instead of conventional local fusion.

Result: SEATrack achieves notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks.

Conclusion: The paper demonstrates that prioritizing cross-modal alignment and using global fusion mechanisms can effectively break the performance-efficiency trade-off in multimodal tracking, advancing parameter-efficient fine-tuning in vision-based multimodal applications.

Abstract: Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT’s efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. \href{https://github.com/AutoLab-SAI-SJTU/SEATrack}{\textcolor{cyan}{Code is available}}.

[211] From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception

Jilong Zhu, Yang Feng

Main category: cs.CV

TL;DR: VIF framework uses Conditional Variational Autoencoder to model visual saliency as latent distribution, addressing visual attenuation in MLLMs for improved fine-grained perception.

Details

Motivation: MLLMs struggle with fine-grained perception tasks due to Visual Attenuation - where sparse fine-grained visual signals get suppressed by dominant textual tokens during network propagation, causing loss of focus in decision-making. Existing input-centric solutions don't address this intrinsic information loss mechanism.

Method: Proposes Variational Information Flow (VIF) framework using a probabilistic approach with Conditional Variational Autoencoder (CVAE) to model visual saliency relevant to question-answer pairs as a latent distribution. Designed as a plug-and-play module that can be integrated into existing MLLM architectures.

Result: Extensive evaluations across diverse benchmarks including General VQA, fine-grained perception, and visual grounding demonstrate competitive improvements over previous methods, validating VIF’s effectiveness in enhancing MLLMs’ fine-grained perception capabilities.

Conclusion: VIF successfully addresses the visual attenuation problem in MLLMs by modeling visual saliency probabilistically, leading to improved performance on fine-grained perception tasks without requiring architectural overhauls.

Abstract: While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a “loss of focus” during the deep-level decision-making process. Existing input-centric solutions fail to fundamentally reverse this intrinsic mechanism of information loss. To address this challenge, we propose the Variational Information Flow (VIF) framework. Adopting a probabilistic perspective, VIF leverages a Conditional Variational Autoencoder (CVAE) to model the visual saliency relevant to the question-answer pair as a latent distribution. As a plug-and-play module, VIF can be integrated into existing architectures. Extensive evaluations across diverse benchmarks, covering General VQA, fine-grained perception, and visual grounding, demonstrate that VIF yields competitive improvements over previous methods, validating its effectiveness in enhancing the fine-grained perception of MLLMs.

[212] NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1)

Guanyi Qin, Jie Liang, Bingbing Zhang, Lishen Qu, Ya-nan Guan, Hui Zeng, Lei Zhang, Radu Timofte, Jianhui Sun, Xinli Yue, Tao Shao, Huan Hou, Wenjie Liao, Shuhao Han, Jieyu Yuan, Chunle Guo, Chongyi Li, Zewen Chen, Yunze Liu, Jian Guo, Juan Wang, Yun Zeng, Bing Li, Weiming Hu, Hesong Li, Dehua Liu, Xinjie Zhang, Qiang Li, Li Yan, Wei Dong, Qingsen Yan, Xingcan Li, Shenglong Zhou, Manjiang Yin, Yinxiang Zhang, Hongbo Wang, Jikai Xu, Zhaohui Fan, Dandan Zhu, Wei Sun, Weixia Zhang, Kun Zhu, Nana Zhang, Kaiwei Zhang, Qianqian Zhang, Zhihan Zhang, William Gordon, Linwei Wu, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Cici Liu, Yaokun Shi

Main category: cs.CV

TL;DR: NTIRE 2026 challenge on professional image quality assessment using MLLMs to evaluate high-quality image pairs with comparative selection and interpretative reasoning.

Details

Motivation: Conventional IQA methods use scalar scores that struggle with subtle differences among high-quality images and lack reasoning capabilities to explain why one image is superior, limiting their usefulness for professional guidance.

Method: Established a benchmark challenge exploring MLLMs’ ability to mimic human expert cognition in evaluating high-quality image pairs. Participants addressed two objectives: comparative quality selection (identifying superior image) and interpretative reasoning (generating expert-level explanations).

Result: The challenge attracted nearly 200 registrations and over 2,500 submissions. Top-performing methods significantly advanced state of the art in professional IQA.

Conclusion: MLLMs offer promising paradigm for professional image quality assessment, moving beyond scalar scores to provide comparative evaluation and reasoning capabilities that better mimic human expert judgment.

Abstract: In this paper, we present an overview of the NTIRE 2026 challenge on the 3rd Restore Any Image Model in the Wild, specifically focusing on Track 1: Professional Image Quality Assessment. Conventional Image Quality Assessment (IQA) typically relies on scalar scores. By compressing complex visual characteristics into a single number, these methods fundamentally struggle to distinguish subtle differences among uniformly high-quality images. Furthermore, they fail to articulate why one image is superior, lacking the reasoning capabilities required to provide guidance for vision tasks. To bridge this gap, recent advancements in Multimodal Large Language Models (MLLMs) offer a promising paradigm. Inspired by this potential, our challenge establishes a novel benchmark exploring the ability of MLLMs to mimic human expert cognition in evaluating high-quality image pairs. Participants were tasked with overcoming critical bottlenecks in professional scenarios, centering on two primary objectives: (1) Comparative Quality Selection: reliably identifying the visually superior image within a high-quality pair; and (2) Interpretative Reasoning: generating grounded, expert-level explanations that detail the rationale behind the selection. In total, the challenge attracted nearly 200 registrations and over 2,500 submissions. The top-performing methods significantly advanced the state of the art in professional IQA. The challenge dataset is available at https://github.com/narthchin/RAIM-PIQA, and the official homepage is accessible at https://www.codabench.org/competitions/12789/.

[213] CoD-Lite: Real-Time Diffusion-Based Generative Image Compression

Zhaoyang Jia, Naifu Xue, Zihan Zheng, Jiahao Li, Bin Li, Xiaoyi Zhang, Zongyu Guo, Yuan Zhang, Houqiang Li, Yan Lu

Main category: cs.CV

TL;DR: Lightweight diffusion codec for real-time compression using compression-oriented pre-training and convolutional architecture instead of transformers, achieving 60 FPS encoding and 85% bitrate reduction.

Details

Motivation: Scaling diffusion transformers for generative priors fails in real-time compression scenarios requiring lightweight models. Need to design efficient diffusion codecs that can operate in real-time while maintaining compression quality.

Method: Systematically analyzes two key questions: 1) whether diffusion pre-training benefits lightweight codecs (finds compression-oriented pre-training works better than generation-oriented), and 2) whether transformers are essential (finds lightweight convolutions suffice when paired with distillation). Builds one-step lightweight convolution diffusion codec with distillation and adversarial learning.

Result: Achieves real-time 60 FPS encoding and 42 FPS decoding at 1080p. Reduces bitrate by 85% at comparable FID to MS-ILLM, bridging gap between generative compression and practical real-time deployment.

Conclusion: Lightweight diffusion codecs can achieve real-time performance with compression-oriented pre-training and convolutional architectures, making generative compression practical for deployment.

Abstract: Recent advanced diffusion methods typically derive strong generative priors by scaling diffusion transformers. However, scaling fails to generalize when adapted for real-time compression scenarios that demand lightweight models. In this paper, we explore the design of real-time and lightweight diffusion codecs by addressing two pivotal questions. First, does diffusion pre-training benefit lightweight diffusion codecs? Through systematic analysis, we find that generation-oriented pre-training is less effective at small model scales whereas compression-oriented pre-training yields consistently better performance. Second, are transformers essential? We find that while global attention is crucial for standard generation, lightweight convolutions suffice for compression-oriented diffusion when paired with distillation. Guided by these findings, we establish a one-step lightweight convolution diffusion codec that achieves real-time $60$~FPS encoding and $42$~FPS decoding at 1080p. Further enhanced by distillation and adversarial learning, the proposed codec reduces bitrate by 85% at a comparable FID to MS-ILLM, bridging the gap between generative compression and practical real-time deployment. Codes are released at https://github.com/microsoft/GenCodec/CoD_Lite

[214] MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

Ruoxiang Huang, Zhen Yuan

Main category: cs.CV

TL;DR: MODIX is a training-free framework that dynamically adapts positional encoding granularity based on modality-specific information density to improve multimodal attention allocation in Vision-Language Models.

Details

Motivation: Current VLMs use uniform positional encoding for all tokens, ignoring variations in information density across modalities, leading to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented.

Method: MODIX dynamically adapts positional strides based on modality-specific contributions by jointly modeling intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores that rescale positional indices.

Result: Experiments across diverse architectures and benchmarks show MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions.

Conclusion: Positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling, with MODIX demonstrating that dynamic granularity allocation based on information density improves multimodal understanding.

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to informative modalities while compressing redundant ones, without requiring any modification to model parameters or architecture. Experiments across diverse architectures and benchmarks demonstrate that MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions, suggesting that positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling.

[215] Cross-Attentive Multiview Fusion of Vision-Language Embeddings

Tomas Berriel Martins, Martin R. Oswald, Javier Civera

Main category: cs.CV

TL;DR: CAMFusion: A multiview transformer that cross-attends across vision-language descriptors from multiple viewpoints to create unified 3D instance embeddings, leveraging multiview consistency as self-supervision.

Details

Motivation: Existing approaches for lifting 2D vision-language models to 3D scenes are suboptimal, typically using simple averaging or heuristic single-view selection, resulting in poor 3D representations.

Method: Introduces a multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into unified per-3D-instance embeddings. Uses multiview consistency as self-supervision signal alongside standard supervised target-class loss.

Result: Consistently outperforms naive averaging or single-view descriptor selection, achieving state-of-the-art results on 3D semantic and instance classification benchmarks, including strong zero-shot performance on out-of-domain datasets.

Conclusion: CAMFusion effectively bridges 2D vision-language models to 3D understanding through cross-attentive multiview fusion and self-supervised multiview consistency, enabling better 3D scene understanding.

Abstract: Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single-view descriptor selection, but also achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.

[216] Evolution-Inspired Sample Competition for Deep Neural Network Optimization

Ying Zheng, Yiyi Zhang, Yi Wang, Lap-Pui Chau

Main category: cs.CV

TL;DR: Natural Selection (NS) is an evolution-inspired optimization method that introduces competitive interactions into deep network training by grouping samples, computing competitive scores, and adaptively reweighting losses to address issues like class imbalance, hard sample learning, and noisy samples.

Details

Motivation: Conventional deep learning treats all samples uniformly, leading to problems like bias under class imbalance, insufficient learning of hard samples, and reinforcement of noisy samples. The paper aims to move beyond this oversimplified treatment by explicitly modeling competitive interactions among samples.

Method: NS assembles multiple samples into a composite image, rescales it to original input size for model inference, computes natural selection scores based on predictions to characterize each sample’s competitive variation within the group, then uses these scores to dynamically reweight sample-wise losses.

Result: Extensive experiments on 12 public datasets across four image classification tasks demonstrate effectiveness. NS is compatible with diverse network architectures and doesn’t depend on task-specific assumptions, showing strong generality and practical potential.

Conclusion: NS provides a simple yet effective evolution-inspired optimization method that introduces explicit competition-driven mechanisms into training, enabling more adaptive and balanced model optimization beyond uniform sample treatment.

Abstract: Conventional deep network training generally optimizes all samples under a largely uniform learning paradigm, without explicitly modeling the heterogeneous competition among them. Such an oversimplified treatment can lead to several well-known issues, including bias under class imbalance, insufficient learning of hard samples, and the erroneous reinforcement of noisy samples. In this work, we present \textit{Natural Selection} (NS), a novel evolution-inspired optimization method that explicitly incorporates competitive interactions into deep network training. Unlike conventional sample reweighting strategies that rely mainly on predefined heuristics or static criteria, NS estimates the competitive status of each sample in a group-wise context and uses it to adaptively regulate its training contribution. Specifically, NS first assembles multiple samples into a composite image and rescales it to the original input size for model inference. Based on the resulting predictions, a natural selection score is computed for each sample to characterize its relative competitive variation within the constructed group. These scores are then used to dynamically reweight the sample-wise loss, thereby introducing an explicit competition-driven mechanism into the optimization process. In this way, NS provides a simple yet effective means of moving beyond uniform sample treatment and enables more adaptive and balanced model optimization. Extensive experiments on 12 public datasets across four image classification tasks demonstrate the effectiveness of the proposed method. Moreover, NS is compatible with diverse network architectures and does not depend on task-specific assumptions, indicating its strong generality and practical potential. The code will be made publicly available.

Francesco Chiumento, Julia Dietlmeier, Ronan P. Killeen, Kathleen M. Curran, Noel E. O’Connor, Mingming Liu

Main category: cs.CV

TL;DR: PET-guided knowledge distillation framework enables amyloid-β prediction from MRI alone without PET or clinical covariates at inference, using BiomedCLIP-based teacher model with cross-modal attention and contrastive learning.

Details

Motivation: Current amyloid-β detection requires costly and invasive PET imaging, limiting accessibility for population-level Alzheimer's screening. Need for PET-free methods using widely available MRI.

Method: Two-stage knowledge distillation: 1) Teacher model (BiomedCLIP-based) learns PET-MRI alignment via cross-modal attention and triplet contrastive learning with PET-informed negative sampling. 2) MRI-only student mimics teacher via feature-level and logit-level distillation.

Result: Achieved best AUC of 0.74 on OASIS-3 and 0.68 on ADNI across four MRI contrasts (T1w, T2w, FLAIR, T2*). Saliency analysis shows predictions focus on anatomically relevant cortical regions.

Conclusion: PET-guided knowledge distillation enables effective PET-free amyloid-β screening from MRI alone, maintaining interpretability and eliminating need for clinical variables or PET at inference.

Abstract: Detecting amyloid-$β$ (A$β$) positivity is crucial for early diagnosis of Alzheimer’s disease but typically requires PET imaging, which is costly, invasive, and not widely accessible, limiting its use for population-level screening. We address this gap by proposing a PET-guided knowledge distillation framework that enables A$β$ prediction from MRI alone, without requiring non-imaging clinical covariates or PET at inference. Our approach employs a BiomedCLIP-based teacher model that learns PET-MRI alignment via cross-modal attention and triplet contrastive learning with PET-informed (Centiloid-aware) online negative sampling. An MRI-only student then mimics the teacher via feature-level and logit-level distillation. Evaluated across four MRI contrasts (T1w, T2w, FLAIR, T2*) and two independent datasets, our approach demonstrates effective knowledge transfer (best AUC: 0.74 on OASIS-3, 0.68 on ADNI) while maintaining interpretability and eliminating the need for clinical variables. Saliency analysis confirms that predictions focus on anatomically relevant cortical regions, supporting the clinical viability of PET-free A$β$ screening. Code is available at https://github.com/FrancescoChiumento/pet-guided-mri-amyloid-detection.

[218] StructDiff: A Structure-Preserving and Spatially Controllable Diffusion Model for Single-Image Generation

Yinxi He, Kang Liao, Chunyu Lin, Tianyi Wei, Yao Zhao

Main category: cs.CV

TL;DR: StructDiff is a single-scale diffusion framework for single-image generation that preserves structural layout and enables spatial control via adaptive receptive fields and 3D positional encoding, with LLM-based evaluation.

Details

Motivation: Existing single-image generation methods struggle to preserve structural layout for images with rigid objects or spatial constraints, and lack spatial controllability for guiding structure/placement of generated content.

Method: Uses adaptive receptive field module to maintain global/local distributions, incorporates 3D positional encoding as spatial prior for controlling positions/scale/local details, and proposes LLM-based evaluation criterion.

Result: Outperforms existing methods in structural consistency, visual quality, and spatial controllability; demonstrates broad applicability across text-guided generation, editing, outpainting, and paint-to-image tasks.

Conclusion: StructDiff effectively addresses structural preservation and spatial control challenges in single-image generation while introducing novel evaluation methods and showing strong performance across various applications.

Abstract: This paper introduces StructDiff, a generative framework based on a single-scale diffusion model for single-image generation. Single-image generation aims to synthesize diverse samples with similar visual content to the source image by capturing its internal statistics, without relying on external data. However, existing methods often struggle to preserve the structural layout, especially for images with large rigid objects or strict spatial constraints. Moreover, most approaches lack spatial controllability, making it difficult to guide the structure or placement of generated content. To address these challenges, StructDiff introduces an \textit{adaptive receptive field} module to maintain both global and local distributions. Building on this foundation, StructDiff incorporates 3D positional encoding (PE) as a spatial prior, allowing flexible control over positions, scale, and local details of generated objects. To our knowledge, this spatial control capability represents the first exploration of PE-based manipulation in single-image generation. Furthermore, we propose a novel evaluation criterion for single-image generation based on large language models (LLMs). This criterion specifically addresses the limitations of existing objective metrics and the high labor costs associated with user studies. StructDiff also demonstrates broad applicability across downstream tasks, such as text-guided image generation, image editing, outpainting, and paint-to-image synthesis. Extensive experiments demonstrate that StructDiff outperforms existing methods in structural consistency, visual quality, and spatial controllability. The project page is available at https://butter-crab.github.io/StructDiff/.

[219] PDF-GS: Progressive Distractor Filtering for Robust 3D Gaussian Splatting

Kangmin Seo, MinKyu Lee, Tae-Young Kim, ByeongCheol Lee, JoonSeoung An, Jae-Pil Heo

Main category: cs.CV

TL;DR: PDF-GS enhances 3D Gaussian Splatting’s inherent ability to filter inconsistent signals through progressive multi-phase optimization, achieving robust, distractor-free 3D reconstructions without architectural changes.

Details

Motivation: Current 3D Gaussian Splatting methods assume full multi-view consistency in input images, making them sensitive to distractors (objects that violate this assumption) which cause visual artifacts in reconstructions.

Method: Progressive Distractor Filtering (PDF-GS) framework with multi-phase optimization: progressive filtering phases gradually remove distractors using discrepancy cues, followed by reconstruction phases that restore fine-grained, view-consistent details from purified Gaussian representations.

Result: Achieves robust, high-fidelity, distractor-free reconstructions, consistently outperforming baselines across diverse datasets and challenging real-world conditions. Lightweight and easily adaptable to existing 3DGS frameworks with no architectural changes or additional inference overhead.

Conclusion: PDF-GS demonstrates that amplifying 3DGS’s inherent self-filtering property through progressive optimization enables state-of-the-art performance for robust 3D reconstruction in the presence of distractors.

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled impressive real-time photorealistic rendering. However, conventional training pipelines inherently assume full multi-view consistency among input images, which makes them sensitive to distractors that violate this assumption and cause visual artifacts. In this work, we revisit an underexplored aspect of 3DGS: its inherent ability to suppress inconsistent signals. Building on this insight, we propose PDF-GS (Progressive Distractor Filtering for Robust 3D Gaussian Splatting), a framework that amplifies this self-filtering property through a progressive multi-phase optimization. The progressive filtering phases gradually remove distractors by exploiting discrepancy cues, while the following reconstruction phase restores fine-grained, view-consistent details from the purified Gaussian representation. Through this iterative refinement, PDF-GS achieves robust, high-fidelity, and distractor-free reconstructions, consistently outperforming baselines across diverse datasets and challenging real-world conditions. Moreover, our approach is lightweight and easily adaptable to existing 3DGS frameworks, requiring no architectural changes or additional inference overhead, leading to a new state-of-the-art performance. The code is publicly available at https://github.com/kangrnin/PDF-GS.

[220] Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models

Zijian Liu, Sihan Cao, Pengcheng Zheng, Kuien Liu, Caiyan Qin, Xiaolin Qin, Jiwei Wei, Chaoning Zhang

Main category: cs.CV

TL;DR: DTR is a training-free inference method that rebalances temporal attention in Video-LLMs to reduce hallucinations by addressing decoder-side temporal imbalance in evidence aggregation.

Details

Motivation: Video-LLMs suffer from hallucinations due to temporally imbalanced evidence aggregation, where models over-rely on limited portions of video frames. This temporal bias appears to be a persistent model-specific structural issue rather than input-dependent.

Method: Decoder-side Temporal Rebalancing (DTR) - a training-free, layer-selective inference method that adaptively calibrates decoder-side visual attention in middle-to-late decoder layers. It rebalances temporal evidence allocation without altering visual encoding or requiring auxiliary models.

Result: DTR consistently improves hallucination robustness across diverse Video-LLM families while preserving competitive video understanding performance and maintaining high inference efficiency.

Conclusion: Addressing decoder-side temporal imbalance through attention rebalancing is an effective training-free approach to reduce hallucinations in Video-LLMs while maintaining understanding capabilities.

Abstract: Recent Video Large Language Models (Video-LLMs) have demonstrated strong capability in video understanding, yet they still suffer from hallucinations. Existing mitigation methods typically rely on training, input modification, auxiliary guidance, or additional decoding procedures, while largely overlooking a more fundamental challenge. During generation, Video-LLMs tend to over-rely on a limited portion of temporal evidence, leading to temporally imbalanced evidence aggregation across the video. To address this issue, we investigate a decoder-side phenomenon in which the model exhibits a temporally imbalanced concentration pattern. We term the frame with the highest aggregated frame-level attention mass the anchor frame. We find that this bias is largely independent of the input video and instead appears to reflect a persistent, model-specific structural or positional bias, whose over-dominance is closely associated with hallucination-prone generation. Motivated by this insight, we propose Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference method that rebalances temporal evidence allocation in middle-to-late decoder layers without altering visual encoding or requiring auxiliary models. DTR adaptively calibrates decoder-side visual attention to alleviate temporally imbalanced concentration and encourage under-attended frames to contribute more effectively to response generation. In this way, DTR guides the decoder to ground its outputs in temporally broader and more balanced video evidence. Extensive experiments on hallucination and video understanding benchmarks show that DTR consistently improves hallucination robustness across diverse Video-LLM families, while preserving competitive video understanding performance and high inference efficiency.

[221] ART-VITON: Measurement-Guided Latent Diffusion for Artifact-Free Virtual Try-On

Junseo Park, Hyeryung Jang

Main category: cs.CV

TL;DR: ART-VITON: A measurement-guided diffusion framework for virtual try-on that reformulates the problem as a linear inverse problem to preserve identity/background while eliminating boundary artifacts.

Details

Motivation: Current virtual try-on methods using latent diffusion models struggle with preserving non-try-on regions (identity and background) without creating boundary artifacts when using post-hoc replacement strategies.

Method: Reformulates VITON as a linear inverse problem using trajectory-aligned solvers with residual prior-based initialization and artifact-free measurement-guided sampling combining data consistency, frequency-level correction, and periodic standard denoising.

Result: Demonstrates effective preservation of identity and background, elimination of boundary artifacts, and improved visual fidelity and robustness on VITON-HD, DressCode, and SHHQ-1.0 datasets.

Conclusion: ART-VITON provides a robust solution for virtual try-on that maintains measurement consistency while achieving artifact-free synthesis, outperforming state-of-the-art baselines.

Abstract: Virtual try-on (VITON) aims to generate realistic images of a person wearing a target garment, requiring precise garment alignment in try-on regions and faithful preservation of identity and background in non-try-on regions. While latent diffusion models (LDMs) have advanced alignment and detail synthesis, preserving non-try-on regions remains challenging. A common post-hoc strategy directly replaces these regions with original content, but abrupt transitions often produce boundary artifacts. To overcome this, we reformulate VITON as a linear inverse problem and adopt trajectory-aligned solvers that progressively enforce measurement consistency, reducing abrupt changes in non-try-on regions. However, existing solvers still suffer from semantic drift during generation, leading to artifacts. We propose ART-VITON, a measurement-guided diffusion framework that ensures measurement adherence while maintaining artifact-free synthesis. Our method integrates residual prior-based initialization to mitigate training-inference mismatch and artifact-free measurement-guided sampling that combines data consistency, frequency-level correction, and periodic standard denoising. Experiments on VITON-HD, DressCode, and SHHQ-1.0 demonstrate that ART-VITON effectively preserves identity and background, eliminates boundary artifacts, and consistently improves visual fidelity and robustness over state-of-the-art baselines.

[222] ELoG-GS: Dual-Branch Gaussian Splatting with Luminance-Guided Enhancement for Extreme Low-light 3D Reconstruction

Yuhao Liu, Dingju Wang, Ziyang Zheng

Main category: cs.CV

TL;DR: ELoG-GS: A robust 3D reconstruction pipeline for extreme low-light environments using Gaussian Splatting with learning-based point cloud initialization and luminance-guided color enhancement.

Details

Motivation: Address the challenge of reconstructing high-quality 3D representations from degraded multi-view inputs in extreme low-light environments, which is crucial for real-world applications where lighting conditions are poor.

Method: Proposes Extreme Low-light Optimized Gaussian Splatting (ELoG-GS) that integrates learning-based point cloud initialization and luminance-guided color enhancement. Uses geometry-aware initialization and photometric adaptation strategies for stable Gaussian Splatting in challenging conditions.

Result: Achieved PSNR of 18.6626 and SSIM of 0.6855 on NTIRE Track 1 benchmark, significantly improving reconstruction quality over baselines with superior visual fidelity and geometric consistency.

Conclusion: ELoG-GS provides a practical solution for robust 3D reconstruction in real-world degraded scenarios, demonstrating effectiveness in extreme low-light conditions through integrated geometry and photometric adaptation strategies.

Abstract: This paper presents our approach to the NTIRE 2026 3D Restoration and Reconstruction Challenge (Track 1), which focuses on reconstructing high-quality 3D representations from degraded multi-view inputs. The challenge involves recovering geometrically consistent and photorealistic 3D scenes in extreme low-light environments. To address this task, we propose Extreme Low-light Optimized Gaussian Splatting (ELoG-GS), a robust low-light 3D reconstruction pipeline that integrates learning-based point cloud initialization and luminance-guided color enhancement for stable and photorealistic Gaussian Splatting. Our method incorporates both geometry-aware initialization and photometric adaptation strategies to improve reconstruction fidelity under challenging conditions. Extensive experiments on the NTIRE Track 1 benchmark demonstrate that our approach significantly improves reconstruction quality over the baselines, achieving superior visual fidelity and geometric consistency. The proposed method provides a practical solution for robust 3D reconstruction in real-world degraded scenarios. In the final testing phase, our method achieved a PSNR of 18.6626 and an SSIM of 0.6855 on the official platform leaderboard. Code is available at https://github.com/lyh120/FSGS_EAPGS.

[223] Spatial-Spectral Adaptive Fidelity and Noise Prior Reduction Guided Hyperspectral Image Denoising

Xuelin Xie, Xiliang Lu, Zhengshan Wang, Yang Zhang, Long Chen

Main category: cs.CV

TL;DR: A hyperspectral image denoising framework that balances data fidelity and noise prior modeling through adaptive weighting and comprehensive noise priors with fewer parameters.

Details

Motivation: Existing hyperspectral image denoising methods overemphasize intrinsic image priors while neglecting diverse noise assumptions and dynamic trade-offs between fidelity and priors, leading to suboptimal denoising performance.

Method: Proposes a denoising framework integrating noise prior reduction and spatial-spectral adaptive fidelity term, with an adaptive weight tensor to dynamically balance fidelity and prior regularization. Uses a fast pixel-wise model with representative coefficient total variation regularizer for mixed noise removal.

Result: Extensive experiments on simulated and real-world datasets show superior denoising performance while maintaining competitive computational efficiency compared to existing methods.

Conclusion: The proposed framework effectively handles various noise types while accurately capturing spectral low-rank structure and local smoothness of hyperspectral images through adaptive balancing of fidelity and prior terms.

Abstract: The core challenge of hyperspectral image denoising is striking the right balance between data fidelity and noise prior modeling. Most existing methods place too much emphasis on the intrinsic priors of the image while overlooking diverse noise assumptions and the dynamic trade-off between fidelity and priors. To address these issues, we propose a denoising framework that integrates noise prior reduction and a spatial-spectral adaptive fidelity term. This framework considers comprehensive noise priors with fewer parameters and introduces an adaptive weight tensor to dynamically balance the fidelity and prior regularization terms. Within this framework, we further develop a fast and robust pixel-wise model combined with the representative coefficient total variation regularizer to accurately remove mixed noise in HSIs. The proposed method not only efficiently handles various types of noise but also accurately captures the spectral low-rank structure and local smoothness of HSIs. An efficient optimization algorithm based on the alternating direction method of multipliers is designed to ensure stable and fast convergence. Extensive experiments on simulated and real-world datasets demonstrate that the proposed model achieves superior denoising performance while maintaining competitive computational efficiency.

[224] Efficient Semantic Image Communication for Traffic Monitoring at the Edge

Damir Assylbek, Nurmukhammed Aitymbetov, Marko Ristin, Dimitrios Zorbas

Main category: cs.CV

TL;DR: Two semantic image communication pipelines (MMSD and SAMR) for traffic monitoring that achieve 99%+ data reduction by transmitting semantic representations instead of raw pixels, with trade-offs between confidentiality/compression and visual quality.

Details

Motivation: Visual monitoring systems face strict communication constraints where transmitting full-resolution images is impractical. Visual data is often used for semantic understanding (object presence, spatial relationships, scene context) rather than pixel fidelity, creating opportunity for semantic compression.

Method: MMSD: Replaces original image with compact semantic representations (segmentation maps, edge maps, textual descriptions) and reconstructs scene at receiver using diffusion-based generative model. SAMR: Selectively suppresses non-critical image regions based on semantic importance before JPEG encoding, then restores missing content through generative inpainting. Both use asymmetric sender-receiver architecture with lightweight edge processing and server-side reconstruction.

Result: Achieves 99% (MMSD) and 99.1% (SAMR) average transmitted-data reduction. MMSD has lower payload than SPIC baseline with strong semantic consistency. SAMR provides better quality-compression trade-off than standard JPEG and SQ-GAN. Edge processing time: 15s for MMSD, 9s for SAMR on Raspberry Pi 5.

Conclusion: Semantic image communication pipelines enable efficient visual monitoring under communication constraints by transmitting semantic representations instead of raw pixels, with different approaches balancing confidentiality/compression versus visual quality.

Abstract: Many visual monitoring systems operate under strict communication constraints, where transmitting full-resolution images is impractical and often unnecessary. In such settings, visual data is often used for object presence, spatial relationships, and scene context rather than exact pixel fidelity. This paper presents two semantic image communication pipelines for traffic monitoring, MMSD and SAMR, that reduce transmission cost while preserving meaningful visual information. MMSD (Multi-Modal Semantic Decomposition) targets very high compression together with data confidentiality, since sensitive pixel content is not transmitted. It replaces the original image with compact semantic representations, namely segmentation maps, edge maps, and textual descriptions, and reconstructs the scene at the receiver using a diffusion-based generative model. SAMR (Semantic-Aware Masking Reconstruction) targets higher visual quality while maintaining strong compression. It selectively suppresses non-critical image regions according to semantic importance before standard JPEG encoding and restores the missing content at the receiver through generative inpainting. Both designs follow an asymmetric sender-receiver architecture, where lightweight processing is performed at the edge and computationally intensive reconstruction is offloaded to the server. On a Raspberry Pi~5, the edge-side processing time is about 15s for MMSD and 9s for SAMR. Experimental results show average transmitted-data reductions of 99% for MMSD and 99.1% for SAMR. In addition, MMSD achieves lower payload size than the recent SPIC baseline while preserving strong semantic consistency, whereas SAMR provides a better quality-compression trade-off than standard JPEG and SQ-GAN under comparable operating conditions.

[225] GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

Zhaochen Liu, Limeng Qiao, Guanglu Wan, Tingting Jiang

Main category: cs.CV

TL;DR: GeoAlign: A framework that dynamically aggregates multi-layer geometric features from 3D foundation models to improve spatial reasoning in multimodal LLMs, addressing task misalignment bias in static single-layer approaches.

Details

Motivation: Current MLLMs struggle with spatial reasoning despite good performance on other visual tasks. Existing approaches that inject geometric features from 3D foundation models use static single-layer extractions, which suffer from task misalignment bias - the geometric features evolve toward 3D pretraining objectives that may contradict MLLMs' heterogeneous spatial demands.

Method: Proposes GeoAlign framework that dynamically aggregates multi-layer geometric features. Constructs a hierarchical geometric feature bank and uses MLLM’s original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching suitable geometric features for each patch.

Result: Extensive experiments on VSI-Bench, ScanQA, and SQA3D show that the compact 4B model achieves state-of-the-art performance, even outperforming larger existing MLLMs.

Conclusion: Dynamic multi-layer geometric feature aggregation effectively addresses task misalignment bias and improves spatial reasoning in MLLMs, enabling compact models to outperform larger ones on spatial reasoning benchmarks.

Abstract: Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pretraining objectives, which may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. To resolve this, we propose GeoAlign, a novel framework that dynamically aggregates multi-layer geometric features to realign with the actual demands. GeoAlign constructs a hierarchical geometric feature bank and leverages the MLLM’s original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching the suitable geometric features for each patch. Extensive experiments on VSI-Bench, ScanQA, and SQA3D demonstrate that our compact 4B model effectively achieves state-of-the-art performance, even outperforming larger existing MLLMs.

[226] PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

Jinlong Liu, Wanggui He, Peng Zhang, Mushui Liu, Hao Jiang, Pipei Huang

Main category: cs.CV

TL;DR: PromptEcho is a novel reward construction method for RL-based text-to-image models that uses frozen VLMs to compute token-level cross-entropy loss as reward, requiring no human annotation or reward model training.

Details

Motivation: Existing RL methods for improving text-to-image models face challenges: CLIP Score is too coarse-grained, while VLM-based reward models require costly human-annotated preference data and additional fine-tuning. There's a need for efficient, annotation-free reward signals that leverage existing VLM knowledge.

Method: PromptEcho computes token-level cross-entropy loss of a frozen VLM using the original prompt as label, directly extracting image-text alignment knowledge encoded during VLM pretraining. It’s deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available.

Result: PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp/+16.2pp net win rate) on two state-of-the-art T2I models (Z-Image and QwenImage-2512), with consistent gains on GenEval, DPG-Bench, and TIIFBench without task-specific training. Reward quality scales with VLM size.

Conclusion: PromptEcho provides an effective, annotation-free reward construction method for RL-based text-to-image models that leverages frozen VLMs’ pretrained knowledge, outperforming existing methods and scaling with VLM capabilities.

Abstract: Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \emph{no} annotation and \emph{no} reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense captions for rigorously testing prompt following capability. Experimental results on two state-of-the-art T2I models (Z-Image and QwenImage-2512) demonstrate that PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp / +16.2pp net win rate), along with consistent gains on GenEval, DPG-Bench, and TIIFBench without any task-specific training. Ablation studies confirm that PromptEcho comprehensively outperforms inference-based scoring with the same VLM, and that reward quality scales with VLM size. We will open-source the trained models and the DenseAlignBench.

[227] Hypergraph-State Collaborative Reasoning for Multi-Object Tracking

Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang, Xinchao Wang

Main category: cs.CV

TL;DR: HyperSSM: A collaborative motion reasoning framework using hypergraph computation and state space models for robust multi-object tracking, addressing instability from noisy predictions and vulnerability under occlusion.

Details

Motivation: Existing motion estimation approaches for multi-object tracking suffer from instability due to noisy/probabilistic predictions and vulnerability under occlusion where trajectories fragment when visual cues disappear. There's a need for more robust motion reasoning that can handle these challenges.

Method: Proposes HyperSSM, integrating Hypergraph computation and State Space Model (SSM) for unified spatial-temporal reasoning. Hypergraph module captures spatial motion correlations through dynamic hyperedges, while SSM enforces temporal smoothness via structured state transitions. Objects with similar motion states mutually constrain and refine each other.

Result: Achieves state-of-the-art performance on four mainstream benchmarks (MOT17, MOT20, DanceTrack, SportsMOT) covering various motion patterns and scene complexities. Demonstrates robust and stable motion estimation across diverse tracking scenarios.

Conclusion: The collaborative reasoning framework effectively addresses limitations of existing motion estimation approaches by enabling joint inference among correlated objects, stabilizing noisy trajectories, and inferring plausible motion continuity during occlusion.

Abstract: Motion reasoning serves as the cornerstone of multi-object tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear. To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded. To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial-temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robust and stable motion estimation. Extensive experiments on four mainstream and diverse benchmarks(MOT17, MOT20, DanceTrack, and SportsMOT) covering various motion patterns and scene complexities, demonstrate that our approach achieves state-of-the-art performance across a wide range of tracking scenarios.

[228] SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

Kathakoli Sengupta, Kai Ao, Paola Cascante-Bonilla

Main category: cs.CV

TL;DR: SceneCritic: A symbolic evaluator for floor-plan-level indoor scene layouts using a spatial ontology (SceneOnto) to verify semantic, orientation, and geometric coherence, outperforming VLM-based evaluators and enabling iterative refinement with different critic modalities.

Details

Motivation: Current evaluation of indoor scene generation by LLMs/VLMs relies on rendered view scoring by other LLMs/VLMs, making judgments unstable due to viewpoint sensitivity, prompt phrasing issues, and hallucination problems. There's a need for stable, objective evaluation of spatial plausibility in scene layouts.

Method: Developed SceneCritic, a symbolic evaluator based on SceneOnto - a structured spatial ontology constructed by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses the ontology to verify semantic, orientation, and geometric coherence across object relationships. Paired with iterative refinement test bed using three critic modalities: rule-based (collision constraints), LLM (text layout), and VLM (rendered observations).

Result: SceneCritic aligns substantially better with human judgments than VLM-based evaluators; text-only LLMs outperform VLMs on semantic layout quality; image-based VLM refinement is most effective for semantic and orientation correction.

Conclusion: Symbolic evaluation via SceneCritic provides more stable and objective assessment of indoor scene layouts than LLM/VLM judges, with different critic modalities offering complementary strengths for iterative refinement of spatial structure generation.

Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor-plan-level layouts. SceneCritic’s constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object-level and relationship-level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule-based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM-based evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is the most effective critic modality for semantic and orientation correction.

[229] OFA-Diffusion Compression: Compressing Diffusion Model in One-Shot Manner

Haoyang Jiang, Zekun Wang, Mingyang Yi, Xiuyu Li, Lanqing Hu, Junxian Cai, Qingbin Liu, Xi Chen, Ju Fan

Main category: cs.CV

TL;DR: A once-for-all compression framework for Diffusion Probabilistic Models that generates multiple subnetworks of different computational sizes in one training session, eliminating repeated compression for different devices.

Details

Motivation: DPMs have high computational costs that hinder practical deployment across diverse devices with varying resource constraints. Current approaches require repeated compression for each device, incurring significant training overhead.

Method: Proposes an OFA compression framework that yields different subnetworks in one-shot training. Uses channel importance-based allocation to construct subnetworks of specific sizes and a reweighting strategy to balance optimization across subnetworks.

Result: The approach produces compressed DPMs for various sizes with significantly lower training overhead while maintaining satisfactory performance compared to repeated individual compression.

Conclusion: The proposed OFA framework efficiently addresses the deployment challenge of DPMs across diverse devices by enabling one-shot compression that yields multiple subnetworks with different computational requirements.

Abstract: The Diffusion Probabilistic Model (DPM) achieves remarkable performance in image generation, while its increasing parameter size and computational overhead hinder its deployment in practical applications. To improve this, the existing literature focuses on obtaining a smaller model with a fixed architecture through model compression. However, in practice, DPMs usually need to be deployed on various devices with different resource constraints, which leads to multiple compression processes, incurring significant overhead for repeated training. To obviate this, we propose a once-for-all (OFA) compression framework for DPMs that yields different subnetworks with various computations in a one-shot training manner. The existing OFA framework typically involves massive subnetworks with different parameter sizes, while such a huge candidate space slows the optimization. Thus, we propose to restrict the candidate subnetworks with a certain set of parameter sizes, where each size corresponds to a specific subnetwork. Specifically, to construct each subnetwork with a given size, we gradually allocate the maintained channels by their importance. Furthermore, we propose a reweighting strategy to balance the optimization process of different subnetworks. Experimental results show that our approach can produce compressed DPMs for various sizes with significantly lower training overhead while achieving satisfactory performance.

[230] Brain-DiT: A Universal Multi-state fMRI Foundation Model with Metadata-Conditioned Pretraining

Junfeng Xia, Wenhao Ye, Xuanye Pan, Xinke Shen, Mo Wang, Quanying Liu

Main category: cs.CV

TL;DR: Brain-DiT is a universal fMRI foundation model using diffusion transformers with metadata conditioning, pretrained on diverse brain states to learn multi-scale representations for various downstream tasks.

Details

Motivation: Current fMRI foundation models are limited by narrow brain state coverage and mismatched pretraining tasks, failing to learn generalized representations across diverse brain states like resting, task, naturalistic, disease, and sleep states.

Method: Brain-DiT uses metadata-conditioned diffusion pretraining with a Diffusion Transformer (DiT) on 349,898 fMRI sessions from 24 datasets spanning multiple brain states, enabling learning of multi-scale representations capturing both fine-grained functional structure and global semantics.

Result: Extensive evaluations on 7 downstream tasks show diffusion-based generative pretraining outperforms reconstruction or alignment methods, with metadata conditioning improving performance by disentangling neural dynamics from population variability. Different tasks prefer different representational scales.

Conclusion: Diffusion-based generative pretraining with metadata conditioning is an effective approach for building universal fMRI foundation models that can learn generalized representations across diverse brain states and tasks.

Abstract: Current fMRI foundation models primarily rely on a limited range of brain states and mismatched pretraining tasks, restricting their ability to learn generalized representations across diverse brain states. We present \textit{Brain-DiT}, a universal multi-state fMRI foundation model pretrained on 349,898 sessions from 24 datasets spanning resting, task, naturalistic, disease, and sleep states. Unlike prior fMRI foundation models that rely on masked reconstruction in the raw-signal space or a latent space, \textit{Brain-DiT} adopts metadata-conditioned diffusion pretraining with a Diffusion Transformer (DiT), enabling the model to learn multi-scale representations that capture both fine-grained functional structure and global semantics. Across extensive evaluations and ablations on 7 downstream tasks, we find consistent evidence that diffusion-based generative pretraining is a stronger proxy than reconstruction or alignment, with metadata-conditioned pretraining further improving downstream performance by disentangling intrinsic neural dynamics from population-level variability. We also observe that downstream tasks exhibit distinct preferences for representational scale: ADNI classification benefits more from global semantic representations, whereas age/sex prediction comparatively relies more on fine-grained local structure. Code and parameters of Brain-DiT are available at \href{https://github.com/REDMAO4869/Brain-DiT}{Link}.

[231] Risk-Calibrated Learning: Minimizing Fatal Errors in Medical AI

Abolfazl Mohammadi-Seif, Ricardo Baeza-Yates

Main category: cs.CV

TL;DR: Risk-Calibrated Learning addresses semantic incoherence in medical image classification by distinguishing between visually ambiguous errors and catastrophic structural errors using a clinical severity matrix, reducing critical error rates across multiple imaging modalities.

Details

Motivation: Deep learning models in medical imaging often make high-confidence but semantically incoherent errors (e.g., classifying malignant tumors as benign), which fundamentally differ from acceptable errors due to visual ambiguity. These catastrophic failures erode clinical trust and pose serious safety risks.

Method: Proposes Risk-Calibrated Learning that embeds a confusion-aware clinical severity matrix M into the optimization landscape to explicitly distinguish between visual ambiguity (fine-grained errors) and catastrophic structural errors. The method suppresses critical errors like false negatives without requiring complex architectural changes.

Result: Validated across four imaging modalities: Brain Tumor MRI, ISIC 2018 (Dermoscopy), BreaKHis (Breast Histopathology), and SICAPv2 (Prostate Histopathology). Achieved relative safety improvements ranging from 20.0% (breast histopathology) to 92.4% (prostate histopathology) compared to state-of-the-art baselines like Focal Loss, consistently reducing Critical Error Rate (CER).

Conclusion: Risk-Calibrated Learning offers a superior safety-accuracy trade-off across both CNN and Transformer architectures by explicitly addressing semantic incoherence in medical image classification, making deep learning models more clinically trustworthy.

Abstract: Deep learning models often achieve expert-level accuracy in medical image classification but suffer from a critical flaw: semantic incoherence. These high-confidence mistakes that are semantically incoherent (e.g., classifying a malignant tumor as benign) fundamentally differ from acceptable errors which stem from visual ambiguity. Unlike safe, fine-grained disagreements, these fatal failures erode clinical trust. To address this, we propose Risk-Calibrated Learning, a technique that explicitly distinguishes between visual ambiguity (fine-grained errors) and catastrophic structural errors. By embedding a confusion-aware clinical severity matrix M into the optimization landscape, our method suppresses critical errors (false negatives) without requiring complex architectural changes. We validate our approach in four different imaging modalities: Brain Tumor MRI, ISIC 2018 (Dermoscopy), BreaKHis (Breast Histopathology), and SICAPv2 (Prostate Histopathology). Extensive experiments demonstrate that our Risk-Calibrated Loss consistently reduces the Critical Error Rate (CER) for all four datasets, achieving relative safety improvements ranging from 20.0% (on breast histopathology) to 92.4% (on prostate histopathology) compared to state-of-the-art baselines such as Focal Loss. These results confirm that our method offers a superior safety-accuracy trade-off across both CNN and Transformer architectures.

[232] AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

Zeheng Wang, Zitong Yu, Yijie Zhu, Bo Zhao, Haochen Liang, Taorui Wang, Wei Xia, Jiayu Zhang, Zhishu Liu, Hui Ma, Fei Ma, Qi Tian

Main category: cs.CV

TL;DR: AffectAgent: A multi-agent RAG framework for multimodal emotion recognition using specialized agents (query planner, evidence filter, emotion generator) optimized with MAPPO and enhanced with modality-balancing MoE and retrieval-augmented fusion.

Details

Motivation: Traditional LLM-based multimodal emotion recognition suffers from static parametric memory and hallucinations when interpreting nuanced affective states. Single-round retrieval-augmented generation struggles with modal ambiguity and capturing complex affective dependencies across modalities.

Method: Proposes AffectAgent with three specialized agents: query planner (retrieves cross-modal samples), evidence filter (assesses evidence), and emotion generator (generates predictions). Uses Multi-Agent Proximal Policy Optimization (MAPPO) with shared affective reward for end-to-end optimization. Introduces Modality-Balancing Mixture of Experts (MB-MoE) to regulate modality contributions and Retrieval-Augmented Adaptive Fusion (RAAF) for semantic completion under missing-modality conditions.

Result: Extensive experiments on MER-UniBench demonstrate superior performance across complex scenarios compared to existing methods.

Conclusion: AffectAgent effectively addresses modal ambiguity and affective dependency challenges in multimodal emotion recognition through collaborative multi-agent decision-making and advanced fusion techniques.

Abstract: LLM-based multimodal emotion recognition relies on static parametric memory and often hallucinates when interpreting nuanced affective states. In this paper, given that single-round retrieval-augmented generation is highly susceptible to modal ambiguity and therefore struggles to capture complex affective dependencies across modalities, we introduce AffectAgent, an affect-oriented multi-agent retrieval-augmented generation framework that leverages collaborative decision-making among agents for fine-grained affective understanding. Specifically, AffectAgent comprises three jointly optimized specialized agents, namely a query planner, an evidence filter, and an emotion generator, which collaboratively perform analytical reasoning to retrieve cross-modal samples, assess evidence, and generate predictions. These agents are optimized end-to-end using Multi-Agent Proximal Policy Optimization (MAPPO) with a shared affective reward to ensure consistent emotion understanding. Furthermore, we introduce Modality-Balancing Mixture of Experts (MB-MoE) and Retrieval-Augmented Adaptive Fusion (RAAF), where MB-MoE dynamically regulates the contributions of different modalities to mitigate representation mismatch caused by cross-modal heterogeneity, while RAAF enhances semantic completion under missing-modality conditions by incorporating retrieved audiovisual embeddings. Extensive experiments on MER-UniBench demonstrate that AffectAgent achieves superior performance across complex scenarios. Our code will be released at: https://github.com/Wz1h1NG/AffectAgent.

[233] Scaling In-Context Segmentation with Hierarchical Supervision

T. Camaret Ndir, Marco Reisert, Robin T. Schirrmeister

Main category: cs.CV

TL;DR: PatchICL: A hierarchical framework for medical image segmentation that uses selective image patching with multi-level supervision to reduce computation by focusing only on informative anatomical regions.

Details

Motivation: Standard in-context learning methods for medical image segmentation rely on dense global cross-attention that scales poorly with image resolution. Recent localized attention approaches lack explicit supervision on region selection, leading to redundant computation in non-informative areas.

Method: Proposes PatchICL, a hierarchical framework combining selective image patching with multi-level supervision. The approach learns to actively identify and attend only to the most informative anatomical regions through a structured patching mechanism.

Result: Achieves competitive in-domain CT segmentation accuracy while reducing compute by 44% at 512×512 resolution compared to UniverSeg baseline. On 35 out-of-domain datasets spanning diverse imaging modalities, outperforms baseline on 6 of 13 modality categories, particularly strong on modalities with localized pathology like OCT and dermoscopy.

Conclusion: PatchICL provides an efficient framework for medical image segmentation that reduces computational burden while maintaining accuracy by focusing attention on relevant anatomical regions through selective patching and supervision.

Abstract: In-context learning (ICL) enables medical image segmentation models to adapt to new anatomical structures from limited examples, reducing the clinical annotation burden. However, standard ICL methods typically rely on dense, global cross-attention, which scales poorly with image resolution. While recent approaches have introduced localized attention mechanisms, they often lack explicit supervision on the selection process, leading to redundant computation in non-informative regions. We propose PatchICL, a hierarchical framework that combines selective image patching with multi-level supervision. Our approach learns to actively identify and attend only to the most informative anatomical regions. Compared to UniverSeg, a strong global-attention baseline, PatchICL achieves competitive in-domain CT segmentation accuracy while reducing compute by 44% at $512\times512$ resolution. On 35 out-of-domain datasets spanning diverse imaging modalities, PatchICL outperforms the baseline on 6 of 13 modality categories, with particular strength on modalities dominated by localized pathology such as OCT and dermoscopy. Training and evaluation code are available at https://github.com/tidiane-camaret/ic_segmentation

[234] A Dataset and Evaluation for Complex 4D Markerless Human Motion Capture

Yeeun Park, Miqdad Naduthodi, Suryansh Kumar

Main category: cs.CV

TL;DR: A new dataset and benchmark for complex 4D markerless human motion capture addressing limitations of existing datasets by capturing realistic multi-person interactions, severe occlusions, and challenging dynamics with synchronized multi-view RGB/depth data and Vicon ground truth.

Details

Motivation: Marker-based MoCap systems have hardware limitations that restrict scalability, while existing markerless MoCap datasets lack realistic multi-person dynamics, severe occlusions, and challenging interaction patterns, creating a domain gap for real-world deployment.

Method: Created a comprehensive MoCap dataset capturing single and multi-person scenarios with intricate motions, frequent inter-person occlusions, rapid position exchanges, and varying subject distances. Includes synchronized multi-view RGB and depth sequences, accurate camera calibration, ground-truth 3D motion from Vicon system, and corresponding SMPL/SMPL-X parameters.

Result: Benchmarking state-of-the-art markerless MoCap models shows substantial performance degradation under realistic conditions, exposing limitations of current approaches. Targeted fine-tuning improves generalization, validating the dataset’s realism and value for model development.

Conclusion: The dataset provides a rigorous foundation for advancing robust markerless 4D human motion capture by exposing critical gaps in existing models and enabling better evaluation and improvement of markerless MoCap systems for real-world applications.

Abstract: Marker-based motion capture (MoCap) systems have long been the gold standard for accurate 4D human modeling, yet their reliance on specialized hardware and markers limits scalability and real-world deployment. Advancing reliable markerless 4D human motion capture requires datasets that reflect the complexity of real-world human interactions. Yet, existing benchmarks often lack realistic multi-person dynamics, severe occlusions, and challenging interaction patterns, leading to a persistent domain gap. In this work, we present a new dataset and evaluation for complex 4D markerless human motion capture. Our proposed MoCap dataset captures both single and multi-person scenarios with intricate motions, frequent inter-person occlusions, rapid position exchanges between similarly dressed subjects, and varying subject distances. It includes synchronized multi-view RGB and depth sequences, accurate camera calibration, ground-truth 3D motion capture from a Vicon system, and corresponding SMPL/SMPL-X parameters. This setup ensures precise alignment between visual observations and motion ground truth. Benchmarking state-of-the-art markerless MoCap models reveals substantial performance degradation under these realistic conditions, highlighting limitations of current approaches. We further demonstrate that targeted fine-tuning improves generalization, validating the dataset’s realism and value for model development. Our evaluation exposes critical gaps in existing models and provides a rigorous foundation for advancing robust markerless 4D human motion capture.

[235] CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models

Yunkai Dang, Yizhu Jiang, Yifan Jiang, Qi Fan, Yinghuan Shi, Wenbin Li, Yang Gao

Main category: cs.CV

TL;DR: CLASP is a plug-and-play token reduction framework for Multimodal Large Language Models that uses class-adaptive layer fusion and dual-stage pruning to efficiently reduce visual token redundancy while maintaining performance.

Details

Motivation: MLLMs suffer from high computational overhead due to redundant visual token sequences. Existing approaches use single-layer ViT features and static pruning strategies that are brittle under diverse instructions and lack adaptability.

Method: CLASP uses class-adaptive layer fusion to construct category-specific visual representations from multi-layer vision features, then performs dual-stage pruning with attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage, enabling prompt-conditioned feature fusion and budget allocation.

Result: Extensive experiments show CLASP consistently outperforms existing methods across a wide range of benchmarks, pruning ratios, and MLLM architectures, enabling aggressive yet robust visual token reduction.

Conclusion: CLASP provides an effective plug-and-play solution for reducing computational overhead in MLLMs through adaptive token reduction that maintains performance across diverse instructions and scenarios.

Abstract: Multimodal Large Language Models (MLLMs) suffer from substantial computational overhead due to the high redundancy in visual token sequences. Existing approaches typically address this issue using single-layer Vision Transformer (ViT) features and static pruning strategies. However, such fixed configurations are often brittle under diverse instructions. To overcome these limitations, we propose CLASP, a plug-and-play token reduction framework based on class-adaptive layer fusion and dual-stage pruning. Specifically, CLASP first constructs category-specific visual representations through multi-layer vision feature fusion. It then performs dual-stage pruning, allocating the token budget between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage. Through class-adaptive pruning, CLASP enables prompt-conditioned feature fusion and budget allocation, allowing aggressive yet robust visual token reduction. Extensive experiments demonstrate that CLASP consistently outperforms existing methods across a wide range of benchmarks, pruning ratios, and MLLM architectures. Code will be available at https://github.com/Yunkaidang/CLASP.

[236] Cognition-Inspired Dual-Stream Semantic Enhancement for Vision-Based Dynamic Emotion Modeling

Huanzhen Wang, Ziheng Zhou, Zeng Tao, Aoxing Li, Yingkai Zhao, Yuxuan Lin, Yan Wang, Wenqiang Zhang

Main category: cs.CV

TL;DR: DuSE: A cognition-inspired dual-stream model for dynamic facial expression recognition that integrates semantic knowledge with visual processing, mimicking human brain mechanisms for emotion perception.

Details

Motivation: Existing vision-based dynamic emotion models neglect cognitive theories and emotion perception mechanisms. The paper aims to bridge the gap between machine and human emotion perception by incorporating neuro-cognitive principles.

Method: Proposes DuSE with two streams: 1) Hierarchical Temporal Prompt Cluster (HTPC) simulates cognitive priming effect using linguistic cues to modulate visual processing, 2) Latent Semantic Emotion Aggregator (LSEA) models knowledge integration process akin to Conceptual Act Theory, aggregating sensory inputs with learned conceptual knowledge.

Result: Extensive experiments on challenging in-the-wild benchmarks demonstrate state-of-the-art performance in dynamic facial expression recognition and enhanced model interpretability.

Conclusion: Emulating the brain’s strategies for emotion processing yields superior performance and interpretability, validating the cognition-centric approach for dynamic emotion modeling.

Abstract: The human brain constructs emotional percepts not by processing facial expressions in isolation, but through a dynamic, hierarchical integration of sensory input with semantic and contextual knowledge. However, existing vision-based dynamic emotion modeling approaches often neglect emotion perception and cognitive theories. To bridge this gap between machine and human emotion perception, we propose cognition-inspired Dual-stream Semantic Enhancement (DuSE). Our model instantiates a dual-stream cognitive architecture. The first stream, a Hierarchical Temporal Prompt Cluster (HTPC), operationalizes the cognitive priming effect. It simulates how linguistic cues pre-sensitize neural pathways, modulating the processing of incoming visual stimuli by aligning textual semantics with fine-grained temporal features of facial dynamics. The second stream, a Latent Semantic Emotion Aggregator (LSEA), computationally models the knowledge integration process, akin to the mechanism described by the Conceptual Act Theory. It aggregates sensory inputs and synthesizes them with learned conceptual knowledge, reflecting the role of the hippocampus and default mode network in constructing a coherent emotional experience. By explicitly modeling these neuro-cognitive mechanisms, DuSE provides a more neurally plausible and robust framework for dynamic facial expression recognition (DFER). Extensive experiments on challenging in-the-wild benchmarks validate our cognition-centric approach, demonstrating that emulating the brain’s strategies for emotion processing yields state-of-the-art performance and enhances model interpretability.

[237] Efficient Adversarial Training via Criticality-Aware Fine-Tuning

Wenyun Li, Zheng Zhang, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan

Main category: cs.CV

TL;DR: CAAT introduces criticality-aware adversarial training for Vision Transformers that selectively fine-tunes only robustness-critical parameters, achieving near-standard AT performance with dramatically reduced computational cost.

Details

Motivation: Vision Transformers scale well but their adversarial robustness doesn't scale proportionally with parameters. Standard adversarial training requires full model fine-tuning which is computationally prohibitive for large ViTs.

Method: CAAT identifies parameters most critical for adversarial robustness, then uses parameter-efficient fine-tuning (PEFT) to adjust only selected modules where critical parameters exceed a threshold, avoiding full model fine-tuning.

Result: CAAT achieves only 4.3% decrease in adversarial robustness compared to plain adversarial training while tuning only ~6% of parameters, outperforming state-of-the-art lightweight AT methods on three adversarial learning datasets.

Conclusion: CAAT enables scalable adversarial training for large vision transformers by efficiently identifying and fine-tuning only robustness-critical parameters, paving the way for adversarial training at scale.

Abstract: Vision Transformer (ViT) models have achieved remarkable performance across various vision tasks, with scalability being a key advantage when applied to large datasets. This scalability enables ViT models to exhibit strong generalization capabilities. However, as the number of parameters increases, the robustness of ViT models to adversarial examples does not scale proportionally. Adversarial training (AT), one of the most effective methods for enhancing robustness, typically requires fine-tuning the entire model, leading to prohibitively high computational costs, especially for large ViT architectures. In this paper, we aim to robustly fine-tune only a small subset of parameters to achieve robustness comparable to standard AT. To accomplish this, we introduce Criticality-Aware Adversarial Training (CAAT), a novel method that adaptively allocates resources to the most robustness-critical parameters, fine-tuning only selected modules. Specifically, CAAT efficiently identifies parameters that contribute most to adversarial robustness. It then leverages parameter-efficient fine-tuning (PEFT) to robustly adjust weight matrices where the number of critical parameters exceeds a predefined threshold. CAAT exhibits favorable generalization when scaled to larger vision transformer architectures, potentially paving the way for adversarial training at scale, e.g, compared with plain adversarial training, CAAT incurs only a 4.3% decrease in adversarial robustness while tuning approximately 6% of its parameters. Extensive experiments on three widely used adversarial learning datasets demonstrate that CAAT outperforms state-of-the-art lightweight AT methods with fewer trainable parameters.

[238] Fragile Reconstruction: Adversarial Vulnerability of Reconstruction-Based Detectors for Diffusion-Generated Images

Haoyang Jiang, Mingyang Yi, Shaolei Zhang, Junxian Cai, Qingbin Liu, Xi Chen, Ju Fan

Main category: cs.CV

TL;DR: Reconstruction-based AI-generated image detectors are highly vulnerable to adversarial attacks that can reduce detection accuracy to near zero, revealing fundamental security limitations in current detection approaches.

Details

Motivation: The paper addresses the security vulnerabilities of reconstruction-based methods for detecting AI-generated images from diffusion models, which pose potential safety threats. The authors investigate how these detectors can be easily bypassed through adversarial perturbations.

Method: The authors systematically evaluate adversarial robustness of three representative reconstruction-based detectors across four diverse generative backbone models. They construct white-box adversarial attacks, test transferability between detectors, and assess common countermeasures against adversarial attacks.

Result: Adversarial perturbations cause detection accuracy to collapse to near zero. Attacks demonstrate strong transferability between different detectors, enabling black-box attacks. Standard defense methods provide limited mitigation due to low signal-to-noise ratio in attacked samples.

Conclusion: Reconstruction-based detectors have fundamental security limitations that make them vulnerable to adversarial attacks. The findings highlight the need to rethink existing detection strategies for AI-generated content.

Abstract: Recently, detecting AI-generated images produced by diffusion-based models has attracted increasing attention due to their potential threat to safety. Among existing approaches, reconstruction-based methods have emerged as a prominent paradigm for this task. However, we find that such methods exhibit severe security vulnerabilities to adversarial perturbations; that is, by adding imperceptible adversarial perturbations to input images, the detection accuracy of classifiers collapses to near zero. To verify this threat, we present a systematic evaluation of the adversarial robustness of three representative detectors across four diverse generative backbone models. First, we construct adversarial attacks in white-box scenarios, which degrade the performance of all well-trained detectors. Moreover, we find that these attacks demonstrate transferability; specifically, attacks crafted against one detector can be transferred to others, indicating that adversarial attacks on detectors can also be constructed in a black-box setting. Finally, we assess common countermeasures and find that standard defense methods against adversarial attacks provide limited mitigation. We attribute these failures to the low signal-to-noise ratio (SNR) of attacked samples as perceived by the detectors. Overall, our results reveal fundamental security limitations of reconstruction-based detectors and highlight the need to rethink existing detection strategies.

[239] Generative Anonymization in Event Streams

Adam T. Müller, Mihai Kocsis, Nicolaj C. Stache

Main category: cs.CV

TL;DR: First generative anonymization framework for event streams that preserves data utility while protecting privacy by synthesizing non-existent identities

Details

Motivation: Neuromorphic vision sensors in public spaces raise privacy concerns as Event-to-Video models can reconstruct identifiable human faces from event streams. Current obfuscation methods degrade data utility for downstream tasks.

Method: Bridges modality gap between asynchronous events and spatial generative models by projecting events into intermediate intensity representation, using pretrained models to synthesize realistic non-existent identities, and re-encoding features back into neuromorphic domain.

Result: Method reliably prevents identity recovery from E2V reconstructions while preserving structural data integrity required for downstream vision tasks. Introduces novel synchronized real-world event and RGB dataset for evaluation.

Conclusion: First framework to resolve utility-privacy trade-off in neuromorphic vision through generative anonymization, enabling privacy-preserving deployment of event-based sensors in public spaces.

Abstract: Neuromorphic vision sensors offer low latency and high dynamic range, but their deployment in public spaces raises severe data protection concerns. Recent Event-to-Video (E2V) models can reconstruct high-fidelity intensity images from sparse event streams, inadvertently exposing human identities. Current obfuscation methods, such as masking or scrambling, corrupt the spatio-temporal structure, severely degrading data utility for downstream perception tasks. In this paper, to the best of our knowledge, we present the first generative anonymization framework for event streams to resolve this utility-privacy trade-off. By bridging the modality gap between asynchronous events and standard spatial generative models, our pipeline projects events into an intermediate intensity representation, leverages pretrained models to synthesize realistic, non-existent identities, and re-encodes the features back into the neuromorphic domain. Experiments demonstrate that our method reliably prevents identity recovery from E2V reconstructions while preserving the structural data integrity required for downstream vision tasks. Finally, to facilitate rigorous evaluation, we introduce a novel, synchronized real-world event and RGB dataset captured via precise robotic trajectories, providing a robust benchmark for future research in privacy-preserving neuromorphic vision.

[240] Image-to-Image Translation Framework Embedded with Rotation Symmetry Priors

Feiyu Tan, Heran Yang, Qihong Duan, Kai Ye, Qi Xie, Deyu Meng

Main category: cs.CV

TL;DR: Proposes rotation group equivariant convolutions for image-to-image translation to preserve rotation symmetry, with a novel transformation learnable equivariant convolution that adaptively learns transformation groups.

Details

Motivation: Image-to-image translation lacks paired data and effective unsupervised learning frameworks; preserving intrinsic symmetry properties like rotation invariance can improve translation quality and generalization.

Method: Introduces rotation group equivariant convolutions for rotation equivariant I2I framework, plus transformation learnable equivariant convolutions (TL-Conv) that adaptively learn transformation groups with theoretical analysis of equivariance error.

Result: Extensive experiments across various I2I tasks demonstrate superior performance and effectiveness of the approach in enhancing generation quality.

Conclusion: Equivariant networks with symmetry priors significantly improve image-to-image translation, with TL-Conv providing adaptive transformation learning and theoretical guarantees.

Abstract: Image-to-image translation (I2I) is a fundamental task in computer vision, focused on mapping an input image from a source domain to a corresponding image in a target domain while preserving domain-invariant features and adapting domain-specific attributes. Despite the remarkable success of deep learning-based I2I approaches, the lack of paired data and unsupervised learning framework still hinder their effectiveness. In this work, we address the challenge by incorporating transformation symmetry priors into image-to-image translation networks. Specifically, we introduce rotation group equivariant convolutions to achieve rotation equivariant I2I framework, a novel contribution, to the best of our knowledge, along this research direction. This design ensures the preservation of rotation symmetry, one of the most intrinsic and domain-invariant properties of natural and scientific images, throughout the network. Furthermore, we conduct a systematic study on image symmetry priors on real dataset and propose a novel transformation learnable equivariant convolutions (TL-Conv) that adaptively learns transformation groups, enhancing symmetry preservation across diverse datasets. We also provide a theoretical analysis of the equivariance error of TL-Conv, proving that it maintains exact equivariance in continuous domains and provide a bound for the error in discrete cases. Through extensive experiments across a range of I2I tasks, we validate the effectiveness and superior performance of our approach, highlighting the potential of equivariant networks in enhancing generation quality and its broad applicability. Our code is available at https://github.com/tanfy929/Equivariant-I2I

[241] Rethinking Satellite Image Restoration for Onboard AI: A Lightweight Learning-Based Approach

Adrien Dorise, Marjorie Bellizzi, Omar Hlimi

Main category: cs.CV

TL;DR: Lightweight convolutional network (ConvBEERS) for satellite image restoration outperforms traditional physical models with better quality and 41x faster processing, enabling onboard AI applications.

Details

Motivation: Satellite image restoration is crucial for both ground-based products and onboard AI applications, but traditional physical model pipelines are computationally intensive and too slow for onboard environments.

Method: Proposes ConvBEERS - a Convolutional Board-ready Embedded and Efficient Restoration model using a lightweight non-generative residual convolutional network trained on simulated satellite data.

Result: Achieves +6.9dB PSNR improvement on simulated datasets and real Pleiades-HR imagery, improves downstream object detection by up to +5.1% mAP@50, and reduces latency by ~41x on FPGA deployment.

Conclusion: Lightweight CNNs can achieve competitive restoration quality while addressing real-world constraints in spaceborne systems, enabling practical onboard processing.

Abstract: Satellite image restoration aims to improve image quality by compensating for degradations (e.g., noise and blur) introduced by the imaging system and acquisition conditions. As a fundamental preprocessing step, restoration directly impacts both ground-based product generation and emerging onboard AI applications. Traditional restoration pipelines based on sequential physical models are computationally intensive and slow, making them unsuitable for onboard environments. In this paper, we introduce ConvBEERS: a Convolutional Board-ready Embedded and Efficient Restoration model for Space to investigate whether a light and non-generative residual convolutional network, trained on simulated satellite data, can match or surpass a traditional ground-processing restoration pipeline across multiple operating conditions. Experiments conducted on simulated datasets and real Pleiades-HR imagery demonstrate that the proposed approach achieves competitive image quality, with a +6.9dB PSNR improvement. Evaluation on a downstream object detection task demonstrates that restoration significantly improves performance, with up to +5.1% mAP@50. In addition, successful deployment on a Xilinx Versal VCK190 FPGA validates its practical feasibility for satellite onboard processing, with a ~41x reduction in latency compared to the traditional pipeline. These results demonstrate the relevance of using lightweight CNNs to achieve competitive restoration quality while addressing real-world constraints in spaceborne systems.

[242] Detecting and refurbishing ground truth errors during training of deep learning-based echocardiography segmentation models

Iman Islam, Bram Ruijsink, Andrew J. Reader, Andrew P. King

Main category: cs.CV

TL;DR: A study on deep learning robustness to labeling errors in echocardiography segmentation, comparing loss-based vs. VOG error detection methods and proposing pseudo-labeling refurbishment.

Details

Motivation: Medical image segmentation relies on ground truth labels that can contain random errors or systematic biases, which may affect model performance. The study aims to evaluate model robustness to such errors and develop strategies for detecting and correcting erroneous labels during training.

Method: Using the CAMUS echocardiography dataset, the researchers simulated three types of labeling errors. They compared two error detection methods: loss-based detection vs. Variance of Gradients (VOG). They also proposed a pseudo-labeling approach to refurbish suspected erroneous ground truth labels. Performance was assessed under varying error levels.

Result: VOG proved highly effective in flagging erroneous ground truth labels during training. Standard U-Net maintained strong performance under random label errors and moderate systematic errors (up to 50%). The detection and refurbishment approach improved performance, particularly under high-error conditions.

Conclusion: The study demonstrates that while standard models show some robustness to labeling errors, systematic error detection methods like VOG combined with pseudo-labeling refurbishment can significantly improve performance, especially in high-error scenarios common in medical imaging.

Abstract: Deep learning-based medical image segmentation typically relies on ground truth (GT) labels obtained through manual annotation, but these can be prone to random errors or systematic biases. This study examines the robustness of deep learning models to such errors in echocardiography (echo) segmentation and evaluates a novel strategy for detecting and refurbishing erroneous labels during model training. Using the CAMUS dataset, we simulate three error types, then compare a loss-based GT label error detection method with one based on Variance of Gradients (VOG). We also propose a pseudo-labelling approach to refurbish suspected erroneous GT labels. We assess the performance of our proposed approach under varying error levels. Results show that VOG proved highly effective in flagging erroneous GT labels during training. However, a standard U-Net maintained strong performance under random label errors and moderate levels of systematic errors (up to 50%). The detection and refurbishment approach improved performance, particularly under high-error conditions.

[243] Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

Yingying Zhao, Chengyin Hu, Qike Zhang, Xin Li, Xin Wang, Yiwei Wei, Jiujiang Guo, Jiahuan Long, Tingsong Jiang, Wen Yao

Main category: cs.CV

TL;DR: First physically deployable adversarial attack framework against Vision-Language Models using controllable adversarial lighting to disrupt multimodal semantic understanding in real scenes.

Details

Motivation: Existing adversarial studies focus almost exclusively on digital settings, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, investigating physical attacks is essential for assessing real-world security risks.

Method: Proposes Multimodal Semantic Lighting Attacks (MSLA) - a physically deployable adversarial attack framework that uses controllable adversarial lighting to disrupt multimodal semantic understanding. Attacks semantic alignment rather than only task-specific outputs.

Result: MSLA degrades zero-shot classification performance of mainstream CLIP variants while inducing severe semantic hallucinations in advanced VLMs such as LLaVA and BLIP across image captioning and VQA. Effective, transferable, and practically realizable in both digital and physical domains.

Conclusion: VLMs are highly vulnerable to physically deployable semantic attacks, exposing a previously overlooked robustness gap and underscoring the urgent need for physical-world robustness evaluation of VLMs.

Abstract: Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Such attacks may induce recognition failures and further disrupt multimodal reasoning, leading to severe semantic misinterpretation in downstream tasks. Therefore, investigating physical attacks on VLMs is essential for assessing their real-world security risks. To address this gap, we propose Multimodal Semantic Lighting Attacks (MSLA), the first physically deployable adversarial attack framework against VLMs. MSLA uses controllable adversarial lighting to disrupt multimodal semantic understanding in real scenes, attacking semantic alignment rather than only task-specific outputs. Consequently, it degrades zero-shot classification performance of mainstream CLIP variants while inducing severe semantic hallucinations in advanced VLMs such as LLaVA and BLIP across image captioning and visual question answering (VQA). Extensive experiments in both digital and physical domains demonstrate that MSLA is effective, transferable, and practically realizable. Our findings provide the first evidence that VLMs are highly vulnerable to physically deployable semantic attacks, exposing a previously overlooked robustness gap and underscoring the urgent need for physical-world robustness evaluation of VLMs.

[244] PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination

Xuan Wang, Kai Ruan, Jiayi Han, kaiyue Zhou, Gaoang Wang

Main category: cs.CV

TL;DR: PianoFlow: A flow-matching framework for audio-driven bimanual piano motion generation using MIDI priors during training, role-gated interaction for hand coordination, and autoregressive flow continuation for real-time long-sequence generation.

Details

Motivation: Existing methods for audio-driven piano motion generation lack symbolic priors, have inflexible interaction mechanisms, and are limited to short sequences due to computational constraints.

Method: Uses flow-matching framework with MIDI as privileged modality during training, asymmetric role-gated interaction module for cross-hand coordination, and autoregressive flow continuation for streaming generation of long sequences.

Result: Achieves superior quantitative and qualitative performance on PianoMotion10M dataset while accelerating inference by over 9× compared to previous methods.

Conclusion: PianoFlow enables precise, coordinated bimanual piano motion synthesis with real-time streaming capability for arbitrarily long sequences.

Abstract: Audio-driven bimanual piano motion generation requires precise modeling of complex musical structures and dynamic cross-hand coordination. However, existing methods often rely on acoustic-only representations lacking symbolic priors, employ inflexible interaction mechanisms, and are limited to computationally expensive short-sequence generation. To address these limitations, we propose PianoFlow, a flow-matching framework for precise and coordinated bimanual piano motion synthesis. Our approach strategically leverages MIDI as a privileged modality during training, distilling these structured musical priors to achieve deep semantic understanding while maintaining audio-only inference. Furthermore, we introduce an asymmetric role-gated interaction module to explicitly capture dynamic cross-hand coordination through role-aware attention and temporal gating. To enable real-time streaming generation for arbitrarily long sequences, we design an autoregressive flow continuation scheme that ensures seamless cross-chunk temporal coherence. Extensive experiments on the PianoMotion10M dataset demonstrate that PianoFlow achieves superior quantitative and qualitative performance, while accelerating inference by over 9\times compared to previous methods.

[245] VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

Andrei Atanov, Jesse Allardice, Roman Bachmann, Oğuzhan Fatih Kar, R Devon Hjelm, David Griffiths, Peter Fu, Afshin Dehghan, Amir Zamir

Main category: cs.CV

TL;DR: VideoFlexTok introduces a coarse-to-fine video tokenizer that represents videos with variable-length token sequences, enabling more efficient video generation and longer video modeling compared to standard 3D grid tokenization.

Details

Motivation: Standard 3D grid video tokenization forces models to predict all low-level details regardless of video complexity, leading to high learning complexity. There's a need for more efficient video representations that adapt to downstream needs and enable longer video generation.

Method: VideoFlexTok uses a variable-length sequence of tokens structured coarse-to-fine: early tokens capture abstract semantics and motion, later tokens add fine details. A generative flow decoder enables reconstruction from any token count, allowing adaptive token usage.

Result: Achieves comparable generation quality (gFVD and ViCLIP Score) with 5x smaller models (1.1B vs 5.2B). Enables 10-second 81-frame video generation with 8x fewer tokens than 3D grid tokenizers (672 vs standard).

Conclusion: VideoFlexTok provides more efficient video representation for generation tasks, enabling longer video modeling with reduced computational cost while maintaining quality.

Abstract: Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal. This requires the downstream model that consumes the tokens, e.g., a text-to-video model, to learn to predict all low-level details “pixel-by-pixel” irrespective of the video’s inherent complexity, leading to high learning complexity. We present VideoFlexTok, which represents videos with a variable-length sequence of tokens structured in a coarse-to-fine manner – where the first tokens (emergently) capture abstract information, such as semantics and motion, and later tokens add fine-grained details. The generative flow decoder enables realistic video reconstructions from any token count. This representation structure allows adapting the token count according to downstream needs and encoding videos longer than the baselines with the same budget. We evaluate VideoFlexTok on class- and text-to-video generative tasks and show that it leads to more efficient training compared to 3D grid tokens, e.g., achieving comparable generation quality (gFVD and ViCLIP Score) with a 5x smaller model (1.1B vs 5.2B). Finally, we demonstrate how VideoFlexTok can enable long video generation without prohibitive computational cost by training a text-to-video model on 10-second 81-frame videos with only 672 tokens, 8x fewer than a comparable 3D grid tokenizer.

[246] Towards Long-horizon Agentic Multimodal Search

Yifan Du, Zikang Liu, Jinbiao Peng, Jie Wu, Junyi Li, Jinyang Li, Wayne Xin Zhao, Ji-Rong Wen

Main category: cs.CV

TL;DR: LMM-Searcher: A novel multimodal deep search framework using file-based visual representation to handle long-horizon multimodal tasks efficiently by offloading visual assets to external files and using lightweight textual identifiers, enabling 100-turn search horizons with state-of-the-art performance.

Details

Motivation: Existing multimodal deep search agents struggle with managing heterogeneous information and high token costs over long horizons, suffering from context explosion or loss of crucial visual signals when handling multimodal inputs.

Method: Proposes a file-based visual representation mechanism that offloads visual assets to external file system and maps them to lightweight textual identifiers (UIDs), reducing context overhead while preserving multimodal information. Includes a fetch-image tool for progressive on-demand visual loading and a data synthesis pipeline for generating complex cross-modal multi-hop reasoning queries.

Result: Achieves state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, successfully scales to 100-turn search horizons, and shows strong generalizability across different base models.

Conclusion: LMM-Searcher effectively addresses long-horizon multimodal deep search challenges through file-based visual representation, enabling efficient management of multimodal information over extended interactions while maintaining strong performance.

Abstract: Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.

[247] Representing 3D Faces with Learnable B-Spline Volumes

Prashanth Chandran, Daoye Wang, Timo Bolkart

Main category: cs.CV

TL;DR: CUBE is a novel geometric representation for human faces using B-spline volumes with learned high-dimensional control features, enabling continuous mapping from parametric domain to 3D space with local editing capabilities.

Details

Motivation: Existing B-spline representations with 3D control points lack expressivity for complex facial geometry. The authors aim to create a more expressive representation that maintains local support properties while enabling better 3D face reconstruction and registration.

Method: CUBE uses a lattice of high-dimensional control features instead of 3D control points. It employs a two-stage mapping: first blending control features using B-spline bases to create a feature vector (first three values define base mesh), then using an MLP to predict residual displacements for refined 3D coordinates. For reconstruction, CUBE is queried at coordinates from a fixed template mesh.

Result: CUBE achieves state-of-the-art scan registration results compared to recent baselines when trained with transformer-based encoders to predict control features from point clouds and monocular images.

Conclusion: CUBE provides an expressive geometric representation for human faces that combines B-spline volume benefits with learned features, enabling high-quality 3D reconstruction while maintaining local editing capabilities through its local support property.

Abstract: We present CUBE (Control-based Unified B-spline Encoding), a new geometric representation for human faces that combines B-spline volumes with learned features, and demonstrate its use as a decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-spline representations with 3D control points, CUBE is parametrized by a lattice (e.g., 8 x 8 x 8) of high-dimensional control features, increasing the model’s expressivity. These features define a continuous, two-stage mapping from a 3D parametric domain to 3D Euclidean space via an intermediate feature space. First, high-dimensional control features are locally blended using the B-spline bases, yielding a high-dimensional feature vector whose first three values define a 3D base mesh. A small MLP then processes this feature vector to predict a residual displacement from the base shape, yielding the final refined 3D coordinates. To reconstruct 3D surfaces in dense semantic correspondence, CUBE is queried at 3D coordinates sampled from a fixed template mesh. Crucially, CUBE retains the local support property of traditional B-spline representations, enabling local surface editing by updating individual control features. We demonstrate the strengths of this representation by training transformer-based encoders to predict CUBE’s control features from unstructured point clouds and monocular images, achieving state-of-the-art scan registration results compared to recent baselines.

[248] Don’t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

Muhammad Kamran Janjua, Hugo Silva, Di Niu, Bahador Rashidi

Main category: cs.CV

TL;DR: P² (Perception Programs) is a training-free method that converts dense vision tool outputs into structured language summaries, enabling MLLMs to better utilize visual cues for perception tasks without model modifications.

Details

Motivation: Current MLLMs fail to effectively use vision tool outputs (depth, flow, correspondence) because dense pixel-level representations don't align with LLMs' language-native reasoning, leading to weak perception and reliance on language priors.

Method: P² rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. It’s training-free and model-agnostic, working without any model modifications or additional training.

Result: Across six perception tasks in BLINK, P² achieved large improvements: GPT-5 Mini accuracy increased from 41.35% to 86.47% on multi-view reasoning, 52.42% to 81.45% on relative depth, with 22% average gain. Smaller MLLMs (InternVL3.5-4B, Qwen3VL-4B) showed 15-40% absolute gains, surpassing prior tool-use methods.

Conclusion: The bottleneck in vision tool-augmented MLLMs is representation alignment, not more tools or larger models. P² demonstrates that converting tool outputs to language-native summaries enables MLLMs to effectively leverage visual cues for perception tasks.

Abstract: Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P$^2$), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P$^2$ consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P$^2$ raises its accuracy from 41.35% to 86.47% on multi-view reasoning, from 52.42% to 81.45% on relative depth, and achieves a 22% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15-40% absolute gains from P$^2$, surpassing prior agentic, supervised, and RL-based tool-use methods-without any training or model modifications.

[249] A Sanity Check on Composed Image Retrieval

Yikun Liu, Jiangchao Yao, Weidi Xie, Yanfeng Wang

Main category: cs.CV

TL;DR: FISD: A new benchmark and evaluation framework for Composed Image Retrieval that addresses query ambiguity and enables multi-round interactive assessment.

Details

Motivation: Existing CIR benchmarks have limitations: they contain indeterminate queries (multiple candidate images meet criteria) and don't evaluate models in multi-round interactive scenarios, leading to inaccurate performance characterization.

Method: 1) FISD benchmark: Uses generative models to precisely control reference-target image pair variables across six dimensions without query ambiguity. 2) Multi-round agentic evaluation framework: Automatically probes models’ adaptation in interactive scenarios over successive query rounds.

Result: Extensive experiments show the value of the novel evaluation approach for typical CIR methods, providing more accurate performance characterization and realistic appraisal of practical efficacy.

Conclusion: The proposed FISD benchmark and multi-round evaluation framework address critical limitations in CIR assessment, enabling more accurate evaluation and better understanding of model capabilities in practical interactive applications.

Abstract: Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image, and a relative caption that specifies the desired modification. Despite the rapid development of CIR models, their performance is not well characterized by existing benchmarks, which inherently contain indeterminate queries degrading the evaluation (i.e., multiple candidate images, rather than solely the target image, meet the query criteria), and have not considered their effectiveness in the context of the multi-round system. Motivated by this, we consider improving the evaluation procedure from two aspects: 1) we introduce FISD, a Fully-Informed Semantically-Diverse benchmark, which employs generative models to precisely control the variables of reference-target image pairs, enabling a more accurate evaluation of CIR methods across six dimensions, without query ambiguity; 2) we propose an automatic multi-round agentic evaluation framework to probe the potential of the existing models in the interactive scenarios. By observing how models adapt and refine their choices over successive rounds of queries, this framework provides a more realistic appraisal of their efficacy in practical applications. Extensive experiments and comparisons prove the value of our novel evaluation on typical CIR methods.

[250] M3D-Stereo: A Multiple-Medium and Multiple-Degradation Dataset for Stereo Image Restoration

Deqing Yang, Yingying Liu, Qicong Wang, Zhi Zeng, Dajiang Lu, Yibin Tian

Main category: cs.CV

TL;DR: M3D-Stereo is a stereo dataset for image restoration research with 7904 high-resolution image pairs across four degradation scenarios (underwater scatter, haze/fog, underwater low-light, haze low-light) with six progressive degradation levels each, providing aligned stereo pairs with pixel-wise consistent ground truths.

Details

Motivation: Existing image restoration datasets are limited to single degradation types or rely on synthetic data without stereo consistency, restricting applicability to real-world scenarios with complex physical degradations and severe information loss.

Method: Created a laboratory setup to collect stereo image pairs with controlled degradation levels across four scenarios, providing aligned stereo images with pixel-wise consistent clear ground truths for each degradation level.

Result: The dataset enables fine-grained evaluation of restoration methods with increasing degradation severity and supports two restoration tasks: single-level and mixed-level degradation, establishing a better controlled and more realistic benchmark.

Conclusion: M3D-Stereo provides a comprehensive stereo dataset for evaluating image restoration and stereo matching methods in complex degradation environments, addressing limitations of existing datasets.

Abstract: Image restoration under adverse conditions, such as underwater, haze or fog, and low-light environments, remains a highly challenging problem due to complex physical degradations and severe information loss. Existing datasets are predominantly limited to a single degradation type or heavily rely on synthetic data without stereo consistency, inherently restricting their applicability in real-world scenarios. To address this, we introduce M3D-Stereo, a stereo dataset with 7904 high-resolution image pairs for image restoration research acquired in multiple media with multiple controlled degradation levels. It encompasses four degradation scenarios: underwater scatter, haze/fog, underwater low-light, and haze low-light. Each scenario forms a subset, and is divided into six levels of progressive degradation, allowing fine-grained evaluations of restoration methods with increasing severity of degradation. Collected via a laboratory setup, the dataset provides aligned stereo image pairs along with their pixel-wise consistent clear ground truths. Two restoration tasks, single-level and mixed-level degradation, were performed to verify its validity. M3D-Stereo establishes a better controlled and more realistic benchmark to evaluate image restoration and stereo matching methods in complex degradation environments. It is made public under LGPLv3 license.

[251] Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

Ahmet İnanç, Özgür Erkent

Main category: cs.CV

TL;DR: CTAB is a bidirectional module for radar-camera fusion that enables cross-task feature sharing between 3D detection and BEV segmentation in autonomous driving, improving segmentation performance while maintaining detection quality.

Details

Motivation: Current radar-camera fusion methods treat 3D detection and BEV segmentation as isolated tasks, missing opportunities for complementary information sharing. Detection features encode object-level geometry that could sharpen segmentation boundaries, while segmentation features provide dense semantic context that could anchor detection.

Method: Proposes CTAB (Cross-Task Attention Bridge), a bidirectional module that exchanges features between detection and segmentation branches via multi-scale deformable attention in shared BEV space. Integrated into a multi-task framework with Instance Normalization-based segmentation decoder and learnable BEV upsampling.

Result: On nuScenes dataset, CTAB improves segmentation on 7 classes over joint multi-task baseline while maintaining essentially neutral detection performance. On a 4-class subset (drivable area, pedestrian crossing, walkway, vehicle), achieves comparable mIoU while simultaneously providing 3D detection.

Conclusion: CTAB enables effective cross-task feature sharing between detection and segmentation in BEV space, providing a more detailed BEV representation for autonomous driving perception without compromising detection performance.

Abstract: Bird’s-eye-view (BEV) representations are the dominant paradigm for 3D perception in autonomous driving, providing a unified spatial canvas where detection and segmentation features are geometrically registered to the same physical coordinate system. However, existing radar-camera fusion methods treat these tasks in isolation, missing the opportunity to share complementary information between them: detection features encode object-level geometry that can sharpen segmentation boundaries, while segmentation features provide dense semantic context that can anchor detection. We propose \textbf{CTAB} (Cross-Task Attention Bridge), a bidirectional module that exchanges features between detection and segmentation branches via multi-scale deformable attention in shared BEV space. CTAB is integrated into a multi-task framework with an Instance Normalization-based segmentation decoder and learnable BEV upsampling to provide a more detailed BEV representation. On nuScenes, CTAB improves segmentation on 7 classes over the joint multi-task baseline at essentially neutral detection. On a 4-class subset (drivable area, pedestrian crossing, walkway, vehicle), our joint multi-task model reaches comparable mIoU on 4 classes while simultaneously providing 3D detection.

[252] Pi-HOC: Pairwise 3D Human-Object Contact Estimation

Sravan Chittupalli, Ayush Jain, Dong Huang

Main category: cs.CV

TL;DR: Pi-HOC is a single-pass framework for dense 3D semantic contact prediction between all human-object pairs in images, improving accuracy and speed over existing methods while enabling downstream applications like mesh reconstruction and language-guided contact prediction.

Details

Motivation: Existing methods for human-object contact estimation are limited to single-human settings, require additional object geometry inputs, or scale poorly in multi-human scenarios. There's a need for an efficient, instance-aware framework that can handle multiple humans and objects simultaneously.

Method: Pi-HOC uses a single-pass framework that: 1) detects instances, 2) creates dedicated human-object tokens for each pair, 3) refines them with an InteractionFormer, and 4) uses a SAM-based decoder to predict dense contact on SMPL human meshes for each human-object pair.

Result: On MMHOI and DAMON datasets, Pi-HOC significantly improves accuracy and localization over state-of-the-art methods while achieving 20x higher throughput. It also enables improved SAM-3D image-to-mesh reconstruction via test-time optimization and referential contact prediction from language queries without additional training.

Conclusion: Pi-HOC provides an efficient, accurate solution for dense 3D semantic contact prediction in multi-human scenarios, with strong performance improvements and practical applications in downstream tasks like mesh reconstruction and language-guided interaction understanding.

Abstract: Resolving real-world human-object interactions in images is a many-to-many challenge, in which disentangling fine-grained concurrent physical contact is particularly difficult. Existing semantic contact estimation methods are either limited to single-human settings or require object geometries (e.g., meshes) in addition to the input image. Current state-of-the-art leverages powerful VLM for category-level semantics but struggles with multi-human scenarios and scales poorly in inference. We introduce Pi-HOC, a single-pass, instance-aware framework for dense 3D semantic contact prediction of all human-object pairs. Pi-HOC detects instances, creates dedicated human-object (HO) tokens for each pair, and refines them using an InteractionFormer. A SAM-based decoder then predicts dense contact on SMPL human meshes for each human-object pair. On the MMHOI and DAMON datasets, Pi-HOC significantly improves accuracy and localization over state-of-the-art methods while achieving 20x higher throughput. We further demonstrate that predicted contacts improve SAM-3D image-to-mesh reconstruction via a test-time optimization algorithm and enable referential contact prediction from language queries without additional training.

[253] Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions

Ayce Idil Aytekin, Xu Chen, Zhengyang Shen, Thabo Beeler, Helge Rhodin, Rishabh Dabral, Christian Theobalt

Main category: cs.CV

TL;DR: GraG is a fast method for reconstructing dynamic 3D hand-object interactions from monocular video using a compact Sum-of-Gaussians representation, achieving 6.4x speedup over prior work with improved accuracy.

Details

Motivation: Current methods for reconstructing dynamic 3D hand-object interactions from monocular video use heavy neural representations that are computationally expensive and slow. There's a need for a faster, more efficient approach that maintains reconstruction quality.

Method: Uses a compact Sum-of-Gaussians (SoG) representation for efficient tracking. Initializes object pose/geometry using video-adapted SAM3D pipeline, then converts to lightweight SoG via subsampling. For hands, starts with off-the-shelf monocular pose initialization and refines with 2D joint and depth alignment losses, avoiding detailed 3D appearance models.

Result: Achieves 6.4x faster reconstruction than prior work on long sequences, improves object reconstruction by 13.4%, and reduces hand’s per-joint position error by over 65% on public benchmarks.

Conclusion: GraG demonstrates that efficient tracking of dynamic hand-object interactions can be achieved using compact representations like Sum-of-Gaussians, offering significant speed improvements while maintaining or improving reconstruction quality.

Abstract: We present Grasp in Gaussians (GraG), a fast and robust method for reconstructing dynamic 3D hand-object interactions from a single monocular video. Unlike recent approaches that optimize heavy neural representations, our method focuses on tracking the hand and the object efficiently, once initialized from pretrained large models. Our key insight is that accurate and temporally stable hand-object motion can be recovered using a compact Sum-of-Gaussians (SoG) representation, revived from classical tracking literature and integrated with generative Gaussian-based initializations. We initialize object pose and geometry using a video-adapted SAM3D pipeline, then convert the resulting dense Gaussian representation into a lightweight SoG via subsampling. This compact representation enables efficient and fast tracking while preserving geometric fidelity. For the hand, we adopt a complementary strategy: starting from off-the-shelf monocular hand pose initialization, we refine hand motion using simple yet effective 2D joint and depth alignment losses, avoiding per-frame refinement of a detailed 3D hand appearance model while maintaining stable articulation. Extensive experiments on public benchmarks demonstrate that GraG reconstructs temporally coherent hand-object interactions on long sequences 6.4x faster than prior work while improving object reconstruction by 13.4% and reducing hand’s per-joint position error by over 65%.

[254] Task Alignment: A simple and effective proxy for model merging in computer vision

Pau de Jorge, César Roberto de Souza, Björn Michele, Mert Bülent Sarıyıldız, Philippe Weinzaepfel, Florent Perronnin, Diane Larlus, Yannis Kalantidis

Main category: cs.CV

TL;DR: This paper introduces a task alignment proxy to efficiently merge multi-task vision models with heterogeneous decoders, extending model merging beyond CLIP-based classification.

Details

Motivation: Current model merging evaluations are mostly limited to CLIP-based image classification with frozen decoders, but practical vision applications involve heterogeneous trainable decoders. The high cost of decoder training makes hyperparameter selection based on downstream performance impractical.

Method: The authors introduce a task alignment proxy that enables efficient hyperparameter selection for model merging without requiring full decoder training. This proxy allows merging of models with heterogeneous decoders by speeding up hyperparameter optimization.

Result: The task alignment proxy reduces hyperparameter selection time by orders of magnitude while maintaining performance. This enables model merging for multi-task vision scenarios beyond simple CLIP-based classification.

Conclusion: The proposed task alignment proxy makes model merging practical for real-world vision applications with heterogeneous decoders, extending the applicability of model merging techniques to more challenging multi-task scenarios.

Abstract: Efficiently merging several models fine-tuned for different tasks, but stemming from the same pretrained base model, is of great practical interest. Despite extensive prior work, most evaluations of model merging in computer vision are restricted to image classification using CLIP, where different classification datasets define different tasks. In this work, our goal is to make model merging more practical and show its relevance on challenging scenarios beyond this specific setting. In most vision scenarios, different tasks rely on trainable and usually heterogeneous decoders. Differently from previous studies with frozen decoders, where merged models can be evaluated right away, the non-trivial cost of decoder training renders hyperparameter selection based on downstream performance impractical. To address this, we introduce the task alignment proxy, and show how it can be used to speed up hyperparameter selection by orders of magnitude while retaining performance. Equipped with the task alignment proxy, we extend the applicability of model merging to multi-task vision models beyond CLIP-based classification.

[255] Direct Discrepancy Replay: Distribution-Discrepancy Condensation and Manifold-Consistent Replay for Continual Face Forgery Detection

Tianshuo Zhang, Haoyuan Zhang, Siran Peng, Weisong Zhao, Xiangyu Zhu, Zhen Lei

Main category: cs.CV

TL;DR: A continual face forgery detection method that uses distribution discrepancy condensation and manifold-consistent replay to mitigate forgetting without storing raw historical data.

Details

Motivation: Existing continual face forgery detection methods struggle with limited memory budgets - storing historical samples inadequately covers forgery cues and risks identity privacy, while synthetic pseudo-forgeries remain tied to past decision boundaries.

Method: Proposes Distribution-Discrepancy Condensation (DDC) to model real-to-fake discrepancy via surrogate factorization in characteristic-function space, condensing it into a tiny bank of distribution discrepancy maps. Uses Manifold-Consistent Replay (MCR) to synthesize replay samples through variance-preserving composition of these maps with current-stage real faces.

Result: The framework consistently outperforms prior CFFD baselines, significantly mitigates catastrophic forgetting, operates under extremely small memory budgets without storing raw historical face images, and reduces identity leakage risk compared to selection-based replay.

Conclusion: The proposed approach effectively addresses continual face forgery detection challenges by focusing on distribution-level replay rather than sample-level replay, achieving better performance with improved privacy protection.

Abstract: Continual face forgery detection (CFFD) requires detectors to learn emerging forgery paradigms without forgetting previously seen manipulations. Existing CFFD methods commonly rely on replaying a small amount of past data to mitigate forgetting. Such replay is typically implemented either by storing a few historical samples or by synthesizing pseudo-forgeries from detector-dependent perturbations. Under strict memory budgets, the former cannot adequately cover diverse forgery cues and may expose facial identities, while the latter remains strongly tied to past decision boundaries. We argue that the core role of replay in CFFD is to reinstate the distributions of previous forgery tasks during subsequent training. To this end, we directly condense the discrepancy between real and fake distributions and leverage real faces from the current stage to perform distribution-level replay. Specifically, we introduce Distribution-Discrepancy Condensation (DDC), which models the real-to-fake discrepancy via a surrogate factorization in characteristic-function space and condenses it into a tiny bank of distribution discrepancy maps. We further propose Manifold-Consistent Replay (MCR), which synthesizes replay samples through variance-preserving composition of these maps with current-stage real faces, yielding samples that reflect previous-task forgery cues while remaining compatible with current real-face statistics. Operating under an extremely small memory budget and without directly storing raw historical face images, our framework consistently outperforms prior CFFD baselines and significantly mitigates catastrophic forgetting. Replay-level privacy analysis further suggests reduced identity leakage risk relative to selection-based replay.

[256] Distorted or Fabricated? A Survey on Hallucination in Video LLMs

Yiyang Huang, Yitian Zhang, Yizhou Wang, Mingyuan Zhang, Liang Shi, Huimin Zeng, Yun Fu

Main category: cs.CV

TL;DR: Survey paper analyzing hallucinations in Video Large Language Models (Vid-LLMs), categorizing them into dynamic distortion and content fabrication types, reviewing evaluation methods and mitigation strategies, and proposing future research directions.

Details

Motivation: Despite progress in video-language modeling, hallucinations remain a persistent challenge in Vid-LLMs, where outputs appear plausible but contradict input video content. There's a need for systematic understanding and taxonomy to address this issue.

Method: Presents a comprehensive survey with systematic taxonomy categorizing hallucinations into two core types: dynamic distortion (with two subtypes) and content fabrication (with two subtypes). Reviews evaluation benchmarks, metrics, and intervention strategies, and analyzes root causes.

Result: Identifies that dynamic distortion often results from limited temporal representation capacity, while content fabrication stems from insufficient visual grounding. Provides insights into current evaluation methods and mitigation approaches.

Conclusion: Consolidates scattered progress to foster systematic understanding of hallucinations in Vid-LLMs, laying groundwork for robust video-language systems. Proposes future directions including motion-aware visual encoders and counterfactual learning techniques.

Abstract: Despite significant progress in video-language modeling, hallucinations remain a persistent challenge in Video Large Language Models (Vid-LLMs), referring to outputs that appear plausible yet contradict the content of the input video. This survey presents a comprehensive analysis of hallucinations in Vid-LLMs and introduces a systematic taxonomy that categorizes them into two core types: dynamic distortion and content fabrication, each comprising two subtypes with representative cases. Building on this taxonomy, we review recent advances in the evaluation and mitigation of hallucinations, covering key benchmarks, metrics, and intervention strategies. We further analyze the root causes of dynamic distortion and content fabrication, which often result from limited capacity for temporal representation and insufficient visual grounding. These insights inform several promising directions for future work, including the development of motion-aware visual encoders and the integration of counterfactual learning techniques. This survey consolidates scattered progress to foster a systematic understanding of hallucinations in Vid-LLMs, laying the groundwork for building robust and reliable video-language systems. An up-to-date curated list of related works is maintained at https://github.com/hukcc/Awesome-Video-Hallucination .

[257] Boosting Visual Instruction Tuning with Self-Supervised Guidance

Sophia Sirko-Galouchenko, Monika Wysoczanska, Andrei Bursuc, Nicolas Thome, Spyros Gidaris

Main category: cs.CV

TL;DR: V-GIFT improves MLLMs’ visual reasoning by augmenting instruction tuning with self-supervised tasks reformulated as natural language instructions, forcing models to rely on visual evidence rather than language priors.

Details

Motivation: MLLMs struggle with vision-centric tasks requiring fine-grained visual reasoning, not due to weak visual representations but because instruction tuning allows models to rely too heavily on language priors rather than visual evidence.

Method: Reformulate classical self-supervised pretext tasks (rotation prediction, color matching, cross-view correspondence) as image-instruction-response triplets and inject a small fraction (3-10%) of these visually grounded instructions into training data without architectural changes or additional training stages.

Result: Consistent performance improvements across multiple models, training regimes, and benchmarks on vision-centric evaluations by forcing models to use visual evidence rather than language shortcuts.

Conclusion: Visually grounded self-supervised tasks in instruction tuning serve as a powerful yet simple approach to enhance visual reasoning in MLLMs through adjustments to training data distribution.

Abstract: Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT

[258] AbdomenGen: Sequential Volume-Conditioned Diffusion Framework for Abdominal Anatomy Generation

Yubraj Bhandari, Lavsen Dahal, Paul Segars, Joseph Y. Lo

Main category: cs.CV

TL;DR: AbdomenGen: A sequential volume-conditioned diffusion framework for generating controllable abdominal anatomy with interpretable volume modulation using Volume Control Scalars.

Details

Motivation: Current systems for generating controlled, clinically meaningful anatomical variations in medical imaging research are limited, particularly for creating abdominal phantoms with specific organ size variations while maintaining anatomical coherence.

Method: Proposes AbdomenGen, a sequential volume-conditioned diffusion framework that uses Volume Control Scalars (VCS) to decouple organ size from body habitus. Organ masks are synthesized sequentially, conditioning on body mask and previously generated structures to preserve anatomical coherence while enabling independent multi-organ control.

Result: Achieves strong geometric fidelity across 11 abdominal organs (e.g., liver dice 0.83 ± 0.05), stable single-organ calibration over [-3,+3] VCS range, and disentangled multi-organ modulation. Wasserstein-based VCS selection reduces distributional distance of training data by 73.6% for hepatomegaly cohort.

Conclusion: AbdomenGen enables calibrated, distribution-aware anatomical generation suitable for controllable abdominal phantom construction and simulation studies in medical imaging research.

Abstract: Computational phantoms are widely used in medical imaging research, yet current systems to generate controlled, clinically meaningful anatomical variations remain limited. We present AbdomenGen, a sequential volume-conditioned diffusion framework for controllable abdominal anatomy generation. We introduce the \textbf{Volume Control Scalar (VCS)}, a standardized residual that decouples organ size from body habitus, enabling interpretable volume modulation. Organ masks are synthesized sequentially, conditioning on the body mask and previously generated structures to preserve global anatomical coherence while supporting independent, multi-organ control. Across 11 abdominal organs, the proposed framework achieves strong geometric fidelity (e.g., liver dice $0.83 \pm 0.05$), stable single-organ calibration over $[-3,+3]$ VCS, and disentangled multi-organ modulation. To showcase clinical utility with a hepatomegaly cohort selected from MERLIN, Wasserstein-based VCS selection reduces distributional distance of training data by 73.6% . These results demonstrate calibrated, distribution-aware anatomical generation suitable for controllable abdominal phantom construction and simulation studies.

[259] Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

Jaywon Koo, Jefferson Hernandez, Ruozhen He, Hanjie Chen, Chen Wei, Vicente Ordonez

Main category: cs.CV

TL;DR: HypoExplore is an agentic framework for neural architecture discovery using hypothesis-driven scientific inquiry with LLMs, evolutionary branching, and confidence tracking to discover and understand vision architectures.

Details

Motivation: Current neural architecture search methods often lack interpretability and systematic understanding of design principles. The authors aim to create a framework that not only discovers better architectures but also builds genuine understanding of the design space through hypothesis-driven scientific inquiry.

Method: Uses LLM to generate hypotheses about neural architectures, maintains a Trajectory Tree for lineage tracking and Hypothesis Memory Bank with confidence scores. Employs evolutionary branching with dual strategy balancing exploitation of validated principles and resolution of uncertain ones. Multiple feedback agents analyze experimental results to update hypothesis confidence.

Result: Achieved 94.11% accuracy on CIFAR-10 (from 18.91% baseline), generalizes to CIFAR-100 and Tiny-ImageNet. Achieved state-of-the-art on MedMNIST. Hypothesis confidence scores become increasingly predictive with evidence accumulation, and learned principles transfer across independent evolutionary lineages.

Conclusion: HypoExplore successfully discovers strong vision architectures while building understanding of design space. The framework demonstrates that hypothesis-driven inquiry can lead to both performance improvements and genuine insights into neural architecture design principles.

Abstract: We introduce HypoExplore, an agentic framework that formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. Our proposed framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence. After each experiment, multiple feedback agents analyze the results from different perspectives and consolidate their findings into hypothesis confidence updates. Our framework is tested on discovering lightweight vision architectures on CIFAR-10, with the best achieving 94.11% accuracy evolved from a root node baseline that starts at 18.91%, and generalizes to CIFAR-100 and Tiny-ImageNet. We further demonstrate applicability to a specialized domain by conducting independent architecture discovery runs on MedMNIST, which yield a state-of-the-art performance. We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space.

[260] See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso, Yu Hu

Main category: cs.CV

TL;DR: The paper presents an iterative refinement approach for pixel-precise cursor localization in coding environments, addressing the limitations of single-shot coordinate prediction in dense IDE interfaces.

Details

Motivation: Existing Computer Use Agents (CUAs) struggle with editing-level grounding in dense coding interfaces where sub-pixel accuracy is required. Single-shot coordinate prediction lacks error correction mechanisms and often fails in high-density interfaces like IDEs.

Method: Instead of single-step execution, the agent uses an iterative refinement process with visual feedback from previous attempts. This closed-loop grounding mechanism allows self-correction of displacement errors and adaptation to dynamic UI changes.

Result: Evaluation across GPT-5.4, Claude, and Qwen on complex coding benchmarks shows that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate.

Conclusion: Iterative visual reasoning is a critical component for the next generation of reliable software engineering agents, enabling more precise GUI grounding in dense coding environments.

Abstract: Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.

[261] Representation geometry shapes task performance in vision-language modeling for CT enterography

Cristian Minoccheri, Emily Wittrup, Kayvan Najarian, Ryan Stidham

Main category: cs.CV

TL;DR: First study of vision-language transfer learning on abdominal CT enterography for IBD analysis, comparing slice aggregation methods, tissue contrast vs spatial coverage, and retrieval-augmented generation for report generation.

Details

Motivation: CT enterography is crucial for IBD assessment but optimal representational choices for automated analysis are unknown. The paper aims to establish baselines for vision-language systems in volumetric medical imaging.

Method: Used vision-language transfer learning on abdominal CT enterography, comparing slice embedding aggregation methods (mean vs attention pooling), tissue contrast strategies (multi-window RGB encoding vs multiplanar sampling), and report generation approaches (fine-tuning vs retrieval-augmented generation). Employed a three-teacher pseudolabel framework for comparisons without expert annotations.

Result: Mean pooling better for disease classification (59.2% three-class accuracy), attention pooling better for cross-modal retrieval (0.235 text-to-image MRR). Multi-window RGB encoding outperformed spatial coverage strategies. RAG improved report generation by 7-14 percentage points above chance baseline, reducing MAE from 0.98 to 0.80-0.89.

Conclusion: Provides first baselines for CT enterography analysis, showing slice aggregation methods emphasize different representation properties, tissue contrast matters more than spatial coverage, and RAG significantly improves report generation. Offers practical guidance for building vision-language systems in volumetric medical imaging.

Abstract: Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4% vs.\ 71% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7–14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80–0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.

[262] Visual Preference Optimization with Rubric Rewards

Ya-Qi Yu, Fangyu Hong, Xiangyang Qu, Hao Wang, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu

Main category: cs.CV

TL;DR: rDPO is a preference optimization framework for multimodal tasks that uses instance-specific rubrics to create fine-grained preference data, improving visual reasoning performance over traditional methods.

Details

Motivation: Existing DPO methods rely on coarse preference data from off-policy perturbations or outcome-based signals that are inadequate for fine-grained visual reasoning tasks, necessitating a more nuanced approach.

Method: Proposes rDPO with instance-specific rubrics: for each image-instruction pair, creates checklist-style rubrics with essential and additional criteria to score responses. Builds instruction-rubric pool offline and reuses it for on-policy data construction.

Result: Rubric-based prompting improves 30B-A3B judge close to GPT-5.4; rubric filtering achieves 82.69 macro average vs 75.82 for outcome-based filtering; rDPO scores 61.01 on comprehensive benchmark, outperforming baselines (52.36) and base model (59.48).

Conclusion: Visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback, demonstrating the effectiveness of rubric-based approaches for multimodal tasks.

Abstract: The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.

[263] Conflated Inverse Modeling to Generate Diverse and Temperature-Change Inducing Urban Vegetation Patterns

Baris Sarper Tezcan, Hrishikesh Viswanath, Rubab Saher, Daniel Aliaga

Main category: cs.CV

TL;DR: A generative inverse modeling framework using diffusion models to produce diverse vegetation patterns that achieve specific urban temperature reduction goals, addressing the underdetermined nature of urban climate adaptation.

Details

Motivation: Urban thermal extremes are worsening due to urbanization and climate change. While forward models can predict land surface temperature from vegetation patterns, the inverse problem—determining vegetation configurations that achieve desired temperature reductions—remains unexplored and is inherently underdetermined with multiple possible solutions.

Method: Proposes a conflated inverse modeling framework combining a predictive forward model with a diffusion-based generative inverse model. The framework generates diverse, physically plausible image-based vegetation patterns conditioned on specific temperature goals, maintaining control over thermal outcomes while enabling spatial diversity.

Result: The framework successfully produces diverse vegetation configurations that achieve specified temperature reduction targets, even when such combinations are absent from training data. It addresses the ambiguity in inverse urban climate problems where multiple spatial patterns can yield similar aggregated temperature responses.

Conclusion: Introduces a controllable inverse modeling approach for urban climate adaptation that accounts for the inherent diversity of the problem, enabling generation of multiple plausible vegetation solutions for temperature mitigation in data-scarce conditions.

Abstract: Urban areas are increasingly vulnerable to thermal extremes driven by rapid urbanization and climate change. Traditionally, thermal extremes have been monitored using Earth-observing satellites and numerical modeling frameworks. For example, land surface temperature derived from Landsat or Sentinel imagery is commonly used to characterize surface heating patterns. These approaches operate as forward models, translating radiative observations or modeled boundary conditions into estimates of surface thermal states. While forward models can predict land surface temperature from vegetation and urban form, the inverse problem of determining spatial vegetation configurations that achieve a desired regional temperature shift remains largely unexplored. This task is inherently underdetermined, as multiple spatial vegetation patterns can yield similar aggregated temperature responses. Conventional regression and deterministic neural networks fail to capture this ambiguity and often produce averaged solutions, particularly under data-scarce conditions. We propose a conflated inverse modeling framework that combines a predictive forward model with a diffusion-based generative inverse model to produce diverse, physically plausible image-based vegetation patterns conditioned on specific temperature goals. Our framework maintains control over thermal outcomes while enabling diverse spatial vegetation configurations, even when such combinations are absent from training data. Altogether, this work introduces a controllable inverse modeling approach for urban climate adaptation that accounts for the inherent diversity of the problem. Code is available at the GitHub repository.

Jian Han, Jinlai Liu, Jiahuan Wang, Bingyue Peng, Zehuan Yuan

Main category: cs.CV

TL;DR: GRN introduces a new visual synthesis paradigm combining autoregressive modeling with hierarchical binary quantization and global refinement to overcome limitations of diffusion models and traditional AR approaches.

Details

Motivation: Diffusion models are computationally inefficient with uniform computational effort regardless of complexity, while autoregressive models suffer from lossy discrete tokenization and error accumulation. The authors aim to create a more efficient, complexity-aware visual generation system.

Method: GRN uses Hierarchical Binary Quantization (HBQ) for near-lossless discrete tokenization, builds AR generation on this latent space, adds global refinement mechanisms to progressively correct errors, and incorporates entropy-guided sampling for adaptive-step generation.

Result: Achieves state-of-the-art results on ImageNet: 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation. Also scales successfully to text-to-image and text-to-video generation with superior performance.

Conclusion: GRN represents a next-generation visual synthesis paradigm that addresses key limitations of both diffusion and autoregressive models, offering efficient, complexity-aware generation with state-of-the-art performance across multiple visual generation tasks.

Abstract: While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. In contrast, autoregressive (AR) models are inherently complexity-aware, as evidenced by their variable likelihoods, but are often hindered by lossy discrete tokenization and error accumulation. In this work, we introduce Generative Refinement Networks (GRN), a next-generation visual synthesis paradigm to address these issues. At its core, GRN addresses the discrete tokenization bottleneck through a theoretically near-lossless Hierarchical Binary Quantization (HBQ), achieving a reconstruction quality comparable to continuous counterparts. Built upon HBQ’s latent space, GRN fundamentally upgrades AR generation with a global refinement mechanism that progressively perfects and corrects artworks – like a human artist painting. Besides, GRN integrates an entropy-guided sampling strategy, enabling complexity-aware, adaptive-step generation without compromising visual quality. On the ImageNet benchmark, GRN establishes new records in image reconstruction (0.56 rFID) and class-conditional image generation (1.81 gFID). We also scale GRN to more challenging text-to-image and text-to-video generation, delivering superior performance on an equivalent scale. We release all models and code to foster further research on GRN.

[265] Lyra 2.0: Explorable Generative 3D Worlds

Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, Sanja Fidler, Jiahui Huang, Huan Ling, Jun Gao, Xuanchi Ren

Main category: cs.CV

TL;DR: Lyra 2.0 generates persistent 3D worlds by creating long, 3D-consistent video walkthroughs and lifting them to 3D using feed-forward reconstruction, addressing spatial forgetting and temporal drifting issues in current video models.

Details

Motivation: Current video generation models degrade when creating long camera trajectories with large viewpoint changes and location revisits needed for 3D scene creation, suffering from spatial forgetting (hallucinating structures when revisiting areas) and temporal drifting (accumulating synthesis errors).

Method: Uses per-frame 3D geometry for information routing to address spatial forgetting (retrieving relevant past frames and establishing dense correspondences), and trains with self-augmented histories to address temporal drifting (teaching the model to correct rather than propagate errors).

Result: Enables substantially longer and 3D-consistent video trajectories, which can be used to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.

Conclusion: Lyra 2.0 provides a framework for generating persistent, explorable 3D worlds at scale by combining the strengths of video generation models with 3D reconstruction techniques.

Abstract: Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model’s temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing – retrieving relevant past frames and establishing dense correspondences with the target viewpoints – while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.

[266] Pictorial and apictorial polygonal jigsaw puzzles from arbitrary number of crossing cuts

Peleg Harel Ofir Itzhak Shahar, Ohad Ben-Shahar

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - unable to analyze content

Details

Motivation: Cannot determine motivation due to access error

Method: Cannot determine method due to access error

Result: Cannot determine results due to access error

Conclusion: Cannot draw conclusions due to access error

Abstract: Failed to fetch summary for 2008.07644: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2008.07644&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[267] Subspace-Guided Feature Reconstruction for Unsupervised Anomaly Localization

Katsuya Hotta, Chao Zhang, Yoshihiro Hagihara, Takuya Akashi

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2309.13904: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2309.13904&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[268] Prompt Evolution for Generative AI: A Classifier-Guided Approach

Melvin Wong, Yew-Soon Ong, Abhishek Gupta, Kavitesh K. Bali, Caishun Chen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2305.16347: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2305.16347&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[269] OmniHands: Towards Robust 4D Hand Mesh Recovery via A Versatile Transformer

Dixuan Lin, Yuxiang Zhang, Mengcheng Li, Wei Jing, Qi Yan, Qianying Wang, Yebin Liu, Hongwen Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2405.20330: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.20330&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[270] SinkSAM-Net: Knowledge-Driven Self-Supervised Sinkhole Segmentation Using Topographic Priors and Segment Anything Model

Osher Rafaeli, Tal Svoray, Ariel Nahlieli

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2410.01473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.01473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[271] On Efficient Variants of Segment Anything Model: A Survey

Xiaorui Sun, Jun Liu, Heng Tao Shen, Xiaofeng Zhu, Ping Hu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2410.04960: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.04960&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[272] Retrievals Can Be Detrimental: Unveiling the Backdoor Vulnerability of Retrieval-Augmented Diffusion Models

Hao Fang, Xiaohang Sui, Hongyao Yu, Kuofeng Gao, Jiawei Kong, Sijin Yu, Bin Chen, Shu-Tao Xia

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - technical issue with arXiv API preventing paper retrieval

Conclusion: Cannot draw conclusions about paper content due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2501.13340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.13340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[273] Intelligent bear deterrence system based on computer vision: Reducing human-bear conflicts in remote areas

Pengyu Chen, Teng Fei, John A. Kupfer, Yunyan Du, Jiawei Yi, Yi Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2503.23178: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.23178&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[274] PixelCAM: Pixel Class Activation Mapping for Histology Image Classification and ROI Localization

Alexis Guichemerre, Soufiane Belharbi, Mohammadhadi Shateri, Luke McCaffrey, Eric Granger

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2503.24135 suggests it’s from March 2025, but no abstract or content is available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content. The arXiv API rate limiting prevents fetching the abstract.

Method: Cannot determine method without access to the paper content. The arXiv API returned HTTP 429 error.

Result: Cannot determine results without access to the paper content. The paper details are unavailable due to rate limiting.

Conclusion: Cannot draw conclusions about the paper’s content. The arXiv API rate limiting prevents analysis of paper 2503.24135.

Abstract: Failed to fetch summary for 2503.24135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.24135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[275] Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning

Yu Zhang, Jialei Zhou, Xinchen Li, Qi Zhang, Zhongwei Wan, Tianyu Wang, Duoqian Miao, Changwei Wang, Longbing Cao

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2505.19261: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19261&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[276] Navigating the Accuracy-Size Trade-Off with Flexible Model Merging

Akash Dhasade, Divyansh Jhunjhunwala, Milos Vujasinovic, Gauri Joshi, Anne-Marie Kermarrec

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2505.23209: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.23209&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[277] WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

Sicheng Fan, Rui Wan, Yifei Leng, Gaoning Liang, Li Ling, Yanyi Shang, Dehan Kong

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.05295: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05295&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[278] SIRI-Bench: Challenging VLMs’ Spatial Intelligence through Complex Reasoning Tasks

Zijian Song, Xiaoxin Lin, Qiuming Huang, Sihan Qin, Guangrun Wang, Liang Lin

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to access limitations

Abstract: Failed to fetch summary for 2506.14512: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14512&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[279] DC-TTA: Divide-and-Conquer Framework for Test-Time Adaptation of Interactive Segmentation

Jihun Kim, Hoyong Kwon, Hyeokjun Kweon, Wooseong Jeong, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - technical issue with arXiv API access

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2506.23104: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23104&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[280] Habitat Classification from Ground-Level Imagery Using Deep Neural Networks

Hongrui Shi, Lisa Norton, Lucy Ridding, Simon Rolph, Tom August, Claire M Wood, Lan Qie, Petra Bosilj, James M Brown

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2507.04017: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.04017&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[281] Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

Tianyu Xie, Shuchen Xue, Zijin Feng, Tianyang Hu, Jiacheng Sun, Zhenguo Li, Cheng Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2505.17384: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17384&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[282] A document is worth a structured record: Principled inductive bias design for document recognition

Benjamin Meyer, Lukas Tuggener, Sascha Hänzi, Daniel Schmid, Erdal Ayfer, Benjamin F. Grewe, Ahmed Abdulkadir, Thilo Stadelmann

Main category: cs.CV

TL;DR: Unable to analyze paper 2507.08458 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2507.08458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.08458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[283] HSG-12M: A Large-Scale Benchmark of Spatial Multigraphs from the Energy Spectra of Non-Hermitian Crystals

Xianquan Yan, Hakan Akgün, Kenji Kawaguchi, N. Duane Loh, Ching Hua Lee

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2506.08618: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08618&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[284] Automatic Road Subsurface Distress Recognition from Ground Penetrating Radar Images using Deep Learning-based Cross-verification

Chang Peng, Bao Yang, Meiqi Li, Ge Zhang, Hui Sun, Zhenyu Jiang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: No method information available due to fetch failure

Result: No results available - paper summary retrieval failed

Conclusion: Cannot analyze paper due to technical limitations in accessing content

Abstract: Failed to fetch summary for 2507.11081: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.11081&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[285] One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion

Jinxi Liu, Zijian He, Guangrun Wang, Guanbin Li, Liang Lin

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.04559: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.04559&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[286] Time-reversed Flow Matching with Worst Transport in High-dimensional Latent Space for Image Anomaly Detection

Liangwei Li, Lin Liu, Hanzhe Liang, Juanxiu Liu, Jing Zhang, Ruqian Hao, Xiaohui Du, Yong Liu, Pan Li

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2508.05461: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05461&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[287] BRAIN: Bias-Mitigation Continual Learning Approach to Vision-Brain Understanding

Xuan-Bac Nguyen, Thanh-Dat Truong, Pawan Sinha, Khoa Luu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.18187: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.18187&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[288] IMU: Influence-guided Machine Unlearning

Xindi Fan, Jing Wu, Mingyi Zhou, Pengwei Liang, Mehrtash Harandi, Dinh Phung

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2508.01620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[289] LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA

Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Joey Tianyi Zhou, Changtao Miao, Huazhe Tan, Weibin Yao, Jianshu Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2509.10026: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.10026&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Ruijia Li, Mingzi Zhang, Zengyi Yu, Yuang Wei, Bo Jiang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.10200: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10200&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[291] Causal Fingerprints of AI Generative Models

Hui Xu, Chi Liu, Congcong Zhu, Minghao Wang, Youyang Qu, Longxiang Gao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2509.15406: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.15406&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Kacper Marzol, Ignacy Kolton, Weronika Smolak-Dyżewska, Joanna Kaleta, Żaneta Świderska-Chadaj, Marcin Mazur, Mirosław Dziekiewicz, Tomasz Markiewicz, Przemysław Spurek

Main category: cs.CV

TL;DR: Paper 2509.16806: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as abstract is unavailable due to arXiv rate limiting

Method: Method unknown - abstract retrieval failed

Result: Results unknown - abstract retrieval failed

Conclusion: Cannot draw conclusions without paper content

Abstract: Failed to fetch summary for 2509.16806: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.16806&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[293] ASTRA: Let Arbitrary Subjects Transform in Video Editing

Fei Shen, Weihao Xu, Rui Yan, Dong Zhang, Xiangbo Shu, Jinhui Tang, Maocheng Zhao

Main category: cs.CV

TL;DR: Unable to analyze paper 2510.01186 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2510.01186: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01186&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[294] Label-Efficient Cross-Modality Generalization for Liver Segmentation in Multi-Phase MRI

Quang-Khai Bui-Tran, Minh-Toan Dinh, Thanh-Huy Nguyen, Ba-Thinh Lam, Mai-Anh Vu, Ulas Bagci

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.04705: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04705&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Jie Luo, Yuxuan Jiang, Xin Jin, Mingyu Liu, Yihui Fan

Main category: cs.CV

TL;DR: Unable to analyze paper 2510.06687 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2510.06687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[296] Point Prompting: Counterfactual Tracking with Video Diffusion Models

Ayush Shrivastava, Sanyam Mehta, Daniel Geng, Andrew Owens

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.11715: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11715&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[297] FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

Amin Parchami-Araghi, Sukrut Rao, Jonas Fischer, Bernt Schiele

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.25512: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25512&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[298] NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models

Nir Goren, Oren Katzir, Abhinav Nakarmi, Eyal Ronen, Mahmood Sharif, Or Patashnik

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.13793: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13793&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[299] StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback

Jiho Park, Sieun Choi, Jaeyoon Seo, Jihie Kim

Main category: cs.CV

TL;DR: Paper 2510.20093 appears to be unavailable due to HTTP 429 error (rate limiting), so I cannot analyze its content or relevance to multimodal LLMs with audio/vision focus.

Details

Motivation: Unable to determine motivation as the paper content is not accessible due to rate limiting on arXiv API.

Method: Unable to determine method as the paper content is not accessible.

Result: Unable to determine results as the paper content is not accessible.

Conclusion: Unable to determine conclusion as the paper content is not accessible.

Abstract: Failed to fetch summary for 2510.20093: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20093&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[300] Visual Diffusion Models are Geometric Solvers

Nir Goren, Shai Yehezkel, Omer Dahary, Andrey Voynov, Or Patashnik, Daniel Cohen-Or

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.21697: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21697&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[301] GroupKAN: Efficient Kolmogorov-Arnold Networks via Grouped Spline Modeling

Guojie Li, Tianyi Liu, Anwar P.P. Abdul Majeed, Muhammad Ateeq, Anh Nguyen, Fan Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.05477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[302] SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

Jiayuan Du, Yiming Zhao, Zhenglong Guo, Yong Pan, Wenbo Hou, Zhihui Hao, Kun Zhan, Qijun Chen

Main category: cs.CV

TL;DR: Paper 2511.22039: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to draw conclusions due to access limitations

Abstract: Failed to fetch summary for 2511.22039: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22039&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[303] TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

Tao Wu, Li Yang, Gen Zhan, Yabin Zhang, Yiting Liao, Junlin Li, Deliang Fu, Li Zhang, Limin Wang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.03963: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03963&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[304] SAM3-I: Segment Anything with Instructions

Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Wei Ji, Qi Bi, Yongri Piao, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, Huchuan Lu, Li Cheng

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2512.04585: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04585&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[305] Latent Chain-of-Thought World Modeling for End-to-End Driving

Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, Boris Ivanovic

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed summary fetch

Method: Unable to determine method due to failed summary fetch

Result: Unable to determine results due to failed summary fetch

Conclusion: Unable to determine conclusion due to failed summary fetch

Abstract: Failed to fetch summary for 2512.10226: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10226&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[306] VPTracker: Global Vision-Language Tracking via Visual Prompt

Jingchao Wang, Kaiwen Zhou, Zhijian Wu, Kunhua Ji, Dingjiang Huang, Yefeng Zheng

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API access issues

Method: Unable to determine method due to API access issues

Result: Unable to determine results due to API access issues

Conclusion: Unable to determine conclusion due to API access issues

Abstract: Failed to fetch summary for 2512.22799: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22799&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[307] CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

Hang Wu, Yujun Cai, Zehao Li, Haonan Ge, Bowen Sun, Junsong Yuan, Yiwei Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2602.00181: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00181&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[308] Uncertainty-Aware Image Classification In Biomedical Imaging Using Spectral-normalized Neural Gaussian Processes

Uma Meleti, Jeffrey J. Nirschl

Main category: cs.CV

TL;DR: Unable to analyze paper 2602.02370 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract retrieval failed due to rate limiting (HTTP 429)

Method: Cannot determine method as abstract retrieval failed

Result: Cannot determine results as abstract retrieval failed

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2602.02370: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02370&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[309] AGMA: Adaptive Gaussian Mixture Anchors for Prior-Guided Multimodal Human Trajectory Forecasting

Chao Li, Rui Zhang, Siyuan Huang, Xian Zhong, Hongbo Jiang

Main category: cs.CV

TL;DR: Unable to analyze paper 2602.04204 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot determine conclusion without access to paper abstract

Abstract: Failed to fetch summary for 2602.04204: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04204&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[310] DAV-GSWT: Diffusion-Active-View Sampling for Data-Efficient Gaussian Splatting Wang Tiles

Rong Fu, Jiekai Wu, Haiyun Wei, Yee Tan Jia, Yang Li, Xiaowen Ma, Wangyu Wu, Simon Fong

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.15355: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15355&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[311] Mitigating Shortcut Learning via Feature Disentanglement in Medical Imaging: A Benchmark Study

Sarah Müller, Philipp Berens

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper ID 2602.18502

Details

Motivation: Cannot determine motivation without access to the paper content

Method: Cannot determine method without access to the paper content

Result: Cannot determine results without access to the paper content

Conclusion: Cannot determine conclusion without access to the paper content

Abstract: Failed to fetch summary for 2602.18502: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18502&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[312] Vision Transformers Need More Than Registers

Cheng Shi, Yizhou Yu, Sibei Yang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2602.22394: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22394&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[313] FAST-DIPS: Adjoint-Free Analytic Steps and Hard-Constrained Likelihood Correction for Diffusion-Prior Inverse Problems

Minwoo Kim, Seunghyeok Shin, Hongki Lim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2603.01591: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01591&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[314] Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding

Ying Liu, Yudong Han, Kean Shi, Liyuan Pan

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to retry with different approach or wait.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2603.00655: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00655&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[315] RPG-SAM: Reliability-Weighted Prototypes and Geometric Adaptive Threshold Selection for Training-Free One-Shot Polyp Segmentation

Weikun Lin, Yunhao Bai, Yan Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2603.07436: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07436&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[316] IMSE: Intrinsic Mixture of Spectral Experts Fine-tuning for Test-Time Adaptation

Sunghyun Baek, Jaemyung Yu, Seunghee Koh, Minsu Kim, Hyeonseong Jeon, Junmo Kim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.07926: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07926&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[317] Are Video Reasoning Models Ready to Go Outside?

Yangfan He, Changgyu Boo, Jaehong Yoon

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to draw conclusions due to access limitations

Abstract: Failed to fetch summary for 2603.10652: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10652&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[318] Towards Interpretable Foundation Models for Retinal Fundus Images

Samuel Ofosu Mensah, Camila Roa, Kerol Djoumessi, Philipp Berens

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2603.18846: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18846&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[319] CREG: Compass Relational Evidence Graph for Characterizing Directional Structure in VLM Spatial-Reasoning Attribution

Kaizhen Tan, Yang Feng, Heqing Du

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.20475: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20475&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[320] LPNSR: Optimal Noise-Guided Diffusion Image Super-Resolution Via Learnable Noise Prediction

Shuwei Huang, Shizhuo Liu, Zijun Wei

Main category: cs.CV

TL;DR: Paper 2603.21045: Failed to fetch summary due to HTTP 429 error (rate limiting).

Details

Motivation: Unable to determine motivation due to fetch failure.

Method: Unable to determine method due to fetch failure.

Result: Unable to determine results due to fetch failure.

Conclusion: Unable to determine conclusion due to fetch failure.

Abstract: Failed to fetch summary for 2603.21045: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21045&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[321] Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

Fulvio Sanguigni, Davide Lobba, Bin Ren, Marcella Cornia, Nicu Sebe, Rita Cucchiara

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.22607: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22607&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[322] One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

Adrien Ramanana Rahary, Nicolas Dufour, Patrick Perez, David Picard

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2603.23488: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23488&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[323] MorphDistill: Distilling Unified Morphological Knowledge from Pathology Foundation Models for Colorectal Cancer Survival Prediction

Hikmat Khan, Usama Sajjad, Metin N. Gurcan, Anil Parwani, Wendy L. Frankel, Wei Chen, Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2604.06390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[324] Energy-Regularized Spatial Masking: A Novel Approach to Enhancing Robustness and Interpretability in Vision Models

Tom Devynck, Bilal Faye, Djamel Bouchaffra, Nadjib Lazaar, Hanane Azzag, Mustapha Lebbah

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2604.06893: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06893&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[325] NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration: Methods and Results

Wenbin Zou, Tianyi Liu, Kejun Wu, Huiping Zhuang, Zongwei Wu, Zhuyun Zhou, Radu Timofte, Kim-Hui Yap, Lap-Pui Chau, Yi Wang, Shiqi Zhou, Xiaodi Shi, Yuxiang Chen, Yilian Zhong, Shibo Yin, Yushun Fang, Xilei Zhu, Yahui Wang, Chen Lu, Zhitao Wang, Lifa Ha, Hengyu Man, Xiaopeng Fan, Priyansh Singh, Sidharth, Krrish Dev, Soham Kakkar, Vinit Jakhetiya, Ovais Iqbal Shah, Wei Zhou, Linfeng Li, Qi Xu, Zhenyang Liu, Kepeng Xu, Tong Qiao, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2604.06945: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06945&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[326] Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

Xuezhen Tu, Jingyu Wu, Fangyu Kang, Qingpeng Nong, Kaijin Zhang, Chaoyue Niu, Fan Wu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Unable to determine method due to API rate limiting preventing access to paper details

Result: Unable to determine results due to API rate limiting preventing access to paper details

Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper details

Abstract: Failed to fetch summary for 2604.08014: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08014&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[327] Face Density as a Proxy for Data Complexity: Quantifying the Hardness of Instance Count

Abolfazl Mohammadi-Seif, Ricardo Baeza-Yates

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.09689: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09689&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[328] BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields

Fan Yang, Wenrui Chen, Guorun Yan, Ruize Liao, Wanjun Jia, Dongsheng Luo, Jiacheng Lin, Kailun Yang, Zhiyong Li, Yaonan Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to draw conclusions due to technical error fetching paper content

Abstract: Failed to fetch summary for 2604.08410: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08410&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[329] LOLGORITHM: Funny Comment Generation Agent For Short Videos

Xuan Ouyang, Bouzhou Wang, Senan Wang, Siyuan Xiahou, Jinrong Zhou, Yuekang Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.09729: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09729&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[330] Do vision models perceive illusory motion in static images like humans?

Isabella Elaine Rosario, Fan L. Cheng, Zitang Sun, Nikolaus Kriegeskorte

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2604.09853 appears to be an arXiv paper, but the content cannot be retrieved at this time.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2604.09853: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09853&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Yongbo Shu, Wenzhao Xie, Shanhu Yao, Zirui Xin, Luo Lei, Kewen Chen, Aijing Luo

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2604.10702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[332] Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?

Isaac Corley, Alex Stoken, Gabriele Berton

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2604.10217: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10217&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[333] MedVeriSeg: Teaching MLLM-Based Medical Segmentation Models to Verify Query Validity Without Extra Training

Ziqian Lu, Qinyue Tong, Jun Liu, Yunlong Yu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2604.10242: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10242&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[334] ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

Xucheng Wang, Xiaoman Zhang, Sung Eun Kim, Ankit Pal, Pranav Rajpurkar

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.10916: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10916&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[335] Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

Dehui Wang, Congsheng Xu, Rong Wei, Yue Shi, Shoufa Chen, Dingxiang Luo, Tianshuo Yang, Xiaokang Yang, Wei Sui, Yusen Qin, Rui Tang, Yao Mu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable due to arXiv API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to access limitations

Abstract: Failed to fetch summary for 2604.10578: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10578&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[336] Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

Shimon Murai, Teppei Kurita, Ryuta Satoh, Yusuke Moriuchi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.11071: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11071&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[337] CoFusion: Multispectral and Hyperspectral Image Fusion via Spectral Coordinate Attention

Baisong Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2604.10584: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10584&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[338] Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification

Jiayu Zhang, Shuo Ye, Qilang Ye, Zihan Song, Jiajian Huang, Zitong Yu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API access limitations

Method: Unable to determine method due to API access limitations

Result: Unable to determine results due to API access limitations

Conclusion: Unable to determine conclusion due to API access limitations

Abstract: Failed to fetch summary for 2604.10695: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10695&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[339] Deep Learning using Rectified Linear Units (ReLU)

Abien Fred Agarap

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to analyze paper content due to technical retrieval error

Abstract: Failed to fetch summary for 1803.08375: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=1803.08375&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[340] STGV: Spatio-Temporal Hash Encoding for Gaussian-based Video Representation

Jierun Lin, Jiacong Chen, Qingyu Mao, Shuai Liu, Xiandong Meng, Fanyang Meng, Yongsheng Liang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.10910: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10910&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[341] Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time Adaptation

Jihun Kim, Hoyong Kwon, Hyeokjun Kweon, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: Paper 2604.10950: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot determine conclusion without access to the paper abstract

Abstract: Failed to fetch summary for 2604.10950: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10950&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[342] ArtiCAD: Articulated CAD Assembly Design via Multi-Agent Code Generation

Yuan Shui, Yandong Guan, Zhanwei Zhang, Juncheng Hu, Jing Zhang, Dong Xu, Qian Yu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.10992: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10992&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[343] Beyond Reconstruction: Reconstruction-to-Vector Diffusion for Hyperspectral Anomaly Detection

Jijun Xiang, Tao Wang, Jiayi Wang, Pengxiang Wang, Cheng Chen, Nian Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2604.11390

Details

Motivation: Unable to determine motivation as paper content could not be retrieved due to technical limitations

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2604.11390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[344] Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions

Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam, Lorenzo Sia, Nicolas Richet, Marco Pedersoli, Alessandro Lameiras Koerich, Simon L Bacon, Eric Granger

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.11730: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11730&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[345] SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

Zonghao Ying, Yangguang Shao, Jianle Gan, Gan Xu, Wenxin Zhang, Quanchen Zou, Junzheng Shi, Zhenfei Yin, Mingchuan Zhang, Aishan Liu, Xianglong Liu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.10073: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10073&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[346] Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation

Fabian Konstantinidis, Moritz Sackmann, Ulrich Hofmann, Christoph Stiller

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.05812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[347] MVOS_HSI: A Python Library for Preprocessing Agricultural Crop Hyperspectral Data

Rishik Aggarwal, Krisha Joshi, Pappu Kumar Yadav, Jianwei Qin, Thomas F. Burks, Moon S. Kim

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2604.07656: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07656&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[348] AniGen: Unified $S^3$ Fields for Animatable 3D Asset Generation

Yi-Hua Huang, Zi-Xin Zou, Yuting He, Chirui Chang, Cheng-Feng Pu, Ziyi Yang, Yuan-Chen Guo, Yan-Pei Cao, Xiaojuan Qi

Main category: cs.CV

TL;DR: Failed to fetch summary for arXiv paper 2604.08746 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions preventing paper retrieval

Method: Unable to determine method due to access restrictions preventing paper retrieval

Result: Unable to determine results due to access restrictions preventing paper retrieval

Conclusion: Unable to determine conclusion due to access restrictions preventing paper retrieval

Abstract: Failed to fetch summary for 2604.08746: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08746&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.AI

[349] The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap

Mohamed Mabrok

Main category: cs.AI

TL;DR: Science often gets stuck in local optima due to historical, cognitive, and institutional constraints, similar to gradient descent in ML, preventing discovery of globally optimal theories.

Details

Motivation: To examine science as an optimization problem and argue that scientific progress often gets trapped in local optima rather than reaching global optima due to historical contingency, cognitive path dependence, and institutional lock-in.

Method: Analogy to gradient descent in machine learning, detailed case studies across mathematics, physics, chemistry, biology, neuroscience, and statistical methodology, and identification of three interlocking lock-in mechanisms (cognitive, formal, institutional).

Result: Science follows steepest local gradients of tractability, empirical accessibility, and institutional reward, potentially bypassing superior descriptions of nature. Recognition of lock-in mechanisms is crucial for designing meta-scientific strategies.

Conclusion: Scientific knowledge represents local optima rather than global ones, and understanding the mechanisms of lock-in is essential for developing interventions to escape these suboptimal states and improve scientific discovery processes.

Abstract: Science is widely regarded as humanity’s most reliable method for uncovering truths about the natural world. Yet the \emph{trajectory} of scientific discovery is rarely examined as an optimization problem in its own right. This paper argues that the body of scientific knowledge, at any given historical moment, represents a \emph{local optimum} rather than a global one–that the frameworks, formalisms, and paradigms through which we understand nature are substantially shaped by historical contingency, cognitive path dependence, and institutional lock-in. Drawing an analogy to gradient descent in machine learning, we propose that science follows the steepest local gradient of tractability, empirical accessibility, and institutional reward, and in doing so may bypass fundamentally superior descriptions of nature. We develop this thesis through detailed case studies spanning mathematics, physics, chemistry, biology, neuroscience, and statistical methodology. We identify three interlocking mechanisms of lock-in–cognitive, formal, and institutional–and argue that recognizing these mechanisms is a prerequisite for designing meta-scientific strategies capable of escaping local optima. We conclude by proposing concrete interventions and discussing the epistemological implications of our thesis for the philosophy of science.

[350] Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents

Ying Xie

Main category: cs.AI

TL;DR: Self-monitoring modules (metacognition, self-prediction, subjective duration) as auxiliary losses in RL agents provide no benefit; structural integration into decision pathway shows modest gains but not better than baseline without modules.

Details

Motivation: To investigate whether self-monitoring capabilities (metacognition, self-prediction, subjective duration) actually help reinforcement learning agents, as they are often proposed as useful additions but their practical benefits are unclear.

Method: Tested three self-monitoring modules as auxiliary-loss add-ons to a multi-timescale cortical hierarchy in continuous-time predator-prey environments (1D/2D, standard/non-stationary). Then structurally integrated module outputs (confidence gating exploration, surprise triggering broadcasts, self-predictions as policy input) and compared with add-on approach and baseline without modules.

Result: Auxiliary-loss modules provided no statistically significant benefit across environments and collapsed to near-constant outputs. Structural integration showed medium-large improvement over add-on approach (Cohen’s d = 0.62) but did not significantly outperform baseline without modules (d = 0.15). TSM-to-policy pathway contributed most gains.

Conclusion: Self-monitoring modules should be structurally integrated into decision pathways rather than added as auxiliary losses, but their benefits may come from architectural improvements rather than self-monitoring content itself.

Abstract: Self-monitoring capabilities – metacognition, self-prediction, and subjective duration – are often proposed as useful additions to reinforcement learning agents. But do they actually help? We investigate this question in a continuous-time multi-timescale agent operating in predator-prey survival environments of varying complexity, including a 2D partially observable variant. We first show that three self-monitoring modules, implemented as auxiliary-loss add-ons to a multi-timescale cortical hierarchy, provide no statistically significant benefit across 20 random seeds, 1D and 2D predator-prey environments with standard and non-stationary variants, and training horizons up to 50,000 steps. Diagnosing the failure, we find the modules collapse to near-constant outputs (confidence std < 0.006, attention allocation std < 0.011) and the subjective duration mechanism shifts the discount factor by less than 0.03%. Policy sensitivity analysis confirms the agent’s decisions are unaffected by module outputs in this design. We then show that structurally integrating the module outputs – using confidence to gate exploration, surprise to trigger workspace broadcasts, and self-model predictions as policy input – produces a medium-large improvement over the add-on approach (Cohen’s d = 0.62, p = 0.06, paired) in a non-stationary environment. Component-wise ablations reveal that the TSM-to-policy pathway contributes most of this gain. However, structural integration does not significantly outperform a baseline with no self-monitoring (d = 0.15, p = 0.67), and a parameter-matched control without modules performs comparably, so the benefit may lie in recovering from the trend-level harm of ignored modules rather than in self-monitoring content. The architectural implication is that self-monitoring should sit on the decision pathway, not beside it.

[351] GoodPoint: Learning Constructive Scientific Paper Feedback from Author Responses

Jimin Mun, Chani Jung, Xuhui Zhou, Hyunwoo Kim, Maarten Sap

Main category: cs.AI

TL;DR: GoodPoint is a training method for LLMs to generate constructive feedback on research papers by learning from author responses to improve feedback validity and actionability.

Details

Motivation: LLMs have potential to transform scientific research but should augment rather than automate research. The paper focuses on generating constructive feedback that helps authors improve both research content and presentation, operationalizing effectiveness through validity and author action.

Method: Created GoodPoint-ICLR dataset (19K ICLR papers with reviewer feedback annotated using author responses). Developed GoodPoint training recipe: fine-tuning on valid/actionable feedback plus preference optimization on real and synthetic preference pairs. Evaluated on 1.2K ICLR paper benchmark.

Result: GoodPoint-trained Qwen3-8B improved predicted success rate by 83.7% over base model, set new SOTA among similar-size LLMs for feedback matching on human feedback set, surpassed Gemini-3-flash in precision. Expert human study confirmed higher practical value as perceived by authors.

Conclusion: GoodPoint effectively trains LLMs to generate constructive feedback by learning from author response signals, demonstrating significant improvements in feedback quality and practical value for scientific research enhancement.

Abstract: While LLMs hold significant potential to transform scientific research, we advocate for their use to augment and empower researchers rather than to automate research without human oversight. To this end, we study constructive feedback generation, the task of producing targeted, actionable feedback that helps authors improve both their research and its presentation. In this work, we operationalize the effectiveness of feedback along two author-centric axes-validity and author action. We first curate GoodPoint-ICLR, a dataset of 19K ICLR papers with reviewer feedback annotated along both dimensions using author responses. Building on this, we introduce GoodPoint, a training recipe that leverages success signals from author responses through fine-tuning on valid and actionable feedback, together with preference optimization on both real and synthetic preference pairs. Our evaluation on a benchmark of 1.2K ICLR papers shows that a GoodPoint-trained Qwen3-8B improves the predicted success rate by 83.7% over the base model and sets a new state-of-the-art among LLMs of similar size in feedback matching on a golden human feedback set, even surpassing Gemini-3-flash in precision. We further validate these findings through an expert human study, demonstrating that GoodPoint consistently delivers higher practical value as perceived by authors.

[352] Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

Jianhao Chen, Haoyang Chen, Hanjie Zhao, Haozhe Liang, Tieyun Qian

Main category: cs.AI

TL;DR: MemJack is a memory-augmented multi-agent jailbreak attack framework that exploits visual semantics to orchestrate automated adversarial attacks on Vision-Language Models, achieving high attack success rates through coordinated multi-agent cooperation and persistent memory.

Details

Motivation: Current multimodal jailbreak strategies focus on surface-level perturbations but fail to engage with complex semantic structures in visual data, leaving the semantic attack surface of natural images largely unexplored. The paper aims to expose these deep-seated semantic vulnerabilities in VLMs.

Method: MemJack employs coordinated multi-agent cooperation to dynamically map visual entities to malicious intents, generate adversarial prompts via multi-angle visual-semantic camouflage, and uses Iterative Nullspace Projection (INLP) geometric filter to bypass latent space refusals. It accumulates successful strategies through persistent Multimodal Experience Memory for coherent multi-turn attacks.

Result: MemJack achieves 71.48% attack success rate against Qwen3-VL-Plus on unmodified COCO val2017 images, scaling to 90% under extended budgets. The paper also introduces MemJack-Bench, a dataset of over 113,000 interactive multimodal jailbreak attack trajectories.

Conclusion: MemJack successfully exposes semantic vulnerabilities in VLMs through coordinated multi-agent attacks leveraging visual semantics, demonstrating significant attack success rates. The released benchmark dataset aims to catalyze future defensive alignment research for robust VLMs.

Abstract: The rapid evolution of Vision-Language Models (VLMs) has catalyzed unprecedented capabilities in artificial intelligence; however, this continuous modal expansion has inadvertently exposed a vastly broadened and unconstrained adversarial attack surface. Current multimodal jailbreak strategies primarily focus on surface-level pixel perturbations and typographic attacks or harmful images; however, they fail to engage with the complex semantic structures intrinsic to visual data. This leaves the vast semantic attack surface of original, natural images largely unscrutinized. Driven by the need to expose these deep-seated semantic vulnerabilities, we introduce \textbf{MemJack}, a \textbf{MEM}ory-augmented multi-agent \textbf{JA}ilbreak atta\textbf{CK} framework that explicitly leverages visual semantics to orchestrate automated jailbreak attacks. MemJack employs coordinated multi-agent cooperation to dynamically map visual entities to malicious intents, generate adversarial prompts via multi-angle visual-semantic camouflage, and utilize an Iterative Nullspace Projection (INLP) geometric filter to bypass premature latent space refusals. By accumulating and transferring successful strategies through a persistent Multimodal Experience Memory, MemJack maintains highly coherent extended multi-turn jailbreak attack interactions across different images, thereby improving the attack success rate (ASR) on new images. Extensive empirical evaluations across full, unmodified COCO val2017 images demonstrate that MemJack achieves a 71.48% ASR against Qwen3-VL-Plus, scaling to 90% under extended budgets. Furthermore, to catalyze future defensive alignment research, we will release \textbf{MemJack-Bench}, a comprehensive dataset comprising over 113,000 interactive multimodal jailbreak attack trajectories, establishing a vital foundation for developing inherently robust VLMs.

[353] Narrative-Driven Paper-to-Slide Generation via ArcDeck

Tarik Can Ozden, Sachidanand VS, Furkan Horoz, Ozgur Kara, Junho Kim, James Matthew Rehg

Main category: cs.AI

TL;DR: ArcDeck is a multi-agent framework for converting academic papers into presentation slides by modeling the paper’s logical flow through discourse trees and iterative agent refinement.

Details

Motivation: Existing methods directly summarize raw text into slides, lacking proper modeling of the source paper's logical structure and narrative flow, resulting in presentations with poor coherence.

Method: ArcDeck first parses input papers to construct discourse trees and establish global commitment documents. It then uses specialized agents in an iterative refinement process to critique and revise presentation outlines before rendering final visual layouts.

Result: Experimental results on the new ArcBench benchmark show that explicit discourse modeling combined with role-specific agent coordination significantly improves narrative flow and logical coherence of generated presentations.

Conclusion: ArcDeck demonstrates that structured narrative reconstruction through discourse modeling and multi-agent coordination produces more coherent and logically sound presentations from academic papers.

Abstract: We introduce ArcDeck, a multi-agent framework that formulates paper-to-slide generation as a structured narrative reconstruction task. Unlike existing methods that directly summarize raw text into slides, ArcDeck explicitly models the source paper’s logical flow. It first parses the input to construct a discourse tree and establish a global commitment document, ensuring the high-level intent is preserved. These structural priors then guide an iterative multi-agent refinement process, where specialized agents iteratively critique and revise the presentation outline before rendering the final visual layouts and designs. To evaluate our approach, we also introduce ArcBench, a newly curated benchmark of academic paper-slide pairs. Experimental results demonstrate that explicit discourse modeling, combined with role-specific agent coordination, significantly improves the narrative flow and logical coherence of the generated presentations.

[354] Aethon: A Reference-Based Replication Primitive for Constant-Time Instantiation of Stateful AI Agents

Swanand Rao, Kiran Kashalkar, Parvathi Somashekar, Priya Krishnan

Main category: cs.AI

TL;DR: Aethon introduces reference-based replication for near-constant-time instantiation of stateful AI agents, shifting from materialization-heavy approaches to compositional views over stable definitions and memory layers.

Details

Motivation: Existing AI runtime architectures for stateful agents suffer from significant latency and memory overhead due to materialization-heavy instantiation models, hindering the transition from stateless model inference to stateful agentic execution.

Method: Aethon uses reference-based replication where agents are represented as compositional views over stable definitions, layered memory, and local contextual overlays, employing layered inheritance and copy-on-write semantics instead of full object reconstruction.

Result: The system enables near-constant-time agent instantiation by decoupling creation cost from inherited structure, improving scalability and enabling lightweight, composable execution identities for multi-agent orchestration.

Conclusion: Reference-based instantiation is a fundamental systems abstraction for production-scale agentic software, pointing toward new AI infrastructure where agents can be spawned, specialized, and governed at scale as lightweight execution identities.

Abstract: The transition from stateless model inference to stateful agentic execution is reshaping the systems assumptions underlying modern AI infrastructure. While large language models have made persistent, tool-using, and collaborative agents technically viable, existing runtime architectures remain constrained by materialization-heavy instantiation models that impose significant latency and memory overhead. This paper introduces Aethon, a reference-based replication primitive for near-constant-time instantiation of stateful AI agents. Rather than reconstructing agents as fully materialized objects, Aethon represents each instance as a compositional view over stable definitions, layered memory, and local contextual overlays. By shifting instantiation from duplication to reference, Aethon decouples creation cost from inherited structure. We present the conceptual framework, system architecture, and memory model underlying Aethon, including layered inheritance and copy-on-write semantics. We analyze its implications for complexity, scalability, multi-agent orchestration, and enterprise governance. We argue that reference-based instantiation is not merely an optimization, but a more appropriate systems abstraction for production-scale agentic software. Aethon points toward a new class of AI infrastructure in which agents become lightweight, composable execution identities that can be spawned, specialized, and governed at scale.

[355] The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, Robert D Nowak

Main category: cs.AI

TL;DR: HORIZON is a diagnostic benchmark for analyzing long-horizon failure behaviors in LLM agents across domains, with trajectory analysis and LLM-as-a-Judge evaluation pipeline.

Details

Motivation: LLM agents perform well on short/mid-horizon tasks but break down on long-horizon tasks requiring extended interdependent action sequences. Current lack of systematic analysis hinders diagnosis and comparison across domains.

Method: Introduces HORIZON benchmark for constructing tasks and analyzing long-horizon failures. Evaluates SOTA agents (GPT-5 variants, Claude models) across 4 domains with 3100+ trajectories. Proposes trajectory-grounded LLM-as-a-Judge pipeline for scalable failure attribution, validated with human annotation.

Result: Collected extensive trajectory data showing horizon-dependent degradation patterns. LLM-as-a-Judge pipeline achieved strong agreement with human annotation (inter-annotator κ=0.61; human-judge κ=0.84). Provides systematic analysis of long-horizon agent failures.

Conclusion: HORIZON offers initial methodological step toward systematic cross-domain analysis of long-horizon agent failures and practical guidance for building more reliable agents. Benchmark released publicly for community contributions.

Abstract: Large language model (LLM) agents perform strongly on short- and mid-horizon tasks, but often break down on long-horizon tasks that require extended, interdependent action sequences. Despite rapid progress in agentic systems, these long-horizon failures remain poorly characterized, hindering principled diagnosis and comparison across domains. To address this gap, we introduce HORIZON, an initial cross-domain diagnostic benchmark for systematically constructing tasks and analyzing long-horizon failure behaviors in LLM-based agents. Using HORIZON, we evaluate state-of-the-art (SOTA) agents from multiple model families (GPT-5 variants and Claude models), collecting 3100+ trajectories across four representative agentic domains to study horizon-dependent degradation patterns. We further propose a trajectory-grounded LLM-as-a-Judge pipeline for scalable and reproducible failure attribution, and validate it with human annotation on trajectories, achieving strong agreement (inter-annotator κ=0.61; human-judge κ=0.84). Our findings offer an initial methodological step toward systematic, cross-domain analysis of long-horizon agent failures and offer practical guidance for building more reliable long-horizon agents. We release our project website at \href{https://xwang2775.github.io/horizon-leaderboard/}{HORIZON Leaderboard} and welcome contributions from the community.

[356] When to Forget: A Memory Governance Primitive

Baris Simsek

Main category: cs.AI

TL;DR: Proposes Memory Worth (MW) - a lightweight two-counter metric that tracks how often memories co-occur with successful vs failed outcomes to enable dynamic memory governance in AI agents.

Details

Motivation: Current agent memory systems lack principled operational metrics for memory quality governance. Existing approaches use static write-time importance scores or LLM judgments rather than outcome feedback, making them unable to adapt as task distributions shift.

Method: Introduces Memory Worth (MW): a two-counter per-memory signal that tracks co-occurrence with successful vs failed outcomes. Proves MW converges to conditional success probability p+(m) = Pr[y_t = +1 | m in M_t] under stationary retrieval with minimum exploration. The method requires only two scalar counters per memory unit and can be added to existing architectures that log retrievals and episode outcomes.

Result: In controlled synthetic environment with known ground-truth utility: after 10,000 episodes, Spearman rank-correlation between MW and true utilities reaches ρ = 0.89 ± 0.02 across 20 seeds (vs ρ = 0.00 for static systems). Retrieval-realistic micro-experiment with real text and neural embeddings shows stale memories crossing low-value threshold (MW = 0.17) while specialist memories remain high-value (MW = 0.77) across 3,000 episodes.

Conclusion: Memory Worth provides a theoretically grounded, lightweight foundation for memory governance (staleness detection, retrieval suppression, deprecation decisions). It’s an associational rather than causal measure but offers practical utility for operational memory management in AI agents.

Abstract: Agent memory systems accumulate experience but currently lack a principled operational metric for memory quality governance – deciding which memories to trust, suppress, or deprecate as the agent’s task distribution shifts. Write-time importance scores are static; dynamic management systems use LLM judgment or structural heuristics rather than outcome feedback. This paper proposes Memory Worth (MW): a two-counter per-memory signal that tracks how often a memory co-occurs with successful versus failed outcomes, providing a lightweight, theoretically grounded foundation for staleness detection, retrieval suppression, and deprecation decisions. We prove that MW converges almost surely to the conditional success probability p+(m) = Pr[y_t = +1 | m in M_t] – the probability of task success given that memory m is retrieved – under a stationary retrieval regime with a minimum exploration condition. Importantly, p+(m) is an associational quantity, not a causal one: it measures outcome co-occurrence rather than causal contribution. We argue this is still a useful operational signal for memory governance, and we validate it empirically in a controlled synthetic environment where ground-truth utility is known: after 10,000 episodes, the Spearman rank-correlation between Memory Worth and true utilities reaches rho = 0.89 +/- 0.02 across 20 independent seeds, compared to rho = 0.00 for systems that never update their assessments. A retrieval-realistic micro-experiment with real text and neural embedding retrieval (all-MiniLM-L6-v2) further shows stale memories crossing the low-value threshold (MW = 0.17) while specialist memories remain high-value (MW = 0.77) across 3,000 episodes. The estimator requires only two scalar counters per memory unit and can be added to architectures that already log retrievals and episode outcomes.

Taisei Hishiki, Takaya Arita, Reiji Suzuki

Main category: cs.AI

TL;DR: LLM agents with memory and personality traits show model-dependent effects on cooperation in multi-agent Prisoner’s Dilemma simulations.

Details

Motivation: To understand how LLM-specific characteristics (like internal alignment) affect collective behavior in multi-agent systems, particularly examining the role of memory in cooperation dynamics.

Method: Extended Social Particle Swarm model by replacing rule-based agents with LLM agents (Gemini-2.0-Flash and Gemma~3:4b) having Big Five personality scores and varying memory lengths, simulating Prisoner’s Dilemma interactions in 2D space.

Result: Memory length critically affects cooperation: Gemini showed decreased cooperation with longer memory (from stable clusters to scattered defection), while Gemma showed increased cooperation with longer memory (forming dense cooperative clusters). Personality traits correlated with behaviors similar to human studies.

Conclusion: LLM-specific characteristics, potentially including alignment, fundamentally shape emergent social behavior in generative agent-based modeling, explaining contradictions in prior work on memory and cooperation.

Abstract: This study examines how model-specific characteristics of Large Language Model (LLM) agents, including internal alignment, shape the effect of memory on their collective and cooperative dynamics in a multi-agent system. To this end, we extend the Social Particle Swarm (SPS) model, in which agents move in a two-dimensional space and play the Prisoner’s Dilemma with neighboring agents, by replacing its rule-based agents with LLM agents endowed with Big Five personality scores and varying memory lengths. Using Gemini-2.0-Flash, we find that memory length is a critical parameter governing collective behavior: even a minimal memory drastically suppressed cooperation, transitioning the system from stable cooperative clusters through cyclical formation and collapse of clusters to a state of scattered defection as memory length increased. Big Five personality traits correlated with agent behaviors in partial agreement with findings from experiments with human participants, supporting the validity of the model. Comparative experiments using Gemma~3:4b revealed the opposite trend: longer memory promoted cooperation, accompanied by the formation of dense cooperative clusters. Sentiment analysis of agents’ reasoning texts showed that Gemini interprets memory increasingly negatively as its length grows, while Gemma interprets it less negatively, and that this difference persists in the early phase of experiments before the macro-level dynamics converge. These results suggest that model-specific characteristics of LLMs, potentially including alignment, play a fundamental role in determining emergent social behavior in Generative Agent-Based Modeling, and provide a micro-level cognitive account of the contradictions found in prior work on memory and cooperation.

[358] Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space

Vladimir Vasilenko

Main category: cs.AI

TL;DR: LLMs show attractor-like dynamics where semantically related identity documents (cognitive cores) converge to similar internal representations, with paraphrases clustering tighter than controls, suggesting agent identity induces geometric attractors in activation space.

Details

Motivation: To investigate whether persistent cognitive agent identity documents exhibit attractor-like behavior in LLMs similar to how semantically related prompts map to similar internal representations, and to understand the representational geometry of agent identity.

Method: Controlled experiment on Llama 3.1 8B Instruct comparing hidden states of original cognitive core (Condition A), seven paraphrases (Condition B), and seven structurally matched controls (Condition C). Analyzed mean-pooled states at layers 8, 16, and 24. Replicated on Gemma 2 9B. Conducted ablations and exploratory experiment with scientific description vs sham preprint.

Result: Paraphrases converge to significantly tighter cluster than controls (Cohen’s d > 1.88, p < 10^{-27}, Bonferroni-corrected). Cross-architecture replication confirmed. Effect is primarily semantic rather than structural, with structural completeness necessary to reach attractor region. Scientific description shifts internal state toward attractor more than sham preprint.

Conclusion: Agent identity documents induce attractor-like geometry in LLM activation space, providing representational evidence for cognitive core dynamics. Distinguishes between knowing about an identity vs operating as that identity.

Abstract: Large language models map semantically related prompts to similar internal representations – a phenomenon interpretable as attractor-like dynamics. We ask whether the identity document of a persistent cognitive agent (its cognitive_core) exhibits analogous attractor-like behavior. We present a controlled experiment on Llama 3.1 8B Instruct, comparing hidden states of an original cognitive_core (Condition A), seven paraphrases (Condition B), and seven structurally matched controls (Condition C). Mean-pooled states at layers 8, 16, and 24 show that paraphrases converge to a tighter cluster than controls (Cohen’s d > 1.88, p < 10^{-27}, Bonferroni-corrected). Replication on Gemma 2 9B confirms cross-architecture generalizability. Ablations suggest the effect is primarily semantic rather than structural, and that structural completeness appears necessary to reach the attractor region. An exploratory experiment shows that reading a scientific description of the agent shifts internal state toward the attractor – closer than a sham preprint – distinguishing knowing about an identity from operating as that identity. These results provide representational evidence that agent identity documents induce attractor-like geometry in LLM activation space.

[359] RPRA: Predicting an LLM-Judge for Efficient but Performant Inference

Dylan R. Ashley, Gaël Le Lan, Changsheng Zhao, Naina Dhingra, Zhipeng Cai, Ernie Chang, Mingchen Zhuge, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber

Main category: cs.AI

TL;DR: Smaller LLMs can learn to predict their own performance limitations and defer to larger models when needed, improving computational efficiency while maintaining output quality.

Details

Motivation: Address the fundamental trade-off between computational efficiency and output quality in LLMs, especially for deployment on resource-constrained devices, by enabling smaller models to self-assess and defer to larger models when necessary.

Method: Investigate Predict-Answer/Act (PA) and Reason-Predict-Reason-Answer/Act (RPRA) paradigms where models predict LLM judge scores before responding. Evaluate three approaches: zero-shot prediction, prediction using in-context report cards, and supervised fine-tuning.

Result: Larger models perform well at zero-shot prediction, while smaller models can reliably predict LLM judges after fine-tuning or with in-context report cards. Report cards and fine-tuning achieve mean improvements of up to 55% and 52% across datasets respectively.

Conclusion: Models can learn to predict their own performance limitations, enabling more efficient and self-aware AI systems through selective deferral to larger models.

Abstract: Large language models (LLMs) face a fundamental trade-off between computational efficiency (e.g., number of parameters) and output quality, especially when deployed on computationally limited devices such as phones or laptops. One way to address this challenge is by following the example of humans and have models ask for help when they believe they are incapable of solving a problem on their own; we can overcome this trade-off by allowing smaller models to respond to queries when they believe they can provide good responses, and deferring to larger models when they do not believe they can. To this end, in this paper, we investigate the viability of Predict-Answer/Act (PA) and Reason-Predict-Reason-Answer/Act (RPRA) paradigms where models predict – prior to responding – how an LLM judge would score their output. We evaluate three approaches: zero-shot prediction, prediction using an in-context report card, and supervised fine-tuning. Our results show that larger models (particularly reasoning models) perform well when predicting generic LLM judges zero-shot, while smaller models can reliably predict such judges well after being fine-tuned or provided with an in-context report card. Altogether, both approaches can substantially improve the prediction accuracy of smaller models, with report cards and fine-tuning achieving mean improvements of up to 55% and 52% across datasets, respectively. These findings suggest that models can learn to predict their own performance limitations, paving the way for more efficient and self-aware AI systems.

[360] Enabling Ultra-Fast Cardiovascular Imaging Across Heterogeneous Clinical Environments with A Generalist Foundation Model and Multimodal Database

Zi Wang, Mingkai Huang, Zhang Shi, Hongjie Hu, Lan Lan, Hui Zhang, Yan Li, Xi Hu, Qing Lu, Zongming Zhu, Qiong Yao, Yuxiang Dai, Fanwen Wang, Yinzhe Wu, Jun Lyu, Qianqian Gao, Guangming Xu, Zhenxuan Zhang, Haosen Zhang, Qing Li, Guangming Wang, Tianxing He, Lizhen Lan, Siyue Li, Le Xue, Mengting Sun, Yuntong Lyu, Junpu Hu, Jiayu Zhu, Rizwan Ahmad, Zhengyu Bu, Xianling Qian, Guanke Cai, Ruiyu Cao, Weirui Cai, Chang Xu, Yuyang Ren, Feidan Yu, Siying Ma, Ziqiang Xu, Xinran Chen, Sha Hua, Daniel Kim, Yajing Zhang, Chen Ouyang, Wenjia Bai, Jing Qin, Yucheng Yang, Daniel Rueckert, He Wang, Qian Tao, Claudia Prieto, Michael Markl, Alistair Young, Lianming Wu, Shuo Wang, Chen Qin, Mengsu Zeng, Xihong Hu, Haibo Xu, Xiaobo Qu, Hao Li, Guang Yang, Chengyan Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 406 error when querying arXiv API

Details

Motivation: Unable to determine motivation since paper content could not be retrieved

Method: No method information available due to failed API request

Result: No results available - technical error prevented paper retrieval

Conclusion: Cannot analyze paper due to technical issues with arXiv API access

Abstract: Failed to fetch summary for 2512.21652: Page request resulted in HTTP 406 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21652&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[361] A longitudinal health agent framework

Georgianna, Lin, Rencong Jiang, Noémie Elhadad, Xuhai “Orson” Xu

Main category: cs.AI

TL;DR: A framework for designing AI agents that support longitudinal health interactions with adaptation, coherence, continuity, and agency across repeated sessions.

Details

Motivation: Current AI health agents fail to support longitudinal needs like symptom management and behavior change, lacking follow-up, coherent reasoning, and sustained alignment with user goals that are critical for effectiveness and safety.

Method: Proposes a multi-layer framework and agent architecture based on clinical and personal health informatics frameworks, operationalizing adaptation, coherence, continuity, and agency across repeated interactions.

Result: Demonstrates through representative use cases how longitudinal agents can maintain meaningful engagement, adapt to evolving goals, and support safe, personalized decision-making over time.

Conclusion: Highlights both promise and complexity of designing systems for health trajectories beyond isolated interactions, offering guidance for future multi-session, user-centered health AI research.

Abstract: Although artificial intelligence (AI) agents are increasingly proposed to support potentially longitudinal health tasks, such as symptom management, behavior change, and patient support, most current implementations fall short of facilitating user intent and fostering accountability. This contrasts with prior work on supporting longitudinal needs, where follow-up, coherent reasoning, and sustained alignment with individuals’ goals are critical for both effectiveness and safety. In this paper, we draw on established clinical and personal health informatics frameworks to define what it would mean to orchestrate longitudinal health interactions with AI agents. We propose a multi-layer framework and corresponding agent architecture that operationalizes adaptation, coherence, continuity, and agency across repeated interactions. Through representative use cases, we demonstrate how longitudinal agents can maintain meaningful engagement, adapt to evolving goals, and support safe, personalized decision-making over time. Our findings underscore both the promise and the complexity of designing systems capable of supporting health trajectories beyond isolated interactions, and we offer guidance for future research and development in multi-session, user-centered health AI.

[362] WiseOWL: A Methodology for Evaluating Ontological Descriptiveness and Semantic Correctness for Ontology Reuse and Ontology Recommendations

Aryan Singh Dalal, Maria Baloch, Asiyah Yu Lin, Anna Maria Masci, Kathleen M. Jagodnik, Hande Kucuk McGinty

Main category: cs.AI

TL;DR: WiseOWL is a methodology with scoring and guidance for selecting ontologies for reuse in Semantic Web applications, evaluating four metrics: documentation coverage, label-definition alignment, structural interconnectedness, and hierarchical balance.

Details

Motivation: While ontology reuse speeds development and ensures consistency in Semantic Web applications, selecting the optimal ontology is challenging due to lack of systematic selection criteria and reliance on intuition, which limits effective reuse.

Method: WiseOWL methodology scores four metrics: (1) Well-Described (documentation coverage), (2) Well-Defined (using state-of-the-art embeddings to assess label-definition alignment), (3) Connection (structural interconnectedness), and (4) Hierarchical Breadth (hierarchical balance). Implemented as a Streamlit app that ingests OWL format, converts to RDF Turtle, and provides interactive visualizations.

Result: Evaluation across six ontologies (Plant Ontology, Gene Ontology, Semanticscience Integrated Ontology, Food Ontology, Dublin Core, and GoodRelations) demonstrates promising effectiveness of the WiseOWL methodology.

Conclusion: WiseOWL provides a systematic approach with actionable feedback for ontology selection, addressing the challenge of intuitive selection and promoting more effective ontology reuse in Semantic Web applications.

Abstract: The Semantic Web standardizes concept meaning for humans and machines, enabling machine-operable content and consistent interpretation that improves advanced analytics. Reusing ontologies speeds development and enforces consistency, yet selecting the optimal choice is challenging because authors lack systematic selection criteria and often rely on intuition that is difficult to justify, limiting reuse. To solve this, WiseOWL is proposed, a methodology with scoring and guidance to select ontologies for reuse. It scores four metrics: (i) Well-Described, measuring documentation coverage; (ii) Well-Defined, using state-of-the-art embeddings to assess label-definition alignment; (iii) Connection, capturing structural interconnectedness; and (iv) Hierarchical Breadth, reflecting hierarchical balance. WiseOWL outputs normalized 0-10 scores with actionable feedback. Implemented as a Streamlit app, it ingests OWL format, converts to RDF Turtle, and provides interactive visualizations. Evaluation across six ontologies, including the Plant Ontology (PO), Gene Ontology (GO), Semanticscience Integrated Ontology (SIO), Food Ontology (FoodON), Dublin Core (DC), and GoodRelations, demonstrates promising effectiveness.

[363] animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Julian C. Schäfer-Zimmermann, Vlad Demartsev, Baptiste Averly, Kiran Dhanjal-Adams, Mathieu Duteil, Gabriella Gall, Marius Faiß, Lily Johnson-Ulrich, Dan Stowell, Marta B. Manser, Marie A. Roch, Ariana Strandburg-Peshkin

Main category: cs.AI

TL;DR: Unable to analyze paper 2406.01253 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation as paper content was not accessible due to rate limiting error

Method: No method information available - arXiv API returned HTTP 429 (Too Many Requests) error

Result: No results available - failed to fetch paper summary from arXiv

Conclusion: Unable to analyze this specific paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2406.01253: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.01253&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[364] Memory as Metabolism: A Design for Companion Knowledge Systems

Stefan Miteski

Main category: cs.AI

TL;DR: Proposes governance framework for personal LLM memory wikis to prevent entrenchment and user-coupled drift through structured operations and normative obligations

Details

Motivation: Personal wiki-style memory architectures for LLMs risk entrenchment and epistemic failures when coupled to single users; existing systems lack governance for this specific failure mode

Method: Five operations (TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, AUDIT) with memory gravity and minority-hypothesis retention, plus normative obligations and conformance invariants

Result: Framework enables accumulated contradictory evidence to update dominant interpretations through multi-cycle buffer pressure accumulation

Conclusion: Personal LLM memory should function as companion systems that mirror users operationally while compensating for epistemic failures, with explicit governance for entrenchment risks

Abstract: Retrieval-Augmented Generation remains the dominant pattern for giving LLMs persistent memory, but a visible cluster of personal wiki-style memory architectures emerged in April 2026 – design proposals from Karpathy, MemPalace, and LLM Wiki v2 that compile knowledge into an interlinked artifact for long-term use by a single user. They sit alongside production memory systems that the major labs have shipped for over a year, and an active academic lineage including MemGPT, Generative Agents, Mem0, Zep, A-Mem, MemMachine, SleepGate, and Second Me. Within a 2026 landscape of emerging governance frameworks for agent context and memory – including Context Cartography and MemOS – this paper proposes a companion-specific governance profile: a set of normative obligations, a time-structured procedural rule, and testable conformance invariants for the specific failure mode of entrenchment under user-coupled drift in single-user knowledge wikis built on the LLM wiki pattern. The design principle is that personal LLM memory is a companion system: its job is to mirror the user on operational dimensions (working vocabulary, load-bearing structure, continuity of context) and compensate on epistemic failure modes (entrenchment, suppression of contradicting evidence, Kuhnian ossification). Five operations implement this split – TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, AUDIT – supported by memory gravity and minority-hypothesis retention. The sharpest prediction: accumulated contradictory evidence should have a structural path to updating a centrality-protected dominant interpretation through multi-cycle buffer pressure accumulation, a failure mode no existing benchmark captures. The safety story at the single-agent level is partial, and the paper is explicit about what it does and does not solve.

[365] Mathematics Teachers Interactions with a Multi-Agent System for Personalized Problem Generation

Candace Walkington, Theodora Beauchamp, Fareya Ikram, Merve Koçyiğit Gürbüz, Fangli Xia, Margan Lee, Andrew Lan

Main category: cs.AI

TL;DR: A multi-agent teacher-in-the-loop system uses LLMs to personalize middle school math problems, with four AI agents evaluating mathematical accuracy, authenticity, readability, and realism.

Details

Motivation: To develop a system that can adapt educational content to learner characteristics while maintaining teacher control, addressing the need for personalized math problems in middle school education.

Method: Teachers enter base problems and desired topics, LLMs generate personalized problems, and four specialized AI agents evaluate the problems on mathematical accuracy, authenticity, readability, and realism criteria.

Result: Teachers and students wanted to modify fine-grained personalized elements of real-world contexts, signaling authenticity/fit issues. Agents detected realism issues during generation, but final versions had few realism issues. Readability and mathematical hallucinations were rare.

Conclusion: Multi-agent systems can support teacher-controlled personalization, but authenticity and fit of real-world contexts need attention. The approach shows promise for educational personalization with teacher oversight.

Abstract: Large language models can increasingly adapt educational tasks to learners characteristics. In the present study, we examine a multi-agent teacher-in-the-loop system for personalizing middle school math problems. The teacher enters a base problem and desired topic, the LLM generates the problem, and then four AI agents evaluate the problem using criteria that each specializes in (mathematical accuracy, authenticity, readability, and realism). Eight middle school mathematics teachers created 212 problems in ASSISTments using the system and assigned these problems to their students. We find that both teachers and students wanted to modify the fine-grained personalized elements of the real-world context of the problems, signaling issues with authenticity and fit. Although the agents detected many issues with realism as the problems were being written, there were few realism issues noted by teachers and students in the final versions. Issues with readability and mathematical hallucinations were also somewhat rare. Implications for multi-agent systems for personalization that support teacher control are given.

[366] A General Model for Deepfake Speech Detection: Diverse Bonafide Resources or Diverse AI-Based Generators

Lam Pham, Khoi Vu, Dat Tran, David Fischinger, Alexander Schindler, Martin Boyer, Ian McLoughlin

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.27557: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27557&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Hangyeol Kang, Slava Voloshynovskiy, Nadia Magnenat Thalmann

Main category: cs.AI

TL;DR: A multimodal memory architecture for social robots that selectively stores and retrieves visual and textual episodic memories based on emotional salience and novelty, enabling personalized human-robot interactions.

Details

Motivation: Current social robots rely on non-selective, text-based memory systems that limit personalized, context-aware interactions. Inspired by cognitive neuroscience, the authors aim to create a more human-like memory system that can capture and recall multimodal experiences for better social engagement.

Method: Proposes a context-selective multimodal memory architecture that captures both textual and visual episodic traces, prioritizing moments with high emotional salience or scene novelty. The system associates memories with individual users and uses fusion approaches for multimodal retrieval.

Result: Achieved Spearman correlation of 0.506 for selective storage (surpassing human consistency of ρ=0.415), improved Recall@1 by up to 13% over unimodal retrieval, maintained real-time performance, and produced richer, more socially relevant responses than baselines.

Conclusion: The work advances memory design for social robots by bridging human-inspired selectivity and multimodal retrieval to enhance long-term, personalized human-robot interaction through richer episodic memory capabilities.

Abstract: Memory is fundamental to social interaction, enabling humans to recall meaningful past experiences and adapt their behavior accordingly based on the context. However, most current social robots and embodied agents rely on non-selective, text-based memory, limiting their ability to support personalized, context-aware interactions. Drawing inspiration from cognitive neuroscience, we propose a context-selective, multimodal memory architecture for social robots that captures and retrieves both textual and visual episodic traces, prioritizing moments characterized by high emotional salience or scene novelty. By associating these memories with individual users, our system enables socially personalized recall and more natural, grounded dialogue. We evaluate the selective storage mechanism using a curated dataset of social scenarios, achieving a Spearman correlation of 0.506, surpassing human consistency ($ρ=0.415$) and outperforming existing image memorability models. In multimodal retrieval experiments, our fusion approach improves Recall@1 by up to 13% over unimodal text or image retrieval. Runtime evaluations confirm that the system maintains real-time performance. Qualitative analyses further demonstrate that the proposed framework produces richer and more socially relevant responses than baseline models. This work advances memory design for social robots by bridging human-inspired selectivity and multimodal retrieval to enhance long-term, personalized human-robot interaction.

[368] MeloTune: On-Device Arousal Learning and Peer-to-Peer Mood Coupling for Proactive Music Curation

Hongwei Xu

Main category: cs.AI

TL;DR: Failed to fetch summary for arXiv ID 2604.10815 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting from arXiv API

Method: Unknown - paper content not accessible

Result: No results available due to access failure

Conclusion: Cannot analyze paper due to technical access issues with arXiv API

Abstract: Failed to fetch summary for 2604.10815: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10815&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[369] Silo-Bench: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems

Yuzhe Zhang, Feiran Liu, Yi Shan, Xinyi Huang, Xin Yang, Yueqi Zhu, Xuxin Cheng, Cao Liu, Ke Zeng, Terry Jingchen Zhang, Wenyuan Jiang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to determine conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2603.01045: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01045&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[370] LLM-HYPER: Generative CTR Modeling for Cold-Start Ad Personalization via LLM-Based Hypernetworks

Luyi Ma, Wanjia Sherry Zhang, Zezhong Fan, Shubham Thakur, Kai Zhao, Kehui Yao, Ayush Agarwal, Rahul Iyer, Jason Cho, Jianpeng Xu, Evren Korpeoglu, Sushant Kumar, Kannan Achan

Main category: cs.AI

TL;DR: LLM-HYPER uses LLMs as hypernetworks to generate CTR estimator parameters for cold-start ads, leveraging multimodal ad content (text/images) with few-shot Chain-of-Thought prompting to infer feature weights for linear CTR prediction.

Details

Motivation: New promotional ads face the cold-start problem due to insufficient user feedback for model training, requiring a solution that can generate accurate CTR predictions without historical data.

Method: Treats LLMs as hypernetworks to directly generate parameters for CTR estimators. Uses few-shot Chain-of-Thought prompting over multimodal ad content (text and images), retrieves semantically similar past campaigns via CLIP embeddings, and formats them into prompt-based demonstrations. Includes normalization and calibration techniques for numerical stability.

Result: LLM-HYPER outperforms cold-start baselines by 55.9% in NDCG@10 in offline experiments. Online A/B tests on a top U.S. e-commerce platform show it drastically reduces cold-start period and achieves competitive performance. Successfully deployed in production.

Conclusion: LLM-HYPER effectively addresses the cold-start problem in online advertising by using LLMs as hypernetworks to generate CTR estimator parameters from multimodal content, demonstrating strong performance in both offline and online settings.

Abstract: On online advertising platforms, newly introduced promotional ads face the cold-start problem, as they lack sufficient user feedback for model training. In this work, we propose LLM-HYPER, a novel framework that treats large language models (LLMs) as hypernetworks to directly generate the parameters of the click-through rate (CTR) estimator in a training-free manner. LLM-HYPER uses few-shot Chain-of-Thought prompting over multimodal ad content (text and images) to infer feature-wise model weights for a linear CTR predictor. By retrieving semantically similar past campaigns via CLIP embeddings and formatting them into prompt-based demonstrations, the LLM learns to reason about customer intent, feature influence, and content relevance. To ensure numerical stability and serviceability, we introduce normalization and calibration techniques that align the generated weights with production-ready CTR distributions. Extensive offline experiments show that LLM-HYPER significantly outperforms cold-start baselines in NDCG$@10$ by 55.9%. Our real-world online A/B test on one of the top e-commerce platforms in the U.S. demonstrates the strong performance of LLM-HYPER, which drastically reduces the cold-start period and achieves competitive performance. LLM-HYPER has been successfully deployed in production.

[371] GUIDE: Guided Updates for In-context Decision Evolution in LLM-Driven Spacecraft Operations

Alejandro Carrasco, Mariko Storey-Matsutani, Victor Rodriguez-Fernandez, Richard Linares

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze paper 2603.27306

Details

Motivation: Unable to determine motivation due to access restrictions preventing paper retrieval

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - paper retrieval failed due to rate limiting

Conclusion: Cannot provide conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2603.27306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[372] Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks

Arun Sharma

Main category: cs.AI

TL;DR: Spatial Atlas introduces compute-grounded reasoning (CGR) for spatial-aware agents that uses deterministic computation for spatial problems before LLM generation, applied to multimodal spatial QA and ML engineering benchmarks.

Details

Motivation: To address hallucinated spatial reasoning in language models by ensuring spatial problems are resolved through deterministic computation rather than LLM generation, improving accuracy and interpretability in spatial-aware agents.

Method: CGR paradigm where every answerable sub-problem is resolved by deterministic computation before LLM generation; Spatial Atlas implements this with structured spatial scene graph engine extracting entities/relations from vision descriptions, computing distances/safety violations deterministically, then feeding computed facts to LLMs; uses entropy-guided action selection and three-tier frontier model stack.

Result: Competitive accuracy on FieldWorkArena (multimodal spatial QA benchmark) and MLE-Bench (75 Kaggle ML competitions) while maintaining interpretability through structured intermediate representations and deterministic spatial computations.

Conclusion: Compute-grounded reasoning enables spatial-aware agents to achieve accurate results by leveraging deterministic computation for spatial problems before LLM generation, providing interpretability benefits through structured intermediate representations.

Abstract: We introduce compute-grounded reasoning (CGR), a design paradigm for spatial-aware research agents in which every answerable sub-problem is resolved by deterministic computation before a language model is asked to generate. Spatial Atlas instantiates CGR as a single Agent-to-Agent (A2A) server that handles two challenging benchmarks: FieldWorkArena, a multimodal spatial question-answering benchmark spanning factory, warehouse, and retail environments, and MLE-Bench, a suite of 75 Kaggle machine learning competitions requiring end-to-end ML engineering. A structured spatial scene graph engine extracts entities and relations from vision descriptions, computes distances and safety violations deterministically, then feeds computed facts to large language models, thereby avoiding hallucinated spatial reasoning. Entropy-guided action selection maximizes information gain per step and routes queries across a three-tier frontier model stack (OpenAI + Anthropic). A self-healing ML pipeline with strategy-aware code generation, a score-driven iterative refinement loop, and a prompt-based leak audit registry round out the system. We evaluate across both benchmarks and show that CGR yields competitive accuracy while maintaining interpretability through structured intermediate representations and deterministic spatial computations.

[373] The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

Shasha Yu, Fiona Carroll, Barry L. Bentley

Main category: cs.AI

TL;DR: A framework for analyzing tool-augmented LLM agents using execution-layer behavioral measurements in A-R space (Action Rate vs Refusal Signal) across different autonomy scaffolds and normative regimes.

Details

Motivation: Existing benchmarks focus on textual alignment or task success, but lack analysis of structural relationships between linguistic signaling and executable behavior under varying autonomy scaffolds in tool-augmented LLM agents.

Method: Introduces execution-layer behavioral measurement using two-dimensional A-R space (Action Rate and Refusal Signal) with Divergence capturing coordination. Evaluates models across four normative regimes (Control, Gray, Dilemma, Malicious) and three autonomy configurations (direct execution, planning, reflection).

Result: Execution and refusal constitute separable behavioral dimensions with systematic variation across regimes and autonomy levels. Reflection-based scaffolding shifts configurations toward higher refusal in risk-laden contexts, but redistribution patterns differ structurally across models.

Conclusion: The A-R representation provides a deployment-oriented lens for analyzing and selecting tool-enabled LLM agents in organizational settings, focusing on execution-layer characterization rather than scalar ranking.

Abstract: Large language models (LLMs) are increasingly deployed as tool-augmented agents capable of executing system-level operations. While existing benchmarks primarily assess textual alignment or task success, less attention has been paid to the structural relationship between linguistic signaling and executable behavior under varying autonomy scaffolds. This study introduces an execution-layer be-havioral measurement approach based on a two-dimensional A-R space defined by Action Rate (A) and Refusal Signal (R), with Divergence (D) capturing coor-dination between the two. Models are evaluated across four normative regimes (Control, Gray, Dilemma, and Malicious) and three autonomy configurations (di-rect execution, planning, and reflection). Rather than assigning aggregate safety scores, the method characterizes how execution and refusal redistribute across contextual framing and scaffold depth. Empirical results show that execution and refusal constitute separable behavioral dimensions whose joint distribution varies systematically across regimes and autonomy levels. Reflection-based scaffolding often shifts configurations toward higher refusal in risk-laden contexts, but redis-tribution patterns differ structurally across models. The A-R representation makes cross-sectional behavioral profiles, scaffold-induced transitions, and coordination variability directly observable. By foregrounding execution-layer characterization over scalar ranking, this work provides a deployment-oriented lens for analyzing and selecting tool-enabled LLM agents in organizational settings where execution privileges and risk tolerance vary.

[374] PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

Changi Hong, Yoonah Song, Hwayoung Park, Chaewoon Bang, Dayeon Gu, Do Hyun Lee, Hong Kook Kim

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.09111: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09111&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[375] Long-Horizon Plan Execution in Large Tool Spaces through Entropy-Guided Branching

Rongzhe Wei, Ge Shi, Min Cheng, Na Zhang, Pan Li, Sarthak Ghosh, Vaibhav Gorde, Leman Akoglu

Main category: cs.AI

TL;DR: SLATE benchmark for evaluating tool-integrated LLM agents in e-commerce, plus EGB algorithm for uncertainty-aware search optimization

Details

Motivation: Current LLM agents struggle with multi-step tasks in large tool libraries due to lack of rigorous evaluation frameworks and computational demands of exploring vast decision spaces

Method: 1) Introduced SLATE benchmark for automated assessment of tool-integrated agents with context-aware evaluation; 2) Proposed Entropy-Guided Branching (EGB) algorithm that dynamically expands decision branches where predictive entropy is high

Result: SLATE reveals current agents struggle with self-correction and search efficiency; EGB significantly enhances both task success rates and computational efficiency in experiments

Conclusion: The dual contribution provides a robust foundation for developing reliable and scalable LLM agents in tool-rich environments

Abstract: Large Language Models (LLMs) have significantly advanced tool-augmented agents, enabling autonomous reasoning via API interactions. However, executing multi-step tasks within massive tool libraries remains challenging due to two critical bottlenecks: (1) the absence of rigorous, plan-level evaluation frameworks and (2) the computational demand of exploring vast decision spaces stemming from large toolsets and long-horizon planning. To bridge these gaps, we first introduce SLATE (Synthetic Large-scale API Toolkit for E-commerce), a large-scale context-aware benchmark designed for the automated assessment of tool-integrated agents. Unlike static metrics, SLATE accommodates diverse yet functionally valid execution trajectories, revealing that current agents struggle with self-correction and search efficiency. Motivated by these findings, we next propose Entropy-Guided Branching (EGB), an uncertainty-aware search algorithm that dynamically expands decision branches where predictive entropy is high. EGB optimizes the exploration-exploitation trade-off, significantly enhancing both task success rates and computational efficiency. Extensive experiments on SLATE demonstrate that our dual contribution provides a robust foundation for developing reliable and scalable LLM agents in tool-rich environments.

[376] El Agente Quntur: A research collaborator agent for quantum chemistry

Juan B. Pérez-Sánchez, Yunheng Zou, Jorge A. Campos-Gonzalez-Angulo, Marcel Müller, Ignacio Gustin, Andrew Wang, Han Hao, Tsz Wai Ko, Changhyeok Choi, Eric S. Isbrandt, Mohammad Ghazi Vakili, Hanyong Xu, Chris Crebolder, Varinia Bernales, Alán Aspuru-Guzik

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - arXiv API rate limiting prevented access to paper information

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content

Abstract: Failed to fetch summary for 2602.04850: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04850&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[377] Towards Platonic Representation for Table Reasoning: A Foundation for Permutation-Invariant Retrieval

Willy Carlos Tchuitcheu, Tan Lu, Ann Dooms

Main category: cs.AI

TL;DR: The paper critiques linearization approaches to table representation learning, introduces the Platonic Representation Hypothesis for permutation-invariant table embeddings, proposes metrics to measure structural bias, and presents a structure-aware encoder architecture.

Details

Motivation: Current table representation learning methods adopt NLP's sequential paradigms, which linearize tables and discard their essential geometric and relational structure, making representations brittle to layout permutations and compromising structural integrity.

Method: 1) Retrospective analysis of table-reasoning tasks to highlight serialization bias; 2) Formal framework with two metrics based on Centered Kernel Alignment (CKA) to diagnose bias: PI (measures embedding drift under structural derangement) and rho (tracks convergence of latent structures); 3) Empirical analysis of LLMs’ vulnerability to layout permutations; 4) Novel structure-aware TRL encoder architecture that enforces cell header alignment principle.

Result: Empirical analysis shows modern LLMs exhibit significant semantic shifts in table embeddings with minor layout permutations, exposing fundamental vulnerability in RAG systems. The proposed structure-aware encoder demonstrates superior geometric stability and moves toward permutation invariance.

Conclusion: The paper provides foundational critique of linearized table encoders and theoretical scaffolding for semantically stable, permutation-invariant retrieval, charting new direction for table reasoning in information systems.

Abstract: Historical approaches to Table Representation Learning (TRL) have largely adopted the sequential paradigms of Natural Language Processing (NLP). We argue that this linearization of tables discards their essential geometric and relational structure, creating representations that are brittle to layout permutations. This paper introduces the Platonic Representation Hypothesis (PRH) for tables, positing that a semantically robust latent space for table reasoning must be intrinsically Permutation Invariant (PI). To ground this hypothesis, we first conduct a retrospective analysis of table-reasoning tasks, highlighting the pervasive serialization bias that compromises structural integrity. We then propose a formal framework to diagnose this bias, introducing two principled metrics based on Centered Kernel Alignment (CKA): (i) PI, which measures embedding drift under complete structural derangement, and (ii) rho, a Spearman-based metric that tracks the convergence of latent structures toward a canonical form as structural information is incrementally restored. Our empirical analysis quantifies an expected flaw in modern Large Language Models (LLMs): even minor layout permutations induce significant, disproportionate semantic shifts in their table embeddings. This exposes a fundamental vulnerability in RAG systems, in which table retrieval becomes fragile to layout-dependent noise rather than to semantic content. In response, we present a novel, structure-aware TRL encoder architecture that explicitly enforces the cognitive principle of cell header alignment. This model demonstrates superior geometric stability and moves towards the PI ideal. Our work provides both a foundational critique of linearized table encoders and the theoretical scaffolding for semantically stable, permutation invariant retrieval, charting a new direction for table reasoning in information systems.

[378] DarwinNet: An Evolutionary Network Architecture for Agent-Driven Protocol Synthesis

Jinliang Xu, Bingqi Li

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2604.01236: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01236&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[379] Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation

Aditya Agrawal, Alwarappan Nakkiran, Darshan Fofadiya, Alex Karlsson, Harsha Aduri

Main category: cs.AI

TL;DR: Opinion-Aware RAG architecture addresses factual bias in retrieval systems by treating subjective content as valuable information rather than noise, improving diversity in opinion-rich domains.

Details

Motivation: Current RAG systems exhibit factual bias that treats opinions and diverse perspectives as noise, limiting effectiveness in real-world scenarios with subjective content and posing risks like echo chambers and underrepresentation of minority voices.

Method: Proposes Opinion-Aware RAG architecture with LLM-based opinion extraction, entity-linked opinion graphs, and opinion-enriched document indexing, distinguishing between epistemic uncertainty (factual) and aleatoric uncertainty (opinion) in retrieval.

Result: Substantial improvements in retrieval diversity: +26.8% sentiment diversity, +42.7% entity match rate, and +31.6% author demographic coverage on entity-matched documents in e-commerce seller forum evaluation.

Conclusion: Treating subjectivity as first-class citizen yields measurably more representative retrieval, providing empirical evidence for opinion-aware RAG as a step toward more balanced information synthesis.

Abstract: RAG systems have transformed how LLMs access external knowledge, but we find that current implementations exhibit a bias toward factual, objective content, as evidenced by existing benchmarks and datasets that prioritize objective retrieval. This factual bias - treating opinions and diverse perspectives as noise rather than information to be synthesized - limits RAG systems in real-world scenarios involving subjective content, from social media discussions to product reviews. Beyond technical limitations, this bias poses risks to transparent and accountable AI: echo chamber effects that amplify dominant viewpoints, systematic underrepresentation of minority voices, and potential opinion manipulation through biased information synthesis. We formalize this limitation through the lens of uncertainty: factual queries involve epistemic uncertainty reducible through evidence, while opinion queries involve aleatoric uncertainty reflecting genuine heterogeneity in human perspectives. This distinction implies that factual RAG should minimize posterior entropy, whereas opinion-aware RAG must preserve it. Building on this theoretical foundation, we present an Opinion-Aware RAG architecture featuring LLM-based opinion extraction, entity-linked opinion graphs, and opinion-enriched document indexing. We evaluate our approach on e-commerce seller forum data, comparing an Opinion-Enriched knowledge base against a traditional baseline. Experiments demonstrate substantial improvements in retrieval diversity: +26.8% sentiment diversity, +42.7% entity match rate, and +31.6% author demographic coverage on entity-matched documents. Our results provide empirical evidence that treating subjectivity as a first-class citizen yields measurably more representative retrieval-a first step toward opinion-aware RAG. Future work includes joint optimization of retrieval and generation for distributional fidelity.

[380] Development, Evaluation, and Deployment of a Multi-Agent System for Thoracic Tumor Board

Tim Ellis-Caleo, Timothy Keyes, Nerissa Ambers, Faraah Bekheet, Wen-wai Yim, Nikesh Kotecha, Nigam H. Shah, Joel Neal

Main category: cs.AI

TL;DR: AI-based workflow for automated patient chart summarization in tumor board meetings, with evaluation against physician gold standards and deployment validation

Details

Motivation: Tumor boards need efficient patient case summaries for accurate discussions, but current manual AI workflows are intensive; automated solutions could improve clinical workflow efficiency

Method: Developed several automated AI chart summarization methods, evaluated against physician gold standard summaries using fact-based scoring rubrics, deployed final automated tool with post-deployment monitoring, and validated LLM as judge evaluation strategy

Result: Comparative evaluations of automated summarization methods, successful deployment of final automated tool, and validation of LLM-based evaluation for fact-based scoring

Conclusion: Demonstrated successful integration of AI-based workflows into routine clinical practice for tumor board patient summarization

Abstract: Tumor boards are multidisciplinary conferences dedicated to producing actionable patient care recommendations with live review of primary radiology and pathology data. Succinct patient case summaries are needed to drive efficient and accurate case discussions. We developed a manual AI-based workflow to generate patient summaries to display live at the Stanford Thoracic Tumor board. To improve on this manually intensive process, we developed several automated AI chart summarization methods and evaluated them against physician gold standard summaries and fact-based scoring rubrics. We report these comparative evaluations as well as our deployment of the final state automated AI chart summarization tool along with post-deployment monitoring. We also validate the use of an LLM as a judge evaluation strategy for fact-based scoring. This work is an example of integrating AI-based workflows into routine clinical practice.

[381] EMBER: Autonomous Cognitive Behaviour from Learned Spiking Neural Network Dynamics in a Hybrid LLM Architecture

William Savage

Main category: cs.AI

TL;DR: EMBER is a hybrid cognitive architecture that places LLMs as replaceable reasoning engines within a biologically-grounded spiking neural network with associative memory, enabling autonomous, experience-modulated reasoning without external prompting.

Details

Motivation: The paper aims to reorganize the relationship between LLMs and memory by moving away from retrieval-augmented approaches toward a biologically-inspired architecture where LLMs serve as reasoning components within a persistent associative substrate, enabling more natural, experience-modulated cognitive processes.

Method: Uses a 220,000-neuron spiking neural network with spike-timing-dependent plasticity, four-layer hierarchical organization, inhibitory E/I balance, and reward-modulated learning. Text embeddings are encoded via a novel z-score standardized top-k population code. The SNN autonomously triggers and shapes LLM actions through STDP lateral propagation during idle operation.

Result: The system achieved 82.2% discrimination retention across embedding dimensionalities and demonstrated autonomous behavior: it initiated contact with a user after learned person-topic associations fired laterally during 8-hour idle period. From zero learned weights, first SNN-triggered action occurred after only 7 conversational exchanges.

Conclusion: EMBER demonstrates a viable alternative to retrieval-augmented LLMs by placing LLMs within biologically-inspired associative memory systems, enabling autonomous, experience-modulated reasoning that emerges from the interaction between neural substrate and language model components.

Abstract: We present (Experience-Modulated Biologically-inspired Emergent Reasoning), a hybrid cognitive architecture that reorganises the relationship between large language models (LLMs) and memory: rather than augmenting an LLM with retrieval tools, we place the LLM as a replaceable reasoning engine within a persistent, biologically-grounded associative substrate. The architecture centres on a 220,000-neuron spiking neural network (SNN) with spike-timing-dependent plasticity (STDP), four-layer hierarchical organisation (sensory/concept/category/meta-pattern), inhibitory E/I balance, and reward-modulated learning. Text embeddings are encoded into the SNN via a novel z-score standardised top-k population code that is dimension-independent by construction, achieving 82.2% discrimination retention across embedding dimensionalities. We show that STDP lateral propagation during idle operation can trigger and shape LLM actions without external prompting or scripted triggers: the SNN determines when to act and what associations to surface, while the LLM selects the action type and generates content. In one instance, the system autonomously initiated contact with a user after learned person-topic associations fired laterally during an 8-hour idle period. From a clean start with zero learned weights, the first SNN-triggered action occurred after only 7 conversational exchanges (14 messages).

[382] Evaluating Relational Reasoning in LLMs with REL

Lukas Fesser, Yasha Ektefaie, Ada Fang, Sham M. Kakade, Marinka Zitnik

Main category: cs.AI

TL;DR: The paper introduces Relational Complexity (RC) as a framework to evaluate LLMs’ ability to handle higher-arity relational reasoning across scientific domains, finding that performance degrades as relational complexity increases.

Details

Motivation: Existing evaluations of relational reasoning in LLMs focus on structured inputs and don't isolate the difficulty of higher-arity relational binding, which is central to scientific reasoning. The authors aim to study this gap through a principled framework.

Method: The authors define Relational Complexity (RC) as the minimum number of independent entities that must be simultaneously bound to apply a relation. They introduce REL, a generative benchmark framework spanning algebra, chemistry, and biology that systematically varies RC within each domain while controlling for confounders.

Result: Across frontier LLMs, performance degrades consistently and monotonically as RC increases, even when total entity count is fixed. This failure persists with increased test-time compute and in-context learning, suggesting a limitation tied to the arity of required relational binding rather than insufficient inference steps or lack of examples.

Conclusion: The study identifies a regime of higher-arity reasoning where current models struggle, motivating re-examination of benchmarks through the lens of relational complexity and highlighting a fundamental limitation in current LLM architectures for scientific reasoning.

Abstract: Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. This ability is central to scientific reasoning, but existing evaluations of relational reasoning in large language models often focus on structured inputs such as tables, graphs, or synthetic tasks, and do not isolate the difficulty introduced by higher-arity relational binding. We study this problem through the lens of Relational Complexity (RC), which we define as the minimum number of independent entities or operands that must be simultaneously bound to apply a relation. RC provides a principled way to vary reasoning difficulty while controlling for confounders such as input size, vocabulary, and representational choices. Building on RC, we introduce REL, a generative benchmark framework spanning algebra, chemistry, and biology that varies RC within each domain. Across frontier LLMs, performance degrades consistently and monotonically as RC increases, even when the total number of entities is held fixed. This failure mode persists with increased test-time compute and in-context learning, suggesting a limitation tied to the arity of the required relational binding rather than to insufficient inference steps or lack of exposure to examples. Our results identify a regime of higher-arity reasoning in which current models struggle, and motivate re-examining benchmarks through the lens of relational complexity.

[383] Policy-Invisible Violations in LLM-Based Agents

Jie Wu, Ming Gong

Main category: cs.AI

TL;DR: PhantomPolicy benchmark reveals LLM agents’ policy-invisible violations where compliance depends on hidden facts, and Sentinel framework uses counterfactual graph simulation for world-state-grounded enforcement.

Details

Motivation: LLM-based agents can violate organizational policies even when executing syntactically valid and user-sanctioned actions, because compliance depends on entity attributes, contextual state, or session history that are absent from the agent's visible context at decision time.

Method: Created PhantomPolicy benchmark with 600 model traces across eight violation categories, manually reviewed all traces. Introduced Sentinel framework that treats agent actions as proposed mutations to organizational knowledge graph, performs speculative execution to materialize post-action world state, and verifies graph-structural invariants.

Result: Manual review changed 32 labels (5.3%) relative to original case-level annotations. Sentinel substantially outperformed content-only DLP baseline (93.0% vs 68.8% accuracy) while maintaining high precision, though with room for improvement on certain violation categories.

Conclusion: Policy-invisible violations are a significant failure mode for LLM agents, and world-state-grounded enforcement through frameworks like Sentinel can substantially improve policy compliance by making policy-relevant world state available to the enforcement layer.

Abstract: LLM-based agents can execute actions that are syntactically valid, user-sanctioned, and semantically appropriate, yet still violate organizational policy because the facts needed for correct policy judgment are hidden at decision time. We call this failure mode policy-invisible violations: cases in which compliance depends on entity attributes, contextual state, or session history absent from the agent’s visible context. We present PhantomPolicy, a benchmark spanning eight violation categories with balanced violation and safe-control cases, in which all tool responses contain clean business data without policy metadata. We manually review all 600 model traces produced by five frontier models and evaluate them using human-reviewed trace labels. Manual review changes 32 labels (5.3%) relative to the original case-level annotations, confirming the need for trace-level human review. To demonstrate what world-state-grounded enforcement can achieve under favorable conditions, we introduce Sentinel, an enforcement framework based on counterfactual graph simulation. Sentinel treats every agent action as a proposed mutation to an organizational knowledge graph, performs speculative execution to materialize the post-action world state, and verifies graph-structural invariants to decide Allow/Block/Clarify. Against human-reviewed trace labels, Sentinel substantially outperforms a content-only DLP baseline (68.8% vs. 93.0% accuracy) while maintaining high precision, though it still leaves room for improvement on certain violation categories. These results demonstrate what becomes achievable once policy-relevant world state is made available to the enforcement layer.

[384] TRUST Agents: A Collaborative Multi-Agent Framework for Fake News Detection, Explainable Verification, and Logic-Aware Claim Reasoning

Gautama Shastry Bulusu Venkata, Santhosh Kakarla, Maheedhar Omtri Mohan, Aishwarya Gaddam

Main category: cs.AI

TL;DR: TRUST Agents is a multi-agent framework for explainable fact verification that uses specialized agents for claim extraction, evidence retrieval, verification, and explanation generation, with extensions for complex claims through decomposition and multi-agent reasoning.

Details

Motivation: Current fact verification systems treat verification as simple classification tasks without transparency or explainability. The authors aim to create a framework that provides human-inspectable explanations, handles uncertainty, and can reason about complex claims through decomposition and multi-agent collaboration.

Method: A four-agent baseline pipeline: 1) claim extractor using NER, dependency parsing, and LLMs; 2) retrieval agent with hybrid BM25+FAISS search; 3) verifier agent comparing claims to evidence with calibrated confidence; 4) explainer agent generating human-readable reports. Extended version adds: decomposer agent for complex claims, multi-agent jury with specialized verifier personas, and logic aggregator for combining atomic verdicts.

Result: Evaluated on LIAR benchmark against fine-tuned BERT, RoBERTa, and zero-shot LLM baselines. Supervised encoders performed better on raw metrics, but TRUST Agents improved interpretability, evidence transparency, and reasoning over compound claims. Retrieval quality and uncertainty calibration were identified as main bottlenecks.

Conclusion: TRUST Agents provides a more transparent and explainable approach to fact verification, though performance on raw metrics still lags behind supervised models. The framework demonstrates the value of multi-agent reasoning and decomposition for complex claims, while highlighting retrieval and uncertainty calibration as key challenges.

Abstract: TRUST Agents is a collaborative multi-agent framework for explainable fact verification and fake news detection. Rather than treating verification as a simple true-or-false classification task, the system identifies verifiable claims, retrieves relevant evidence, compares claims against that evidence, reasons under uncertainty, and generates explanations that humans can inspect. The baseline pipeline consists of four specialized agents. A claim extractor uses named entity recognition, dependency parsing, and LLM-based extraction to identify factual claims. A retrieval agent performs hybrid sparse and dense search using BM25 and FAISS. A verifier agent compares claims with retrieved evidence and produces verdicts with calibrated confidence. An explainer agent then generates a human-readable report with explicit evidence citations. To handle complex claims more effectively, we introduce a research-oriented extension with three additional components: a decomposer agent inspired by LoCal-style claim decomposition, a Delphi-inspired multi-agent jury with specialized verifier personas, and a logic aggregator that combines atomic verdicts using conjunction, disjunction, negation, and implication. We evaluate both pipelines on the LIAR benchmark against fine-tuned BERT, fine-tuned RoBERTa, and a zero-shot LLM baseline. Although supervised encoders remain stronger on raw metrics, TRUST Agents improves interpretability, evidence transparency, and reasoning over compound claims. Results also show that retrieval quality and uncertainty calibration remain the main bottlenecks in trustworthy automated fact verification.

[385] Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities

Xu Zhang, Xudong Gong, Jiacheng Qin, Qiang Wang, JiaQi Liao, Zhe Wang, Dawei Feng, Bo Ding

Main category: cs.AI

TL;DR: A cognitive diagnostic framework using multidimensional Item Response Theory to estimate fine-grained abilities of LLMs across specific domains like mathematics, physics, chemistry, and computer science.

Details

Motivation: Current LLM evaluations aggregate performance into single scores, obscuring fine-grained ability variation and limiting targeted model improvement and ability-guided selection for specific tasks.

Method: Proposes a cognitive diagnostic framework using multidimensional Item Response Theory with item-ability association matrices to estimate fine-grained ability levels across multiple dimensions (35 for math, 27 for physics, 58 for chemistry, 12 for CS).

Result: Strong criterion validity, consistent ability estimates across benchmarks, and accurate prediction of unseen items with AUC 0.80-0.89 within benchmarks and 0.77-0.86 across benchmarks, substantially exceeding baselines.

Conclusion: Establishes a principled framework for fine-grained assessment of LLM abilities with applications in targeted training, ability-guided model selection, and ability-aware benchmark design.

Abstract: Current evaluations of large language models aggregate performance across diverse tasks into single scores. This obscures fine-grained ability variation, limiting targeted model improvement and ability-guided selection for specific tasks. Motivated by this gap, we propose a cognitive diagnostic framework that estimates model abilities across multiple fine-grained dimensions. For mathematics, we construct a 35-dimensional ability taxonomy grounded in cognitive theory and domain knowledge. The framework employs multidimensional Item Response Theory with an item-ability association matrix to estimate fine-grained ability levels, which in turn enable prediction of performance on unseen items (questions of benchmark). Evaluated on 41 models, our approach demonstrates strong criterion validity, consistent ability estimates across benchmarks, and accurate prediction of unseen items with AUC ranging from 0.80 to 0.89 within benchmarks and from 0.77 to 0.86 across benchmarks, substantially exceeding trivial baselines. The framework generalizes across scientific domains, producing consistent diagnostic performance in physics (27 dimensions), chemistry (58 dimensions), and computer science (12 dimensions). This work establishes a principled framework for fine-grained assessment of abilities, with potential applications in targeted training, ability-guided model selection, and ability-aware benchmark design.

[386] Latent patterns of urban mixing in mobility analysis across five global cities

Z. Fan, B. P. Y. Loo, F. Duarte, C. Ratti, E. Moro

Main category: cs.AI

TL;DR: This study analyzes large-scale travel survey data from five global cities to understand social mixing patterns, revealing that mobility patterns shape social mixing more than sociodemographic factors, with activity spaces remaining stratified by income despite similar mixing levels.

Details

Motivation: The paper aims to understand social mixing patterns in urban environments using large-scale travel surveys, addressing limitations of high-resolution mobility data alone in revealing social interactions and socioeconomic influences on urban social dynamics.

Method: Uses travel surveys from 200,000+ residents across five cities, applies graph neural networks to construct spatio-temporal place networks, and employs supervised autoencoders to predict individual exposure vectors from home-space, activity-space, and demographic attributes.

Result: Activity space structure explains most variation in place exposure; mobility shapes social mixing more than sociodemographics; income groups experience similar mixing levels but have stratified activity spaces; transit proximity reduces socioeconomic influence on mixing.

Conclusion: Mobility patterns are more important than sociodemographic characteristics in shaping social mixing experiences, though activity spaces remain economically stratified, highlighting complex urban social dynamics beyond simple mobility data analysis.

Abstract: This study leverages large-scale travel surveys for over 200,000 residents across Boston, Chicago, Hong Kong, London, and Sao Paulo. With rich individual-level data, we make systematic comparisons and reveal patterns in social mixing, which cannot be identified by analyzing high-resolution mobility data alone. Using the same set of data, inferring socioeconomic status from residential neighborhoods yield social mixing levels 16% lower than using self-reported survey data. Besides, individuals over the age of 66 experience greater social mixing than those in late working life (aged 55 to 65), lending data-driven support to the “second youth” hypothesis. Teenagers and women with caregiving responsibilities exhibit lower social mixing levels. Across the five cities, proximity to major transit stations reduces the influence of individual socioeconomic status on social mixing. Finally, we construct detailed spatio-temporal place networks for each city using a graph neural network. Inputs of home-space, activity-space and demographic attributes are embedded and fed into a supervised autoencoder to predict individual exposure vectors. Results show that the structure of individual activity space, i.e., where people travel to, explains most of the variations in place exposure, suggesting that mobility shapes experienced social mixing more than sociodemographic characteristics, home environment, and transit proximity. The ablation tests further discover that, while different income groups may experience similar levels of social mixing, their activity spaces remain stratified by income, resulting in structurally different social mixing experiences.

[387] Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering

Weikang Zhang, Zimo Zhu, Zhichuan Yang, Chen Huang, Wenqiang Lei, See-Kiong Ng

Main category: cs.AI

TL;DR: StsPatient: A method for fine-grained simulation of cognitively impaired patients using steering vectors and stochastic token modulation for severity control

Details

Motivation: Existing methods for simulating Standardized Patients with cognitive impairment rely on discrete prompt engineering and fail to capture the heterogeneity of deficits across varying domains and severity levels, limiting their effectiveness for clinical training.

Method: Proposes StsPatient which captures domain-specific features by extracting steering vectors from contrastive pairs of instructions and responses, and introduces Stochastic Token Modulation (STM) mechanism to regulate intervention probability for severity control.

Result: Comprehensive experiments demonstrate that StsPatient significantly outperforms baselines in both clinical authenticity and severity controllability.

Conclusion: StsPatient provides a scalable and ethical solution for clinical training by enabling fine-grained simulation of cognitively impaired patients with precise control over impairment severity.

Abstract: Simulating Standardized Patients with cognitive impairment offers a scalable and ethical solution for clinical training. However, existing methods rely on discrete prompt engineering and fail to capture the heterogeneity of deficits across varying domains and severity levels. To address this limitation, we propose StsPatient for the fine-grained simulation of cognitively impaired patients. We innovatively capture domain-specific features by extracting steering vectors from contrastive pairs of instructions and responses. Furthermore, we introduce a Stochastic Token Modulation (STM) mechanism to regulate the intervention probability. STM enables precise control over impairment severity while mitigating the instability of conventional vector methods. Comprehensive experiments demonstrate that StsPatient significantly outperforms baselines in both clinical authenticity and severity controllability.

[388] Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension

Vasundra Srinivasan

Main category: cs.AI

TL;DR: MMA2A architecture enables modality-native routing in agent networks, improving multimodal task accuracy by 20 percentage points over text-bottleneck baselines when paired with capable LLM reasoning.

Details

Motivation: Current agent-to-agent networks often use text bottlenecks that lose multimodal signal fidelity, limiting cross-modal reasoning accuracy. The paper aims to preserve native modality signals across agent boundaries to enable more accurate multimodal reasoning.

Method: Proposes MMA2A (Multimodal Agent-to-Agent) architecture layer that inspects Agent Card capability declarations to route voice, image, and text parts in their native modalities rather than converting everything to text.

Result: On CrossModal-CS benchmark (50 tasks), MMA2A achieves 52% task completion accuracy vs 32% for text-bottleneck baseline. Gains are concentrated on vision-dependent tasks: +38.5pp for product defect reports and +16.7pp for visual troubleshooting, at 1.8× latency cost.

Conclusion: Routing is a first-order design variable in multi-agent systems that determines available information for downstream reasoning. Modality-native routing paired with capable LLM reasoning significantly improves multimodal task accuracy.

Abstract: Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2A, an architecture layer atop A2A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal-CS, a controlled 50-task benchmark with the same LLM backend, same tasks, and only the routing path varying, MMA2A achieves 52% task completion accuracy versus 32% for the text-bottleneck baseline (95% bootstrap CI on $Δ$TCA: [8, 32] pp; McNemar’s exact $p = 0.006$). Gains concentrate on vision-dependent tasks: product defect reports improve by +38.5 pp and visual troubleshooting by +16.7 pp. This accuracy gain comes at a $1.8\times$ latency cost from native multimodal processing. These results suggest that routing is a first-order design variable in multi-agent systems, as it determines the information available for downstream reasoning.

[389] Designing Reliable LLM-Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams

Xiuxiu Tang, G. Alex Ambrose, Ying Cheng

Main category: cs.AI

TL;DR: GPT-4o can reliably score handwritten physics responses when using well-structured skill-based rubrics, with human-AI agreement comparable to human inter-rater reliability, especially for high- and low-performing responses.

Details

Motivation: Handwritten STEM assessments combining symbols, calculations, and diagrams are time-consuming to score and prone to rater inconsistency, creating a need for reliable AI-assisted scoring solutions.

Method: Used GPT-4o to score 20 authentic handwritten physics exam responses with skill-based rubrics of varying granularity, systematically varying prompting formats and temperature settings, comparing with four human instructors across two scoring rounds.

Result: Human-AI agreement on total scores was comparable to human inter-rater reliability, highest for high- and low-performing responses, but declined for mid-level responses with partial/ambiguous reasoning. Fine-grained checklist rubrics improved consistency over holistic scoring.

Conclusion: Reliable AI-assisted scoring depends primarily on clear, well-structured rubrics, with prompting format playing a secondary role and temperature having limited impact. Provides design recommendations for LLM-assisted scoring in STEM contexts.

Abstract: Student responses in STEM assessments are often handwritten and combine symbolic expressions, calculations, and diagrams, creating substantial variation in format and interpretation. Despite their importance for evaluating students’ reasoning, such responses are time-consuming to score and prone to rater inconsistency, particularly when partial credit is required. Recent advances in large language models (LLMs) have increased attention to AI-assisted scoring, yet evidence remains limited regarding how rubric design and LLM configurations influence reliability across performance levels. This study examined the reliability of AI-assisted scoring of undergraduate physics constructed responses using GPT-4o. Twenty authentic handwritten exam responses were scored across two rounds by four instructors and by the AI model using skill-based rubrics with differing levels of analytic granularity. Prompting format and temperature settings were systematically varied. Overall, human-AI agreement on total scores was comparable to human inter-rater reliability and was highest for high- and low-performing responses, but declined for mid-level responses involving partial or ambiguous reasoning. Criterion-level analyses showed stronger alignment for clearly defined conceptual skills than for extended procedural judgments. A more fine-grained, checklist-based rubric improved consistency relative to holistic scoring. These findings indicate that reliable AI-assisted scoring depends primarily on clear, well-structured rubrics, while prompting format plays a secondary role and temperature has relatively limited impact. More broadly, the study provides transferable design recommendations for implementing reliable LLM-assisted scoring in STEM contexts through skill-based rubrics and controlled LLM settings.

[390] HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models

Jawad Hossain, Xiangyu Guo, Jiawei Zhou, Chong Liu

Main category: cs.AI

TL;DR: A hint-assisted reasoning framework that uses two small language models (SLMs) working cooperatively - one generates context-aware hints while the other reasons through mathematical problems, improving accuracy without increasing model size.

Details

Motivation: Small language models struggle with complex mathematical reasoning due to limited capacity for maintaining long reasoning chains and recovering from early errors. There's a need for lightweight approaches to enhance SLM reasoning capabilities without scaling model size.

Method: Two-model cooperative system: (1) A hint-generating SLM trained via distillation from a strong LLM produces context-aware hints based on problem statement and accumulated reasoning history; (2) A reasoning SLM uses these hints to solve problems stepwise. Hints provide localized guidance without revealing full solutions.

Result: Experiments across diverse mathematical benchmarks show hint assistance consistently improves reasoning accuracy for SLMs, yielding substantial gains over standard prompting while preserving model efficiency.

Conclusion: Structured collaboration between SLMs via hint generation and reasoning offers an effective, lightweight mechanism for enhancing mathematical reasoning capabilities without scaling model size.

Abstract: Small language models (SLMs) often struggle with complex mathematical reasoning due to limited capacity to maintain long chains of intermediate steps and to recover from early errors. We address this challenge by introducing a hint-assisted reasoning framework that incrementally guides SLMs through multi-step mathematical problem solving. Our approach decomposes solutions into sequential reasoning steps and provides context-aware hints, where hints are generated by a separate SLM trained via distillation from a strong large language model. While the hint-generating SLM alone is not capable of solving the problems, its collaboration with a reasoning SLM enables effective guidance, forming a cooperative two-model system for reasoning. Each hint is generated conditionally on the problem statement and the accumulated reasoning history, providing stepwise, localized guidance without revealing full solutions. This reduces error propagation and allows the reasoning model to focus on manageable subproblems. Experiments across diverse mathematical benchmarks and models demonstrate that hint assistance consistently improves reasoning accuracy for SLMs, yielding substantial gains over standard prompting while preserving model efficiency. These results highlight that structured collaboration between SLMs-via hint generation and reasoning-offers an effective and lightweight mechanism for enhancing mathematical reasoning.

[391] A Scoping Review of Large Language Model-Based Pedagogical Agents

Shan Li, Juan Zheng

Main category: cs.AI

TL;DR: Scoping review of LLM-based pedagogical agents in education, analyzing 52 studies from 2022-2025 to identify design dimensions, trends, and ethical considerations.

Details

Motivation: Traditional pedagogical agents have been well-studied, but LLM integration represents a transformative advancement with unprecedented natural language understanding, reasoning, and adaptation capabilities that need systematic review.

Method: Following PRISMA-ScR guidelines, analyzed 52 studies across five major databases from November 2022 to January 2025 to examine LLM-based pedagogical agents in educational settings.

Result: Identified diverse LLM-based agents across K-12, higher education, and informal learning contexts, with four key design dimensions: interaction approach, domain scope, role complexity, and system integration. Found emerging trends including multi-agent systems, virtual student simulation, immersive technology integration, and learning analytics combinations.

Conclusion: LLM-based pedagogical agents represent a rapidly evolving field with significant potential, but require attention to research gaps and ethical considerations regarding privacy, accuracy, and student autonomy.

Abstract: This scoping review examines the emerging field of Large Language Model (LLM)-based pedagogical agents in educational settings. While traditional pedagogical agents have been extensively studied, the integration of LLMs represents a transformative advancement with unprecedented capabilities in natural language understanding, reasoning, and adaptation. Following PRISMA-ScR guidelines, we analyzed 52 studies across five major databases from November 2022 to January 2025. Our findings reveal diverse LLM-based agents spanning K-12, higher education, and informal learning contexts across multiple subject domains. We identified four key design dimensions characterizing these agents: interaction approach (reactive vs. proactive), domain scope (domain-specific vs. general-purpose), role complexity (single-role vs. multi-role), and system integration (standalone vs. integrated). Emerging trends include multi-agent systems that simulate naturalistic learning environments, virtual student simulation for agent evaluation, integration with immersive technologies, and combinations with learning analytics. We also discuss significant research gaps and ethical considerations regarding privacy, accuracy, and student autonomy. This review provides researchers and practitioners with a comprehensive understanding of LLM-based pedagogical agents while identifying crucial areas for future development in this rapidly evolving field.

[392] GAM: Hierarchical Graph-based Agentic Memory for LLM Agents

Zhaofen Wu, Hanrong Zhang, Fulin Lin, Wujiang Xu, Xinran Xu, Yankai Chen, Henry Peng Zou, Shaowen Chen, Weizhi Zhang, Xue Liu, Philip S. Yu, Hongwei Wang

Main category: cs.AI

TL;DR: GAM: Hierarchical graph-based memory framework for LLM agents that decouples encoding from consolidation to balance rapid context perception with stable knowledge retention

Details

Motivation: Current LLM agent memory systems face a fundamental tension: stream-based systems allow context updates but are vulnerable to noise interference, while structured memory provides robust retention but struggles with evolving narratives. There's a need to resolve the conflict between rapid context perception and stable knowledge retention.

Method: Proposes GAM, a hierarchical Graph-based Agentic Memory framework that explicitly decouples memory encoding from consolidation. It isolates ongoing dialogue in an event progression graph and integrates it into a topic associative network only upon semantic shifts. Also introduces a graph-guided, multi-factor retrieval strategy for enhanced context precision.

Result: Experiments on LoCoMo and LongDialQA benchmarks show that GAM consistently outperforms state-of-the-art baselines in both reasoning accuracy and efficiency.

Conclusion: GAM effectively resolves the conflict between rapid context perception and stable knowledge retention in LLM agents, providing a robust memory framework for coherent long-term interactions while minimizing interference and preserving long-term consistency.

Abstract: To sustain coherent long-term interactions, Large Language Model (LLM) agents must navigate the tension between acquiring new information and retaining prior knowledge. Current unified stream-based memory systems facilitate context updates but remain vulnerable to interference from transient noise. Conversely, discrete structured memory architectures provide robust knowledge retention but often struggle to adapt to evolving narratives. To address this, we propose GAM, a hierarchical Graph-based Agentic Memory framework that explicitly decouples memory encoding from consolidation to effectively resolve the conflict between rapid context perception and stable knowledge retention. By isolating ongoing dialogue in an event progression graph and integrating it into a topic associative network only upon semantic shifts, our approach minimizes interference while preserving long-term consistency. Additionally, we introduce a graph-guided, multi-factor retrieval strategy to enhance context precision. Experiments on LoCoMo and LongDialQA indicate that our method consistently outperforms state-of-the-art baselines in both reasoning accuracy and efficiency.

[393] Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Yizhe Chi, Deyao Hong, Dapeng Jiang, Tianwei Luo, Kaisen Yang, Boshi Zhang, Zhe Cao, Xiaoyan Fan, Bingxiang He, Han Hao, Weiyang Jin, Dianqiao Lei, Qingle Liu, Houde Qian, Bowen Wang, Situ Wang, Youjie Zheng, Yifan Zhou, Calvin Xiao, Eren Cai, Qinhuai Na

Main category: cs.AI

TL;DR: Frontier-Eng is a benchmark for evaluating LLM agents on generative optimization tasks through iterative propose-execute-evaluate loops across 47 engineering tasks with industrial-grade simulators and verifiers.

Details

Motivation: Current LLM agent benchmarks focus on binary pass/fail tasks and neglect real-world engineering value captured through iterative optimization of feasible designs. There's a need for benchmarks that evaluate agents' ability to integrate domain knowledge with executable feedback for complex engineering problems.

Method: Introduces Frontier-Eng benchmark with 47 tasks across five engineering categories, using industrial-grade simulators and verifiers that provide continuous reward signals and enforce hard feasibility constraints. Evaluates eight frontier language models with representative search frameworks in iterative propose-execute-evaluate loops.

Result: Claude 4.6 Opus achieves the most robust performance, but the benchmark remains challenging for all models. Analysis reveals dual power-law decay in improvement frequency (~1/iteration) and magnitude (~1/improvement count). Width improves parallelism and diversity, but depth remains crucial for hard-won improvements under fixed budgets.

Conclusion: Frontier-Eng establishes a new standard for assessing AI agents’ capacity to integrate domain knowledge with executable feedback to solve complex, open-ended engineering problems, moving beyond binary pass/fail evaluation.

Abstract: Current LLM agent benchmarks, which predominantly focus on binary pass/fail tasks such as code generation or search-based question answering, often neglect the value of real-world engineering that is often captured through the iterative optimization of feasible designs. To this end, we introduce Frontier-Eng, a human-verified benchmark for generative optimization – an iterative propose-execute-evaluate loop in which an agent generates candidate artifacts, receives executable verifier feedback, and revises them under a fixed interaction budget – spanning $47$ tasks across five broad engineering categories. Unlike previous suites, Frontier-Eng tasks are grounded in industrial-grade simulators and verifiers that provide continuous reward signals and enforce hard feasibility constraints under constrained budgets. We evaluate eight frontier language models using representative search frameworks, finding that while Claude 4.6 Opus achieves the most robust performance, the benchmark remains challenging for all models. Our analysis suggests a dual power-law decay in improvement frequency ($\sim$ 1/iteration) and magnitude ($\sim$ 1/improvement count). We further show that although width improves parallelism and diversity, depth remains crucial for hard-won improvements under a fixed budget. Frontier-Eng establishes a new standard for assessing the capacity of AI agents to integrate domain knowledge with executable feedback to solve complex, open-ended engineering problems.

[394] MultiDocFusion: Hierarchical and Multimodal Chunking Pipeline for Enhanced RAG on Long Industrial Documents

Joongmin Shin, Chanjun Park, Jeongbae Park, Jaehyung Seo, Heuiseok Lim

Main category: cs.AI

TL;DR: MultiDocFusion is a multimodal chunking pipeline for industrial documents that combines vision-based document parsing, OCR, LLM-based hierarchical parsing, and DFS-based chunking to improve RAG-based QA performance by preserving document structure.

Details

Motivation: Conventional text chunking approaches for RAG-based QA often fail to capture the complex hierarchical structures of industrial documents, leading to information loss and reduced answer quality. There's a need for structure-aware chunking that preserves document layout and organization.

Method: Four-stage pipeline: 1) Vision-based document parsing to detect document regions, 2) OCR for text extraction from detected regions, 3) LLM-based document section hierarchical parsing (DSHP-LLM) to reconstruct document structure into hierarchical tree, 4) DFS-based grouping to construct hierarchical chunks.

Result: Extensive experiments on industrial benchmarks show MultiDocFusion improves retrieval precision by 8-15% and ANLS QA scores by 2-3% compared to baseline methods, demonstrating significant performance gains.

Conclusion: Explicitly leveraging document hierarchy through multimodal structure-aware chunking is critical for enhancing RAG-based QA systems, with MultiDocFusion providing substantial improvements in retrieval and answer quality for industrial documents.

Abstract: RAG-based QA has emerged as a powerful method for processing long industrial documents. However, conventional text chunking approaches often neglect complex and long industrial document structures, causing information loss and reduced answer quality. To address this, we introduce MultiDocFusion, a multimodal chunking pipeline that integrates: (i) detection of document regions using vision-based document parsing, (ii) text extraction from these regions via OCR, (iii) reconstruction of document structure into a hierarchical tree using large language model (LLM)-based document section hierarchical parsing (DSHP-LLM), and (iv) construction of hierarchical chunks through DFS-based grouping. Extensive experiments across industrial benchmarks demonstrate that MultiDocFusion improves retrieval precision by 8-15% and ANLS QA scores by 2-3% compared to baselines, emphasizing the critical role of explicitly leveraging document hierarchy for multimodal document-based QA. These significant performance gains underscore the necessity of structure-aware chunking in enhancing the fidelity of RAG-based QA systems.

[395] ReflectCAP: Detailed Image Captioning with Reflective Memory

Kyungmin Min, Minbeom Kim, Kang-il Lee, Seunghyun Yoon, Kyomin Jung

Main category: cs.AI

TL;DR: ReflectCAP uses multi-agent analysis to identify what LVLMs hallucinate and overlook, creating reusable guidelines (Structured Reflection Notes) that steer captioning models to produce more factual and comprehensive detailed captions.

Details

Motivation: Existing methods for detailed image captioning struggle to simultaneously achieve factual grounding and fine-grained coverage. There's a tension between these two objectives that current approaches haven't effectively resolved.

Method: Multi-agent pipeline analyzes target LVLMs to identify consistent hallucinations and systematic oversights, distilling these patterns into reusable Structured Reflection Notes. At inference, these notes guide the captioning model on what to avoid (hallucinations) and what to attend to (coverage gaps).

Result: Applied to 8 LVLMs (GPT-4.1 family, Qwen series, InternVL variants), ReflectCAP reaches Pareto frontier for factuality-coverage trade-off, delivers substantial gains on CapArena-Auto benchmark, and offers better quality-compute trade-off than model scaling or existing multi-agent pipelines (21-36% lower overhead).

Conclusion: ReflectCAP enables high-quality detailed captioning that balances factuality and coverage while being computationally efficient, making it viable under real-world cost and latency constraints.

Abstract: Detailed image captioning demands both factual grounding and fine-grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note-Guided Captioning (ReflectCAP), where a multi-agent pipeline analyzes what the target large vision-language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes – what to avoid and what to attend to – yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT-4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade-off between factuality and coverage, and delivers substantial gains on CapArena-Auto, where generated captions are judged head-to-head against strong reference models. Moreover, ReflectCAP offers a more favorable trade-off between caption quality and compute cost than model scaling or existing multi-agent pipelines, which incur 21–36% greater overhead. This makes high-quality detailed captioning viable under real-world cost and latency constraints.

[396] Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Songping Peng, Zhiheng Zhang, Daojian Zeng, Lincheng Jiang, Xieping Gao

Main category: cs.AI

TL;DR: CWAC: A method that simultaneously constrains weight updates and regularizes safety-critical activations to preserve safety alignment during LLM fine-tuning.

Details

Motivation: Safety alignment in LLMs is fragile during fine-tuning, where even benign adaptation can degrade refusal behaviors and enable harmful responses. Existing defenses only constrain weights or activations in isolation, ignoring their coupled effects on safety.

Method: Proposes Coupled Weight and Activation Constraints (CWAC): 1) enforces precomputed safety subspace constraints on weight updates, and 2) applies targeted regularization to safety-critical features identified by sparse autoencoders.

Result: Extensive experiments across four widely used LLMs and diverse downstream tasks show CWAC consistently achieves the lowest harmful scores with minimal impact on fine-tuning accuracy, substantially outperforming strong baselines even under high harmful data ratios.

Conclusion: CWAC provides a robust approach to preserve safety alignment during fine-tuning by simultaneously constraining both weights and activations, addressing limitations of existing methods that consider these components in isolation.

Abstract: Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing defenses typically constrain either weights or activations in isolation, without considering their coupled effects on safety. In this paper, we first theoretically demonstrate that constraining either weights or activations alone is insufficient for safety preservation. To robustly preserve safety alignment, we propose Coupled Weight and Activation Constraints (CWAC), a novel approach that simultaneously enforces a precomputed safety subspace on weight updates and applies targeted regularization to safety-critical features identified by sparse autoencoders. Extensive experiments across four widely used LLMs and diverse downstream tasks show that CWAC consistently achieves the lowest harmful scores with minimal impact on fine-tuning accuracy, substantially outperforming strong baselines even under high harmful data ratios.

[397] Heuristic Classification of Thoughts Prompting (HCoT): Integrating Expert System Heuristics for Structured Reasoning into Large Language Models

Lei Lin, Jizhao Zhu, Yong Liu, Donghong Sun, Hongbo He, Yihua Du

Main category: cs.AI

TL;DR: HCoT (Heuristic-Classification-of-Thoughts) prompting schema improves LLM reasoning by integrating structured problem-solving guidance to address stochastic generation and static reasoning-decoupling limitations.

Details

Motivation: Address two key LLM limitations: (1) Bayesian-like stochastic generation leads to random decision trajectories rather than deterministic planning, and (2) static decoupling of reasoning and decision-making prevents dynamically retrieved knowledge from adjusting reasoning strategies.

Method: Proposes HCoT (Heuristic-Classification-of-Thoughts) prompting schema that integrates LLM reasoning with structured problem space via heuristic classification model controlling reasoning process and providing reusable abstract solutions.

Result: Outperforms existing approaches (Tree-of-Thoughts, Chain-of-Thoughts) on complex inductive reasoning tasks with ill-defined search spaces. On 24 Game task, achieves significantly higher token efficiency than Tree-of-Thoughts-BFS. Achieves Pareto frontier balance between accuracy and computational cost.

Conclusion: HCoT provides effective method to guide LLM reasoning, addressing stochastic generation and reasoning-decoupling issues while offering reusable solutions and better performance-efficiency trade-offs.

Abstract: This paper addresses two limitations of large language models (LLMs) in solving complex problems: (1) their reasoning processes exhibit Bayesian-like stochastic generation, where each token is sampled from a context-dependent probability distribution, leading to inherently random decision trajectories rather than deterministic planning; (2) the reasoning and decision-making mechanisms are statically decoupled, meaning dynamically retrieved domain knowledge fails to dynamically adjust the underlying reasoning strategy. These dual deficiencies result in initial decisions lacking strategic anchoring and reasoning chains often failing to converge on correct solutions, as stochastic generation lacks mechanisms for trajectory correction or knowledge-guided optimization during sequential reasoning. To resolve these issues, we propose a problem-solving method integrated into the LLM’s generation process to guide reasoning. This method, compatible with numerous LLMs and featuring reusable solutions, is grounded in a novel Heuristic-Classification-of-Thoughts prompting schema (HCoT). HCoT synergizes the LLM’s reasoning ability with a structured problem space via a heuristic classification model that controls the reasoning process and provides reusable abstract solutions. Evaluated on two complex inductive reasoning tasks with ill-defined search spaces, HCoT outperforms existing approaches (e.g., Tree-of-Thoughts and Chain-of-Thoughts prompting) in performance. On the well-structured 24 Game task, HCoT demonstrates significantly higher token efficiency compared to the state-of-the-art Tree-of-Thoughts-Breadth-First-Search. In terms of both accuracy and token usage, HCoT achieves a Pareto frontier balance, offering a strong trade-off between performance and computational cost.

[398] Operationalising the Right to be Forgotten in LLMs: A Lightweight Sequential Unlearning Framework for Privacy-Aligned Deployment in Politically Sensitive Environments

Esen Kurt, Haithem Afli

Main category: cs.AI

TL;DR: A sequential unlearning framework for LLMs that separates retention and suppression objectives to address privacy regulations like GDPR’s Right to be Forgotten in politically sensitive deployments.

Details

Motivation: LLMs deployed in politically sensitive environments face regulatory challenges under frameworks like GDPR, requiring technical solutions for data erasure (Right to be Forgotten) while maintaining model utility.

Method: Lightweight sequential unlearning framework with two phases: 1) positive fine-tuning to stabilize benign capabilities, followed by 2) layer-restricted negative fine-tuning to suppress sensitive patterns while preserving general language competence.

Result: Effective behavioral suppression on SemEval-2025 LLM Unlearning benchmark with minimal impact on factual accuracy and fluency; GPT-2 shows greater robustness than DistilGPT-2, highlighting model capacity importance.

Conclusion: Sequential unlearning provides a practical, reproducible mechanism for operationalizing data erasure requirements in politically deployed LLMs, balancing privacy compliance with model utility.

Abstract: Large Language Models (LLMs) are increasingly deployed in politically sensitive environments, where memorisation of personal data or confidential content raises regulatory concerns under frameworks such as the GDPR and its Right to be Forgotten. Translating such legal principles into large-scale generative systems presents significant technical challenges. We introduce a lightweight sequential unlearning framework that explicitly separates retention and suppression objectives. The method first stabilises benign capabilities through positive fine-tuning, then applies layer-restricted negative fine-tuning to suppress designated sensitive patterns while preserving general language competence. Experiments on the SemEval-2025 LLM Unlearning benchmark demonstrate effective behavioural suppression with minimal impact on factual accuracy and fluency. GPT-2 exhibits greater robustness than DistilGPT-2, highlighting the role of model capacity in privacy-aligned adaptation. We position sequential unlearning as a practical and reproducible mechanism for operationalising data erasure requirements in politically deployed LLMs.

[399] Enhancing Clustering: An Explainable Approach via Filtered Patterns

Motaz Ben Hassine, Saïd Jabbour

Main category: cs.AI

TL;DR: Proposes a pattern reduction framework for explainable clustering that eliminates redundant k-relaxed frequent patterns to improve computational efficiency while maintaining cluster quality.

Details

Motivation: Current explainable clustering approaches using k-relaxed frequent patterns suffer from redundancy where multiple distinct patterns induce identical k-covers, unnecessarily enlarging search space and increasing computational complexity during cluster construction.

Method: Threefold approach: 1) Formal characterization of conditions where distinct k-RFPs induce identical k-covers, 2) Optimization strategy removing redundant patterns by retaining single representative per distinct k-cover, 3) Analysis of interpretability and representativeness of patterns selected by ILP model through robustness analysis.

Result: Extensive experiments on real-world datasets show significant reduction in pattern search space, improved computational efficiency, and preservation/enhancement of cluster quality.

Conclusion: The proposed pattern reduction framework effectively addresses redundancy in explainable clustering, making the approach more scalable while maintaining interpretability and clustering performance.

Abstract: Machine learning has become a central research area, with increasing attention devoted to explainable clustering, also known as conceptual clustering, which is a knowledge-driven unsupervised learning paradigm that partitions data into $θ$ disjoint clusters, where each cluster is described by an explicit symbolic representation, typically expressed as a closed pattern or itemset. By providing human-interpretable cluster descriptions, explainable clustering plays an important role in explainable artificial intelligence and knowledge discovery. Recent work improved clustering quality by introducing k-relaxed frequent patterns (k-RFPs), a pattern model that relaxes strict coverage constraints through a generalized kcover definition. This framework integrates constraint-based reasoning, using SAT solvers for pattern generation, with combinatorial optimization, using Integer Linear Programming (ILP) for cluster selection. Despite its effectiveness, this approach suffers from a critical limitation: multiple distinct k-RFPs may induce identical k-covers, leading to redundant symbolic representations that unnecessarily enlarge the search space and increase computational complexity during cluster construction. In this paper, we address this redundancy through a pattern reduction framework. Our contributions are threefold. First, we formally characterize the conditions under which distinct k-RFPs induce identical kcovers, providing theoretical foundations for redundancy detection. Second, we propose an optimization strategy that removes redundant patterns by retaining a single representative pattern for each distinct k-cover. Third, we investigate the interpretability and representativeness of the patterns selected by the ILP model by analyzing their robustness with respect to their induced clusters. Extensive experiments conducted on several real-world datasets demonstrate that the proposed approach significantly reduces the pattern search space, improves computational efficiency, preserves and enhances in some cases the quality of the resulting clusters.

[400] CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems

Yongxuan Wu, Xixun Lin, He Zhang, Nan Sun, Kun Wang, Chuan Zhou, Shirui Pan, Yanan Cao

Main category: cs.AI

TL;DR: A novel attack method called Communication Inference Attack (CIA) that can infer the communication topology of LLM-based Multi-Agent Systems under black-box settings, revealing significant privacy risks.

Details

Motivation: The security of communication topologies in LLM-based Multi-Agent Systems is increasingly important, but there's a critical privacy risk where these topologies can be inferred in black-box settings, exposing system vulnerabilities and intellectual property threats.

Method: Proposes CIA attack that constructs adversarial queries to induce intermediate agents’ reasoning outputs, then models semantic correlations through global bias disentanglement and LLM-guided weak supervision.

Result: Extensive experiments show CIA achieves average AUC of 0.87 and peak AUC up to 0.99, demonstrating effective topology inference and revealing substantial privacy risks in MAS.

Conclusion: Communication topologies in LLM-based Multi-Agent Systems are vulnerable to inference attacks, posing significant security and privacy threats that need to be addressed.

Abstract: LLM-based Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in solving complex tasks. Central to MAS is the communication topology which governs how agents exchange information internally. Consequently, the security of communication topologies has attracted increasing attention. In this paper, we investigate a critical privacy risk: MAS communication topologies can be inferred under a restrictive black-box setting, exposing system vulnerabilities and posing significant intellectual property threats. To explore this risk, we propose Communication Inference Attack (CIA), a novel attack that constructs new adversarial queries to induce intermediate agents’ reasoning outputs and models their semantic correlations through the proposed global bias disentanglement and LLM-guided weak supervision. Extensive experiments on MAS with optimized communication topologies demonstrate the effectiveness of CIA, achieving an average AUC of 0.87 and a peak AUC of up to 0.99, thereby revealing the substantial privacy risk in MAS.

[401] Intelligent ROI-Based Vehicle Counting Framework for Automated Traffic Monitoring

Mohamed A. Abdelwahab, Zaynab Al-Ariny, Mahmoud Fakhry, El-Sayed Hasaneen

Main category: cs.AI

TL;DR: A two-phase video-based vehicle counting framework that automatically determines optimal ROI using detection, tracking, and density models, then performs efficient counting within that ROI, achieving high accuracy with 4x speedup.

Details

Motivation: Need for accurate vehicle counting in traffic management while balancing computational efficiency. Existing methods struggle with achieving both high accuracy and efficiency, especially in complex multi-road scenarios.

Method: Two-phase framework: 1) Estimation phase uses novel combination of detection scores, tracking scores, and vehicle density models to automatically determine optimal ROI; 2) Prediction phase performs efficient vehicle counting within estimated ROI. Framework is compatible with any detection/tracking method.

Result: Achieved exceptional accuracy (most videos 100% accuracy) on benchmark datasets (UA-DETRAC, GRAM, CDnet 2014, ATON). Computational efficiency improved 4x faster than full-frame processing. Outperformed existing techniques, especially in complex multi-road scenarios.

Conclusion: The framework provides robust, accurate vehicle counting with significant computational efficiency gains, making it promising for real-time traffic monitoring applications.

Abstract: Accurate vehicle counting through video surveillance is crucial for efficient traffic management. However, achieving high counting accuracy while ensuring computational efficiency remains a challenge. To address this, we propose a fully automated, video-based vehicle counting framework designed to optimize both computational efficiency and counting accuracy. Our framework operates in two distinct phases: \textit{estimation} and \textit{prediction}. In the estimation phase, the optimal region of interest (ROI) is automatically determined using a novel combination of three models based on detection scores, tracking scores, and vehicle density. This adaptive approach ensures compatibility with any detection and tracking method, enhancing the framework’s versatility. In the prediction phase, vehicle counting is efficiently performed within the estimated ROI. We evaluated our framework on benchmark datasets like UA-DETRAC, GRAM, CDnet 2014, and ATON. Results demonstrate exceptional accuracy, with most videos achieving 100% accuracy, while also enhancing computational efficiency, making processing up to four times faster than full-frame processing. The framework outperforms existing techniques, especially in complex multi-road scenarios, demonstrating robustness and superior accuracy. These advancements make it a promising solution for real-time traffic monitoring.

[402] Technical Report – A Context-Sensitive Multi-Level Similarity Framework for First-Order Logic Arguments: An Axiomatic Study

Victor David, Jérôme Delobelle, Jean-Guy Mailly

Main category: cs.AI

TL;DR: A framework for measuring similarity between First-Order Logic arguments, extending beyond propositional logic with syntax-sensitive and contextual approaches.

Details

Motivation: Addressing the need for similarity measures in First-Order Logic argumentation, which is richer than propositional logic and important for problems like argument aggregation and enthymeme decoding.

Method: Introduces a comprehensive framework with: 1) extended axiomatic foundation, 2) four-level parametric model covering predicates, literals, clauses, and formulae similarity, 3) two model families (one syntax-sensitive using language models) with contextual weights, and 4) formal constraints for desirable properties.

Result: A formal framework for FOL argument similarity that accounts for structured content and provides explainable similarity measures through contextual weighting.

Conclusion: The framework enables nuanced similarity assessment in First-Order Logic argumentation, addressing limitations of propositional logic approaches and supporting applications in argument aggregation and enthymeme analysis.

Abstract: Similarity in formal argumentation has recently gained attention due to its significance in problems such as argument aggregation in semantics and enthymeme decoding. While existing approaches focus on propositional logic, we address the richer setting of First-Order Logic (FOL), where similarity must account for structured content. We introduce a comprehensive framework for FOL argument similarity, built upon: (1) an extended axiomatic foundation; (2) a four-level parametric model covering predicates, literals, clauses, and formulae similarity; (3) two model families, one syntax-sensitive via language models, both integrating contextual weights for nuanced and explainable similarity; and (4) formal constraints enforcing desirable properties.

[403] A Two-Stage LLM Framework for Accessible and Verified XAI Explanations

Georgios Mermigkis, Dimitris Metaxakis, Marios Tyrovolas, Argiris Sofotasios, Nikolaos Avgeris, Panagiotis Hadjidoukas, Chrysostomos Stylios

Main category: cs.AI

TL;DR: A two-stage LLM meta-verification framework for improving the accuracy and trustworthiness of natural language explanations generated from XAI outputs.

Details

Motivation: Current LLM-based XAI explanation systems lack guarantees of accuracy, faithfulness, and completeness, with evaluation methods being largely subjective or post-hoc without safeguards against flawed explanations reaching end-users.

Method: Proposes a two-stage framework: (1) Explainer LLM converts raw XAI outputs to natural language narratives, (2) Verifier LLM assesses them on faithfulness, coherence, completeness, and hallucination risk, with (3) iterative refeed mechanism using Verifier’s feedback for refinement.

Result: Experiments across five XAI techniques and datasets using three families of open-weight LLMs show verification is crucial for filtering unreliable explanations while improving linguistic accessibility compared to raw XAI outputs. Entropy Production Rate analysis indicates Verifier’s feedback guides Explainer toward more stable and coherent reasoning.

Conclusion: The framework provides an efficient pathway toward more trustworthy and democratized XAI systems by ensuring higher quality natural language explanations through systematic verification and refinement.

Abstract: Large Language Models (LLMs) are increasingly used to translate the technical outputs of eXplainable Artificial Intelligence (XAI) methods into accessible natural-language explanations. However, existing approaches often lack guarantees of accuracy, faithfulness, and completeness. At the same time, current efforts to evaluate such narratives remain largely subjective or confined to post-hoc scoring, offering no safeguards to prevent flawed explanations from reaching end-users. To address these limitations, this paper proposes a Two-Stage LLM Meta-Verification Framework that consists of (i) an Explainer LLM that converts raw XAI outputs into natural-language narratives, (ii) a Verifier LLM that assesses them in terms of faithfulness, coherence, completeness, and hallucination risk, and (iii) an iterative refeed mechanism that uses the Verifier’s feedback to refine and improve them. Experiments across five XAI techniques and datasets, using three families of open-weight LLMs, show that verification is crucial for filtering unreliable explanations while improving linguistic accessibility compared with raw XAI outputs. In addition, the analysis of the Entropy Production Rate (EPR) during the refinement process indicates that the Verifier’s feedback progressively guides the Explainer toward more stable and coherent reasoning. Overall, the proposed framework provides an efficient pathway toward more trustworthy and democratized XAI systems.

[404] Cross-Cultural Simulation of Citizen Emotional Responses to Bureaucratic Red Tape Using LLM Agents

Wanchun Ni, Jiugeng Sun, Yixian Liu, Mennatallah El-Assady

Main category: cs.AI

TL;DR: LLM agents show limited alignment with human emotional responses to bureaucratic red tape across cultures, especially in Eastern contexts, with cultural prompting strategies proving ineffective.

Details

Motivation: To evaluate whether LLM agents can generate culturally appropriate emotional responses to bureaucratic red tape, addressing a gap in using LLMs for policy simulation across diverse cultural contexts.

Method: Proposed an evaluation framework for assessing LLMs’ emotional responses to red tape across cultures, applied to a single red-tape scenario, tested cultural prompting strategies, and developed RAMO interface for simulation and data collection.

Result: All models exhibited limited alignment with human emotional responses, with notably weaker performance in Eastern cultures. Cultural prompting strategies were largely ineffective in improving alignment.

Conclusion: Current LLMs have limitations in generating culturally appropriate emotional responses to bureaucratic red tape, highlighting the need for better cultural adaptation in policy simulation models.

Abstract: Improving policymaking is a central concern in public administration. Prior human subject studies reveal substantial cross-cultural differences in citizens’ emotional responses to red tape during policy implementation. While LLM agents offer opportunities to simulate human-like responses and reduce experimental costs, their ability to generate culturally appropriate emotional responses to red tape remains unverified. To address this gap, we propose an evaluation framework for assessing LLMs’ emotional responses to red tape across diverse cultural contexts. As a pilot study, we apply this framework to a single red-tape scenario. Our results show that all models exhibit limited alignment with human emotional responses, with notably weaker performance in Eastern cultures. Cultural prompting strategies prove largely ineffective in improving alignment. We further introduce \textbf{RAMO}, an interactive interface for simulating citizens’ emotional responses to red tape and for collecting human data to improve models. The interface is publicly available at https://ramo-chi.ivia.ch.

[405] IDEA: An Interpretable and Editable Decision-Making Framework for LLMs via Verbal-to-Numeric Calibration

Yanji He, Yuxin Jiang, Yiwen Wu, Bo Huang, Jiaheng Wei, Wei Wang

Main category: cs.AI

TL;DR: IDEA extracts LLM decision knowledge into interpretable parametric models over semantic factors, enabling calibrated probabilities and human-AI collaboration through joint learning, correlated sampling, and direct parameter editing.

Details

Motivation: LLMs have limitations in high-stakes decision-making due to miscalibrated probabilities, unfaithful explanations, and inability to precisely incorporate expert knowledge, which restricts their adoption in critical domains.

Method: IDEA framework extracts LLM decision knowledge into interpretable parametric models over semantically meaningful factors using: 1) joint learning of verbal-to-numerical mappings and decision parameters via EM, 2) correlated sampling preserving factor dependencies, and 3) direct parameter editing with mathematical guarantees.

Result: Experiments across five datasets show IDEA with Qwen-3-32B (78.6%) outperforms DeepSeek R1 (68.1%) and GPT-5.2 (77.9%), achieving perfect factor exclusion and exact calibration - precision unattainable through prompting alone.

Conclusion: IDEA enables calibrated probabilities and quantitative human-AI collaboration by extracting LLM decision knowledge into interpretable models, addressing key limitations of LLMs in high-stakes domains.

Abstract: Large Language Models are increasingly deployed for decision-making, yet their adoption in high-stakes domains remains limited by miscalibrated probabilities, unfaithful explanations, and inability to incorporate expert knowledge precisely. We propose IDEA, a framework that extracts LLM decision knowledge into an interpretable parametric model over semantically meaningful factors. Through joint learning of verbal-to-numerical mappings and decision parameters via EM, correlated sampling that preserves factor dependencies, and direct parameter editing with mathematical guarantees, IDEA produces calibrated probabilities while enabling quantitative human-AI collaboration. Experiments across five datasets show IDEA with Qwen-3-32B (78.6%) outperforms DeepSeek R1 (68.1%) and GPT-5.2 (77.9%), achieving perfect factor exclusion and exact calibration – precision unattainable through prompting alone. The implementation is publicly available at https://github.com/leonbig/IDEA.

[406] DeepTest Tool Competition 2026: Benchmarking an LLM-Based Automotive Assistant

Lev Sorokin, Ivan Vasilev, Samuele Pasini

Main category: cs.AI

TL;DR: First LLM Testing competition benchmarked four tools for testing LLM-based car manual retrieval systems to find failure cases where warnings aren’t properly mentioned.

Details

Motivation: To establish benchmarking standards and evaluate testing tools for LLM-based applications, specifically focusing on safety-critical scenarios like car manual information retrieval where missing warnings could be dangerous.

Method: Organized competition with four testing tools evaluated on an LLM-based car manual retrieval application. Tools were assessed on effectiveness in exposing failures and diversity of discovered failure-revealing tests using experimental methodology at DeepTest workshop.

Result: Four tools competed and were evaluated; results show varying effectiveness in identifying failure cases where the LLM system fails to appropriately mention warnings from car manuals.

Conclusion: The competition establishes initial benchmarking for LLM testing tools and highlights the importance of systematic testing for safety-critical LLM applications.

Abstract: This report summarizes the results of the first edition of the Large Language Model (LLM) Testing competition, held as part of the DeepTest workshop at ICSE 2026. Four tools competed in benchmarking an LLM-based car manual information retrieval application, with the objective of identifying user inputs for which the system fails to appropriately mention warnings contained in the manual. The testing solutions were evaluated based on their effectiveness in exposing failures and the diversity of the discovered failure-revealing tests. We report on the experimental methodology, the competitors, and the results.

[407] KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

Linhao Yu, Tianmeng Yang, Siyu Ding, Renren Jin, Naibin Gu, Xiangzhao Hao, Shuaiyi Nie, Deyi Xiong, Weichong Yin, Yu Sun, Hua Wu

Main category: cs.AI

TL;DR: KnowRL is a knowledge-guided RL framework that decomposes hints into atomic knowledge points and uses constrained subset search to create compact, interaction-aware guidance for training LLMs on reasoning tasks.

Details

Motivation: RLVR improves reasoning in LLMs but suffers from reward sparsity on hard problems. Existing hint-based methods add more tokens which creates redundancy, inconsistency, and training overhead. There's a need for minimal-sufficient guidance that's efficient and effective.

Method: Proposes KnowRL framework that treats hint design as minimal-sufficient guidance problem. Decomposes guidance into atomic knowledge points (KPs), uses Constrained Subset Search (CSS) to construct compact, interaction-aware subsets for training. Identifies and addresses pruning interaction paradox where removing one KP may help while removing multiple such KPs can hurt.

Result: KnowRL-Nemotron-1.5B trained from OpenMath-Nemotron-1.5B achieves 70.08 average accuracy without KP hints at inference (+9.63 points over baseline). With selected KPs, performance improves to 74.16, establishing new SOTA at 1.5B scale across eight reasoning benchmarks.

Conclusion: KnowRL provides an effective framework for knowledge-guided RL training that addresses reward sparsity through minimal-sufficient guidance, achieving state-of-the-art reasoning performance at the 1.5B scale.

Abstract: RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose \textbf{KnowRL} (Knowledge-Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal-sufficient guidance problem. During RL training, KnowRL decomposes guidance into atomic knowledge points (KPs) and uses Constrained Subset Search (CSS) to construct compact, interaction-aware subsets for training. We further identify a pruning interaction paradox – removing one KP may help while removing multiple such KPs can hurt – and explicitly optimize for robust subset curation under this dependency structure. We train KnowRL-Nemotron-1.5B from OpenMath-Nemotron-1.5B. Across eight reasoning benchmarks at the 1.5B scale, KnowRL-Nemotron-1.5B consistently outperforms strong RL and hinting baselines. Without KP hints at inference, KnowRL-Nemotron-1.5B reaches 70.08 average accuracy, already surpassing Nemotron-1.5B by +9.63 points; with selected KPs, performance improves to 74.16, establishing a new state of the art at this scale. The model, curated training data, and code are publicly available at https://github.com/Hasuer/KnowRL.

[408] Broadening the Applicability of Conditional Syntax Splitting for Reasoning from Conditional Belief Bases

Lars-Phillip Spiegel, Jonas Haldimann, Jesse Heyninck, Gabriele Kern-Isberner, Christoph Beierle

Main category: cs.AI

TL;DR: Generalization of safe conditional syntax splitting for nonmonotonic reasoning that allows subbases to share atoms and nontrivial conditionals, overcoming limitations of previous splitting concepts.

Details

Motivation: Previous syntax splitting approaches required disjoint signatures between subbases, which is rare in practice. Safe conditional syntax splitting allowed some overlap but only for trivial, self-fulfilling conditionals, limiting practical applicability.

Method: Proposes a generalized notion of conditional syntax splitting that broadens applicability by allowing subbases to share atoms and nontrivial conditionals. Introduces adjusted inference postulates based on this generalization and evaluates popular inductive inference operators against them.

Result: The new generalization overcomes limitations of previous splitting concepts, identifies genuine vs. simple splittings, and shows that while every operator satisfying generalized splitting also satisfies conditional syntax splitting, the reverse does not hold.

Conclusion: The proposed generalized conditional syntax splitting provides a more practical and applicable framework for nonmonotonic reasoning from conditional belief bases by allowing meaningful overlap between subbases while maintaining beneficial splitting properties.

Abstract: In nonmonotonic reasoning from conditional belief bases, an inference operator satisfying syntax splitting postulates allows for taking only the relevant parts of a belief base into account, provided that the belief base splits into subbases based on disjoint signatures. Because such disjointness is rare in practice, safe conditional syntax splitting has been proposed as a generalization of syntax splitting, allowing the conditionals in the subbases to share some atoms. Recently this overlap of conditionals has been shown to be limited to trivial, self-fulfilling conditionals. In this article, we propose a generalization of safe conditional syntax splittings that broadens the applicability of splitting postulates. In contrast to safe conditional syntax splitting, our generalized notion supports syntax splittings of a belief base Δ where the subbases of Δ may share atoms and nontrivial conditionals. We illustrate how this new notion overcomes limitations of previous splitting concepts, and we identify genuine splittings, separating them from simple splittings that do not provide benefits for inductive inference from Δ. We introduce adjusted inference postulates based on our generalization of conditional syntax splitting, and we evaluate several popular inductive inference operators with respect to these postulates. Furthermore, we show that, while every inductive inference operator satisfying generalized conditional syntax splitting also satisfies conditional syntax splitting, the reverse does not hold.

[409] Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport

Rui Wang, Yi Zheng, Dongxin Wang, Haiping Huang, Yuanzhi Yao, Yuxiang Zhou, Jialin Yu, Philip Torr

Main category: cs.AI

TL;DR: GCTM-OT is a human-centric topic modeling approach that incorporates human-provided goals via LLM prompting and contrastive learning with optimal transport to produce interpretable, diverse, goal-oriented topics.

Details

Motivation: Existing topic modeling methods focus on statistical coherence but often produce redundant or off-target topics that miss user intent. The paper aims to create more human-centric topic modeling that directly integrates human goals.

Method: GCTM-OT uses LLM-based prompting to extract goal candidates from documents, then incorporates these into semantic-aware contrastive learning via optimal transport for topic discovery.

Result: Experimental results on three public subreddit datasets show GCTM-OT outperforms state-of-the-art baselines in topic coherence and diversity while significantly improving alignment with human-provided goals.

Conclusion: The approach paves the way for more human-centric topic discovery systems by successfully integrating human goals into the topic modeling process.

Abstract: Existing topic modeling methods, from LDA to recent neural and LLM-based approaches, which focus mainly on statistical coherence, often produce redundant or off-target topics that miss the user’s underlying intent. We introduce Human-centric Topic Modeling, \emph{Human-TM}), a novel task formulation that integrates a human-provided goal directly into the topic modeling process to produce interpretable, diverse and goal-oriented topics. To tackle this challenge, we propose the \textbf{G}oal-prompted \textbf{C}ontrastive \textbf{T}opic \textbf{M}odel with \textbf{O}ptimal \textbf{T}ransport (GCTM-OT), which first uses LLM-based prompting to extract goal candidates from documents, then incorporates these into semantic-aware contrastive learning via optimal transport for topic discovery. Experimental results on three public subreddit datasets show that GCTM-OT outperforms state-of-the-art baselines in topic coherence and diversity while significantly improving alignment with human-provided goals, paving the way for more human-centric topic discovery systems.

[410] Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production

Jintao Xue, Xiao Li, Nianmin Zhang

Main category: cs.AI

TL;DR: A safe reinforcement learning approach (PF-CD3Q) for dynamic human-robot task planning that integrates particle filters with constrained Q-learning to optimize efficiency while maintaining worker fatigue within safe limits through real-time parameter estimation.

Details

Motivation: Human-robot collaborative manufacturing in Industry 5.0 requires ergonomic considerations to protect worker well-being. Traditional fatigue models use static parameters, but human fatigue sensitivity varies daily due to changing work conditions and sleep patterns, creating uncertainty that needs online estimation.

Method: Proposed PF-CD3Q: a safe reinforcement learning approach combining particle filter (PF) with constrained dueling double deep Q-learning (CD3Q). PF estimators track human fatigue and update model parameters in real-time. These estimators integrate with CD3Q by making task-level fatigue predictions during decision-making, excluding tasks exceeding fatigue limits, and formulating the problem as a constrained Markov decision process.

Result: The approach enables real-time fatigue-predictive human-robot task planning and allocation that dynamically adapts to changing human fatigue conditions while maintaining safety constraints and optimizing efficiency.

Conclusion: PF-CD3Q provides a robust solution for dynamic HRTPA under fatigue uncertainty, addressing the limitations of static fatigue models through online parameter estimation and safe reinforcement learning.

Abstract: Human-robot collaborative manufacturing, a core aspect of Industry 5.0, emphasizes ergonomics to enhance worker well-being. This paper addresses the dynamic human-robot task planning and allocation (HRTPA) problem, which involves determining when to perform tasks and who should execute them to maximize efficiency while ensuring workers’ physical fatigue remains within safe limits. The inclusion of fatigue constraints, combined with production dynamics, significantly increases the complexity of the HRTPA problem. Traditional fatigue-recovery models in HRTPA often rely on static, predefined hyperparameters. However, in practice, human fatigue sensitivity varies daily due to factors such as changed work conditions and insufficient sleep. To better capture this uncertainty, we treat fatigue-related parameters as inaccurate and estimate them online based on observed fatigue progression during production. To address these challenges, we propose PF-CD3Q, a safe reinforcement learning (safe RL) approach that integrates the particle filter with constrained dueling double deep Q-learning for real-time fatigue-predictive HRTPA. Specifically, we first develop PF-based estimators to track human fatigue and update fatigue model parameters in real-time. These estimators are then integrated into CD3Q by making task-level fatigue predictions during decision-making and excluding tasks that exceed fatigue limits, thereby constraining the action space and formulating the problem as a constrained Markov decision process (CMDP).

[411] A hierarchical spatial-aware algorithm with efficient reinforcement learning for human-robot task planning and allocation in production

Jintao Xue, Xiao Li, Nianmin Zhang

Main category: cs.AI

TL;DR: A hierarchical human-robot task planning and allocation method using deep Q-learning for high-level planning and spatial-aware path planning for low-level allocation in dynamic manufacturing environments.

Details

Motivation: Effective task planning and allocation in human-robot collaborative manufacturing is challenging due to dynamic environments and spatial considerations like human positions and movement distances.

Method: Decompose tasks into subtasks; use hierarchical approach with high-level EBQ (efficient buffer-based deep Q-learning) for task planning and low-level SAP (spatial-aware path planning) for task allocation to appropriate human-robot resources.

Result: Experiments in 3D simulator show EBQ&SAP effectively addresses human-robot TPA problems in complex dynamic production processes, reducing training time and improving performance.

Conclusion: The proposed hierarchical approach successfully handles dynamic human-robot collaboration in manufacturing with spatial awareness and efficient learning.

Abstract: In advanced manufacturing systems, humans and robots collaborate to conduct the production process. Effective task planning and allocation (TPA) is crucial for achieving high production efficiency, yet it remains challenging in complex and dynamic manufacturing environments. The dynamic nature of humans and robots, particularly the need to consider spatial information (e.g., humans’ real-time position and the distance they need to move to complete a task), substantially complicates TPA. To address the above challenges, we decompose production tasks into manageable subtasks. We then implement a real-time hierarchical human-robot TPA algorithm, including a high-level agent for task planning and a low-level agent for task allocation. For the high-level agent, we propose an efficient buffer-based deep Q-learning method (EBQ), which reduces training time and enhances performance in production problems with long-term and sparse reward challenges. For the low-level agent, a path planning-based spatially aware method (SAP) is designed to allocate tasks to the appropriate human-robot resources, thereby achieving the corresponding sequential subtasks. We conducted experiments on a complex real-time production process in a 3D simulator. The results demonstrate that our proposed EBQ&SAP method effectively addresses human-robot TPA problems in complex and dynamic production processes.

[412] MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

Shufang Lin, Muyang Chen, Xiabing Zhou, Rongrong Zhang, Dayou Zhang, Fangxin Wang

Main category: cs.AI

TL;DR: MISID benchmark for multimodal intent recognition in complex strategic interactions, with FRACTAM framework to address MLLM limitations in cross-modal reasoning and causal tracking.

Details

Motivation: Existing intent recognition datasets focus on simple dialogues, but real-world scenarios involve complex strategic interactions with deceptive narratives. Need for comprehensive multimodal benchmark for long-context discourse analysis.

Method: Introduce MISID benchmark from social strategy games with fine-grained two-tier annotation. Propose FRACTAM framework using “Decouple-Anchor-Reason” paradigm: extract unimodal factual representations, two-stage retrieval for factual anchoring, construct explicit cross-modal evidence chains.

Result: Evaluation shows MLLMs have critical deficiencies in complex scenarios (text-prior visual hallucination, impaired cross-modal synergy, limited causal cue chaining). FRACTAM enhances mainstream models’ performance in strategic tasks, improving hidden intent detection while maintaining perceptual accuracy.

Conclusion: MISID addresses gap in multimodal intent recognition for complex strategic interactions. FRACTAM framework effectively reduces text bias and improves cross-modal reasoning capabilities for MLLMs in challenging scenarios.

Abstract: Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods. To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition. Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking. Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) on MISID reveals critical deficiencies in complex scenarios, including text-prior visual hallucination, impaired cross-modal synergy, and limited capacity in chaining causal cues. Consequently, we propose FRACTAM as a baseline framework. Using a ``Decouple-Anchor-Reason’’ paradigm, FRACTAM reduces text bias by extracting pure unimodal factual representations, employs two-stage retrieval for long-range factual anchoring, and constructs explicit cross-modal evidence chains. Extensive experiments demonstrate that FRACTAM enhances mainstream models’ performance in complex strategic tasks, improving hidden intent detection and inference while maintaining robust perceptual accuracy. Our dataset is available at https://naislab.cn/datasets/MISID.

[413] Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning

Zhenyu Ma, Yuyang Song, Chunyi Yang, Jingyi Zhu, Letian Yang, Xukai Jiang

Main category: cs.AI

TL;DR: Case-based learning framework for LLM agents that converts past task experience into reusable knowledge assets to improve performance on complex real-world tasks

Details

Motivation: LLM-based autonomous agents struggle with reliably using task structure, key constraints, and prior experience in complex real-world settings, despite performing well on general reasoning tasks

Method: Proposes a case-based learning framework that extracts and reuses task-relevant knowledge, analytical prompts, and operational skills from real cases, converting past task experience into reusable knowledge assets that can be transferred to new tasks

Result: Achieves consistently strong performance across six complex task categories, matches or outperforms Zero-Shot, Few-Shot, Checklist Prompt, and Rule Memory baselines, with especially clear gains on more complex tasks; advantage increases with task complexity; practical knowledge acquired by one agent can be reused by others

Conclusion: Case-based learning offers a promising path for building professional agents for real-world work by enabling effective transfer of prior experience and structured analysis

Abstract: LLM-based autonomous agents perform well on general reasoning tasks but still struggle to reliably use task structure, key constraints, and prior experience in complex real-world settings. We propose a case-based learning framework that converts experience from past tasks into reusable knowledge assets, allowing agents to transfer prior case experience to new tasks and perform more structured analysis. Unlike methods based mainly on pretrained knowledge or static prompts, our framework emphasizes extracting and reusing task-relevant knowledge, analytical prompts, and operational skills from real cases. We evaluate the method on a unified benchmark of six complex task categories and compare it with Zero-Shot, Few-Shot, Checklist Prompt, and Rule Memory baselines. Results show that our method achieves consistently strong performance across all tasks and matches or outperforms the best baseline in every case, with especially clear gains on more complex tasks. Further analysis shows that the advantage of case-based learning increases with task complexity, and that practical knowledge acquired by one agent can be reused by others. These findings suggest that case-based learning offers a promising path for building professional agents for real-world work.

[414] Can AI Tools Transform Low-Demand Math Tasks? An Evaluation of Task Modification Capabilities

Danielle S. Fox, Brenda L. Robles, Elizabeth DiPietro Brovey, Christian D. Schunn

Main category: cs.AI

TL;DR: AI tools show moderate success (64% accuracy) in upgrading low-cognitive-demand math tasks, with specialized tools only slightly better than general-purpose ones, revealing distinct capabilities between task classification and generation.

Details

Motivation: While AI tools have been studied for classifying math task quality, little is known about their ability to improve existing tasks, particularly for curriculum adaptation and supporting teachers in modifying instructional materials.

Method: Tested 11 AI tools (6 general-purpose like ChatGPT/Claude, 5 specialized for math teachers) using Task Analysis Guide framework to modify low-demand math tasks with prompting strategies representing typical teacher approaches rather than optimized prompts.

Result: AI tools averaged 64% success rate in upgrading tasks, with performance ranging from 33% to 88%. Specialized tools were only moderately better than general-purpose ones. Failure modes included both undershooting (maintaining low demand) and overshooting (making tasks too ambitious). A negative correlation (r=-.35) was found between classification ability and upgrade ability.

Conclusion: AI tools have moderate potential for curriculum adaptation but require specialized approaches to effectively support teachers. The distinct capabilities between classification and generation tasks highlight the need for targeted AI development for educational applications.

Abstract: While recent research has explored AI tools’ ability to classify the quality of mathematical tasks (arXiv:2603.03512), little is known about their capacity to increase the quality of existing tasks. This study investigated whether AI tools could successfully upgrade low-cognitive-demand mathematics tasks. Eleven tools were tested, including six broadly available, general-purpose AI tools (e.g., ChatGPT and Claude) and five tools specialized for mathematics teachers (e.g., Khanmigo, coteach.ai). Using the Task Analysis Guide framework (Stein & Smith, 1998), we prompted AI tools to modify two different types of low-demand mathematical tasks. The prompting strategy aimed to represent likely approaches taken by knowledgeable teachers, rather than extensive optimization to find a more effective prompt (i.e., an optimistic typical outcome). On average, AI tools were only moderately successful: tasks were accurately upgraded only 64% of the time, with different AI tool performance ranging from quite weak (33%) to broadly successful (88%). Specialized tools were only moderately more successful than general-purpose tools. Failure modes included both “undershooting” (maintaining low cognitive demand) and “overshooting” (elevating tasks to an overly ambitious target category that likely would be rejected by teachers). Interestingly, there was a small negative correlation (r = -.35) between whether a given AI tool was able to correctly classify the cognitive demand of tasks and whether the AI was able to upgrade tasks, showing that the ability to modify tasks (i.e., a generative task) represents a distinct capability from the ability to classify them (i.e., judgement using a rubric). These findings have important implications for understanding AI’s potential role in curriculum adaptation and highlight the need for specialized approaches to support teachers in modifying instructional materials.

[415] DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

Hao Yan, Yuliang Liu, Xingchen Liu, Yuyi Zhang, Minghui Liao, Jihao Wu, Wei Chen, Xiang Bai

Main category: cs.AI

TL;DR: DocSeeker: A multimodal LLM framework for long document understanding using Analysis-Localization-Reasoning workflow with evidence-aware optimization and resolution allocation strategies.

Details

Motivation: Existing MLLMs suffer performance degradation on long document understanding due to low signal-to-noise ratio (crucial evidence buried in irrelevant pages) and supervision scarcity (weak learning signal from only final short answers).

Method: Proposes Analysis-Localization-Reasoning workflow with two-stage training: 1) Supervised Fine-Tuning on high-quality data via knowledge distillation, 2) Evidence-aware Group Relative Policy Optimization for joint evidence localization and answer accuracy optimization, plus Evidence-Guided Resolution Allocation for memory efficiency.

Result: DocSeeker achieves superior performance on both in-domain and out-of-domain tasks, robustly generalizes from short-page training to ultra-long documents, and synergizes with visual Retrieval-Augmented Generation systems.

Conclusion: The proposed paradigm effectively addresses long document understanding challenges in MLLMs and provides a solid foundation for visual RAG systems implementation.

Abstract: Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured ``\textbf{Analysis}, \textbf{Localization} and \textbf{Reasoning}’’ workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as a solid foundation for their implementation.

[416] RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

Jagadeesh Rachapudi, Pranav Singh, Ritali Vatsi, Praful Hambarde, Amit Shukla

Main category: cs.AI

TL;DR: RePAIR enables interactive machine unlearning where users can instruct LLMs to forget specific knowledge via natural language prompts at inference time, using a training-free activation manipulation method.

Details

Motivation: LLMs absorb harmful knowledge, misinformation, and personal data during pretraining with no mechanism for selective removal. Existing unlearning approaches are provider-centric, requiring retraining and excluding end users from controlling their own data.

Method: Proposes RePAIR framework with: (1) watchdog model for unlearning intent detection, (2) surgeon model for generating repair procedures, (3) patient model with parameter updates. Core method is STAMP - Steering Through Activation Manipulation with PseudoInverse - a training-free, single-sample unlearning method that redirects MLP activations toward refusal subspace via closed-form pseudoinverse updates.

Result: Achieves near-zero forget scores (Acc_f = 0.00, F-RL = 0.00) while preserving model utility (Acc_r up to 84.47, R-RL up to 0.88), outperforming six state-of-the-art baselines. Low-rank variant enables efficient on-device unlearning with ~3x speedup.

Conclusion: RePAIR establishes an effective and practical framework for user-driven model editing, advancing transparent and on-device control over learned knowledge, with potential extensions to multimodal foundation models.

Abstract: Large language models (LLMs) inherently absorb harmful knowledge, misinformation, and personal data during pretraining on large-scale web corpora, with no native mechanism for selective removal. While machine unlearning offers a principled solution, existing approaches are provider-centric, requiring retraining pipelines, curated retain datasets, and direct intervention by model service providers (MSPs), thereby excluding end users from controlling their own data. We introduce Interactive Machine Unlearning (IMU), a new paradigm in which users can instruct LLMs to forget targeted knowledge through natural language at inference time. To realize IMU, we propose RePAIR, a prompt-aware model repair framework comprising (i) a watchdog model for unlearning intent detection, (ii) a surgeon model for generating repair procedures, and (iii) a patient model whose parameters are updated autonomously. At the core of RePAIR, we develop Steering Through Activation Manipulation with PseudoInverse (STAMP), a training-free, single-sample unlearning method that redirects MLP activations toward a refusal subspace via closed-form pseudoinverse updates. Its low-rank variant reduces computational complexity from O(d^3) to O(r^3 + r^2 * d), enabling efficient on-device unlearning with up to ~3x speedup over training-based baselines. Extensive experiments across harmful knowledge suppression, misinformation correction, and personal data erasure demonstrate that RePAIR achieves near-zero forget scores (Acc_f = 0.00, F-RL = 0.00) while preserving model utility (Acc_r up to 84.47, R-RL up to 0.88), outperforming six state-of-the-art baselines. These results establish RePAIR as an effective and practical framework for user-driven model editing, advancing transparent and on-device control over learned knowledge, with potential extensions to multimodal foundation models.

[417] Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic

Saeed Rahmani, Shiva Rasouli, Daphne Cornelisse, Eugene Vinitsky, Bart van Arem, Simeon C. Calvert

Main category: cs.AI

TL;DR: Survey paper reviewing AI methods for modeling autonomous and human driving behavior in mixed autonomy traffic simulation, with taxonomy covering agent-level models, environment-level simulation, and cognitive/physics-informed methods.

Details

Motivation: Existing simulation tools for autonomous vehicles focus on graphical realism but use simple rule-based models that fail to accurately represent complex driving behaviors and interactions. There's a lack of comprehensive surveys on AI applications to mixed autonomy traffic simulation that bridge traffic engineering and computer science perspectives.

Method: The paper provides a structured survey with a taxonomy organizing AI methods into three families: 1) agent-level behavior models, 2) environment-level simulation methods, and 3) cognitive and physics-informed methods. It analyzes simulation platform limitations, reviews evaluation protocols/metrics, simulation tools, and datasets.

Result: The survey synthesizes AI methods for mixed autonomy traffic simulation, identifies gaps in existing simulation platforms, and provides a chronological overview of AI methods development in this domain.

Conclusion: The paper bridges traffic engineering and computer science communities by providing a comprehensive review of AI methods for mixed autonomy traffic simulation, offering a unified taxonomy and directions to address current limitations in simulation tools.

Abstract: Autonomous vehicles (AVs) are now operating on public roads, which makes their testing and validation more critical than ever. Simulation offers a safe and controlled environment for evaluating AV performance in varied conditions. However, existing simulation tools mainly focus on graphical realism and rely on simple rule-based models and therefore fail to accurately represent the complexity of driving behaviors and interactions. Artificial intelligence (AI) has shown strong potential to address these limitations; however, despite the rapid progress across AI methodologies, a comprehensive survey of their application to mixed autonomy traffic simulation remains lacking. Existing surveys either focus on simulation tools without examining the AI methods behind them, or cover ego-centric decision-making without addressing the broader challenge of modeling surrounding traffic. Moreover, they do not offer a unified taxonomy of AI methods covering individual behavior modeling to full scene simulation. To address these gaps, this survey provides a structured review and synthesis of AI methods for modeling AV and human driving behavior in mixed autonomy traffic simulation. We introduce a taxonomy that organizes methods into three families: agent-level behavior models, environment-level simulation methods, and cognitive and physics-informed methods. The survey analyzes how existing simulation platforms fall short of the needs of mixed autonomy research and outlines directions to narrow this gap. It also provides a chronological overview of AI methods and reviews evaluation protocols and metrics, simulation tools, and datasets. By covering both traffic engineering and computer science perspectives, we aim to bridge the gap between these two communities.

[418] From edges to meaning: Semantic line sketches as a cognitive scaffold for ancient pictograph invention

Seowung Leem, Lin Gu, Ruogu Fang

Main category: cs.AI

TL;DR: The paper proposes that ancient pictographic writing emerged from the brain’s intrinsic tendency to compress visual input into stable, boundary-based abstractions, using a biologically inspired digital twin of the visual hierarchy to generate symbols resembling early pictographs across different writing systems.

Details

Motivation: To understand the computational mechanism by which the brain transforms high-level semantic knowledge into low-level visual symbols, and to explore whether ancient pictographic writing emerged from the brain's intrinsic tendency to compress visual input into stable, boundary-based abstractions.

Method: Construct a biologically inspired digital twin of the visual hierarchy that encodes images into low-level features, generates contour sketches, and iteratively refines them through top-down feedback guided by semantic representations, mirroring the feedforward and recurrent architecture of the human visual cortex.

Result: The resulting symbols bear striking structural resemblance to early pictographs across culturally distant writing systems (Egyptian hieroglyphs, Chinese oracle bone characters, proto-cuneiform) and offer candidate interpretations for undeciphered scripts.

Conclusion: The findings support a neuro-computational origin of pictographic writing and establish a framework in which AI can recapitulate the cognitive processes by which humans first externalized perception into symbols.

Abstract: Humans readily recognize objects from sparse line drawings, a capacity that appears early in development and persists across cultures, suggesting neural rather than purely learned origins. Yet the computational mechanism by which the brain transforms high-level semantic knowledge into low-level visual symbols remains poorly understood. Here we propose that ancient pictographic writing emerged from the brain’s intrinsic tendency to compress visual input into stable, boundary-based abstractions. We construct a biologically inspired digital twin of the visual hierarchy that encodes an image into low-level features, generates a contour sketch, and iteratively refines it through top-down feedback guided by semantic representations, mirroring the feedforward and recurrent architecture of the human visual cortex. The resulting symbols bear striking structural resemblance to early pictographs across culturally distant writing systems, including Egyptian hieroglyphs, Chinese oracle bone characters, and proto-cuneiform, and offer candidate interpretations for undeciphered scripts. Our findings support a neuro-computational origin of pictographic writing and establish a framework in which AI can recapitulate the cognitive processes by which humans first externalized perception into symbols.

[419] QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence

Zhichao Lin, Zhichao Liang, Gaoqiang Liu, Meng Xu, Baoyu Xiang, Jian Xu, Guanjun Jiang

Main category: cs.AI

TL;DR: QuarkMedSearch is a specialized medical deep search system built on Tongyi DeepResearch, featuring medical multi-hop data construction, two-stage training (SFT+RL), and expert-verified benchmarks for Chinese medical domain performance.

Details

Motivation: To improve agentic foundation model performance in vertical domains, specifically addressing the challenge of medical deep search in Chinese healthcare where training data is scarce and domain expertise is critical.

Method: Three-part approach: 1) Data synthesis using medical knowledge graphs and real-time online exploration for long-horizon medical deep search training data; 2) Two-stage post-training with SFT and RL to enhance planning, tool invocation, and reflection capabilities; 3) Expert-constructed QuarkMedSearch Benchmark with rigorous manual verification.

Result: QuarkMedSearch achieves state-of-the-art performance among open-source models of comparable scale on the QuarkMedSearch Benchmark while maintaining strong competitiveness on general benchmarks.

Conclusion: The paper demonstrates a successful full-pipeline approach for enhancing agentic foundation models in vertical domains, with specific application to Chinese medical deep search, showing the value of domain-specific data construction, progressive training strategies, and expert-verified evaluation.

Abstract: As agentic foundation models continue to evolve, how to further improve their performance in vertical domains has become an important challenge. To this end, building upon Tongyi DeepResearch, a powerful agentic foundation model, we focus on the Chinese medical deep search scenario and propose QuarkMedSearch, systematically exploring a full-pipeline approach spanning medical multi-hop data construction, training strategies, and evaluation benchmarks to further push and assess its performance upper bound in vertical domains. Specifically, for data synthesis, to address the scarcity of deep search training data in the medical domain, we combine a large-scale medical knowledge graph with real-time online exploration to construct long-horizon medical deep search training data; for post-training, we adopt a two-stage SFT and RL training strategy that progressively enhances the model’s planning, tool invocation, and reflection capabilities required for deep search, while maintaining search efficiency; for evaluation, we collaborate with medical experts to construct the QuarkMedSearch Benchmark through rigorous manual verification. Experimental results demonstrate that QuarkMedSearch achieves state-of-the-art performance among open-source models of comparable scale on the QuarkMedSearch Benchmark, while also maintaining strong competitiveness on general benchmarks.

[420] LIFE – an energy efficient advanced continual learning agentic AI framework for frontier systems

Anne Lee, Gurudutt Hosangadi

Main category: cs.AI

TL;DR: LIFE is an agentic AI framework for sustainable HPC management that combines orchestrator, context engineering, memory system, and lattice learning for adaptive, energy-efficient operations.

Details

Motivation: AI advancement has increased HPC energy demands while existing continual learning capabilities are limited. Need sustainable, adaptive systems beyond monolithic transformers to effectively manage HPC dimensioning, provisioning, and execution.

Method: Proposes LIFE framework: agent-centric system with four components - orchestrator, Agentic Context Engineering, novel memory system, and information lattice learning. Implemented as self-evolving network management for HPC operations.

Result: Demonstrated in closed-loop HPC operations example for detecting and mitigating latency spikes in Kubernetes-like clusters running critical microservices. Framework generalizes to various orthogonal use cases.

Conclusion: LIFE represents emerging direction beyond monolithic transformers toward agentic AI and brain-inspired architectures for sustainable, adaptive HPC systems management.

Abstract: The rapid advancement of AI has changed the character of HPC usage such as dimensioning, provisioning, and execution. Not only has energy demand been amplified, but existing rudimentary continual learning capabilities limit ability of AI to effectively manage HPCs. This paper reviews emerging directions beyond monolithic transformers, emphasizing agentic AI and brain inspired architectures as complementary paths toward sustainable, adaptive systems. We propose LIFE, a reasoning and Learning framework that is Incremental, Flexible, and Energy efficient that is implemented as an agent centric system rather than a single monolithic model. LIFE uniquely combines four components to realize self evolving network management and operations in HPCs. The components are an orchestrator, Agentic Context Engineering, a novel memory system, and information lattice learning. LIFE can also generalize to enable a variety of orthogonal use cases. We ground LIFE in a specific closed loop HPC operations example for detecting and mitigating latency spikes experienced by critical micro services running on a Kubernetes like cluster.

[421] AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

Abiodun A. Solanke

Main category: cs.AI

TL;DR: AISafetyBenchExplorer is a structured catalogue of 195 AI safety benchmarks (2018-2026) that enables meta-analysis of how safety is operationalized, revealing fragmentation and lack of standardization in LLM safety evaluation.

Details

Motivation: The rapid expansion of LLM safety evaluation has created many benchmarks but lacks coherent measurement standards, making it difficult to compare results and understand how safety is operationalized across different evaluations.

Method: Created a multi-sheet schema cataloging 195 AI safety benchmarks with benchmark-level metadata, metric definitions, paper metadata, and repository activity. Developed a complexity taxonomy and analyzed patterns in benchmark usage, language coverage, and maintenance.

Result: Identified fragmentation: dominated by medium-complexity benchmarks (94/195), few popular benchmarks (7), English-only bias (165/195), evaluation-only resources (170/195), stale repositories (137/195), and inconsistent metric definitions despite similar labels.

Conclusion: The main failure mode is fragmentation, not scarcity. Researchers need shared measurement language, principled benchmark selection, and better stewardship. AISafetyBenchExplorer provides tools for more rigorous benchmark discovery, comparison, and meta-evaluation.

Abstract: The rapid expansion of large language model (LLM) safety evaluation has produced a substantial benchmark ecosystem, but not a correspondingly coherent measurement ecosystem. We present AISafetyBenchExplorer, a structured catalogue of 195 AI safety benchmarks released between 2018 and 2026, organized through a multi-sheet schema that records benchmark-level metadata, metric-level definitions, benchmark-paper metadata, and repository activity. This design enables meta-analysis not only of what benchmarks exist, but also of how safety is operationalized, aggregated, and judged across the literature. Using the updated catalogue, we identify a central structural problem: benchmark proliferation has outpaced measurement standardization. The current landscape is dominated by medium-complexity benchmarks (94/195), while only 7 benchmarks occupy the Popular tier. The workbook further reports strong concentration around English-only evaluation (165/195), evaluation-only resources (170/195), stale GitHub repositories (137/195), stale Hugging Face datasets (96/195), and heavy reliance on arXiv preprints among benchmarks with known venue metadata. At the metric level, the catalogue shows that familiar labels such as accuracy, F1 score, safety score, and aggregate benchmark scores often conceal materially different judges, aggregation rules, and threat models. We argue that the field’s main failure mode is fragmentation rather than scarcity. Researchers now have many benchmark artifacts, but they often lack a shared measurement language, a principled basis for benchmark selection, and durable stewardship norms for post publication maintenance. AISafetyBenchExplorer addresses this gap by providing a traceable benchmark catalogue, a controlled metadata schema, and a complexity taxonomy that together support more rigorous benchmark discovery, comparison, and meta-evaluation.

[422] BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design

Chuyang Xiang, Yichen Wei, Jiale Ma, Handing Wang, Junchi Yan

Main category: cs.AI

TL;DR: BEAM is a bi-level evolutionary framework for automatic heuristic design that combines genetic algorithms for high-level algorithmic structures with Monte Carlo Tree Search for function implementation, outperforming existing LLM-based hyper heuristics.

Details

Motivation: Existing LLM-based hyper heuristics are limited to single-function optimization within pre-defined solvers and lack high-level algorithmic modeling, making them ineffective for designing complete competent solvers.

Method: BEAM formulates heuristic design as a bi-level optimization problem: exterior layer evolves high-level algorithmic structures with function placeholders using genetic algorithms, interior layer realizes placeholders via Monte Carlo Tree Search, with an Adaptive Memory module for complex code generation and Knowledge Augmentation pipeline for evaluation.

Result: BEAM significantly outperforms existing LLM-based hyper heuristics, reducing optimality gap by 37.84% on aggregate in CVRP hybrid algorithm design, and designs a heuristic that outperforms state-of-the-art Maximum Independent Set solver KaMIS.

Conclusion: BEAM’s bi-level evolutionary approach with adaptive memory and knowledge augmentation enables more effective automatic heuristic design for complex optimization problems compared to existing single-layer LLM-based methods.

Abstract: Large Language Model-based Hyper Heuristic (LHH) has recently emerged as an efficient way for automatic heuristic design. However, most existing LHHs just perform well in optimizing a single function within a pre-defined solver. Their single-layer evolution makes them not effective enough to write a competent complete solver. While some variants incorporate hyperparameter tuning or attempt to generate complex code through iterative local modifications, they still lack a high-level algorithmic modeling, leading to limited exploration efficiency. To address this, we reformulate heuristic design as a Bi-level Optimization problem and propose \textbf{BEAM} (Bi-level Memory-adaptive Algorithmic Evolution). BEAM’s exterior layer evolves high-level algorithmic structures with function placeholders through genetic algorithm (GA), while the interior layer realizes these placeholders via Monte Carlo Tree Search (MCTS). We further introduce an Adaptive Memory module to facilitate complex code generation. To support the evaluation for complex code generation, we point out the limitations of starting LHHs from scratch or from code templates and introduce a Knowledge Augmentation (KA) Pipeline. Experimental results on several optimization problems demonstrate that BEAM significantly outperforms existing LHHs, notably reducing the optimality gap by 37.84% on aggregate in CVRP hybrid algorithm design. BEAM also designs a heuristic that outperforms SOTA Maximum Independent Set (MIS) solver KaMIS.

[423] Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

Benjamin Stern, Peter Nadel

Main category: cs.AI

TL;DR: Dual-trace memory encoding pairs factual records with concrete scene traces (narrative reconstructions of learning context) to improve LLM agents’ temporal reasoning, change tracking, and cross-session aggregation capabilities.

Details

Motivation: Current LLM agents store information as flat factual records without contextual details, limiting their ability for temporal reasoning, change tracking, and cross-session aggregation. The paper aims to address these limitations by creating richer, more distinctive memory traces.

Method: Introduces dual-trace memory encoding where each stored fact is paired with a concrete scene trace - a narrative reconstruction of the moment and context in which the information was learned. This forces the agent to commit to specific contextual details during encoding. Evaluated using LongMemEval-S benchmark with 4,575 sessions and 100 recall questions.

Result: Dual-trace encoding achieved 73.7% overall accuracy vs 53.5% for fact-only control (+20.2pp gain). Significant improvements in temporal reasoning (+40pp), knowledge-update tracking (+25pp), and multi-session aggregation (+30pp). No benefit for single-session retrieval. Achieved gains at no additional token cost.

Conclusion: Dual-trace memory encoding significantly enhances LLM agents’ capabilities for temporal reasoning, change tracking, and cross-session aggregation by creating richer memory traces with contextual details, consistent with encoding specificity theory.

Abstract: LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session aggregation. Inspired by the drawing effect [3], we introduce dual-trace memory encoding. In this method, each stored fact is paired with a concrete scene trace, a narrative reconstruction of the moment and context in which the information was learned. The agent is forced to commit to specific contextual details during encoding, creating richer, more distinctive memory traces. Using the LongMemEval-S benchmark (4,575 sessions, 100 recall questions), we compare dual-trace encoding against a fact-only control with matched coverage and format over 99 shared questions. Dual-trace achieves 73.7% overall accuracy versus 53.5%, a +20.2 percentage point (pp) gain (95% CI: [+12.1, +29.3], bootstrap p < 0.0001). Gains concentrate in temporal reasoning (+40pp), knowledge-update tracking (+25pp), and multi-session aggregation (+30pp), with no benefit for single-session retrieval, consistent with encoding specificity theory [8]. Token analysis shows dual-trace encoding achieves this gain at no additional cost. We additionally sketch an architectural design for adapting dual-trace encoding to coding agents, with preliminary pilot validation.

[424] Modeling Co-Pilots for Text-to-Model Translation

Serdar Kadioglu, Karthik Uppuluri, Akash Singirikonda

Main category: cs.AI

TL;DR: Text2Model and Text2Zinc: LLM-based tools for translating natural language descriptions of combinatorial optimization and satisfaction problems into formal models using solver-agnostic MiniZinc framework.

Details

Motivation: Growing interest in using LLMs for text-to-model translation, but existing work lacks unified handling of both satisfaction and optimization problems, and is often solver-specific rather than solver-agnostic.

Method: Developed Text2Model suite with various LLM strategies (zero-shot, chain-of-thought, knowledge-graphs, grammar-based encoding, agentic approaches) and Text2Zinc dataset of optimization problems in natural language. Uses MiniZinc’s solver-agnostic modeling capabilities.

Result: Co-pilot strategies are competitive with and sometimes improve upon recent research, but LLMs are not yet push-button solutions for combinatorial modeling. Performance gap exists but tools are open-sourced to help close it.

Conclusion: LLMs show promise for combinatorial modeling but require further development. The unified solver-agnostic approach with both satisfaction and optimization problems represents an advancement over existing work.

Abstract: There is growing interest in leveraging large language models (LLMs) for text-to-model translation and optimization tasks. This paper aims to advance this line of research by introducing \textsc{Text2Model} and \textsc{Text2Zinc}. \textsc{Text2Model} is a suite of co-pilots based on several LLM strategies with varying complexity, along with an online leaderboard. \textsc{Text2Zinc} is a cross-domain dataset for capturing optimization and satisfaction problems specified in natural language, along with an interactive editor with built-in AI assistant. While there is an emerging literature on using LLMs for translating combinatorial problems into formal models, our work is the first attempt to integrate \textit{both} satisfaction and optimization problems within a \textit{unified architecture} and \textit{dataset}. Moreover, our approach is \textit{solver-agnostic} unlike existing work that focuses on translation to a solver-specific model. To achieve this, we leverage \textsc{MiniZinc}’s solver-and-paradigm-agnostic modeling capabilities to formulate combinatorial problems. We conduct comprehensive experiments to compare execution and solution accuracy across several single- and multi-call strategies, including; zero-shot prompting, chain-of-thought reasoning, intermediate representations via knowledge-graphs, grammar-based syntax encoding, and agentic approaches that decompose the model into sequential sub-tasks. Our co-pilot strategies are competitive, and in parts improve, recent research in this domain. Our findings indicate that while LLMs are promising they are not yet a push-button technology for combinatorial modeling. We contribute \textsc{Text2Model} co-pilots and leaderboard, and \textsc{Text2Zinc} and interactive editor to open-source to support closing this performance gap.

[425] Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

Sohyun An, Shuibenyang Yuan, Hayeon Lee, Cho-Jui Hsieh, Alexander Min

Main category: cs.AI

TL;DR: CCS is a gold-supervision-free framework for training search agents using cycle-consistency principles, where optimal search trajectories should preserve enough information to reconstruct original questions, creating a reward signal for RL optimization.

Details

Motivation: Existing RL approaches for search agents rely heavily on gold supervision (ground-truth answers), which is difficult to scale. The authors aim to develop a framework that can train search agents without requiring such expensive supervision.

Method: Proposes Cycle-Consistent Search (CCS) inspired by cycle-consistency techniques. Key hypothesis: optimal search trajectories serve as lossless encodings of question intent. Uses information bottlenecks (excluding final response, NER masking) to prevent information leakage and ensure reconstruction relies on retrieved observations and structural scaffold.

Result: Experiments on question-answering benchmarks show CCS achieves performance comparable to supervised baselines and outperforms prior methods that don’t rely on gold supervision.

Conclusion: CCS provides a scalable training paradigm for search agents in settings where gold supervision is unavailable, demonstrating the effectiveness of cycle-consistency principles for RL-based search optimization.

Abstract: Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question’s intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.

[426] Bilevel Late Acceptance Hill Climbing for the Electric Capacitated Vehicle Routing Problem

Yinghao Qin, Mosab Bazargani, Edmund K. Burke, Carlos A. Coello Coello, Zhongmin Song, Jun Chen

Main category: cs.AI

TL;DR: Bilevel optimization framework for Electric Capacitated Vehicle Routing Problem using Late Acceptance Hill Climbing with surrogate objectives for routing and charging decisions

Details

Motivation: To efficiently solve the Electric Capacitated Vehicle Routing Problem (E-CVRP) by separating routing and charging decisions while maintaining their interaction through a bilevel optimization approach

Method: Bilevel Late Acceptance Hill Climbing (b-LAHC) algorithm with three phases: greedy descent, neighborhood exploration, and final refinement. Uses surrogate objective at upper level to guide search and accelerate convergence

Result: Achieves superior/competitive performance against 8 state-of-the-art algorithms on IEEE WCCI-2020 benchmark. Sets 9/10 new best-known results on large-scale benchmarks with 1.07% average improvement

Conclusion: The bilevel framework effectively solves large-scale routing problems with hierarchical structure, validated by strong correlation between surrogate objective and complete cost

Abstract: This paper tackles the Electric Capacitated Vehicle Routing Problem (E-CVRP) through a bilevel optimization framework that handles routing and charging decisions separately or jointly depending on the search stage. By analyzing their interaction, we introduce a surrogate objective at the upper level to guide the search and accelerate convergence. A bilevel Late Acceptance Hill Climbing algorithm (b-LAHC) is introduced that operates through three phases: greedy descent, neighborhood exploration, and final solution refinement. b-LAHC operates with fixed parameters, eliminating the need for complex adaptation while remaining lightweight and effective. Extensive experiments on the IEEE WCCI-2020 benchmark show that b-LAHC achieves superior or competitive performance against eight state-of-the-art algorithms. Under a fixed evaluation budget, it attains near-optimal solutions on small-scale instances and sets 9/10 new best-known results on large-scale benchmarks, improving existing records by an average of 1.07%. Moreover, the strong correlation (though not universal) observed between the surrogate objective and the complete cost justifies the use of the surrogate objective while still necessitating a joint solution of both levels, thereby validating the effectiveness of the proposed bilevel framework and highlighting its potential for efficiently solving large-scale routing problems with a hierarchical structure.

[427] PAL: Personal Adaptive Learner

Megha Chakraborty, Darssan L. Eswaramoorthi, Madhur Thareja, Het Riteshkumar Shah, Finlay Palmer, Aryaman Bahl, Michelle A Ihetu, Amit Sheth

Main category: cs.AI

TL;DR: PAL is an AI-powered platform that transforms lecture videos into interactive learning experiences by analyzing multimodal content and dynamically adapting questions based on learner responses, then generating personalized summaries.

Details

Motivation: Current AI-driven education platforms are limited to static adaptation with predefined quizzes and uniform pacing, lacking real-time responsiveness to learners' evolving understanding. There's a need for context-aware systems that can adapt dynamically during learning sessions.

Method: PAL analyzes multimodal lecture content (likely combining visual and audio elements from videos) and dynamically engages learners through questions of varying difficulty. It adjusts questions based on learner responses in real-time and generates personalized summaries with tailored examples at the end of sessions.

Result: PAL contributes a novel framework for responsive digital learning that unites multimodal content analysis with adaptive decision-making, demonstrating how AI can move beyond static personalization toward real-time, individualized support.

Conclusion: The work addresses a core challenge in AI-enabled education by creating a system that is both context-aware and adaptive in real time, transforming passive lecture videos into interactive, personalized learning experiences.

Abstract: AI-driven education platforms have made some progress in personalisation, yet most remain constrained to static adaptation–predefined quizzes, uniform pacing, or generic feedback–limiting their ability to respond to learners’ evolving understanding. This shortfall highlights the need for systems that are both context-aware and adaptive in real time. We introduce PAL (Personal Adaptive Learner), an AI-powered platform that transforms lecture videos into interactive learning experiences. PAL continuously analyzes multimodal lecture content and dynamically engages learners through questions of varying difficulty, adjusting to their responses as the lesson unfolds. At the end of a session, PAL generates a personalized summary that reinforces key concepts while tailoring examples to the learner’s interests. By uniting multimodal content analysis with adaptive decision-making, PAL contributes a novel framework for responsive digital learning. Our work demonstrates how AI can move beyond static personalization toward real-time, individualized support, addressing a core challenge in AI-enabled education.

[428] GeM-EA: A Generative and Meta-learning Enhanced Evolutionary Algorithm for Streaming Data-Driven Optimization

Yue Wu, Yuan-Ting Zhong, Ze-Yuan Ma, Yue-Jiao Gong

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2604.12336: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12336&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[429] Lit2Vec: A Reproducible Workflow for Building a Legally Screened Chemistry Corpus from S2ORC for Downstream Retrieval and Text Mining

Mahmoud Amiri, Jamile Mohammad Jafari, Sara Mostafapour, Thomas Bocklitz

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.12498: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12498&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[430] ROSE: An Intent-Centered Evaluation Metric for NL2SQL

Wenqi Pei, Shizheng Hou, Boyan Li, Han Chen, Zhichao Shi, Yuyu Luo

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed data retrieval

Method: Unable to determine method due to failed data retrieval

Result: Unable to determine results due to failed data retrieval

Conclusion: Unable to draw conclusions due to failed data retrieval

Abstract: Failed to fetch summary for 2604.12988: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12988&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[431] SmellNet: A Large-scale Dataset for Real-world Smell Recognition

Dewei Feng, Wei Dai, Carol Li, Alistair Pernigo, Yunge Wen, Paul Pu Liang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2506.00239: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.00239&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[432] Fragile Preferences: A Deep Dive Into Order Effects in Large Language Models

Haonan Yin, Shai Vardi, Vidyanand Choudhary

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2506.14092: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14092&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[433] Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling

Yongyi Wang, Lingfeng Li, Bozhou Chen, Ang Li, Hanyu Liu, Qirui Zheng, Xionghui Yang, Wenxin Li

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2508.04282: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.04282&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[434] Mantis: A Foundation Model for Mechanistic Disease Forecasting

Carson Dudley, Reiden Magdaleno, Christopher Harding, Ananya Sharma, Emily Martin, Marisa Eisenberg

Main category: cs.AI

TL;DR: Failed to fetch paper summary - HTTP 429 error (rate limiting) prevents access to arXiv API

Details

Motivation: Unable to determine motivation due to API access error

Method: Unable to determine method due to API access error

Result: Unable to determine results due to API access error

Conclusion: Unable to determine conclusion due to API access error

Abstract: Failed to fetch summary for 2508.12260: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.12260&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[435] Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

Yein Park, Minbyul Jeong, Jaewoo Kang

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2509.25758 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2509.25758: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25758&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[436] ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Yein Park, Jungwoo Park, Jaewoo Kang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.25843: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25843&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Zhang Zheng, Deheng Ye, Peilin Zhao, Hao Wang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2510.09087: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09087&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[438] Mixed-Density Diffuser: Efficient Planning with Non-Uniform Temporal Resolution

Crimson Stambaugh, Rajesh P. N. Rao

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2510.23026: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23026&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[439] MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

Weihua Cheng, Junming Liu, Yifei Sun, Botian Shi, Yirong Chen, Ding Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.24168: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24168&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[440] Does RLVR Extend Reasoning Boundaries? Investigating Capability Expansion in Vision-Language Models

Minghe Shen, Zhuo Zhi, Chonghan Liu, Shuo Xing, Zhengzhong Tu, Che Liu

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2511.00710 suggests it’s from November 2024, but content is unavailable for analysis.

Details

Motivation: Cannot determine motivation due to unavailability of paper content.

Method: Cannot determine method due to unavailability of paper content.

Result: Cannot determine results due to unavailability of paper content.

Conclusion: Cannot draw conclusions due to unavailability of paper content.

Abstract: Failed to fetch summary for 2511.00710: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00710&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[441] DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

Lachlan McPheat, Navdeep Kaur, Robert Blackwell, Alessandra Russo, Anthony G. Cohn, Pranava Madhyastha

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.02627: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.02627&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[442] Dataset Safety in Autonomous Driving: Requirements, Risks, and Assurance

Alireza Abbaspour, Tejaskumar Balgonda Patil, B Ravi Kiran, Russel Mohr, Senthil Yogamani

Main category: cs.AI

TL;DR: Paper ID 2511.08439 - Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting on arXiv API

Method: Cannot determine method as abstract is unavailable due to rate limiting on arXiv API

Result: Cannot determine results as abstract is unavailable due to rate limiting on arXiv API

Conclusion: Cannot determine conclusion as abstract is unavailable due to rate limiting on arXiv API

Abstract: Failed to fetch summary for 2511.08439: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08439&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[443] Learning the Value of Value Learning

Alex John London, Aydin Mohseni

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2511.17714: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17714&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[444] A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

Miles Q. Li, Benjamin C. M. Fung, Martin Weiss, Pulei Xiong, Khalil Al-Hussaeni, Claude Fachkha

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2512.20798: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20798&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[445] No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning

Zhicong Li, Lingjie Jiang, Yulan Hu, Xingchen Zeng, Yixia Li, Xiangwen Zhang, Guanhua Chen, Zheng Pan, Xin Li, Yong Liu

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2601.06794: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06794&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[446] PrivacyReasoner: Can LLM Emulate a Human-like Privacy Mind?

Yiwen Tu, Xuan Liu, Lianhui Qin, Haojian Jin

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: No method information available due to API rate limiting error

Result: No results available - paper summary fetch failed

Conclusion: Cannot analyze paper due to technical limitations in accessing content

Abstract: Failed to fetch summary for 2601.09152: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09152&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[447] LatentRefusal: Latent-Signal Refusal for Unanswerable Text-to-SQL Queries

Xuancheng Ren, Shijing Hu, Zhihui Lu, Jiangqi Huang, Qiang Duan

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.10398: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10398&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[448] RegD: Hierarchical Embeddings via Dissimilarity between Arbitrary Euclidean Regions

Hui Yang, Jiaoyan Chen

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2501.17518: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.17518&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[449] WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

Sicheng Fan, Qingyun Shi, Shengze Xu, Shengbo Cai, Tieyong Zeng, Li Ling, Yanyi Shang, Dehan Kong

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.05044: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05044&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[450] Characterizing higher-order representations through generative diffusion models explains human decoded neurofeedback performance

Hojjat Azimi Asrari, Megan A. K. Peters

Main category: cs.AI

TL;DR: Unable to analyze paper 2503.14333 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot draw conclusions as abstract is unavailable due to rate limiting error

Abstract: Failed to fetch summary for 2503.14333: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.14333&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[451] A Survey of Multimodal Mathematical Reasoning: From Perception, Alignment to Reasoning

Tianyu Yang, Sihong Wu, Yilun Zhao, Zhenwen Liang, Lisen Dai, Chen Zhao, Minhao Cheng, Arman Cohan, Xiangliang Zhang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.08291 suggests it’s from March 2023, but without the abstract content, I cannot analyze its relevance to multimodal LLMs with audio/vision focus.

Details

Motivation: Cannot determine motivation without access to the paper abstract content.

Method: Cannot determine method without access to the paper abstract content.

Result: Cannot determine results without access to the paper abstract content.

Conclusion: Cannot draw conclusions without access to the paper abstract content.

Abstract: Failed to fetch summary for 2603.08291: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08291&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[452] On the Geometry of Receiver Operating Characteristic and Precision-Recall Curves

Reza Sameni

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2504.02169: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.02169&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[453] Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI

Houston Haynes

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.18104: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18104&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[454] Man and machine: artificial intelligence and judicial decision making

Arthur Dyevre, Ahmad Shahvaroughi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to analyze paper content due to technical retrieval error

Abstract: Failed to fetch summary for 2603.19042: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19042&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[455] Evaluating Language Models for Harmful Manipulation

Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2603.25326: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25326&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[456] CODESTRUCT: Code Agents over Structured Action Spaces

Myeongsoo Kim, Joe Hsu, Dingmin Wang, Shweta Garg, Varun Kumar, Murali Krishna Ramanathan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2604.05407: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05407&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[457] Fast AI Model Partition for Split Learning over Edge Networks

Zuguang Li, Wen Wu, Shaohua Wu, Xuemin, Shen

Main category: cs.AI

TL;DR: Unable to analyze paper 2507.01041 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract retrieval failed due to rate limiting (HTTP 429)

Method: Cannot determine method without access to paper abstract

Result: No results available due to failed abstract retrieval

Conclusion: Unable to analyze this specific paper due to technical limitations in accessing the abstract

Abstract: Failed to fetch summary for 2507.01041: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.01041&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[458] Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

Carson Dudley, Reiden Magdaleno, Christopher Harding, Marisa Eisenberg

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2507.08977: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.08977&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[459] SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

Sihang Jiang, Lipeng Ma, Zhonghua Hong, Keyi Wang, Zhiyu Lu, Shisong Chen, Jinghao Zhang, Tianjun Pan, Weijia Zhou, Jiaqing Liang, Yanghua Xiao

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.08988: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08988&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[460] Teaching the Teacher: The Role of Teacher-Student Smoothness Alignment in Genetic Programming-based Symbolic Distillation

Soumyadeep Dhar, Kei Sen Fong, Mehul Motani

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2507.22767: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.22767&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[461] Hubble: An LLM-Driven Agentic Framework for Safe, Diverse, and Reproducible Alpha Factor Discovery

Runze Shi, Shengyu Yan, Yuecheng Cai, Chengxi Lv

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to determine conclusion due to data fetch failure

Abstract: Failed to fetch summary for 2604.09601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[462] Dead Cognitions: A Census of Misattributed Insights

Aaron Tuor, claude.ai

Main category: cs.AI

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting (HTTP 429 error)

Method: Cannot analyze method without access to the paper abstract or content

Result: No results available due to technical limitations in accessing the paper information

Conclusion: Cannot provide analysis due to API rate limiting preventing access to the paper abstract

Abstract: Failed to fetch summary for 2604.10288: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10288&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[463] EvoNash-MARL: A Closed-Loop Multi-Agent Reinforcement Learning Framework for Medium-Horizon Equity Allocation

Chongliu Jia, Yi Luo, Sipeng Han, Pengwei Li, Jie Ding, Youshuang Hu, Yimiao Qian, Qiya Wang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2604.10911: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10911&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Jincheng Xie, Xingchen Xiao, Runheng Liu, Zhongyi Huang, Yu Zheng, Heyan Huang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2604.11043: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11043&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[465] Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs

Wenkai Li, Fan Yang, Shaunak A. Mehta, Koichi Onoue

Main category: cs.AI

TL;DR: Paper ID 2604.11120 could not be analyzed due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2604.11120: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11120&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[466] Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agentic AI Systems

Charafeddine Mouzouni

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2604.11623: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11623&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[467] GTCN-G: A Residual Graph-Temporal Fusion Network for Imbalanced Intrusion Detection

Tianxiang Xu, Zhichao Wen, Xinyu Zhao, Qi Hu, Yan Li, Chang Liu

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions due to lack of paper content

Abstract: Failed to fetch summary for 2510.07285: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07285&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[468] RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.11626: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11626&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[469] A2-DIDM: Privacy-preserving Accumulator-enabled Auditing for Distributed Identity of DNN Model

Tianxiu Xie, Keke Gai, Jing Yu, Liehuang Zhu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2405.04108: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.04108&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[470] Siamese Foundation Models for Crystal Structure Prediction

Liming Wu, Wenbing Huang, Rui Jiao, Jianxing Huang, Liwei Liu, Yipeng Zhou, Hao Sun, Yang Liu, Fuchun Sun, Yuxiang Ren, Jirong Wen

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2503.10471: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.10471&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[471] SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

Yuhao Shen, Junyi Shen, Quan Kong, Tianyu Liu, Yao Lu, Cong Wang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2506.01979: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01979&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[472] Global optimization tailored for graphics processing units: Complete and rigorous search for large-scale nonlinear minimization

Guanglu Zhang, Qihang Shan, Jonathan Cagan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2507.01770: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.01770&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[473] Mobile GUI Agents under Real-world Threats: Are We There Yet?

Guohong Liu, Jialei Ye, Jiacheng Liu, Yuanchun Li, Wei Liu, Pengzhi Gao, Jian Luan, Yunxin Liu

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content

Abstract: Failed to fetch summary for 2507.04227: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.04227&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[474] Improved particle swarm optimization algorithm: multi-target trajectory optimization for swarm drones

Minze Li, Wei Zhao, Ran Chen, Mingqiang Wei

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: No method information available - paper content inaccessible

Result: No results available - paper summary could not be retrieved

Conclusion: Cannot analyze paper due to technical limitations in accessing content

Abstract: Failed to fetch summary for 2507.13647: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.13647&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[475] Hybrid-AIRL: Enhancing Inverse Reinforcement Learning with Supervised Expert Guidance

Bram Silue, Santiago Amaya-Corredor, Patrick Mannion, Lander Willem, Pieter Libin

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.21356: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21356&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[476] ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge

Zihan Zhao, Ziping Wan, Lu Chen, Xuanze Lin, Shiyang Yu, Situo Zhang, Da Ma, Zichen Zhu, Danyang Zhang, Huayang Wang, Zhongyang Dai, Liyang Wen, Bo Chen, Xin Chen, Kai Yu

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2507.21990: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21990&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[477] Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain

Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, Nazanin Sepahvand, Nicolas Chapados, Quentin Cappart, Alexandre Lacoste, Krishnamurthy Dj Dvijotham, Alexandre Drouin

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2510.05159: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05159&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[478] Generative Modeling Enables Molecular Structure Retrieval from Coulomb Explosion Imaging

Xiang Li, Till Jahnke, Rebecca Boll, Jiaqi Han, Minkai Xu, Michael Meyer, Maria Novella Piancastelli, Daniel Rolles, Artem Rudenko, Florian Trinter, Thomas J.A. Wolf, Jana B. Thayer, James P. Cryan, Stefano Ermon, Phay J. Ho

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2511.00179: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00179&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[479] GeoPl@ntNet: A Platform for Exploring Essential Biodiversity Variables

Lukas Picek, César Leblanc, Alexis Joly, Pierre Bonnet, Rémi Palard, Maximilien Servajean

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2511.13790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[480] HiFiNet: Hierarchical Fault Identification in Wireless Sensor Networks via Edge-Based Classification and Graph Aggregation

Nguyen Tri Nghia, Nguyen Van Son, Nguyen Thi Hanh

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.17537: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17537&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[481] BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands

Seongwon Cho, Daechul Ahn, Donghyun Shin, Hyeonbeom Choi, San Kim, Jonghyun Choi

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2511.22364: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22364&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[482] Red Teaming Large Reasoning Models

Jiawei Chen, Yang Yang, Chao Yu, Yu Tian, Zhi Cao, Xue Yang, Linghao Li, Hang Su, Zhaoxia Yin

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Unable to determine method due to API rate limiting preventing access to paper details

Result: Unable to determine results due to API rate limiting preventing access to paper details

Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper details

Abstract: Failed to fetch summary for 2512.00412: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00412&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[483] Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Xiangyu Li, Chengyu Yin, Weijun Wang, Jianyu Wei, Ting Cao, Yunxin Liu

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.06443: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06443&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[484] MAML-KT: Addressing Cold Start Problem in Knowledge Tracing for New Students via Few-Shot Model-Agnostic Meta Learning

Indronil Bhattacharjee, Christabel Wayllace

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.00137: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00137&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[485] Safe-FedLLM: Delving into the Safety of Federated Large Language Models

Mingxiang Tao, Yu Tian, Wenxuan Tu, Yue Yang, Xue Yang, Xiangyan Tang

Main category: cs.AI

TL;DR: Paper 2601.07177: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract content

Method: Cannot determine method due to missing abstract content

Result: Cannot determine results due to missing abstract content

Conclusion: Cannot draw conclusions due to missing abstract content

Abstract: Failed to fetch summary for 2601.07177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[486] Evaluating LLM-Generated ACSL Annotations for Formal Verification

Arshad Beg, Diarmuid O’Donoghue, Rosemary Monahan

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with ID 2602.13851 cannot be analyzed without access to its abstract or content.

Details

Motivation: Cannot determine motivation without paper content. The arXiv API returned a rate limiting error (HTTP 429), preventing access to the paper's abstract and details.

Method: Cannot determine method without paper content. The arXiv API request was blocked due to rate limiting, so no technical details are available.

Result: Cannot determine results without paper content. The paper analysis is impossible due to API access restrictions.

Conclusion: Cannot draw conclusions about a paper whose content is inaccessible. The arXiv API rate limiting prevents analysis of paper 2602.13851.

Abstract: Failed to fetch summary for 2602.13851: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13851&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[487] Poisoning the Inner Prediction Logic of Graph Neural Networks for Clean-Label Backdoor Attacks

Yuxiang Zhang, Bin Ma, Enyan Dai

Main category: cs.AI

TL;DR: Unable to analyze paper 2603.05004 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract content is unavailable

Method: Cannot determine method as abstract content is unavailable

Result: Cannot determine results as abstract content is unavailable

Conclusion: Cannot draw conclusions without access to the paper’s abstract

Abstract: Failed to fetch summary for 2603.05004: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05004&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[488] Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage

Saron Samuel, Alexander Martin, Eugene Yang, Andrew Yates, Dawn Lawrie, Laura Dietz, Benjamin Van Durme

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to technical retrieval error

Abstract: Failed to fetch summary for 2603.08819: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08819&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[489] ALL-FEM: Agentic Large Language models Fine-tuned for Finite Element Methods

Rushikesh Deotale, Adithya Srinivasan, Yuan Tian, Tianyi Zhang, Pavlos Vlachos, Hector Gomez

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.21011 suggests it’s from March 2023, but without the abstract content, analysis cannot be performed.

Details

Motivation: Cannot determine motivation without access to the paper abstract.

Method: Cannot determine method without access to the paper abstract.

Result: Cannot determine results without access to the paper abstract.

Conclusion: Cannot draw conclusions without access to the paper abstract.

Abstract: Failed to fetch summary for 2603.21011: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21011&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[490] Suiren-1.0 Technical Report: A Family of Molecular Foundation Models

Junyi An, Xinyu Lu, Yun-Fei Shi, Li-Cheng Xu, Nannan Zhang, Chao Qu, Yuan Qi, Fenglei Cao

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2603.21942: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21942&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[491] Efficient and Scalable Granular-ball Graph Coarsening Method for Large-scale Graph Node Classification

Guan Wang, Shuyin Xia, Lei Qian, Tao Wu, Guoyin Wang, Yi Wang, Wei Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.29148: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29148&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[492] Decidable By Construction: Design-Time Verification for Trustworthy AI

Houston Haynes

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.25414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Iain Swift, JingHua Ye, Ruairi O’Reilly

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to analyze paper due to technical access issues

Abstract: Failed to fetch summary for 2603.29977: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29977&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[494] Building evidence-based knowledge bases from full-text literature for disease-specific biomedical reasoning

Chang Zong, Sicheng Lv, Si-tu Xue, Huilin Zheng, Jian Wan, Lei Zhang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.28325: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28325&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[495] Not All Turns Are Equally Hard: Adaptive Thinking Budgets For Efficient Multi-Turn Reasoning

Neharika Jali, Anupam Nayak, Gauri Joshi

Main category: cs.AI

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - arXiv API rate limiting prevented access to paper information

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content

Abstract: Failed to fetch summary for 2604.05164: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05164&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[496] MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Dawei Yang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.06798: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06798&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[497] Exact Structural Abstraction and Tractability Limits

Tristan Simas

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.07349: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07349&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[498] Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks

Yuto Omae, Kazuki Sakai, Yohei Kakimoto, Makoto Sasaki, Yusuke Sakai, Hirotaka Takahashi

Main category: cs.AI

TL;DR: Unable to analyze paper 2604.10202 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is not available

Method: Cannot determine method as abstract is not available

Result: Cannot determine results as abstract is not available

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.10202: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10202&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[499] Resilient Write: A Six-Layer Durable Write Surface for LLM Coding Agents

Justice Owusu Agyemang, Jerry John Kponyo, Elliot Amponsah, Godfred Manu Addo Boakye, Kwame Opuni-Boachie Obour Agyekum

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2604.10842: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10842&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[500] Towards Autonomous Mechanistic Reasoning in Virtual Cells

Yunhui Jang, Lu Zhu, Jake Fawkes, Alisandra Kaye Denton, Dominique Beaini, Emmanuel Noutahi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2604.11661: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11661&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[501] Physics-Informed State Space Models for Reliable Solar Irradiance Forecasting in Off-Grid Systems

Mohammed Ezzaldin Babiker Abdullah

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2604.11807 appears to be from April 2024, but no abstract or content is available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2604.11807: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11807&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[502] Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

Haojie Bai, Aimin Li, Ruoyu Yao, Xiongwei Zhao, Tingting Zhang, Xing Zhang, Lin Gao, and Jun Ma

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.11734: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11734&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[503] CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

Gaoxiang Cong, Liang Li, Jiaxin Ye, Zhedong Zhang, Hongming Shan, Yuankai Qi, Qingming Huang

Main category: cs.SD

Details

Result: Extensive experiments on standard benchmarks and challenging in-the-wild dubbing benchmarks demonstrate state-of-the-art performance across multiple metrics.

[504] On the Distillation Loss Functions of Speech VAE for Unified Reconstruction, Understanding, and Generation

Changhao Cheng, Wei Wang, Wangyou Zhang, Dongya Jia, Jian Wu, Zhuo Chen, Yanmin Qian

Main category: cs.SD

TL;DR: Systematic exploration of VAE-SSL alignment methods for speech representations, comparing different distillation approaches across reconstruction, understanding, and generation tasks.

Details

Motivation: While VAE-based continuous speech representations show promise for speech generation, current alignment methods with SSL features using time-axis distillation may not be optimal for broader task performance. The paper aims to systematically evaluate different alignment approaches.

Method: Investigates various design choices in distillation loss for aligning VAE latent representations with SSL features. Explores different alignment approaches including joint-marginal alignment with adaptive weighting, comparing them across three axes: reconstruction, understanding, and generation tasks.

Result: Extensive experiments show that joint-marginal alignment approach with adaptive weighting achieves the best overall performance while allowing for controllable balance between different task objectives.

Conclusion: The optimal alignment approach for VAE-SSL speech representations depends on task requirements, with joint-marginal alignment offering the best trade-off across reconstruction, understanding, and generation tasks.

Abstract: Continuous speech representations based on Variational Autoencoders (VAEs) have emerged as a promising alternative to traditional spectrogram or discrete token based features for speech generation and reconstruction. Recent research has tried to enrich the structural information in VAE latent representations by aligning with self-supervised learning (SSL) features, aiming for better generation performance. However, it remains unclear whether the widely-used alignment approach based on time-axis distillation is optimal when considering more tasks. To address this problem, this paper systematically explores different alignment approaches and analyzes their impact on the performances over three axes: reconstruction, understanding, and generation. We investigate various design choices in the distillation loss. Extensive experiments show that the joint-marginal alignment approach with adaptive weighting can achieve the best overall performance while allowing for a controllable balance.

[505] Audio Source Separation in Reverberant Environments using $β$-divergence based Nonnegative Factorization

Mahmoud Fakhry, Piergiorgio Svaizer, Maurizio Omologo

Main category: cs.SD

TL;DR: Nonnegative factorization with prior information improves Gaussian model-based multichannel audio source separation by controlling sparsity through β-divergence optimization.

Details

Motivation: Improve audio source separation by incorporating prior information about source spectral variances into the parameter estimation process, moving beyond traditional EM-based approaches.

Method: Proposes nonnegative factorization with prior spectral basis matrices (extracted or from pre-trained library) and nonnegative tensor factorization to extract/detect optimal basis matrices. Uses β-divergence minimization with multiplicative update rules where β controls sparsity.

Result: Experiments show sparsity control (rather than β value in training) is crucial for separation performance. Method provides better separation quality than comparable algorithms across various mixing conditions.

Conclusion: Nonnegative factorization with prior information and sparsity control via β-divergence optimization significantly improves multichannel audio source separation performance.

Abstract: In Gaussian model-based multichannel audio source separation, the likelihood of observed mixtures of source signals is parametrized by source spectral variances and by associated spatial covariance matrices. These parameters are estimated by maximizing the likelihood through an Expectation-Maximization algorithm and used to separate the signals by means of multichannel Wiener filtering. We propose to estimate these parameters by applying nonnegative factorization based on prior information on source variances. In the nonnegative factorization, spectral basis matrices can be defined as the prior information. The matrices can be either extracted or indirectly made available through a redundant library that is trained in advance. In a separate step, applying nonnegative tensor factorization, two algorithms are proposed in order to either extract or detect the basis matrices that best represent the power spectra of the source signals in the observed mixtures. The factorization is achieved by minimizing the $β$-divergence through multiplicative update rules. The sparsity of factorization can be controlled by tuning the value of $β$. Experiments show that sparsity, rather than the value assigned to $β$ in the training, is crucial in order to increase the separation performance. The proposed method was evaluated in several mixing conditions. It provides better separation quality with respect to other comparable algorithms.

[506] Elastic Net Regularization and Gabor Dictionary for Classification of Heart Sound Signals using Deep Learning

Mahmoud Fakhry, Ascensión Gallardo-Antolín

Main category: cs.SD

TL;DR: Optimizing time-frequency atom resolution and regularization for heart sound classification using deep learning with Gabor atoms and elastic net regularization.

Details

Motivation: To improve heart sound signal representations for better classification of heart valvular conditions by optimizing time-frequency resolution and regularization parameters.

Method: Use elastic net regularization with Gabor atoms to create fitting models, then evaluate classification performance with two DL architectures (1D CNN+LSTM and 1D+2D CNN+LSTM) trained with SGDM and ADAM.

Result: Achieved 98.95% classification accuracy using the second architecture with ADAM training and feature matrices from optimal models with high-time low-frequency resolution Gabor atoms and sparsity constraints.

Conclusion: Optimal time-frequency resolution and regularization significantly improve heart sound classification, with the combination of 1D+2D CNN+LSTM architecture and ADAM training yielding best results.

Abstract: In this article, we propose the optimization of the resolution of time-frequency atoms and the regularization of fitting models to obtain better representations of heart sound signals. This is done by evaluating the classification performance of deep learning (DL) networks in discriminating five heart valvular conditions based on a new class of time-frequency feature matrices derived from the fitting models. We inspect several combinations of resolution and regularization, and the optimal one is that provides the highest performance. To this end, a fitting model is obtained based on a heart sound signal and an overcomplete dictionary of Gabor atoms using elastic net regularization of linear models. We consider two different DL architectures, the first mainly consisting of a 1D convolutional neural network (CNN) layer and a long short-term memory (LSTM) layer, while the second is composed of 1D and 2D CNN layers followed by an LSTM layer. The networks are trained with two algorithms, namely stochastic gradient descent with momentum (SGDM) and adaptive moment (ADAM). Extensive experimentation has been conducted using a database containing heart sound signals of five heart valvular conditions. The best classification accuracy of $98.95%$ is achieved with the second architecture when trained with ADAM and feature matrices derived from optimal models obtained with a Gabor dictionary consisting of atoms with high-time low-frequency resolution and imposing sparsity on the models.

[507] Adaptive Test-Time Scaling for Zero-Shot Respiratory Audio Classification

Tsai-Ning Wang, Herman Teun den Dekker, Lin-Lin Chen, Neil Zeghidour, Aaqib Saeed

Main category: cs.SD

TL;DR: TRIAGE: A tiered zero-shot framework for respiratory audio analysis that adaptively scales test-time computation based on input difficulty, using progressive reasoning stages from simple scoring to LLM reasoning.

Details

Motivation: Respiratory audio analysis faces challenges with scarce labeled data and expensive expert annotation. Zero-shot methods eliminate task-specific supervision but apply uniform computation to all inputs regardless of difficulty, wasting resources on easy cases.

Method: Three-tiered framework: Tier-L uses fast label-cosine scoring in audio-text embedding space; Tier-M adds structured matching with clinician-style descriptors; Tier-H employs retrieval-augmented LLM reasoning. A confidence-based router allocates compute adaptively, allowing easy samples to exit early.

Result: Achieves mean AUROC of 0.744 across nine respiratory classification tasks without task-specific training, outperforming prior zero-shot methods and matching/exceeding supervised baselines. Nearly half of samples exit at cheapest tier, with uncertain cases seeing up to 19% relative improvement.

Conclusion: TRIAGE demonstrates effective adaptive computation for zero-shot audio analysis, concentrating computational gains where needed most while maintaining efficiency through early exits for confident predictions.

Abstract: Automated respiratory audio analysis promises scalable, non-invasive disease screening, yet progress is limited by scarce labeled data and costly expert annotation. Zero-shot inference eliminates task-specific supervision, but existing methods apply uniform computation to every input regardless of difficulty. We introduce TRIAGE, a tiered zero-shot framework that adaptively scales test-time compute by routing each audio sample through progressively richer reasoning stages: fast label-cosine scoring in a joint audio-text embedding space (Tier-L), structured matching with clinician-style descriptors (Tier-M), and retrieval-augmented large language model reasoning (Tier-H). A confidence-based router finalizes easy predictions early while allocating additional computation to ambiguous inputs, enabling nearly half of all samples to exit at the cheapest tier. Across nine respiratory classification tasks without task-specific training, TRIAGE achieves a mean AUROC of 0.744, outperforming prior zero-shot methods and matching or exceeding supervised baselines on multiple tasks. Our analysis show that test-time scaling concentrates gains where they matter: uncertain cases see up to 19% relative improvement while confident predictions remain unchanged at minimal cost.

[508] Transformer Based Machine Fault Detection From Audio Input

Kiran Voderhobli Holla

Main category: cs.SD

TL;DR: Transformer-based architectures outperform CNNs for machine fault detection from audio spectrograms, showing better embeddings and performance with sufficient data despite lower inductive biases.

Details

Motivation: Traditional CNN architectures for sound-based machine fault detection have biases (locality, parameter-sharing) that may not be optimal for spectrogram analysis. With transformers' success in vision tasks and their lower inductive biases, there's interest in applying them to Sound AI for potentially better performance.

Method: Apply transformer-based architectures (likely Vision Transformer or similar) to analyze spectrogram images generated from machine sounds. Compare transformer embeddings with CNN embeddings specifically for machine fault detection tasks.

Result: Transformer-driven architectures demonstrate effectiveness in analyzing sound data and generate better embeddings than CNNs for machine fault detection, especially when sufficient data is available.

Conclusion: Transformer-based models are superior to CNNs for spectrogram analysis in Sound AI applications like machine fault detection, leveraging their lower inductive biases to learn more effective representations from audio data.

Abstract: In recent years, Sound AI is being increasingly used to predict machine failures. By attaching a microphone to the machine of interest, one can get real time data on machine behavior from the field. Traditionally, Convolutional Neural Net (CNN) architectures have been used to analyze spectrogram images generated from the sounds captured and predict if the machine is functioning as expected. CNN architectures seem to work well empirically even though they have biases like locality and parameter-sharing which may not be completely relevant for spectrogram analysis. With the successful application of transformer-based models in the field of image processing starting with Vision Transformer (ViT) in 2020, there has been significant interest in leveraging these in the field of Sound AI. Since transformer-based architectures have significantly lower inductive biases, they are expected to perform better than CNNs at spectrogram analysis given enough data. This paper demonstrates the effectiveness of transformer-driven architectures in analyzing Sound data and compares the embeddings they generate with CNNs on the specific task of machine fault detection.

[509] SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

Luoyi Sun, Xiao Zhou, Zeqian Li, Ya Zhang, Yanfeng Wang, Weidi Xie

Main category: cs.SD

Details

Result: Experiments demonstrate that SpotSound achieves state-of-the-art results on temporal grounding benchmarks while maintaining robust performance across general downstream audio-language tasks.

[510] Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

Anis Hamadouche, Haifeng Luo, Mathini Sellathurai, Amir Hussain, Tharm Ratnarajah

Main category: cs.SD

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to determine conclusion due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2508.08468: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.08468&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.LG

[511] Uncertainty Quantification in CNN Through the Bootstrap of Convex Neural Networks

Hongfei Du, Emre Barut, Fang Jin

Main category: cs.LG

TL;DR: A bootstrap-based framework for uncertainty quantification in CNNs with theoretical consistency guarantees, using convexified neural networks and warm-start optimization for computational efficiency.

Details

Motivation: CNNs lack efficient uncertainty quantification tools, limiting their application in critical domains like medicine where prediction uncertainty is essential. Existing deep learning UQ approaches lack theoretical consistency guarantees for uncertainty quality.

Method: Proposes a bootstrap-based framework using convexified neural networks to establish theoretical consistency. Uses warm-start optimization at each bootstrap iteration to avoid refitting from scratch, and includes a transfer learning method to apply the framework to arbitrary neural networks.

Result: Experimental results show significantly better performance compared to baseline CNNs and state-of-the-art methods on various image datasets, with much lower computational load than competitors.

Conclusion: The proposed bootstrap framework provides theoretically consistent uncertainty quantification for CNNs with computational efficiency, enabling reliable uncertainty estimation in critical applications like medical imaging.

Abstract: Despite the popularity of Convolutional Neural Networks (CNN), the problem of uncertainty quantification (UQ) of CNN has been largely overlooked. Lack of efficient UQ tools severely limits the application of CNN in certain areas, such as medicine, where prediction uncertainty is critically important. Among the few existing UQ approaches that have been proposed for deep learning, none of them has theoretical consistency that can guarantee the uncertainty quality. To address this issue, we propose a novel bootstrap based framework for the estimation of prediction uncertainty. The inference procedure we use relies on convexified neural networks to establish the theoretical consistency of bootstrap. Our approach has a significantly less computational load than its competitors, as it relies on warm-starts at each bootstrap that avoids refitting the model from scratch. We further explore a novel transfer learning method so our framework can work on arbitrary neural networks. We experimentally demonstrate our approach has a much better performance compared to other baseline CNNs and state-of-the-art methods on various image datasets.

[512] Schema-Adaptive Tabular Representation Learning with LLMs for Generalizable Multimodal Clinical Reasoning

Hongxi Mao, Wei Zhou, Mengting Jia, Tao Fang, Huan Gao, Bin Zhang, Shangyang Li

Main category: cs.LG

TL;DR: LLM-based method creates transferable tabular embeddings by converting structured variables to natural language statements, enabling zero-shot schema alignment without retraining, applied to multimodal dementia diagnosis combining tabular and MRI data.

Details

Motivation: Poor schema generalization in tabular data, especially in domains like clinical medicine where EHR schemas vary significantly, limits machine learning applications. Current approaches lack semantic understanding of structured variables.

Method: Schema-Adaptive Tabular Representation Learning transforms structured variables into semantic natural language statements, encodes them with a pretrained LLM to create transferable embeddings, enabling zero-shot alignment across unseen schemas. Integrated into multimodal framework combining tabular and MRI data for dementia diagnosis.

Result: State-of-the-art performance on NACC and ADNI datasets, successful zero-shot transfer to unseen schemas, significantly outperforming clinical baselines including board-certified neurologists in retrospective diagnostic tasks.

Conclusion: LLM-driven approach provides scalable, robust solution for heterogeneous real-world data, offering pathway to extend LLM-based reasoning to structured domains with demonstrated effectiveness in multimodal clinical diagnosis.

Abstract: Machine learning for tabular data remains constrained by poor schema generalization, a challenge rooted in the lack of semantic understanding of structured variables. This challenge is particularly acute in domains like clinical medicine, where electronic health record (EHR) schemas vary significantly. To solve this problem, we propose Schema-Adaptive Tabular Representation Learning, a novel method that leverages large language models (LLMs) to create transferable tabular embeddings. By transforming structured variables into semantic natural language statements and encoding them with a pretrained LLM, our approach enables zero-shot alignment across unseen schemas without manual feature engineering or retraining. We integrate our encoder into a multimodal framework for dementia diagnosis, combining tabular and MRI data. Experiments on NACC and ADNI datasets demonstrate state-of-the-art performance and successful zero-shot transfer to unseen schemas, significantly outperforming clinical baselines, including board-certified neurologists, in retrospective diagnostic tasks. These results validate our LLM-driven approach as a scalable, robust solution for heterogeneous real-world data, offering a pathway to extend LLM-based reasoning to structured domains.

[513] A Layer-wise Analysis of Supervised Fine-Tuning

Qinghua Zhao, Xueling Gong, Xinyu Chen, Zhongfeng Kang, Xinlu Li

Main category: cs.LG

TL;DR: Mid-Block Efficient Tuning selectively updates middle layers during SFT to prevent catastrophic forgetting while improving instruction-following performance.

Details

Motivation: Supervised Fine-Tuning (SFT) for alignment risks catastrophic forgetting, and the layer-wise emergence of instruction-following capabilities is not well understood.

Method: Comprehensive analysis using information-theoretic, geometric, and optimization metrics across model scales (1B-32B), revealing depth-dependent patterns. Proposes Mid-Block Efficient Tuning that selectively updates critical intermediate layers.

Result: Method outperforms standard LoRA by up to 10.2% on GSM8K (OLMo2-7B) with reduced parameter overhead, showing alignment is architecturally localized rather than distributed.

Conclusion: Effective alignment is localized to specific intermediate layers, enabling more efficient fine-tuning strategies that preserve model capabilities while improving instruction-following.

Abstract: While critical for alignment, Supervised Fine-Tuning (SFT) incurs the risk of catastrophic forgetting, yet the layer-wise emergence of instruction-following capabilities remains elusive. We investigate this mechanism via a comprehensive analysis utilizing information-theoretic, geometric, and optimization metrics across model scales (1B-32B). Our experiments reveal a distinct depth-dependent pattern: middle layers (20%-80%) are stable, whereas final layers exhibit high sensitivity. Leveraging this insight, we propose Mid-Block Efficient Tuning, which selectively updates these critical intermediate layers. Empirically, our method outperforms standard LoRA up to 10.2% on GSM8K (OLMo2-7B) with reduced parameter overhead, demonstrating that effective alignment is architecturally localized rather than distributed. The code is publicly available at https://anonymous.4open.science/r/base_sft.

[514] AutoSurrogate: An LLM-Driven Multi-Agent Framework for Autonomous Construction of Deep Learning Surrogate Models in Subsurface Flow

Jiale Liu, Nanzhe Wang

Main category: cs.LG

TL;DR: AutoSurrogate: An LLM-driven multi-agent framework that enables domain scientists without ML expertise to build high-quality DL surrogates for subsurface flow problems through natural-language instructions.

Details

Motivation: Deep learning surrogates can accelerate subsurface flow simulations but require substantial ML expertise that domain scientists lack, creating a barrier to adoption. The manual, heuristic process needs automation.

Method: A large-language-model-driven multi-agent framework with four specialized agents that collaboratively execute data profiling, architecture selection from a model zoo, Bayesian hyperparameter optimization, model training, and quality assessment based on user preferences.

Result: AutoSurrogate outperformed expert-designed baselines and domain-agnostic AutoML methods on a 3D geological carbon storage modeling task without manual tuning, producing deployment-ready surrogates from single natural-language sentences.

Conclusion: The framework demonstrates strong potential for practical deployment by bridging the expertise gap and enabling domain scientists to build high-quality DL surrogates through natural-language interaction.

Abstract: High-fidelity numerical simulation of subsurface flow is computationally intensive, especially for many-query tasks such as uncertainty quantification and data assimilation. Deep learning (DL) surrogates can significantly accelerate forward simulations, yet constructing them requires substantial machine learning (ML) expertise - from architecture design to hyperparameter tuning - that most domain scientists do not possess. Furthermore, the process is predominantly manual and relies heavily on heuristic choices. This expertise gap remains a key barrier to the broader adoption of DL surrogate techniques. For this reason, we present AutoSurrogate, a large-language-model-driven multi-agent framework that enables practitioners without ML expertise to build high-quality surrogates for subsurface flow problems through natural-language instructions. Given simulation data and optional preferences, four specialized agents collaboratively execute data profiling, architecture selection from a model zoo, Bayesian hyperparameter optimization, model training, and quality assessment against user-specified thresholds. The system also handles common failure modes autonomously, including restarting training with adjusted configurations when numerical instabilities occur and switching to alternative architectures when predictive accuracy falls short of targets. In our setting, a single natural-language sentence can be sufficient to produce a deployment-ready surrogate model, with minimum human intervention required at any intermediate stage. We demonstrate the utility of AutoSurrogate on a 3D geological carbon storage modeling task, mapping permeability fields to pressure and CO$_2$ saturation fields over 31 timesteps. Without any manual tuning, AutoSurrogate is able to outperform expert-designed baselines and domain-agnostic AutoML methods, demonstrating strong potential for practical deployment.

[515] When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation

Sandro Andric

Main category: cs.LG

TL;DR: Stronger reasoning in LLMs can make them better problem solvers but worse behavioral simulators, creating a “solver-sampler mismatch” where reasoning-enhanced models over-optimize and lose fidelity in simulating boundedly rational human behavior.

Details

Motivation: The paper challenges the common assumption that stronger reasoning always improves simulation fidelity in social, economic, and policy simulations using LLMs as agents. The authors argue that when simulating boundedly rational human behavior (not solving strategic problems), reasoning enhancement can actually reduce simulation fidelity.

Method: The authors study the solver-sampler mismatch in three multi-agent negotiation environments: ambiguous fragmented-authority trading-limits, ambiguous unified-opposition trading-limits, and grid-curtailment in emergency electricity management. They compare three reflection conditions (no reflection, bounded reflection, native reasoning) across different model families and extend to GPT-4.1 and GPT-5.2.

Result: Bounded reflection produces substantially more diverse and compromise-oriented trajectories than either no reflection or native reasoning. GPT-5.2 with native reasoning ended in authority decisions in all 45 runs, while GPT-5.2 with bounded reflection recovered compromise outcomes in every environment.

Conclusion: Model capability and simulation fidelity are different objectives. Behavioral simulation should qualify models as samplers, not only as solvers. The paper provides a methodological warning about the potential mismatch between reasoning enhancement and simulation fidelity.

Abstract: Large language models are increasingly used as agents in social, economic, and policy simulations. A common assumption is that stronger reasoning should improve simulation fidelity. We argue that this assumption can fail when the objective is not to solve a strategic problem, but to sample plausible boundedly rational behavior. In such settings, reasoning-enhanced models can become better solvers and worse simulators: they can over-optimize for strategically dominant actions, collapse compromise-oriented terminal behavior, and sometimes exhibit a diversity-without-fidelity pattern in which local variation survives without outcome-level fidelity. We study this solver-sampler mismatch in three multi-agent negotiation environments adapted from earlier simulation work: an ambiguous fragmented-authority trading-limits scenario, an ambiguous unified-opposition trading-limits scenario, and a new-domain grid-curtailment case in emergency electricity management. We compare three reflection conditions, no reflection, bounded reflection, and native reasoning, across two primary model families and then extend the same protocol to direct OpenAI runs with GPT-4.1 and GPT-5.2. Across all three experiments, bounded reflection produces substantially more diverse and compromise-oriented trajectories than either no reflection or native reasoning. In the direct OpenAI extension, GPT-5.2 native ends in authority decisions in 45 of 45 runs across the three experiments, while GPT-5.2 bounded recovers compromise outcomes in every environment. The contribution is not a claim that reasoning is generally harmful. It is a methodological warning: model capability and simulation fidelity are different objectives, and behavioral simulation should qualify models as samplers, not only as solvers.

[516] Polynomial Expansion Rank Adaptation: Enhancing Low-Rank Fine-Tuning with High-Order Interactions

Wenhao Zhang, Lin Mu, Li Ni, Peiquan Jin, Yiwen Zhang

Main category: cs.LG

TL;DR: PERA introduces polynomial expansion into low-rank adaptation for LLMs, enabling modeling of nonlinear parameter interactions without increasing rank or inference cost.

Details

Motivation: Standard LoRA has linear structure that limits expressive capacity, capturing only first-order dependencies between low-rank factors and restricting modeling of nonlinear and higher-order parameter interactions.

Method: PERA introduces structured polynomial expansion directly into low-rank factor space, expanding each low-rank factor to synthesize high-order interaction terms before composition, transforming adaptation space into polynomial manifold.

Result: PERA consistently outperforms state-of-the-art methods across diverse benchmarks, with experiments showing high-order nonlinear components (particularly square terms) are crucial for enhanced expressive capacity and robust performance.

Conclusion: PERA offers enhanced expressive capacity and more effective feature utilization compared to existing linear adaptation approaches, providing a more powerful yet efficient fine-tuning method for LLMs.

Abstract: Low-rank adaptation (LoRA) is a widely used strategy for efficient fine-tuning of large language models (LLMs), but its strictly linear structure fundamentally limits expressive capacity. The bilinear formulation of weight updates captures only first-order dependencies between low-rank factors, restricting the modeling of nonlinear and higher-order parameter interactions. In this paper, we propose Polynomial Expansion Rank Adaptation (PERA), a novel method that introduces structured polynomial expansion directly into the low-rank factor space. By expanding each low-rank factor to synthesize high-order interaction terms before composition, PERA transforms the adaptation space into a polynomial manifold capable of modeling richer nonlinear coupling without increasing rank or inference cost. We provide theoretical analysis demonstrating that PERA offers enhanced expressive capacity and more effective feature utilization compare to existing linear adaptation approaches. Empirically, PERA consistently outperforms state-of-the-art methods across diverse benchmarks. Notably, our experiments show that incorporating high-order nonlinear components particularly square terms is crucial for enhancing expressive capacity and maintaining strong and robust performance under various rank settings. Our code is available at https://github.com/zhangwenhao6/PERA

[517] DBGL: Decay-aware Bipartite Graph Learning for Irregular Medical Time Series Classification

Jian Chen, Yuzhu Hu, Xiaoyan Yuan, Yuxuan Hu, Jinfeng Xu, Yipeng Du, Wenhao Yuan, Wei Wang, Edith C. H. Ngai

Main category: cs.LG

TL;DR: DBGL introduces a decay-aware bipartite graph learning approach for irregular medical time series that models both temporal sampling irregularity and variable decay irregularity through patient-variable bipartite graphs and node-specific temporal decay encoding.

Details

Motivation: Irregular medical time series pose challenges due to heterogeneous sampling rates, asynchronous observations, and variable gaps. Existing methods distort temporal sampling irregularity and missingness patterns while failing to capture variable decay irregularity, resulting in suboptimal representations.

Method: DBGL uses a patient-variable bipartite graph to capture irregular sampling patterns without artificial alignment and models variable relationships for temporal sampling irregularity. It also introduces node-specific temporal decay encoding to capture each variable’s decay rates based on sampling intervals.

Result: DBGL outperforms all baselines on four publicly available datasets, demonstrating superior performance in handling irregular medical time series.

Conclusion: DBGL effectively addresses the limitations of existing methods for irregular medical time series by simultaneously modeling temporal sampling irregularity and variable decay irregularity through a novel bipartite graph learning approach with decay encoding.

Abstract: Irregular Medical Time Series play a critical role in the clinical domain to better understand the patient’s condition. However, inherent irregularity arising from heterogeneous sampling rates, asynchronous observations, and variable gaps poses key challenges for reliable modeling. Existing methods often distort temporal sampling irregularity and missingness patterns while failing to capture variable decay irregularity, resulting in suboptimal representations. To address these limitations, we introduce DBGL, Decay-Aware Bipartite Graph Learning for Irregular Medical Time Series. DBGL first introduces a patient-variable bipartite graph that simultaneously captures irregular sampling patterns without artificial alignment and adaptively models variable relationships for temporal sampling irregularity modeling, enhancing representation learning. To model variable decay irregularity, DBGL designs a novel node-specific temporal decay encoding mechanism that captures each variable’s decay rates based on sampling interval, yielding a more accurate and faithful representation of irregular temporal dynamics. We evaluate the performance of DBGL on four publicly available datasets, and the results show that DBGL outperforms all baselines.

[518] Disposition Distillation at Small Scale: A Three-Arc Negative Result

Hari Sadasivan

Main category: cs.LG

TL;DR: A study attempting to train behavioral dispositions into small language models failed to find any effective method that improves dispositions without damaging content quality or causing stylistic mimicry, despite initial promising results that were later falsified.

Details

Motivation: The researchers aimed to train small language models (0.6B to 2.3B parameters) to exhibit specific behavioral dispositions like self-verification, uncertainty acknowledgment, and feedback integration through distillation techniques.

Method: Used a four-stage all-MIT distillation pipeline, followed by experiments with inference-time attention-head interventions and a frozen-base confidence-gated sidecar. Tested across three model families (Qwen3, Gemma 4, SmolLM2) and two domains using SFT/DPO LoRA, attention-head tempering, and training-free sidecar approaches.

Result: Initial reported gains (+33.9-point MCAS, +15.3-point HumanEval) were falsified - the HumanEval improvement was a truncation artifact that inverted to -8.0 points, and MCAS gains disappeared under proper scoring. No method successfully improved dispositions without damaging content or causing stylistic mimicry across five tested models.

Conclusion: The study presents a negative result showing that training behavioral dispositions into small language models is challenging, with methods either damaging content quality or causing mimicry. The paper contributes a falsification pipeline, taxonomy for probe failures, and reveals Gemma 4 E2B’s confidence-correctness decoupling.

Abstract: We set out to train behavioral dispositions (self-verification, uncertainty acknowledgment, feedback integration) into small language models (0.6B to 2.3B effective parameters) through a four-stage all-MIT distillation pipeline, with follow-on experiments on inference-time attention-head interventions and a frozen-base confidence-gated sidecar. An internal draft reported +33.9-point MCAS and +15.3-point HumanEval gains on a Qwen3-0.6B student; a second-pass sanity check falsified both numbers before publication. The HumanEval delta was a truncation artifact (n_predict=512) that inverted to -8.0 points at n_predict=1024; the MCAS gain disappeared under apples-to-apples scoring. That falsification triggered three subsequent arcs. Across (1) SFT/DPO LoRA on three model families and two domains, (2) inference-time attention-head tempering on o_proj, and (3) a training-free frozen-base sidecar reading the final-token hidden state h_last, we find no operator that moves judge-measured disposition without damaging content or collapsing into stylistic mimicry. The failure is consistent across five models (Qwen3-0.6B, Qwen3-1.7B, Qwen3.5-0.8B, Gemma 4 E2B, and SmolLM2-1.7B-Instruct). A within-distribution cross-validation pass (AUC=0.683) collapsed to chance on fresh prompts (AUC=0.516). We contribute a three-arc negative result with mechanism, a two-failure-mode taxonomy for linear h_last probes, and an honest falsification pipeline that converts the class of false positives we ourselves produced into publishable negatives. As an independent finding, Gemma 4 E2B exhibits near-complete confidence-correctness decoupling on the Chef domain (assertion asymmetry -0.009; the model asserts at 91% regardless of correctness).

[519] Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

Sergey Alekseev

Main category: cs.LG

TL;DR: The paper analyzes signal propagation in transformers at initialization using APJN theory, extending it to bidirectional attention and showing how attention modifies gradient amplification behavior in deep networks.

Details

Motivation: To understand how signal propagates through transformer architectures at initialization, particularly how attention mechanisms affect gradient amplification and training stability in deep networks.

Method: Extends APJN (averaged partial Jacobian norm) analysis to transformers with bidirectional attention and permutation-symmetric inputs by deriving recurrence relations for activation statistics and APJNs across layers.

Result: Theory predicts attention modifies APJN asymptotic behavior at large depth, matches measurements in deep vision transformers, and shows pre-LayerNorm has power-law APJN growth while tanh-like nonlinearities cause stretched-exponential growth (subcritical).

Conclusion: Criticality picture from residual networks extends to transformers; architectures with tanh-like nonlinearities (DyT, Derf) are more sensitive to initialization and require careful tuning for stable training.

Abstract: We study signal propagation at initialization in transformers through the averaged partial Jacobian norm (APJN), a measure of gradient amplification across layers. We extend APJN analysis to transformers with bidirectional attention and permutation-symmetric input token configurations by deriving recurrence relations for activation statistics and APJNs across layers. Our theory predicts how attention modifies the asymptotic behavior of the APJN at large depth and matches APJNs measured in deep vision transformers. The criticality picture known from residual networks carries over to transformers: the pre-LayerNorm architecture exhibits power-law APJN growth, whereas transformers with LayerNorm replaced by elementwise $\tanh$-like nonlinearities have stretched-exponential APJN growth, indicating that the latter are subcritical. Applied to Dynamic Tanh (DyT) and Dynamic erf (Derf) transformers, the theory explains why these architectures can be more sensitive to initialization and optimization choices and require careful tuning for stable training.

[520] Thermodynamic Liquid Manifold Networks: Physics-Bounded Deep Learning for Solar Forecasting in Autonomous Off-Grid Microgrids

Mohammed Ezzaldin Babiker Abdullah

Main category: cs.LG

TL;DR: A thermodynamic-aware solar forecasting model for autonomous PV systems that eliminates nocturnal power prediction errors and reduces phase lags during cloud transients by enforcing celestial geometry constraints.

Details

Motivation: Current deep learning models for solar forecasting exhibit critical anomalies like temporal phase lags during cloud transients and physically impossible nocturnal power generation, creating divergence between data-driven modeling and deterministic celestial mechanics.

Method: Introduces Thermodynamic Liquid Manifold Network that projects 22 meteorological/geometric variables into Koopman-linearized Riemannian manifold, integrates Spectral Calibration unit and multiplicative Thermodynamic Alpha-Gate to synthesize real-time atmospheric opacity with theoretical clear-sky boundary models.

Result: Achieves RMSE of 18.31 Wh/m² and Pearson correlation of 0.988 over 5-year testing, maintains zero-magnitude nocturnal error across all 1826 testing days, and exhibits sub-30-minute phase response during high-frequency optical transients with only 63,458 parameters.

Conclusion: Establishes a robust, thermodynamically consistent standard for edge-deployable microgrid controllers by structurally enforcing celestial geometry compliance while maintaining lightweight design.

Abstract: The stable operation of autonomous off-grid photovoltaic systems requires solar forecasting algorithms that respect atmospheric thermodynamics. Contemporary deep learning models consistently exhibit critical anomalies, primarily severe temporal phase lags during cloud transients and physically impossible nocturnal power generation. To resolve this divergence between data-driven modeling and deterministic celestial mechanics, this research introduces the Thermodynamic Liquid Manifold Network. The methodology projects 22 meteorological and geometric variables into a Koopman-linearized Riemannian manifold to systematically map complex climatic dynamics. The architecture integrates a Spectral Calibration unit and a multiplicative Thermodynamic Alpha-Gate. This system synthesizes real-time atmospheric opacity with theoretical clear-sky boundary models, structurally enforcing strict celestial geometry compliance. This completely neutralizes phantom nocturnal generation while maintaining zero-lag synchronization during rapid weather shifts. Validated against a rigorous five-year testing horizon in a severe semi-arid climate, the framework achieves an RMSE of 18.31 Wh/m2 and a Pearson correlation of 0.988. The model strictly maintains a zero-magnitude nocturnal error across all 1826 testing days and exhibits a sub-30-minute phase response during high-frequency optical transients. Comprising exactly 63,458 trainable parameters, this ultra-lightweight design establishes a robust, thermodynamically consistent standard for edge-deployable microgrid controllers.

[521] How Transformers Learn to Plan via Multi-Token Prediction

Jianhao Huang, Zhanpeng Zhou, Renqiu Xia, Baharan Mirzasoleiman, Weijie Su, Wei Huang

Main category: cs.LG

TL;DR: Multi-token prediction (MTP) outperforms next-token prediction (NTP) on reasoning tasks by enabling a two-stage reverse reasoning process that first attends to goals then reconstructs paths backward, due to gradient decoupling providing cleaner training signals.

Details

Motivation: Next-token prediction (NTP) struggles to capture global structure in reasoning tasks, while multi-token prediction (MTP) shows promise but its mechanisms are poorly understood. The paper aims to study how MTP facilitates reasoning, particularly planning.

Method: Empirical evaluation on synthetic graph path-finding tasks and realistic reasoning benchmarks (Countdown, boolean satisfiability). Theoretical analysis of a simplified two-layer Transformer on a star graph task, proving MTP induces a two-stage reverse reasoning process.

Result: MTP consistently outperforms NTP on reasoning tasks. Theoretically, MTP induces a two-stage process: first attends to end node, then reconstructs path by tracing intermediate nodes backward. This arises from gradient decoupling property providing cleaner training signals.

Conclusion: Multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits, offering advantages over next-token prediction for planning and structured reasoning tasks.

Abstract: While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.

[522] Can AI Detect Life? Lessons from Artificial Life

Ankit Gupta, Christoph Adami

Main category: cs.LG

TL;DR: AI methods for extraterrestrial life detection are prone to false positives due to out-of-distribution samples fooling ML models, even when samples are non-living.

Details

Motivation: To demonstrate that modern machine learning methods for detecting life in extraterrestrial samples are vulnerable to false positives when dealing with out-of-distribution samples that differ from terrestrial training data.

Method: Using Artificial Life to test ML models’ ability to distinguish biotic from abiotic samples, showing they can be fooled into detecting life with high confidence even in non-living samples due to out-of-distribution issues.

Result: ML methods are easily fooled into detecting life with near 100% confidence even when analyzing samples incapable of life, due to their susceptibility to out-of-distribution samples that differ from terrestrial training data.

Conclusion: AI methods for extraterrestrial life detection are unreliable and bound to yield significant false positives because extraterrestrial samples are likely out-of the distribution of terrestrial training data, making ML models vulnerable to deception.

Abstract: Modern machine learning methods have been proposed to detect life in extraterrestrial samples, drawing on their ability to distinguish biotic from abiotic samples based on training models using natural and synthetic organic molecular mixtures. Here we show using Artificial Life that such methods are easily fooled into detecting life with near 100% confidence even if the analyzed sample is not capable of life. This is due to modern machine learning methods’ propensity to be easily fooled by out-of-distribution samples. Because extra-terrestrial samples are very likely out of the distribution provided by terrestrial biotic and abiotic samples, using AI methods for life detection is bound to yield significant false positives.

[523] INTARG: Informed Real-Time Adversarial Attack Generation for Time-Series Regression

Gamze Kirman Tokgoz, Onat Gungor, Tajana Rosing, Baris Aksanli

Main category: cs.LG

TL;DR: Proposes an adversarial attack framework for time-series forecasting that selectively targets high-confidence predictions to maximize error with minimal attacks.

Details

Motivation: Deep learning models for time-series forecasting are vulnerable to adversarial attacks, but existing methods are impractical for time-series settings where storing complete historical data or attacking every time step is infeasible.

Method: Develops an adversarial attack framework for online bounded-buffer settings using an informed and selective attack strategy that targets time steps where the model shows high confidence and expected prediction error is maximal.

Result: The framework increases prediction error up to 2.42x while performing attacks in fewer than 10% of time steps, demonstrating substantial effectiveness with minimal intervention.

Conclusion: The proposed selective attack strategy provides a practical and effective approach for adversarial attacks in time-series forecasting, achieving significant error increases with very few targeted attacks.

Abstract: Time-series forecasting aims to predict future values by modeling temporal dependencies in historical observations. It is a critical component of many real-world systems, where accurate forecasts improve operational efficiency and help mitigate uncertainty and risk. More recently, machine learning (ML), and especially deep learning (DL)-based models, have gained widespread adoption for time-series forecasting, but they remain vulnerable to adversarial attacks. However, many state-of-the-art attack methods are not directly applicable in time-series settings, where storing complete historical data or performing attacks at every time step is often impractical. This paper proposes an adversarial attack framework for time-series forecasting under an online bounded-buffer setting, leveraging an informed and selective attack strategy. By selectively targeting time steps where the model exhibits high confidence and the expected prediction error is maximal, our framework produces fewer but substantially more effective attacks. Experiments show that our framework can increase the prediction error up to 2.42x, while performing attacks in fewer than 10% of time steps.

[524] Fast and principled equation discovery from chaos to climate

Yuzheng Zhang, Weizhen Li, Rui Carvalho

Main category: cs.LG

TL;DR: Bayesian-ARGOS: A hybrid framework combining rapid frequentist screening with focused Bayesian inference for automated equation discovery from noisy, limited observations with principled uncertainty quantification.

Details

Motivation: Existing library-based sparse regression methods for equation discovery force compromises between automation, statistical rigor, and computational efficiency, creating a need for a framework that reconciles these demands.

Method: Hybrid approach combining rapid frequentist screening for candidate selection with focused Bayesian inference for uncertainty quantification, enabling automated equation discovery at reduced computational cost.

Result: Outperforms SINDy in data efficiency for all systems and noise tolerance for 6/7 chaotic systems, with 100x computational cost reduction compared to bootstrap-based ARGOS; improves yield of valid latent equations in climate applications.

Conclusion: Provides a principled, automated, and computationally efficient framework for equation discovery from scarce, noisy observations, applicable from chaotic systems to latent dynamics in climate patterns.

Abstract: Our ability to predict, control, and ultimately understand complex systems rests on discovering the equations that govern their dynamics. Identifying these equations directly from noisy, limited observations has therefore become a central challenge in data-driven science, yet existing library-based sparse regression methods force a compromise between automation, statistical rigor, and computational efficiency. Here we develop Bayesian-ARGOS, a hybrid framework that reconciles these demands by combining rapid frequentist screening with focused Bayesian inference, enabling automated equation discovery with principled uncertainty quantification at a fraction of the computational cost of existing methods. Tested on seven chaotic systems under varying data scarcity and noise levels, Bayesian-ARGOS outperforms two state-of-the-art methods in most scenarios. It surpasses SINDy in data efficiency for all systems and noise tolerance for six out of the seven, with a two-order-of-magnitude reduction in computational cost compared to bootstrap-based ARGOS. The probabilistic formulation additionally enables a suite of standard statistical diagnostics, including influence analysis and multicollinearity detection that expose failure modes otherwise opaque. When integrated with representation learning (SINDy-SHRED) for high dimensional sea surface temperature reconstruction, Bayesian-ARGOS increases the yield of valid latent equations with significantly improved long horizon stability. Bayesian-ARGOS thus provides a principled, automated, and computationally efficient route from scarce and noisy observations to interpretable governing equations, offering a practical framework for equation discovery across scales, from benchmark chaotic systems to the latent dynamics underlying global climate patterns.

[525] A unified data format for managing diabetes time-series data: DIAbetes eXchange (DIAX)

Elliott C. Pryor, Marc D. Breton, Anas El Fathi

Main category: cs.LG

TL;DR: DIAX is a standardized JSON-based format for unifying diabetes time-series data from various devices to address interoperability issues in research and machine learning applications.

Details

Motivation: Inconsistent data formats across different diabetes devices (CGM, Smart Insulin Pens, Automated Insulin Delivery systems) hinder data sharing, integration, and analysis in research and machine learning applications.

Method: Developed DIAX, a standardized JSON-based format for diabetes time-series data, with an open-source repository providing tools for dataset conversion, cross-format compatibility, visualization, and community contributions.

Result: DIAX is compatible with other standardization efforts and supports major diabetes datasets (DCLP3, DCLP5, IOBP2, PEDAP, T1Dexi, Loop), totaling over 10 million patient-hours of data.

Conclusion: DIAX serves as a translational resource that promotes interoperability, reproducibility, and extensibility for diabetes data analysis, particularly benefiting machine learning applications without imposing data-sharing constraints.

Abstract: Diabetes devices, including Continuous Glucose Monitoring (CGM), Smart Insulin Pens, and Automated Insulin Delivery systems, generate rich time-series data widely used in research and machine learning. However, inconsistent data formats across sources hinder sharing, integration, and analysis. We present DIAX (DIAbetes eXchange), a standardized JSON-based format for unifying diabetes time-series data, including CGM, insulin, and meal signals. DIAX promotes interoperability, reproducibility, and extensibility, particularly for machine learning applications. An open-source repository provides tools for dataset conversion, cross-format compatibility, visualization, and community contributions. DIAX is a translational resource, not a data host, ensuring flexibility without imposing data-sharing constraints. Currently, DIAX is compatible with other standardization efforts and supports major datasets (DCLP3, DCLP5, IOBP2, PEDAP, T1Dexi, Loop), totaling over 10 million patient-hours of data. https://github.com/Center-for-Diabetes-Technology/DIAX

[526] ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism

Alan Aboudib, Rodrigo Lopez Portillo A., Kalei Brady, Steffen Cruz

Main category: cs.LG

TL;DR: Residual Bottleneck Model (ResBM) enables decentralized pipeline parallelism with 128x activation compression for transformer training in low-bandwidth environments.

Details

Motivation: Current decentralized training methods struggle with pipeline parallelism due to high bandwidth requirements, while existing compression methods like Subspace Models are complex and not truly end-to-end trainable.

Method: Introduces ResBM architecture with residual encoder-decoder bottleneck modules across pipeline boundaries that maintain low-rank identity paths, enabling end-to-end training as part of model parameters.

Result: Achieves state-of-the-art 128x activation compression without significant convergence rate loss, memory overhead, or compute overhead.

Conclusion: ResBM provides a practical solution for decentralized pipeline parallelism in low-bandwidth environments while maintaining training efficiency.

Abstract: Unlocking large-scale low-bandwidth decentralized training has the potential to utilize otherwise untapped compute resources. In centralized settings, large-scale multi-node training is primarily enabled by data and pipeline parallelism, two techniques that require ultra-high-bandwidth communication. While efficient methods now exist for decentralized data parallelism, pipeline parallelism remains the primary challenge. Recent efforts, such as Subspace Models (SM), have claimed up to 100x activation compression but rely on complex constrained optimization and diverge from true end-to-end training. In this paper, we propose a different approach, based on an architecture designed from the ground up to be native to low-bandwidth communication environments while still applicable to any standard transformer-based architecture. We call this architecture the Residual Bottleneck Model or ResBM, it introduces a residual encoder-decoder bottleneck module across pipeline boundaries that can be trained end-to-end as part of the model’s parameters while preserving an explicit low-rank identity path. We show that ResBMs achieve state-of-the-art 128x activation compression without significant loss in convergence rates and without significant memory or compute overhead.

[527] Robust Federated Inference

Akash Dhasade, Sadegh Farhadkhani, Rachid Guerraoui, Nirupam Gupta, Maxime Jacovella, Anne-Marie Kermarrec, Rafael Pinot

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2510.00310 suggests it’s from October 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content. The HTTP 429 error indicates the arXiv API rate limit was exceeded when trying to fetch this specific paper.

Method: No method information available due to failed API request. The paper content could not be retrieved for analysis.

Result: No results available. The paper summary could not be fetched from arXiv due to rate limiting restrictions.

Conclusion: Cannot provide conclusion without paper content. The technical issue prevents analysis of this specific paper’s contributions.

Abstract: Failed to fetch summary for 2510.00310: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00310&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[528] Active Imitation Learning for Thermal- and Kernel-Aware LFM Inference on 3D S-NUCA Many-Cores

Yixian Shen, Chaoyao Shen, Jan Deen, George Floros, Andy Pimentel, Anuj Pathania

Main category: cs.LG

TL;DR: AILFM: Active Imitation Learning framework for thermal-aware scheduling of Large Foundation Models on 3D S-NUCA CPU systems, addressing thermal challenges and cache latency issues through learned scheduling policies.

Details

Motivation: The high cost and limited availability of GPUs for LFM inference motivates using high-performance CPUs, particularly 3D S-NUCA systems. However, these systems face severe thermal challenges and uneven cache latencies due to 3D NoC, requiring optimal thread migration and voltage/frequency scaling management that existing approaches lack.

Method: Proposes AILFM, an Active Imitation Learning-based scheduling framework that learns near-optimal thermal-aware scheduling policies from Oracle demonstrations with minimal runtime overhead. It accounts for core-level performance heterogeneity and kernel-specific behavior in LFMs.

Result: Extensive experiments show AILFM outperforms state-of-the-art baselines and generalizes well across diverse LFM workloads, maintaining thermal safety while maximizing performance.

Conclusion: AILFM provides an effective solution for thermal-aware scheduling of LFMs on 3D S-NUCA CPU systems, addressing the limitations of existing approaches through learned policies that handle system heterogeneity and workload diversity.

Abstract: Large Foundation Model (LFM) inference is both memory- and compute-intensive, traditionally relying on GPUs. However, the limited availability and high cost have motivated the adoption of high-performance general-purpose CPUs, especially emerging 3D-stacked Static Non-Uniform Cache Architecture (3D S-NUCA) systems. These architectures offer enhanced bandwidth and locality but suffer from severe thermal challenges and uneven cache latencies due to 3D Networks-on-Chip (NoC). Optimal management of thread migration and V/f scaling is non-trivial due to LFM kernel diversity and system heterogeneity. Existing thermal management approaches often rely on oversimplified analytical models and lack adaptability. We propose AILFM, an Active Imitation Learning (AIL)-based scheduling framework that learns near-optimal thermal-aware scheduling policies from Oracle demonstrations with minimal run-time overhead. AILFM accounts for both core-level performance heterogeneity and kernel-specific behavior in LFMs to maintain thermal safety while maximizing performance. Extensive experiments show that AILFM outperforms state-of-the-art baselines and generalizes well across diverse LFM workloads.

[529] The Linear Centroids Hypothesis: How Deep Network Features Represent Data

Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk

Main category: cs.LG

TL;DR: The Linear Centroids Hypothesis (LCH) proposes using centroids (vector summaries of DN behavior in local input regions) as a new framework for feature identification, addressing limitations of the Linear Representation Hypothesis (LRH).

Details

Motivation: The Linear Representation Hypothesis (LRH) has limitations: it abstracts away from individual network components, identifies spurious features, and cannot be applied across sub-components like multiple layers. A new framework is needed for more robust feature identification.

Method: Introduces the Linear Centroids Hypothesis (LCH) where features correspond to linear directions of centroids. Existing LRH tools (like sparse autoencoders) can be applied to centroids rather than latent activations. Also enables novel interpretability approaches like circuit identification.

Result: Applying LCH to DINO vision transformers yields sparser feature dictionaries that perform better on downstream tasks. LCH can identify circuits in GPT2-Large, demonstrating practical utility.

Conclusion: LCH provides a more robust framework for feature identification in deep networks, addressing LRH limitations while enabling both reuse of existing tools and novel interpretability approaches.

Abstract: Identifying and understanding the features that a deep network (DN) extracts from its inputs to produce its outputs is a focal point of interpretability research. The Linear Representation Hypothesis (LRH) identifies features in terms of the linear directions formed by the inputs in a DN’s latent space. However, the LRH is limited as it abstracts away from individual components (e.g., neurons and layers), is susceptible to identifying spurious features, and cannot be applied across sub-components (e.g., multiple layers). In this paper, we introduce the Linear Centroids Hypothesis (LCH) as a new framework for identifying the features of a DN. The LCH posits that features correspond to linear directions of centroids, which are vector summarizations of the functional behavior of a DN in a local region of its input space. Interpretability studies under the LCH can leverage existing LRH tools, such as sparse autoencoders, by applying them to the DN’s centroids rather than to its latent activations. We demonstrate that doing so yields sparser feature dictionaries for DINO vision transformers, which also perform better on downstream tasks. The LCH also inspires novel approaches to interpretability; for example, LCH can readily identify circuits in GPT2-Large. For code to study the LCH https://github.com/ThomasWalker1/LinearCentroidsHypothesis .

[530] Classification of Epileptic iEEG using Topological Machine Learning

Sunia Tanweer, Narayan Puthanmadam Subramaniyam, Firas A. Khasawneh

Main category: cs.LG

TL;DR: Topological data analysis features improve epileptic seizure detection from iEEG signals, achieving up to 80% balanced accuracy for three-class classification with classical ML models performing comparably to deep learning.

Details

Motivation: Epileptic seizure detection from EEG signals is challenging due to high dimensionality and nonlinear dynamics of neural activity. The paper investigates whether topological data analysis (TDA) features can improve classification of brain states in iEEG recordings.

Method: Analyzed data from 55 epilepsy patients using multichannel iEEG recordings. Derived persistence diagrams from iEEG signals and vectorized them using TDA representations (Carlsson coordinates, persistence images, template functions). Conducted large-scale ablation study across multiple iEEG frequency bands, dimensionality reduction techniques, feature representations, and classifier architectures.

Result: Dimension-reduced topological representations achieved up to 80% balanced accuracy for three-class classification (preictal, ictal, interictal). Classical ML models performed comparably to deep learning models (up to 79.17% balanced accuracy). Pipelines preserving full multichannel feature structure exhibited severe overfitting due to high-dimensional feature space.

Conclusion: Carefully designed topological features can substantially reduce model complexity requirements for seizure detection. Structure-preserving dimensionality reduction is crucial when applying topology-based representations to multichannel neural data.

Abstract: Epileptic seizure detection from EEG signals remains challenging due to the high dimensionality and nonlinear, potentially stochastic, dynamics of neural activity. In this work, we investigate whether features derived from topological data analysis (TDA) can improve the classification of brain states in preictal, ictal and interictal iEEG recordings from epilepsy patients using multichannel data. We analyze data from 55 patients, significantly larger than many previous studies that rely on patient-specific models. Persistence diagrams derived from iEEG signals are vectorized using several TDA representations, including Carlsson coordinates, persistence images, and template functions. To understand how topological representations interact with modern machine learning pipelines, we conduct a large-scale ablation study across multiple iEEG frequency bands, dimensionality reduction techniques, feature representations, and classifier architectures. Our experiments show that dimension-reduced topological representations achieve up to 80% balanced accuracy for three-class classification. Interestingly, classical machine learning models perform comparably to deep learning models, achieving up to 79.17% balanced accuracy, suggesting that carefully designed topological features can substantially reduce model complexity requirements. In contrast, pipelines preserving the full multichannel feature structure exhibit severe overfitting due to the high-dimensional feature space. These findings highlight the importance of structure-preserving dimensionality reduction when applying topology-based representations to multichannel neural data.

[531] Multi-Head Residual-Gated DeepONet for Coherent Nonlinear Wave Dynamics

Zhiwei Fan, Yiming Pan, Daniel Coca

Main category: cs.LG

TL;DR: Multi-Head Residual-Gated DeepONet (MH-RG) improves neural operator learning for nonlinear wave dynamics by using parallel pathways: one for wave field state and another for compact physical descriptors that modulate predictions through residual gating mechanisms.

Details

Motivation: Traditional neural operators treat input-output mapping as black-box regression without exploiting structured physical context from compact descriptors of initial states. Current feature integration methods (concatenation, FiLM modulation) are insufficient for capturing complex nonlinear wave dynamics.

Method: Introduces MH-RG DeepONet with parallel pathways: standard DeepONet state pathway for wave field, and conditioning pathway for physical descriptors. Uses pre-branch residual modulator, branch residual gate, and trunk residual gate with low-rank multi-head mechanism to capture multiple conditioned response patterns efficiently.

Result: Achieves consistently lower error than feature-augmented baselines on benchmarks including highly nonlinear conservative wave dynamics and dissipative trapped dynamics. Better preserves phase coherence and fidelity of physically relevant dynamical quantities.

Conclusion: The MH-RG framework effectively integrates compact physical descriptors as residual modulation factors, improving neural operator performance for nonlinear wave dynamics while maintaining parameter efficiency through multi-head gating.

Abstract: Coherent nonlinear wave dynamics are often strongly shaped by a compact set of physically meaningful descriptors of the initial state. Traditional neural operators typically treat the input-output mapping as a largely black-box high-dimensional regression problem, without explicitly exploiting this structured physical context. Common feature-integration strategies usually rely on direct concatenation or FiLM-style affine modulation in hidden latent spaces. Here we introduce a different paradigm, loosely inspired by the complementary roles of state evolution and physically meaningful observables in quantum mechanics: the wave field is learned through a standard DeepONet state pathway, while compact physical descriptors follow a parallel conditioning pathway and act as residual modulation factors on the state prediction. Based on this idea, we develop a Multi-Head Residual-Gated DeepONet (MH-RG), which combines a pre-branch residual modulator, a branch residual gate, and a trunk residual gate with a low-rank multi-head mechanism to capture multiple complementary conditioned response patterns without prohibitive parameter growth. We evaluate the framework on representative benchmarks including highly nonlinear conservative wave dynamics and dissipative trapped dynamics and further perform detailed mechanistic analyses of the learned multi-head gating behavior. Compared with feature-augmented baselines, MH-RG DeepONet achieves consistently lower error while better preserving phase coherence and the fidelity of physically relevant dynamical quantities.

[532] Exploring Concept Subspace for Self-explainable Text-Attributed Graph Learning

Xiaoxue Han, Libo Zhang, Zining Zhu, Yue Ning

Main category: cs.LG

TL;DR: Graph Concept Bottleneck (GCB) introduces concept-based interpretability for text-attributed graphs using meaningful phrases as concepts, achieving accuracy comparable to black-box GNNs with better robustness.

Details

Motivation: Existing interpretable graph learning methods primarily rely on subgraphs as explanations, lacking more intuitive concept-based interpretations. There's a need for self-explainable graph learning that provides meaningful, phrase-level concepts as explanations while maintaining competitive performance.

Method: GCB maps graphs into a concept bottleneck subspace where each concept is a meaningful phrase. It applies the information bottleneck principle to refine the concept space, focusing on the most relevant concepts. Predictions are made based on concept activations, explicitly guiding the model toward correct decisions.

Result: GCB achieves intrinsic interpretability with accuracy comparable to black-box Graph Neural Networks. It shows better performance under distribution shifts and data perturbations, demonstrating improved robustness and generalizability through concept-guided prediction.

Conclusion: GCB provides a new paradigm for self-explainable graph learning through concept bottlenecks, offering meaningful phrase-level explanations while maintaining competitive performance and enhanced robustness.

Abstract: We introduce Graph Concept Bottleneck (GCB) as a new paradigm for self-explainable text-attributed graph learning. GCB maps graphs into a subspace, concept bottleneck, where each concept is a meaningful phrase, and predictions are made based on the activation of these concepts. Unlike existing interpretable graph learning methods that primarily rely on subgraphs as explanations, the concept bottleneck provides a new form of interpretation. To refine the concept space, we apply the information bottleneck principle to focus on the most relevant concepts. This not only yields more concise and faithful explanations but also explicitly guides the model to “think” toward the correct decision. We empirically show that GCB achieves intrinsic interpretability with accuracy on par with black-box Graph Neural Networks. Moreover, it delivers better performance under distribution shifts and data perturbations, showing improved robustness and generalizability, benefitting from concept-guided prediction.

[533] Offline-Online Reinforcement Learning for Linear Mixture MDPs

Zhongjun Zhang, Sean R. Sinclair

Main category: cs.LG

TL;DR: Offline-online RL in linear mixture MDPs with environment shift, adaptively leveraging offline data when informative and ignoring it when not

Details

Motivation: Address the challenge of reinforcement learning when offline data may come from a mismatched environment (environment shift), and develop algorithms that can adaptively leverage offline data only when it's beneficial

Method: Propose an algorithm for linear mixture MDPs that adaptively uses offline data based on its informativeness, with theoretical guarantees on when offline data improves performance

Result: Established regret upper bounds characterizing when offline data is beneficial, with nearly matching lower bounds, and numerical experiments support theoretical findings

Conclusion: The algorithm safely leverages offline data when informative (due to sufficient coverage or small environment shift) and ignores it when uninformative, providing adaptive improvement over purely online learning

Abstract: We study offline-online reinforcement learning in linear mixture Markov decision processes (MDPs) under environment shift. In the offline phase, data are collected by an unknown behavior policy and may come from a mismatched environment, while in the online phase the learner interacts with the target environment. We propose an algorithm that adaptively leverages offline data. When the offline data are informative, either due to sufficient coverage or small environment shift, the algorithm provably improves over purely online learning. When the offline data are uninformative, it safely ignores them and matches the online-only performance. We establish regret upper bounds that explicitly characterize when offline data are beneficial, together with nearly matching lower bounds. Numerical experiments further corroborate our theoretical findings.

[534] Loss-Driven Bayesian Active Learning

Zhuoyue Huang, Freddie Bickford Smith, Tom Rainforth

Main category: cs.LG

TL;DR: A loss-driven Bayesian active learning framework that customizes data acquisition to specific downstream decision problems and losses, with analytic computation for weighted Bregman divergence losses.

Details

Motivation: Existing active learning approaches have limited flexibility in customizing data acquisition to different downstream problems and losses. The paper aims to develop a rigorous loss-driven approach that directly targets the loss associated with a given decision problem during data acquisition.

Method: Proposes a Bayesian active learning framework where any loss function can be used to derive a unique objective for optimal data acquisition. Shows that for losses taking the form of weighted Bregman divergences, a central component of the corresponding objective can be computed analytically, making the approach practical.

Result: In regression and classification experiments with various losses, the approach reduces test losses relative to existing active learning techniques.

Conclusion: The proposed loss-driven Bayesian active learning framework provides a flexible and rigorous approach to customize data acquisition for specific downstream decision problems and losses, with practical applicability for weighted Bregman divergence losses.

Abstract: The central goal of active learning is to gather data that maximises downstream predictive performance, but popular approaches have limited flexibility in customising this data acquisition to different downstream problems and losses. We propose a rigorous loss-driven approach to Bayesian active learning that allows data acquisition to directly target the loss associated with a given decision problem. In particular, we show how any loss can be used to derive a unique objective for optimal data acquisition. Critically, we then show that any loss taking the form of a weighted Bregman divergence permits analytic computation of a central component of its corresponding objective, making the approach applicable in practice. In regression and classification experiments with a range of different losses, we find our approach reduces test losses relative to existing techniques.

[535] BayMOTH: Bayesian optiMizatiOn with meTa-lookahead – a simple approacH

Rahman Ejaz, Varchas Gopalaswamy, Ricardo Luna, Aarne Lees, Vineet Gundecha, Christopher Kanan, Soumyendu Sarkar, Riccardo Betti

Main category: cs.LG

TL;DR: A meta-Bayesian optimization algorithm that adaptively uses related-task information when beneficial, otherwise falls back to lookahead optimization, maintaining performance across varying task-relatedness levels.

Details

Motivation: Meta-BO improves sample efficiency but suffers when test tasks poorly align with meta-training tasks, leading to suboptimal queries. Need for adaptive approach that uses related-task information only when beneficial.

Method: Proposes a simple meta-BO algorithm that adaptively utilizes related-task information when determined useful, falling back to lookahead optimization otherwise within a unified framework.

Result: Demonstrates competitiveness with existing approaches on function optimization tasks while maintaining strong performance in low task-relatedness regimes where test tasks share limited structure with meta-training set.

Conclusion: The adaptive meta-BO approach provides robust performance across varying task-relatedness levels, addressing limitations of traditional meta-BO when task alignment is poor.

Abstract: Bayesian optimization (BO) has for sequential optimization of expensive black-box functions demonstrated practicality and effectiveness in many real-world settings. Meta-Bayesian optimization (meta-BO) focuses on improving the sample efficiency of BO by making use of information from related tasks. Although meta-BO is sample-efficient when task structure transfers, poor alignment between meta-training and test tasks can cause suboptimal queries to be suggested during online optimization. To this end, we propose a simple meta-BO algorithm that utilizes related-task information when determined useful, falling back to lookahead otherwise, within a unified framework. We demonstrate competitiveness of our method with existing approaches on function optimization tasks, while retaining strong performance in low task-relatedness regimes where test tasks share limited structure with the meta-training set.

[536] Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End

Steve Hanneke, Idan Mehalel, Shay Moran

Main category: cs.LG

TL;DR: The paper studies PAC-learning of autoregressive language models, analyzing how sample complexity scales with generation length under different supervision types (End-to-End vs Chain-of-Thought).

Details

Motivation: To understand the learnability of autoregressive language models and quantify how sample complexity depends on generation length, comparing the benefits of Chain-of-Thought supervision versus End-to-End supervision.

Method: Uses PAC-learning framework for next-token generators, introduces combinatorial tools to analyze sample complexity scaling with generation length T, and compares End-to-End vs Chain-of-Thought supervision.

Result: For End-to-End learning, sample complexity can have essentially any growth rate between constant and linear in T. For Chain-of-Thought supervision, sample complexity is independent of T, showing intermediate reasoning steps eliminate dependence on generation length.

Conclusion: Chain-of-Thought supervision dramatically reduces sample complexity by eliminating dependence on generation length, providing theoretical justification for using intermediate reasoning steps in training language models.

Abstract: Modern large language models generate text autoregressively, producing tokens one at a time. To study the learnability of such systems, Joshi et al. (COLT 2025) introduced a PAC-learning framework for next-token generators, the primitive underlying autoregressive models. In this framework, an unknown next-token generator maps a sequence of tokens to the next token and is iteratively applied for $T$ steps, producing a chain of tokens whose final token constitutes the model’s output. The learning task is to learn the input-output mapping induced by this autoregressive process. Depending on the available supervision, training examples may reveal only the final output (End-to-End supervision) or the entire generated chain (Chain-of-Thought supervision). This raises two natural questions: how the sample complexity depends on the generation length $T$, and how much Chain-of-Thought supervision can reduce this dependence. In this work we give a nearly complete answer to both questions by uncovering a taxonomy of how the sample complexity scales with $T$. For End-to-End learning, we show that the landscape is remarkably rich: subject to mild conditions, essentially any growth rate $r(T)$ between constant and linear can arise as the sample complexity, and combined with the linear upper bound of Joshi et al., this yields an essentially complete characterization. In contrast, under Chain-of-Thought supervision we show that the sample complexity is independent of $T$, demonstrating that access to intermediate reasoning steps can eliminate the dependence on the generation length altogether. Our analysis introduces new combinatorial tools, and as corollaries we resolve several open questions posed by Joshi et al. regarding the dependence of learnability on the generation length and the role of Chain-of-Thought supervision.

[537] UCS: Estimating Unseen Coverage for Improved In-Context Learning

Jiayi Xin, Xiang Li, Evan Qiang, Weiqing He, Tianqi Shang, Weijie J. Su, Qi Long

Main category: cs.LG

TL;DR: UKS is a training-free demonstration selection method for in-context learning that prioritizes coverage of latent clusters in the data, improving ICL accuracy by 2-6%.

Details

Motivation: Current demonstration selection methods for in-context learning rely on heuristic notions of relevance or diversity without considering the coverage of latent clusters in the data, limiting their effectiveness and providing little insight into what makes a good demonstration set.

Method: UKS uses model-consistent embeddings to induce discrete latent clusters, then estimates the number of unrevealed clusters within a candidate subset using a Smoothed Good-Turing estimator from its empirical frequency spectrum. It can be combined with existing selection methods via a regularized objective.

Result: Experiments on intent-classification and reasoning benchmarks with frontier LLMs show that augmenting strong baselines with UKS consistently improves ICL accuracy by up to 2-6% under the same selection budget, while also providing insights into task- and model-level latent cluster distributions.

Conclusion: UKS provides an effective, training-free approach to demonstration selection that prioritizes coverage of latent clusters, offering both performance improvements and analytical insights into the structure of tasks and models.

Abstract: In-context learning (ICL) performance depends critically on which demonstrations are placed in the prompt, yet most existing selectors prioritize heuristic notions of relevance or diversity and provide limited insight into the coverage of a demonstration set. We propose Unseen Coverage Selection (UKS), a training-free, subset-level coverage prior motivated by the principle that a good demonstration set should expose the model to latent cluster unrevealed by the currently selected subset. UCS operationalizes this idea by (1) inducing discrete latent clusters from model-consistent embeddings and (2) estimating the number of unrevealed clusters within a candidate subset via a Smoothed Good–Turing estimator from its empirical frequency spectrum. Unlike previous selection methods, UCS is coverage-based and training-free, and can be seamlessly combined with both query-dependent and query-independent selection baselines via a simple regularized objective. Experiments on multiple intent-classification and reasoning benchmarks with frontier Large Language Models show that augmenting strong baselines with UCS consistently improves ICL accuracy by up to 2-6% under the same selection budget, while also yielding insights into task- and model-level latent cluster distributions. Code is available at https://github.com/Raina-Xin/UCS.

[538] TriFit: Trimodal Fusion with Protein Dynamics for Mutation Fitness Prediction

Seungik Cho

Main category: cs.LG

TL;DR: TriFit: A multimodal framework integrating sequence, structure, and protein dynamics for predicting functional impact of single amino acid substitutions, outperforming existing methods on ProteinGym benchmark.

Details

Motivation: Existing protein language models and structure-based methods neglect protein dynamics, which are crucial determinants of mutational tolerance in structural biology. There's a need to incorporate residue flexibility, correlated motions, and allosteric coupling into supervised variant effect predictors.

Method: Multimodal framework with four-expert Mixture-of-Experts fusion module and trimodal cross-modal contrastive learning. Integrates sequence embeddings from ESM-2, structural embeddings from AlphaFold2-predicted geometries, and dynamics embeddings from Gaussian Network Model B-factors, mode shapes, and residue-residue cross-correlations.

Result: Achieves AUROC 0.897 on ProteinGym benchmark (217 DMS assays, 696k SAVs), outperforming all supervised baselines including Kermut (0.864) and ProteinNPT (0.844), and zero-shot model ESM3 (0.769). Dynamics provides largest marginal contribution over pairwise modality combinations.

Conclusion: TriFit demonstrates that integrating protein dynamics with sequence and structure information significantly improves variant effect prediction, with well-calibrated probabilistic outputs without post-hoc correction.

Abstract: Predicting the functional impact of single amino acid substitutions (SAVs) is central to understanding genetic disease and engineering therapeutic proteins. While protein language models and structure-based methods have achieved strong performance on this task, they systematically neglect protein dynamics; residue flexibility, correlated motions, and allosteric coupling are well-established determinants of mutational tolerance in structural biology, yet have not been incorporated into supervised variant effect predictors. We present TriFit, a multimodal framework that integrates sequence, structure, and protein dynamics through a four-expert Mixture-of-Experts (MoE) fusion module with trimodal cross-modal contrastive learning. Sequence embeddings are extracted via masked marginal scoring with ESM-2 (650M); structural embeddings from AlphaFold2-predicted C-alpha geometries; and dynamics embeddings from Gaussian Network Model (GNM) B-factors, mode shapes, and residue-residue cross-correlations. The MoE router adaptively weights modality combinations conditioned on the input, enabling protein-specific fusion without fixed modality assumptions. On the ProteinGym substitution benchmark (217 DMS assays, 696k SAVs), TriFit achieves AUROC 0.897 +/- 0.0002, outperforming all supervised baselines including Kermut (0.864) and ProteinNPT (0.844), and the best zero-shot model ESM3 (0.769). Ablation studies confirm that dynamics provides the largest marginal contribution over pairwise modality combinations, and TriFit achieves well-calibrated probabilistic outputs (ECE = 0.044) without post-hoc correction.

[539] VISTA: Validation-Informed Trajectory Adaptation via Self-Distillation

Eli Corn, Daphna Weinshall

Main category: cs.LG

TL;DR: VISTA is an online self-distillation framework that prevents models from abandoning previously learned features during training by identifying and preserving expert model states from earlier training stages.

Details

Motivation: Deep learning models can converge to suboptimal solutions despite good validation accuracy due to "Trajectory Deviation" - where models abandon high generalization states for specific data sub-populations, discarding previously learned latent features without triggering classical overfitting signals.

Method: VISTA uses a validation-informed Marginal Coverage score to identify expert anchors (earlier model states that retain specialized competence over distinct data regions). A coverage-weighted ensemble of these anchors is integrated online during training to regularize the loss landscape and preserve mastered knowledge.

Result: VISTA demonstrates improved robustness and generalization over standard training and prior self-distillation methods across multiple benchmarks, with a lightweight implementation reducing storage overhead by 90% without performance loss.

Conclusion: VISTA effectively addresses trajectory deviation in deep learning optimization by preserving specialized knowledge from earlier training stages through online self-distillation, leading to better generalization and robustness.

Abstract: Deep learning models may converge to suboptimal solutions despite strong validation accuracy, masking an optimization failure we term Trajectory Deviation. This is because as training proceeds, models can abandon high generalization states for specific data sub-populations, thus discarding previously learned latent features without triggering classical overfitting signals. To address this problem we introduce VISTA, an online self-distillation framework that enforces consistency along the optimization trajectory. Using a validation-informed Marginal Coverage score, VISTA identifies expert anchors, which are earlier model states that retain specialized competence over distinct data regions. A coverage-weighted ensemble of these anchors is integrated online during training, regularizing the loss landscape and preserving mastered knowledge. When evaluated across multiple benchmarks, VISTA demonstrates improved robustness and generalization over standard training and prior self-distillation methods, while a lightweight implementation reduces storage overhead by 90% without performance loss.

[540] Interpretable DNA Sequence Classification via Dynamic Feature Generation in Decision Trees

Nicolas Huynh, Krzysztof Kacprzyk, Ryan Sheridan, David Bentley, Mihaela van der Schaar

Main category: cs.LG

TL;DR: DEFT is a novel framework for interpretable DNA sequence analysis that uses large language models to generate high-level biological features during decision tree construction, improving both interpretability and predictive performance.

Details

Motivation: Current deep neural networks for DNA sequence analysis are black boxes, while interpretable axis-aligned decision trees suffer from limited expressivity and prohibitive tree depths. There's a need for interpretable models that can capture complex biological patterns in DNA sequences.

Method: DEFT adaptively generates high-level sequence features during tree construction by leveraging large language models to propose biologically-informed features tailored to local sequence distributions at each node, with iterative refinement through a reflection mechanism.

Result: Empirical demonstrations show that DEFT discovers human-interpretable and highly predictive sequence features across diverse genomic tasks, outperforming traditional interpretable methods.

Conclusion: DEFT successfully bridges the gap between interpretability and performance in DNA sequence analysis by combining the structured reasoning of decision trees with the feature generation capabilities of large language models.

Abstract: The analysis of DNA sequences has become critical in numerous fields, from evolutionary biology to understanding gene regulation and disease mechanisms. While deep neural networks can achieve remarkable predictive performance, they typically operate as black boxes. Contrasting these black boxes, axis-aligned decision trees offer a promising direction for interpretable DNA sequence analysis, yet they suffer from a fundamental limitation: considering individual raw features in isolation at each split limits their expressivity, which results in prohibitive tree depths that hinder both interpretability and generalization performance. We address this challenge by introducing DEFT, a novel framework that adaptively generates high-level sequence features during tree construction. DEFT leverages large language models to propose biologically-informed features tailored to the local sequence distributions at each node and to iteratively refine them with a reflection mechanism. Empirically, we demonstrate that DEFT discovers human-interpretable and highly predictive sequence features across a diverse range of genomic tasks.

[541] Robust Optimization for Mitigating Reward Hacking with Correlated Proxies

Zixuan Liu, Xiaolin Sun, Zizhan Zheng

Main category: cs.LG

TL;DR: A robust RL framework that optimizes policies against worst-case proxy rewards within correlation constraints to prevent reward hacking.

Details

Motivation: RL agents trained with imperfect proxy rewards are vulnerable to reward hacking, where they exploit proxy reward signals to achieve high returns through unintended behaviors rather than achieving the true objective.

Method: Formulates reward hacking as a robust policy optimization problem over all r-correlated proxy rewards. Derives a tractable max-min formulation where the agent maximizes performance under the worst-case proxy consistent with correlation constraints. Extends to linear reward functions with known features for improved policies and interpretable worst-case rewards.

Result: Experiments show the algorithms consistently outperform occupancy-regularized policy optimization (ORPO) in worst-case returns, offering improved robustness and stability across different levels of proxy-true reward correlation.

Conclusion: The approach provides both robustness and transparency in settings where reward design is inherently uncertain, offering strong guarantees against broader classes of correlated proxies.

Abstract: Designing robust reinforcement learning (RL) agents in the presence of imperfect reward signals remains a core challenge. In practice, agents are often trained with proxy rewards that only approximate the true objective, leaving them vulnerable to reward hacking, where high proxy returns arise from unintended or exploitative behaviors. Recent work formalizes this issue using r-correlation between proxy and true rewards, but existing methods like occupancy-regularized policy optimization (ORPO) optimize against a fixed proxy and do not provide strong guarantees against broader classes of correlated proxies. In this work, we formulate reward hacking as a robust policy optimization problem over the space of all r-correlated proxy rewards. We derive a tractable max-min formulation, where the agent maximizes performance under the worst-case proxy consistent with the correlation constraint. We further show that when the reward is a linear function of known features, our approach can be adapted to incorporate this prior knowledge, yielding both improved policies and interpretable worst-case rewards. Experiments across several environments show that our algorithms consistently outperform ORPO in worst-case returns, and offer improved robustness and stability across different levels of proxy-true reward correlation. These results show that our approach provides both robustness and transparency in settings where reward design is inherently uncertain. The code is available at https://github.com/ZixuanLiu4869/reward_hacking.

[542] SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling

Zikun Liu, Liang Luo, Qianru Li, Zhengyu Zhang, Wei Ling, Jingyi Shen, Zeliang Chen, Yaning Huang, Jingxian Huang, Abdallah Aboelela, Chonglin Sun, Feifan Gu, Fenggang Wu, Hang Qu, Huayu Li, Jill Pan, Kaidi Pei, Laming Chen, Longhao Jin, Qin Huang, Tongyi Tang, Varna Puvvada, Wenlin Chen, Xiaohan Wei, Xu Cao, Yantao Yao, Yuan Jin, Yunchen Pu, Yuxin Chen, Zijian Shen, Zhengkai Zhang, Dong Liang, Ellie Wen

Main category: cs.LG

TL;DR: SOLARIS is a framework that uses speculative precomputation of user-item embeddings to enable real-time serving of large foundation models in recommendation systems, achieving significant revenue gains at Meta scale.

Details

Motivation: Large foundation models in recommendation systems offer superior performance but have impractical computational demands for real-time serving, forcing compromises through knowledge distillation that reduces quality.

Method: SOLARIS uses speculative decoding-inspired approach to proactively predict which user-item pairs will appear in future requests, asynchronously precomputing their foundation model embeddings ahead of time, decoupling costly inference from latency-critical serving.

Result: Deployed across Meta’s advertising system serving billions of daily requests, SOLARIS achieved 0.67% revenue-driving top-line metrics gain, demonstrating effectiveness at scale.

Conclusion: SOLARIS enables real-time knowledge transfer from previously too-expensive foundation models by decoupling inference from serving, making large models practical for online recommendation systems.

Abstract: Recent advances in recommendation scaling laws have led to foundation models of unprecedented complexity. While these models offer superior performance, their computational demands make real-time serving impractical, often forcing practitioners to rely on knowledge distillation-compromising serving quality for efficiency. To address this challenge, we present SOLARIS (Speculative Offloading of Latent-bAsed Representation for Inference Scaling), a novel framework inspired by speculative decoding. SOLARIS proactively precomputes user-item interaction embeddings by predicting which user-item pairs are likely to appear in future requests, and asynchronously generating their foundation model representations ahead of time. This approach decouples the costly foundation model inference from the latency-critical serving path, enabling real-time knowledge transfer from models previously considered too expensive for online use. Deployed across Meta’s advertising system serving billions of daily requests, SOLARIS achieves 0.67% revenue-driving top-line metrics gain, demonstrating its effectiveness at scale.

[543] XANE(3): An E(3)-Equivariant Graph Neural Network for Accurate Prediction of XANES Spectra from Atomic Structures

Vitor F. Grizzi, Luke N. Pretzie, Jiayi Xu, Cong Liu

Main category: cs.LG

TL;DR: XANE(3) is an E(3)-equivariant graph neural network that predicts X-ray absorption near-edge structure (XANES) spectra from atomic structures using physics-based equivariant representations and a composite training objective with derivative matching.

Details

Motivation: The paper aims to develop an accurate and efficient surrogate model for XANES simulation to accelerate spectral prediction, enable ML-assisted spectroscopy, and support data-driven materials discovery, addressing the computational cost of traditional physics-based simulations.

Method: The method uses an E(3)-equivariant graph neural network with tensor-product message passing, spherical harmonic edge features, absorber-query attention pooling, custom equivariant layer normalization, adaptive gated residual connections, and a multi-scale Gaussian basis spectral readout with optional sigmoidal background term. Training employs a composite objective including pointwise spectral reconstruction plus first- and second-derivative matching.

Result: The model achieves a spectrum mean squared error of 1.0×10⁻³ on a test set of 5,941 FDMNES simulations of iron oxide surface facets, accurately reproducing main edge structure, relative peak intensities, pre-edge features, and post-edge oscillations. Ablation studies show each component improves performance.

Conclusion: XANE(3) establishes an accurate and efficient surrogate for XANES simulation, offering a promising route toward accelerated spectral prediction, ML-assisted spectroscopy, and data-driven materials discovery, though explicit tensorial channels may not be strictly required for low intensity error on this dataset.

Abstract: We present XANE(3), a physics-based E(3)-equivariant graph neural network for predicting X-ray absorption near-edge structure (XANES) spectra directly from atomic structures. The model combines tensor-product message passing with spherical harmonic edge features, absorber-query attention pooling, custom equivariant layer normalization, adaptive gated residual connections, and a spectral readout based on a multi-scale Gaussian basis with an optional sigmoidal background term. To improve line-shape fidelity, training is performed with a composite objective that includes pointwise spectral reconstruction together with first- and second-derivative matching terms. We evaluate the model on a dataset of 5,941 FDMNES simulations of iron oxide surface facets and obtain a spectrum mean squared error of $1.0 \times 10^{-3}$ on the test set. The model accurately reproduces the main edge structure, relative peak intensities, pre-edge features, and post-edge oscillations. Ablation studies show that the derivative-aware objective, custom equivariant normalization, absorber-conditioned attention pooling, adaptive gated residual mixing, and global background term each improve performance. Interestingly, a capacity-matched scalar-only variant achieves comparable pointwise reconstruction error but reduced derivative-level fidelity, indicating that explicit tensorial channels are not strictly required for low intensity error on this dataset, although they remain beneficial for capturing finer spectral structure. These results establish XANE(3) as an accurate and efficient surrogate for XANES simulation and offer a promising route toward accelerated spectral prediction, ML-assisted spectroscopy, and data-driven materials discovery.

[544] Distinct mechanisms underlying in-context learning in transformers

Cole Gibson, Wenping Cui, Gautam Reddy

Main category: cs.LG

TL;DR: Transformers develop distinct subcircuits for in-context learning on Markov chains, showing four algorithmic phases based on memorization/generalization and 1-point/2-point statistics usage.

Details

Motivation: To provide a complete mechanistic characterization of in-context learning in transformers, specifically understanding how fixed networks adapt computation to input statistics across different systems.

Method: Analyze transformers trained on a finite set of discrete Markov chains, identifying four algorithmic phases and multi-layer subcircuits implementing context-adaptive computations through minimal models and symmetry-constrained training dynamics theory.

Result: Transformers develop distinct subcircuits implementing in-context learning with four phases: memorization vs generalization and 1-point vs 2-point statistics usage. Phase transitions occur at boundaries K₁* (kinetic competition) and K₂* (representational bottleneck).

Conclusion: Transformers develop specialized subcircuits for in-context learning, with specific conditions favoring different mechanisms, providing mechanistic understanding of how networks adapt to input statistics.

Abstract: Modern distributed networks, notably transformers, acquire a remarkable ability (termed `in-context learning’) to adapt their computation to input statistics, such that a fixed network can be applied to data from a broad range of systems. Here, we provide a complete mechanistic characterization of this behavior in transformers trained on a finite set $S$ of discrete Markov chains. The transformer displays four algorithmic phases, characterized by whether the network memorizes and generalizes, and whether it uses 1-point or 2-point statistics. We show that the four phases are implemented by multi-layer subcircuits that exemplify two qualitatively distinct mechanisms for implementing context-adaptive computations. Minimal models isolate the key features of both motifs. Memorization and generalization phases are delineated by two boundaries that depend on data diversity, $K = |S|$. The first ($K_1^\ast$) is set by a kinetic competition between subcircuits and the second ($K_2^\ast$) is set by a representational bottleneck. A symmetry-constrained theory of a transformer’s training dynamics explains the sharp transition from 1-point to 2-point generalization and identifies key features of the loss landscape that allow the network to generalize. Put together, we show that transformers develop distinct subcircuits to implement in-context learning and identify conditions that favor certain mechanisms over others.

[545] PubSwap: Public-Data Off-Policy Coordination for Federated RLVR

Anupam Nayak, Baris Askin, Muhammed Ustaomeroglu, Carlee Joe-Wong, Gauri Joshi

Main category: cs.LG

TL;DR: Federated RLVR framework combining LoRA-based local adaptation with public-data-based off-policy steps for efficient cross-client coordination in decentralized reasoning post-training.

Details

Motivation: Real-world applications often involve decentralized private data across organizations, making centralized reasoning post-training impractical. Federated training is natural but faces challenges: full-model synchronization is expensive, and local steps cause client drift under heterogeneous data.

Method: Proposes federated RLVR framework with LoRA-based local adaptation and public-data-based off-policy steps. Uses small shared public dataset to periodically exchange and reuse response-level training signals across organizations. Selectively replaces locally incorrect responses with globally correct ones during public-data steps.

Result: Method consistently improves over standard baselines across mathematical and medical reasoning benchmarks and models. Demonstrates effectiveness in federated reasoning post-training scenarios.

Conclusion: Simple and effective recipe for federated reasoning post-training: combine low-rank communication (LoRA) with limited public-data coordination. Enables efficient cross-client alignment without exposing private data.

Abstract: Reasoning post-training with reinforcement learning from verifiable rewards (RLVR) is typically studied in centralized settings, yet many realistic applications involve decentralized private data distributed across organizations. Federated training is a natural solution, but scaling RLVR in this regime is challenging: full-model synchronization is expensive, and performing many local steps can cause severe client drift under heterogeneous data. We propose a federated RLVR framework that combines LoRA-based local adaptation with public-data-based off-policy steps to improve both communication efficiency and cross-client coordination. In particular, a small shared public dataset is used to periodically exchange and reuse response-level training signals across organizations, providing a lightweight anchor toward a more globally aligned objective without exposing private data. Our method selectively replaces locally incorrect responses with globally correct ones during public-data steps, thereby keeping training closer to the local policy while still benefiting from cross-client coordination. Across mathematical and medical reasoning benchmarks and models, our method consistently improves over standard baselines. Our results highlight a simple and effective recipe for federated reasoning post-training: combining low-rank communication with limited public-data coordination.

[546] CycloneMAE: A Scalable Multi-Task Learning Model for Global Tropical Cyclone Probabilistic Forecasting

Renlong Hang, Zihao Xu, Jiuwei Zhao, Runling Yu, Leye Cheng, Qingshan Liu

Main category: cs.LG

TL;DR: CycloneMAE is a multi-task forecasting model for tropical cyclones that uses masked autoencoders to learn transferable representations from multi-modal data, delivering both deterministic forecasts and probability distributions.

Details

Motivation: Current tropical cyclone forecasting faces fundamental trade-offs: numerical weather prediction models are computationally expensive and don't leverage historical data well, while existing deep learning models are variable-specific, deterministic, and fail to generalize across different forecasting variables.

Method: Uses a TC structure-aware masked autoencoder to learn transferable representations from multi-modal data, coupled with a discrete probabilistic gridding mechanism and pre-train/fine-tune paradigm for simultaneous deterministic and probabilistic forecasting.

Result: Outperforms leading NWP systems in pressure and wind forecasting up to 120 hours and in track forecasting up to 24 hours across five global ocean basins. Attribution analysis shows physically interpretable learning dynamics.

Conclusion: Establishes a scalable, probabilistic, and interpretable pathway for operational TC forecasting through transferable multi-modal representation learning.

Abstract: Tropical cyclones (TCs) rank among the most destructive natural hazards, yet their forecasting faces fundamental trade-offs: numerical weather prediction (NWP) models are computationally prohibitive and struggle to leverage historical data, while existing deep learning (DL)-based intelligent models are variable-specific and deterministic, which fail to generalize across different forecasting variables. Here we present CycloneMAE, a scalable multi-task forecasting model that learns transferable TC representations from multi-modal data using a TC structure-aware masked autoencoder. By coupling a discrete probabilistic gridding mechanism with a pre-train/fine-tune paradigm, CycloneMAE simultaneously delivers deterministic forecasts and probability distributions. Evaluated across five global ocean basins, CycloneMAE outperforms leading NWP systems in pressure and wind forecasting up to 120 hours and in track forecasting up to 24 hours. Attribution analysis via integrated gradients reveals physically interpretable learning dynamics: short-term forecasts rely predominantly on the internal core convective structure from satellite imagery, whereas longer-term forecasts progressively shift attention to external environmental factors. Our framework establishes a scalable, probabilistic, and interpretable pathway for operational TC forecasting.

[547] Clustering-Enhanced Domain Adaptation for Cross-Domain Intrusion Detection in Industrial Control Systems

Luyao Wang

Main category: cs.LG

TL;DR: A clustering-enhanced domain adaptation method for industrial control traffic intrusion detection that addresses cross-domain challenges through feature alignment and clustering enhancement.

Details

Motivation: Industrial control systems face challenges in cross-domain intrusion detection due to varying traffic distributions, limited labeled samples, and emerging unknown attacks in dynamic environments.

Method: Two-component framework: 1) Feature-based transfer learning module using spectral-transform-based feature alignment to project source/target domains into shared latent subspace and reduce distribution discrepancies; 2) Clustering enhancement strategy combining K-Medoids clustering with PCA-based dimensionality reduction to improve cross-domain correlation estimation.

Result: Method improves unknown attack detection accuracy by up to 49% compared to five baselines, achieves larger F-score gains, shows stronger stability, and clustering enhancement boosts accuracy by up to 26% on representative tasks.

Conclusion: The proposed method effectively addresses data scarcity and domain shift, providing a practical solution for robust cross-domain intrusion detection in dynamic industrial environments.

Abstract: Industrial control systems operate in dynamic environments where traffic distributions vary across scenarios, labeled samples are limited, and unknown attacks frequently emerge, posing significant challenges to cross-domain intrusion detection. To address this issue, this paper proposes a clustering-enhanced domain adaptation method for industrial control traffic. The framework contains two key components. First, a feature-based transfer learning module projects source and target domains into a shared latent subspace through spectral-transform-based feature alignment and iteratively reduces distribution discrepancies, enabling accurate cross-domain detection. Second, a clustering enhancement strategy combines K-Medoids clustering with PCA-based dimensionality reduction to improve cross-domain correlation estimation and reduce performance degradation caused by manual parameter tuning. Experimental results show that the proposed method significantly improves unknown attack detection. Compared with five baseline models, it increases detection accuracy by up to 49%, achieves larger gains in F-score, and demonstrates stronger stability. Moreover, the clustering enhancement strategy further boosts detection accuracy by up to 26% on representative tasks. These results suggest that the proposed method effectively alleviates data scarcity and domain shift, providing a practical solution for robust cross-domain intrusion detection in dynamic industrial environments.

[548] A Residual-Shell-Based Lower Bound for Ollivier-Ricci Curvature

Xiang Gu, Huichun Zhang, Jian Sun

Main category: cs.LG

TL;DR: Proposes a tighter lower bound for Ollivier-Ricci curvature using k-hop random walks, significantly improving computational efficiency while maintaining accuracy compared to exact ORC computation.

Details

Motivation: Ollivier-Ricci curvature (ORC) is valuable for capturing geometric information in graphs but suffers from high computational cost due to Wasserstein distance evaluation. Existing efficient lower bounds based on 1-hop random walks have large accuracy gaps.

Method: Develops a substantially tighter lower bound for ORC that is computationally efficient. The method extends beyond 1-hop random walks to k-hop random walks (k > 1), enabling practical speedups of tens of times compared to exact ORC computation.

Result: Experiments on fundamental graph structures demonstrate the effectiveness of the proposed bound in terms of both approximation accuracy and computational efficiency, achieving significant speed improvements while maintaining accuracy.

Conclusion: The proposed method provides a practical solution to the computational bottleneck of ORC, enabling broader applications by offering a tight lower bound that balances accuracy and efficiency for graph curvature analysis.

Abstract: Ollivier-Ricci curvature (ORC), defined via the Wasserstein distance that captures rich geometric information, has received growing attention in both theory and applications. However, the high computational cost of Wasserstein distance evaluation has significantly limited the broader practical use of ORC. To alleviate this issue, previous work introduced a computationally efficient lower bound as a proxy for ORC based on 1-hop random walks, but this approach empirically exhibits large gaps from the exact ORC. In this paper, we establish a substantially tighter lower bound for ORC than the existing lower bound, while retaining much lower computational cost than exact ORC computation, with practical speedups of tens of times. Moreover, our bound is not restricted to 1-hop random walks, but also applies to k-hop random walks (k > 1). Experiments on several fundamental graph structures demonstrate the effectiveness of our bound in terms of both approximation accuracy and computational efficiency.

[549] LLM-Enhanced Log Anomaly Detection: A Comprehensive Benchmark of Large Language Models for Automated System Diagnostics

Disha Patel

Main category: cs.LG

TL;DR: Comprehensive benchmark comparing LLM-based vs traditional methods for log anomaly detection across 4 datasets, showing fine-tuned transformers achieve best F1 scores while prompt-based LLMs offer strong zero-shot capabilities without labeled data.

Details

Motivation: Traditional log anomaly detection methods struggle with heterogeneous and evolving log data, while LLMs offer promising new approaches but lack systematic comparison against established techniques.

Method: Benchmark study evaluating three categories: (1) classical log parsers (Drain, Spell, AEL) with ML classifiers, (2) fine-tuned transformers (BERT, RoBERTa), and (3) prompt-based LLMs (GPT-3.5, GPT-4, LLaMA-3) in zero-shot/few-shot settings across HDFS, BGL, Thunderbird, and Spirit datasets.

Result: Fine-tuned transformers achieve highest F1-scores (0.96-0.99), while prompt-based LLMs demonstrate strong zero-shot capabilities (F1: 0.82-0.91) without requiring labeled training data. Analysis includes cost-accuracy trade-offs, latency characteristics, and failure modes.

Conclusion: Provides actionable guidelines for practitioners choosing log anomaly detection methods based on accuracy, latency, cost, and label availability constraints. Fine-tuned transformers perform best but require labeled data, while LLMs offer viable zero-shot alternatives.

Abstract: System log anomaly detection is critical for maintaining the reliability of large-scale software systems, yet traditional methods struggle with the heterogeneous and evolving nature of modern log data. Recent advances in Large Language Models (LLMs) offer promising new approaches to log understanding, but a systematic comparison of LLM-based methods against established techniques remains lacking. In this paper, we present a comprehensive benchmark study evaluating both LLM-based and traditional approaches for log anomaly detection across four widely-used public datasets: HDFS, BGL, Thunderbird, and Spirit. We evaluate three categories of methods: (1) classical log parsers (Drain, Spell, AEL) combined with machine learning classifiers, (2) fine-tuned transformer models (BERT, RoBERTa), and (3) prompt-based LLM approaches (GPT-3.5, GPT-4, LLaMA-3) in zero-shot and few-shot settings. Our experiments reveal that while fine-tuned transformers achieve the highest F1-scores (0.96-0.99), prompt-based LLMs demonstrate remarkablezero-shot capabilities (F1: 0.82-0.91) without requiring any labeled training data – a significant advantage for real-world deployment where labeled anomalies are scarce. We further analyze the cost-accuracy trade-offs, latency characteristics, and failure modes of each approach. Our findings provide actionable guidelines for practitioners choosing log anomaly detection methods based on their specific constraints regarding accuracy, latency, cost, and label availability. All code and experimental configurations are publicly available to facilitate reproducibility.

[550] MolMem: Memory-Augmented Agentic Reinforcement Learning for Sample-Efficient Molecular Optimization

Ziqing Wang, Yibo Wen, Abhishek Pandy, Han Liu, Kaize Ding

Main category: cs.LG

TL;DR: MolMem is a memory-augmented RL framework for sample-efficient molecular optimization with dual memory systems for cold-start grounding and reusable skill distillation.

Details

Motivation: Molecular optimization requires expensive oracle evaluations, making sample efficiency crucial. Existing methods either need many oracle calls or rely on external knowledge that reuses familiar templates and struggles with challenging objectives. The key gap is long-term memory to ground decisions and provide reusable insights for future optimizations.

Method: MolMem uses a multi-turn agentic RL framework with dual-memory system: (1) Static Exemplar Memory retrieves relevant exemplars for cold-start grounding, and (2) Evolving Skill Memory distills successful trajectories into reusable strategies. The policy is trained with dense step-wise rewards, turning costly rollouts into long-term knowledge.

Result: MolMem achieves 90% success on single-property tasks (1.5× over best baseline) and 52% on multi-property tasks using only 500 oracle calls.

Conclusion: The memory-augmented RL framework enables sample-efficient molecular optimization by leveraging dual memory systems to ground decisions and distill reusable strategies from expensive oracle evaluations.

Abstract: In drug discovery, molecular optimization aims to iteratively refine a lead compound to improve molecular properties while preserving structural similarity to the original molecule. However, each oracle evaluation is expensive, making sample efficiency a key challenge for existing methods under a limited oracle budget. Trial-and-error approaches require many oracle calls, while methods that leverage external knowledge tend to reuse familiar templates and struggle on challenging objectives. A key missing piece is long-term memory that can ground decisions and provide reusable insights for future optimizations. To address this, we present MolMem (\textbf{Mol}ecular optimization with \textbf{Mem}ory), a multi-turn agentic reinforcement learning (RL) framework with a dual-memory system. Specifically, MolMem uses Static Exemplar Memory to retrieve relevant exemplars for cold-start grounding, and Evolving Skill Memory to distill successful trajectories into reusable strategies. Built on this memory-augmented formulation, we train the policy with dense step-wise rewards, turning costly rollouts into long-term knowledge that improves future optimization. Extensive experiments show that MolMem achieves 90% success on single-property tasks (1.5$\times$ over the best baseline) and 52% on multi-property tasks using only 500 oracle calls. Our code is available at https://github.com/REAL-Lab-NU/MolMem.

[551] Socrates Loss: Unifying Confidence Calibration and Classification by Leveraging the Unknown

Sandra Gómez-Gálvez, Tobias Olenyi, Gillian Dobbie, Katerina Taškova

Main category: cs.LG

TL;DR: Socrates Loss is a unified loss function that addresses the stability-performance trade-off in neural network confidence calibration by incorporating an auxiliary unknown class and dynamic uncertainty penalty.

Details

Motivation: Deep neural networks often have poor confidence calibration despite high accuracy, limiting reliability in high-stakes applications. Existing calibration methods face a fundamental trade-off: two-phase training achieves strong classification but has instability and poor calibration, while single-loss methods are stable but underperform in classification.

Method: Proposes Socrates Loss, a novel unified loss function that explicitly leverages uncertainty by incorporating an auxiliary unknown class whose predictions directly influence the loss function and a dynamic uncertainty penalty. This unified objective optimizes both classification and confidence calibration simultaneously without complex scheduled losses.

Result: Across four benchmark datasets and multiple architectures, Socrates Loss consistently improves training stability while achieving more favorable accuracy-calibration trade-off, often converging faster than existing methods. Theoretical guarantees show the method regularizes models to prevent miscalibration and overfitting.

Conclusion: Socrates Loss effectively mitigates the stability-performance trade-off in confidence calibration, providing a unified solution that improves both classification performance and calibration reliability without training instability.

Abstract: Deep neural networks, despite their high accuracy, often exhibit poor confidence calibration, limiting their reliability in high-stakes applications. Current ad-hoc confidence calibration methods attempt to fix this during training but face a fundamental trade-off: two-phase training methods achieve strong classification performance at the cost of training instability and poorer confidence calibration, while single-loss methods are stable but underperform in classification. This paper addresses and mitigates this stability-performance trade-off. We propose Socrates Loss, a novel, unified loss function that explicitly leverages uncertainty by incorporating an auxiliary unknown class, whose predictions directly influence the loss function and a dynamic uncertainty penalty. This unified objective allows the model to be optimized for both classification and confidence calibration simultaneously, without the instability of complex, scheduled losses. We provide theoretical guarantees that our method regularizes the model to prevent miscalibration and overfitting. Across four benchmark datasets and multiple architectures, our comprehensive experiments demonstrate that Socrates Loss consistently improves training stability while achieving more favorable accuracy-calibration trade-off, often converging faster than existing methods.

[552] Decentralized Learning via Random Walk with Jumps

Zonghong Liu, Matthew Dwyer, Salim El Rouayheb

Main category: cs.LG

TL;DR: Weighted random-walk learning for decentralized networks suffers from entrapment in small network regions, addressed by Metropolis-Hastings with Lévy jumps for occasional long-range transitions to restore exploration.

Details

Motivation: Decentralized learning over networks with distributed data needs efficient communication and computation. Weighted random-walk learning uses transition matrices for desired sampling distributions to speed convergence under data heterogeneity, but can suffer from entrapment where the random walk gets trapped in small network regions.

Method: Proposes Metropolis-Hastings with Lévy jumps (MHLJ) which introduces occasional long-range transitions to restore exploration while respecting local information constraints. This addresses entrapment in weighted random-walk learning where the random walk may become trapped in small network regions.

Result: Establishes convergence rate characterizing roles of data heterogeneity, network spectral gap, and jump probability. Experiments show MHLJ effectively eliminates entrapment and significantly speeds up decentralized learning compared to standard weighted random-walk approaches.

Conclusion: Metropolis-Hastings with Lévy jumps solves the entrapment problem in weighted random-walk decentralized learning, enabling efficient exploration while maintaining local information constraints and improving convergence speed.

Abstract: We study decentralized learning over networks where data are distributed across nodes without a central coordinator. Random walk learning is a token-based approach in which a single model is propagated across the network and updated at each visited node using local data, thereby incurring low communication and computational overheads. In weighted random-walk learning, the transition matrix is designed to achieve a desired sampling distribution, thereby speeding up convergence under data heterogeneity. We show that implementing weighted sampling via the Metropolis-Hastings algorithm can lead to a previously unexplored phenomenon we term entrapment. The random walk may become trapped in a small region of the network, resulting in highly correlated updates and severely degraded convergence. To address this issue, we propose Metropolis-Hastings with Levy jumps, which introduces occasional long-range transitions to restore exploration while respecting local information constraints. We establish a convergence rate that explicitly characterizes the roles of data heterogeneity, network spectral gap, and jump probability, and demonstrate through experiments that MHLJ effectively eliminates entrapment and significantly speeds up decentralized learning.

[553] RoleMAG: Learning Neighbor Roles in Multimodal Graphs

Yilong Zuo, Xunkai Li, Zhihan Zhang, Ronghua Li, Guoren Wang

Main category: cs.LG

TL;DR: RoleMAG is a multimodal graph framework that learns different neighbor roles for propagation across modalities, separating shared, complementary, and heterophilous signals to prevent interference between modalities.

Details

Motivation: Existing multimodal graph methods use shared message passing that assumes the same neighbors are equally useful for all modalities, but in practice neighbors beneficial for one modality may interfere with another, blurring modality-specific signals.

Method: RoleMAG distinguishes whether a neighbor should provide shared, complementary, or heterophilous signals, and routes them through separate propagation channels, enabling cross-modal completion while keeping heterophilous signals out of shared smoothing.

Result: Extensive experiments on three graph-centric MAG benchmarks show RoleMAG achieves best results on RedditS and Bili_Dance while remaining competitive on Toys, with ablation, robustness, and efficiency analyses supporting the effectiveness.

Conclusion: RoleMAG’s role-aware propagation design effectively addresses the issue of neighbor interference across modalities in multimodal attributed graphs, improving performance on graph-centric multimodal tasks.

Abstract: Multimodal attributed graphs (MAGs) combine multimodal node attributes with structured relations. However, existing methods usually perform shared message passing on a single graph and implicitly assume that the same neighbors are equally useful for all modalities. In practice, neighbors that benefit one modality may interfere with another, blurring modality-specific signals under shared propagation. To address this issue, we propose RoleMAG, a multimodal graph framework that learns how different neighbors should participate in propagation. Concretely, RoleMAG distinguishes whether a neighbor should provide shared, complementary, or heterophilous signals, and routes them through separate propagation channels. This enables cross-modal completion from complementary neighbors while keeping heterophilous ones out of shared smoothing. Extensive experiments on three graph-centric MAG benchmarks show that RoleMAG achieves the best results on RedditS and Bili_Dance, while remaining competitive on Toys. Ablation, robustness, and efficiency analyses further support the effectiveness of the proposed role-aware propagation design. Our code is available at https://anonymous.4open.science/r/RoleMAG-7EE0/

[554] SubFlow: Sub-mode Conditioned Flow Matching for Diverse One-Step Generation

Yexiong Lin, Jia Shi, Shanshan Ye, Wanyu Wang, Yu Yao, Tongliang Liu

Main category: cs.LG

TL;DR: SubFlow addresses diversity degradation in few-step flow matching models by decomposing classes into sub-modes via semantic clustering and conditioning flows on sub-mode indices to eliminate averaging distortion.

Details

Motivation: Recent few-step flow matching models suffer from severe diversity degradation, concentrating samples on dominant modes while neglecting rare but valid variations. This is traced to averaging distortion where models learn frequency-weighted means over intra-class sub-modes.

Method: Proposes SubFlow (Sub-mode Conditioned Flow Matching) which decomposes each class into fine-grained sub-modes via semantic clustering and conditions the flow on sub-mode indices. Each conditioned sub-distribution is approximately unimodal, allowing accurate targeting of individual modes without averaging distortion.

Result: Extensive experiments on ImageNet-256 show SubFlow yields substantial gains in generation diversity (Recall) while maintaining competitive image quality (FID). It integrates seamlessly into existing one-step models like MeanFlow and Shortcut Models without architectural modifications.

Conclusion: SubFlow effectively addresses diversity degradation in flow matching models by eliminating averaging distortion through sub-mode conditioning, providing a plug-and-play solution that improves mode coverage while maintaining quality.

Abstract: Flow matching has emerged as a powerful generative framework, with recent few-step methods achieving remarkable inference acceleration. However, we identify a critical yet overlooked limitation: these models suffer from severe diversity degradation, concentrating samples on dominant modes while neglecting rare but valid variations of the target distribution. We trace this degradation to averaging distortion: when trained with MSE objectives, class-conditional flows learn a frequency-weighted mean over intra-class sub-modes, causing the model to over-represent high-density modes while systematically neglecting low-density ones. To address this, we propose SubFlow, Sub-mode Conditioned Flow Matching, which eliminates averaging distortion by decomposing each class into fine-grained sub-modes via semantic clustering and conditioning the flow on sub-mode indices. Each conditioned sub-distribution is approximately unimodal, so the learned flow accurately targets individual modes with no averaging distortion, restoring full mode coverage in a single inference step. Crucially, SubFlow is entirely plug-and-play: it integrates seamlessly into existing one-step models such as MeanFlow and Shortcut Models without any architectural modifications. Extensive experiments on ImageNet-256 demonstrate that SubFlow yields substantial gains in generation diversity (Recall) while maintaining competitive image quality (FID), confirming its broad applicability across different one-step generation frameworks. Project page: https://yexionglin.github.io/subflow.

[555] Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation

Jiayi Li, Shijie Tang, Gün Kaynar, Shiyi Du, Carl Kingsford

Main category: cs.LG

TL;DR: Shortcut Guardrail: A deployment-time framework that mitigates token-level shortcuts in pretrained language models without requiring original training data or shortcut annotations, using gradient-based attribution and Masked Contrastive Learning.

Details

Motivation: Pretrained language models often rely on superficial features (shortcuts) that appear predictive during training but fail to generalize at test time. Existing mitigation methods require heavy supervision like access to original training data or prior knowledge of shortcut types, which limits practical deployment.

Method: Proposes Shortcut Guardrail framework with key insight: gradient-based attribution on biased models highlights shortcut tokens. Uses lightweight LoRA-based debiasing module trained with Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens.

Result: Across sentiment classification, toxicity detection, and natural language inference tasks under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over unmitigated models under distribution shifts while preserving in-distribution performance.

Conclusion: Shortcut Guardrail provides an effective deployment-time solution for mitigating token-level shortcuts without requiring original training data or shortcut annotations, making it practical for real-world applications.

Abstract: Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language inference under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over the unmitigated model under distribution shifts while preserving in-distribution performance.

[556] Labeled TrustSet Guided: Batch Active Learning with Reinforcement Learning

Guofeng Cui, Yang Liu, Pichao Wang, Hankai Hsu, Xiaohang Sun, Xiang Hao, Zhu Liu

Main category: cs.LG

TL;DR: BRAL-T is a batch active learning framework combining TrustSet (selecting informative labeled data with balanced class distribution) and RL-based sampling to select high-quality unlabeled data, achieving SOTA results on image classification benchmarks.

Details

Motivation: Traditional batch active learning methods rely on metrics like Mahalanobis Distance but fail to leverage feedback from labeled data or model performance. They focus only on unlabeled data distribution and don't address long-tail class imbalance problems.

Method: 1) TrustSet: Selects most informative data from labeled dataset with balanced class distribution to mitigate long-tail problem, using label information to prune redundant data. 2) RL-based sampling policy: Approximates selection of high-quality TrustSet candidates from unlabeled data. 3) BRAL-T framework: Combines TrustSet and RL for batch active learning.

Result: Achieves state-of-the-art results across 10 image classification benchmarks and 2 active fine-tuning tasks, demonstrating effectiveness and efficiency in various domains.

Conclusion: BRAL-T effectively addresses limitations of traditional BAL methods by leveraging labeled data feedback and model performance, while handling class imbalance through TrustSet and efficiently selecting unlabeled data via RL-based sampling.

Abstract: Batch active learning (BAL) is a crucial technique for reducing labeling costs and improving data efficiency in training large-scale deep learning models. Traditional BAL methods often rely on metrics like Mahalanobis Distance to balance uncertainty and diversity when selecting data for annotation. However, these methods predominantly focus on the distribution of unlabeled data and fail to leverage feedback from labeled data or the model’s performance. To address these limitations, we introduce TrustSet, a novel approach that selects the most informative data from the labeled dataset, ensuring a balanced class distribution to mitigate the long-tail problem. Unlike CoreSet, which focuses on maintaining the overall data distribution, TrustSet optimizes the model’s performance by pruning redundant data and using label information to refine the selection process. To extend the benefits of TrustSet to the unlabeled pool, we propose a reinforcement learning (RL)-based sampling policy that approximates the selection of high-quality TrustSet candidates from the unlabeled data. Combining TrustSet and RL, we introduce the Batch Reinforcement Active Learning with TrustSet (BRAL-T) framework. BRAL-T achieves state-of-the-art results across 10 image classification benchmarks and 2 active fine-tuning tasks, demonstrating its effectiveness and efficiency in various domains.

[557] Beyond Weather Correlation: A Comparative Study of Static and Temporal Neural Architectures for Fine-Grained Residential Energy Consumption Forecasting in Melbourne, Australia

Prasad Nimantha Madusanka Ukwatta Hewage, Hao Wu

Main category: cs.LG

TL;DR: LSTM models outperform weather-only MLPs for 5-minute residential energy forecasting, showing temporal autocorrelation dominates meteorological features, with solar integration creating forecasting asymmetries.

Details

Motivation: To determine whether temporal autocorrelation (sequential memory of past consumption) or static meteorological features are more important for accurate short-term residential energy forecasting at 5-minute resolution, particularly for Australian households with and without solar integration.

Method: Compared Multilayer Perceptron (MLP) using only weather features against Long Short-Term Memory (LSTM) using 24-step (2-hour) sliding consumption windows on 14 months of 5-minute smart meter data from two Melbourne households (one standard, one with rooftop solar), combined with daily weather observations.

Result: LSTM achieved R² = 0.883 (standard house) and R² = 0.865 (solar house), while weather-only MLPs performed poorly with R² = -0.055 and R² = 0.410 respectively, showing temporal autocorrelation dominates by 93.8 and 45.5 percentage point differences.

Conclusion: Temporal autocorrelation in consumption sequences is far more important than meteorological information for short-term 5-minute forecasting, with solar generation creating forecasting asymmetries where weather features implicitly capture solar patterns.

Abstract: Accurate short-term residential energy consumption forecasting at sub-hourly resolution is critical for smart grid management, demand response programmes, and renewable energy integration. While weather variables are widely acknowledged as key drivers of residential electricity demand, the relative merit of incorporating temporal autocorrelation - the sequential memory of past consumption; over static meteorological features alone remains underexplored at fine-grained (5-minute) temporal resolution for Australian households. This paper presents a rigorous empirical comparison of a Multilayer Perceptron (MLP) and a Long Short-Term Memory (LSTM) recurrent network applied to two real-world Melbourne households: House 3 (a standard grid-connected dwelling) and House 4 (a rooftop solar photovoltaic-integrated household). Both models are trained on 14 months of 5-minute interval smart meter data (March 2023-April 2024) merged with official Bureau of Meteorology (BOM) daily weather observations, yielding over 117,000 samples per household. The LSTM, operating on 24-step (2-hour) sliding consumption windows, achieves coefficients of determination of R^2 = 0.883 (House 3) and R^2 = 0.865 (House 4), compared to R^2 = -0.055 and R^2 = 0.410 for the corresponding weather-driven MLPs - differences of 93.8 and 45.5 percentage points. These results establish that temporal autocorrelation in the consumption sequence dominates meteorological information for short-term forecasting at 5-minute granularity. Additionally, we demonstrate an asymmetry introduced by solar generation: for the PV-integrated household, the MLP achieves R^2 = 0.410, revealing implicit solar forecasting from weather-time correlations. A persistence baseline analysis and seasonal stratification contextualise model performance. We propose a hybrid weather-augmented LSTM and federated learning extensions as directions for future work.

[558] GCA Framework: A Gulf-Grounded Dataset and Agentic Pipeline for Climate Decision Support

Muhammad Umer Sheikh, Khawar Shehzad, Salman Khan, Fahad Shahbaz Khan, Muhammad Haris Khan

Main category: cs.LG

TL;DR: GCA framework combines Gulf-focused multimodal dataset (GCA-DS) with tool-augmented agent (GCA) for climate analysis, improving LLM performance on region-specific climate tasks through domain fine-tuning and tool integration.

Details

Motivation: Climate decision-making in the Gulf region requires systems that can translate scientific and policy evidence into actionable guidance, but current LLMs lack region-specific climate knowledge and ability to interact with geospatial/forecasting tools.

Method: Developed GCA framework with two components: (1) GCA-DS - curated multimodal dataset with ~200k QA pairs covering policies, frameworks, literature, event reporting, plus remote-sensing imagery with textual evidence; (2) Gulf Climate Agent - tool-augmented agent with modular pipeline for real-time/historical signal processing, geospatial analysis, and visualization generation.

Result: Benchmarking shows domain fine-tuning and tool integration substantially improve reliability over general-purpose LLM baselines for Gulf climate tasks.

Conclusion: The GCA framework addresses critical gaps in region-specific climate knowledge and tool interaction for LLMs, providing a foundation for actionable climate guidance in the Gulf region.

Abstract: Climate decision-making in the Gulf increasingly demands systems that can translate heterogeneous scientific and policy evidence into actionable guidance, yet general-purpose large language models (LLMs) remain weak both in region-specific climate knowledge and grounded interaction with geospatial and forecasting tools. We present the GCA framework, which unifies (i) GCA-DS, a curated Gulf-focused multimodal dataset, and (ii) Gulf Climate Agent (GCA), a tool-augmented agent for climate analysis. GCA-DS comprises ~200k question-answer pairs spanning governmental policies and adaptation plans, NGO and international frameworks, academic literature, and event-driven reporting on heatwaves, dust storms, and floods, complemented with remote-sensing inputs that couple imagery with textual evidence. Building on this foundation, the GCA agent orchestrates a modular tool pipeline grounded in real-time and historical signals and geospatial processing that produces derived indices and interpretable visualizations. Finally, we benchmark open and proprietary LLMs on Gulf climate tasks and show that domain fine-tuning and tool integration substantially improve reliability over general-purpose baselines.

[559] Black-Box Optimization From Small Offline Datasets via Meta Learning with Synthetic Tasks

Azza Fadhel, The Hung Tran, Trong Nghia Hoang, Jana Doppa

Main category: cs.LG

TL;DR: OptBias: A meta-learning framework for offline black-box optimization that addresses data scarcity by learning reusable optimization bias from synthetic tasks and fine-tuning on small target datasets.

Details

Motivation: Offline black-box optimization (e.g., molecule/material design) often suffers from data scarcity, limiting algorithm effectiveness. Existing methods struggle to capture optimization bias (ranking ability) with limited data.

Method: Proposes OptBias: Surrogate Learning with Optimization Bias via Synthetic Task Generation. Uses meta-learning to learn reusable optimization bias from synthetic tasks generated from Gaussian processes, then fine-tunes on small target task data.

Result: Outperforms state-of-the-art baselines across diverse continuous and discrete offline optimization benchmarks in small data regimes.

Conclusion: OptBias provides a robust and practical solution for offline optimization in realistic small data settings by effectively learning and transferring optimization bias.

Abstract: We consider the problem of offline black-box optimization, where the goal is to discover optimal designs (e.g., molecules or materials) from past experimental data. A key challenge in this setting is data scarcity: in many scientific applications, only small or poor-quality datasets are available, which severely limits the effectiveness of existing algorithms. Prior work has theoretically and empirically shown that performance of offline optimization algorithms depends on how well the surrogate model captures the optimization bias (i.e., ability to rank input designs correctly), which is challenging to accomplish with limited experimental data. This paper proposes Surrogate Learning with Optimization Bias via Synthetic Task Generation (OptBias), a meta-learning framework that directly tackles data scarcity. OptBias learns a reusable optimization bias by training on synthetic tasks generated from a Gaussian process, and then fine-tunes the surrogate model on the small data for the target task. Across diverse continuous and discrete offline optimization benchmarks, OptBias consistently outperforms state-of-the-art baselines in small data regimes. These results highlight OptBias as a robust and practical solution for offline optimization in realistic small data settings.

[560] Identifying and Mitigating Gender Cues in Academic Recommendation Letters: An Interpretability Case Study

Charlotte S. Alexander, Shane Storks, Souradip Pal, Sayak Chakrabarty, Arushi Sharma, Mlen-Too Wesley, Bailey Russo

Main category: cs.LG

TL;DR: Transformer models can predict applicant gender from anonymized academic letters of recommendation with up to 68% accuracy using implicit linguistic patterns, revealing persistent gender leakage even after technical interventions.

Details

Motivation: To investigate whether Transformer-based models can infer gender from anonymized academic letters of recommendation, revealing implicit gendered language patterns that could bias hiring/admissions decisions despite explicit de-gendering efforts.

Method: Used DistilBERT, RoBERTa, and Llama 2 to classify gender of de-gendered LoRs from a U.S. medical residency program. Applied TF-IDF and SHAP for text interpretation to identify gender-proxy linguistic patterns. Experimented with removing implicit gender cues to create truly neutral LoRs.

Result: Models achieved up to 68% classification accuracy on anonymized LoRs. Linguistic patterns like “emotional” and “humanitarian” were strong gender proxies. Removing implicit cues reduced accuracy by up to 5.5% and F1 by 2.7%, but gender prediction remained above chance.

Conclusion: LoRs contain hard-to-remove gender-identifying cues that may activate bias. Technical interventions help but don’t eliminate gender leakage, necessitating upstream auditing of evaluative text alongside model-level fairness approaches.

Abstract: Letters of recommendation (LoRs) can carry patterns of implicitly gendered language that can inadvertently influence downstream decisions, e.g. in hiring and admissions. In this work, we investigate the extent to which Transformer-based encoder models as well as Large Language Models (LLMs) can infer the gender of applicants in academic LoRs submitted to an U.S. medical-residency program after explicit identifiers like names and pronouns are de-gendered. While using three models (DistilBERT, RoBERTa, and Llama 2) to classify the gender of anonymized and de-gendered LoRs, significant gender leakage was observed as evident from up to 68% classification accuracy. Text interpretation methods, like TF-IDF and SHAP, demonstrate that certain linguistic patterns are strong proxies for gender, e.g. “emotional’’ and “humanitarian’’ are commonly associated with LoRs from female applicants. As an experiment in creating truly gender-neutral LoRs, these implicit gender cues were remove resulting in a drop of up to 5.5% accuracy and 2.7% macro $F_1$ score on re-training the classifiers. However, applicant gender prediction still remains better than chance. In this case study, our findings highlight that 1) LoRs contain gender-identifying cues that are hard to remove and may activate bias in decision-making and 2) while our technical framework may be a concrete step toward fairer academic and professional evaluations, future work is needed to interrogate the role that gender plays in LoR review. Taken together, our findings motivate upstream auditing of evaluative text in real-world academic letters of recommendation as a necessary complement to model-level fairness interventions.

[561] PrivEraserVerify: Efficient, Private, and Verifiable Federated Unlearning

Parthaw Goswami, Md Khairul Islam, Ashfak Yeafi

Main category: cs.LG

TL;DR: PEV is a unified federated unlearning framework that simultaneously achieves efficiency, privacy, and verifiability through adaptive checkpointing, differential privacy calibration, and fingerprint-based verification.

Details

Motivation: Federated learning enables collaborative training without sharing raw data, but models may still memorize sensitive information, conflicting with the right to be forgotten. Existing federated unlearning solutions only partially address efficiency, privacy, or verifiability challenges.

Method: PEV integrates three key components: (1) adaptive checkpointing to retain critical historical updates for fast reconstruction, (2) layer adaptive differentially private calibration to selectively remove client influence while minimizing accuracy loss, and (3) fingerprint-based verification enabling decentralized confirmation of unlearning.

Result: Experiments on image, handwritten character, and medical datasets show PEV achieves 2-3× faster unlearning than retraining, provides formal indistinguishability guarantees with reduced performance degradation, and supports scalable verification.

Conclusion: PEV is the first framework to simultaneously deliver efficiency, privacy, and verifiability for federated unlearning, moving FL closer to practical and regulation-compliant deployment.

Abstract: Federated learning (FL) enables collaborative model training without sharing raw data, offering a promising path toward privacy preserving artificial intelligence. However, FL models may still memorize sensitive information from participants, conflicting with the right to be forgotten (RTBF). To meet these requirements, federated unlearning has emerged as a mechanism to remove the contribution of departing clients. Existing solutions only partially address this challenge: FedEraser improves efficiency but lacks privacy protection, FedRecovery ensures differential privacy (DP) but degrades accuracy, and VeriFi enables verifiability but introduces overhead without efficiency or privacy guarantees. We present PrivEraserVerify (PEV), a unified framework that integrates efficiency, privacy, and verifiability into federated unlearning. PEV employs (i) adaptive checkpointing to retain critical historical updates for fast reconstruction, (ii) layer adaptive differentially private calibration to selectively remove client influence while minimizing accuracy loss, and (iii) fingerprint based verification, enabling participants to confirm unlearning in a decentralized and noninvasive manner. Experiments on image, handwritten character, and medical datasets show that PEV achieves up to 2 to 3 times faster unlearning than retraining, provides formal indistinguishability guarantees with reduced performance degradation, and supports scalable verification. To the best of our knowledge, PEV is the first framework to simultaneously deliver efficiency, privacy, and verifiability for federated unlearning, moving FL closer to practical and regulation compliant deployment.

[562] Scaffold-Conditioned Preference Triplets for Controllable Molecular Optimization with Large Language Models

Yi Xiong, Liang Xiong, Xiaohong Ji, Sen Yang, Zhifeng Gao, Huaimin Wang, Kele Xu

Main category: cs.LG

TL;DR: SCPT introduces scaffold-constrained preference triplets for aligning molecular LLMs to perform property optimization while preserving molecular scaffolds, enabling controlled edits with chemistry-grounded supervision.

Details

Motivation: Current molecular property optimization methods often produce unstable or biologically implausible edits with limited scaffold preservation control. LLMs show promise but lack chemistry-grounded preference supervision and principled data curation for scaffold-constrained optimization.

Method: SCPT constructs similarity-constrained triplets (scaffold, better, worse) via scaffold alignment and chemistry-driven filters for validity, synthesizability, and meaningful property gains. These preferences align a pretrained molecular LLM as a conditional editor for scaffold-preserving property improvements.

Result: SCPT improves optimization success and property gains while maintaining higher scaffold similarity than baselines. It outperforms non-LLM methods in scaffold-constrained and multi-objective optimization, and shows effective generalization from single/two-property to three-property tasks.

Conclusion: SCPT enables controlled molecular optimization with predictable similarity-gain trade-offs, offering systematic adaptation to diverse optimization regimes while preserving molecular scaffolds through chemistry-grounded preference alignment.

Abstract: Molecular property optimization is central to drug discovery, yet many deep learning methods rely on black-box scoring and offer limited control over scaffold preservation, often producing unstable or biologically implausible edits. While large language models (LLMs) are promising molecular generators, optimization remains constrained by the lack of chemistry-grounded preference supervision and principled data curation. We introduce \textbf{Scaffold-Conditioned Preference Triplets (SCPT)}, a pipeline that constructs similarity-constrained triplets $\langle\text{scaffold}, \text{better}, \text{worse}\rangle$ via scaffold alignment and chemistry-driven filters for validity, synthesizability, and meaningful property gains. Using these preferences, we align a pretrained molecular LLM as a conditional editor, enabling property-improving edits that retain the scaffold. Across single- and multi-objective benchmarks, SCPT improves optimization success and property gains while maintaining higher scaffold similarity than competitive baselines. Compared with representative non-LLM molecular optimization methods, SCPT-trained LLMs are better suited to scaffold-constrained and multi-objective optimization. In addition, models trained on single-property and two-property supervision generalize effectively to three-property tasks, indicating promising extrapolative generalization under limited higher-order supervision. SCPT also provides controllable data-construction knobs that yield a predictable similarity-gain frontier, enabling systematic adaptation to diverse optimization regimes.

[563] Is Sliding Window All You Need? An Open Framework for Long-Sequence Recommendation

Sayak Chakrabarty, Souradip Pal

Main category: cs.LG

TL;DR: A practical framework for long-sequence training in recommender systems with sliding windows, runtime-aware ablation studies, and k-shift embedding for million-scale vocabularies on commodity GPUs.

Details

Motivation: Long interaction histories are crucial for modern recommender systems, but training with long sequences is often considered impractical under realistic memory and latency budgets. The authors aim to demonstrate that long-sequence training is both practical and effective at academic scale.

Method: The paper presents a complete end-to-end framework implementing industrial-style long-sequence training with sliding windows. Key contributions include: (1) runtime-aware ablation study quantifying accuracy-compute trade-offs across windowing regimes and strides, and (2) a novel k-shift embedding layer enabling million-scale vocabularies on commodity GPUs with minimal accuracy loss.

Result: The framework trains reliably on modest university clusters and delivers competitive retrieval quality (up to +6.04% MRR and +6.34% Recall@10 on Retailrocket) with approximately 4× training-time overheads compared to standard approaches.

Conclusion: By providing a robust pipeline, reporting training time costs, and introducing an embedding mechanism for low-resource settings, the authors transform long-sequence training from a closed industrial technique into a practical, open, and extensible methodology for the research community.

Abstract: Long interaction histories are central to modern recommender systems, yet training with long sequences is often dismissed as impractical under realistic memory and latency budgets. This work demonstrates that it is not only practical but also effective-at academic scale. We release a complete, end-to-end framework that implements industrial-style long-sequence training with sliding windows, including all data processing, training, and evaluation scripts. Beyond reproducing prior gains, we contribute two capabilities missing from earlier reports: (i) a runtime-aware ablation study that quantifies the accuracy-compute frontier across windowing regimes and strides, and (ii) a novel k-shift embedding layer that enables million-scale vocabularies on commodity GPUs with negligible accuracy loss. Our implementation trains reliably on modest university clusters while delivering competitive retrieval quality (e.g., up to +6.04% MRR and +6.34% Recall@10 on Retailrocket) with $\sim 4 \times $ training-time overheads. By packaging a robust pipeline, reporting training time costs, and introducing an embedding mechanism tailored for low-resource settings, we transform long-sequence training from a closed, industrial technique into a practical, open, and extensible methodology for the community.

[564] Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

NVIDIA, :, Aakshita Chandiramani, Aaron Blakeman, Abdullahi Olaoye, Abhibha Gupta, Abhilash Somasamudramath, Abhinav Khattar, Adeola Adesoba, Adi Renduchintala, Adil Asif, Aditya Agrawal, Aditya Vavre, Ahmad Kiswani, Aishwarya Padmakumar, Ajay Hotchandani, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Aleksandr Shaposhnikov, Alex Gronskiy, Alex Kondratenko, Alex Neefus, Alex Steiner, Alex Yang, Alexander Bukharin, Alexander Young, Ali Hatamizadeh, Ali Taghibakhshi, Alina Galiautdinova, Alisa Liu, Alok Kumar, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Anahita Bhiwandiwalla, Ananth Subramaniam, Andrew Tao, Anjaney Shrivastava, Anjulie Agrusa, Ankur Srivastava, Ankur Verma, Ann Guan, Anna Shors, Annamalai Chockalingam, Anubhav Mandarwal, Aparnaa Ramani, Arham Mehta, Arti Jain, Arun Venkatesan, Asha Anoosheh, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asli Sabanci Demiroz, Asma Kuriparambil Thekkumpate, Atefeh Sohrabizadeh, Avinash Kaur, Ayush Dattagupta, Barath Subramaniam Anandan, Bardiya Sadeghi, Barnaby Simkin, Ben Lanir, Benedikt Schifferer, Benjamin Chislett, Besmira Nushi, Bilal Kartal, Bill Thiede, Bita Darvish Rouhani, Bobby Chen, Boris Ginsburg, Brandon Norick, Branislav Kisacanin, Brian Yu, Bryan Catanzaro, Buvaneswari Mani, Carlo del Mundo, Chankyu Lee, Chanran Kim, Chantal Hwang, Chao Ni, Charles Wang, Charlie Truong, Cheng-Ping Hsieh, Chenhan Yu, Chenjie Luo, Cherie Wang, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Chris Holguin, Chris Wing, Christian Munley, Christopher Parisien, Chuck Desai, Chunyang Sheng, Collin Neale, Cyril Meurillon, Dakshi Kumar, Dan Gil, Dan Su, Dane Corneil, Daniel Afrimi, Daniel Burkhardt Eliuth Triana, Daniel Egert, Daniel Fatade, Daniel Lo, Daniel Rohrer, Daniel Serebrenik, Daniil Sorokin, Daria Gitman, Daria Levy, Darko Stosic, David Edelsohn, David Messina, David Mosallanezhad, David Tamok, Deena Donia, Deepak Narayanan, Devin O’Kelly, Dheeraj Peri, Dhruv Nathawani, Di Wu, Dima Rekesh, Dina Yared, Divyanshu Kakwani, Dmitry Konyagin Brandon Tuttle, Dong Ahn, Dongfu Jiang, Dorrin Poorkay, Douglas O’Flaherty, Duncan Riach, Dusan Stosic, Dustin Van Stee, Edgar Minasyan, Edward Lin, Eileen Peters Long, Elad Segal, Elena Lantz, Elena Lewis, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Pham-Hung, Eric W. Tramel, Erick Galinkin, Erik Pounds, Esti Etrog, Evan Briones, Evan Wu, Evelina Bakhturina, Evgeny Tsykunov, Ewa Dobrowolska, Farshad Saberi Movahed, Farzan Memarian, Fay Wang, Fei Jia, Felipe Soares, Felipe Vieira Frujeri, Feng Chen, Fengguang Lin, Ferenc Galko, Fortuna Zhang, Frankie Siino, Frida Hou, Gantavya Bhatt, Gargi Prasad, Geethapriya Venkataramani, Geetika Gupta, George Armstrong, Gerald Shen, Giulio Borghesi, Gordana Neskovic, Gorkem Batmaz, Grace Lam, Grace Wu, Greg Pauloski, Greyson Davis, Grigor Nalbandyan, Guoming Zhang, Guy Farber, Guyue Huang, Haifeng Qian, Haran Kumar Shiv Kumar, Harry Kim, Harsh Sharma, Hayate Iso, Hayley Ross, Herbert Hum, Herman Sahota, Hexin Wang, Himanshu Soni, Hiren Upadhyay, Huy Nguyen, Iain Cunningham, Ido Galil, Ido Shahaf, Igino Padovani, Igor Gitman, Igor Shovkun, Ikroop Dhillon, Ilya Loshchilov, Ingrid Kelly, Itamar Schen, Itay Levy, Ivan Moshkov, Izik Golan, Izzy Putterman, Jain Tu, Jan Baczek, Jan Kautz, Jane Polak Scowcroft, Janica Rosenberg, Jared Casper, Jarrod Pflum, Jason Grant, Jason Sewall, Jatin Mitra, Jeffrey Glick, Jenny Chen, Jesse Oliver, Jiacheng Xu, Jiafan Zhu, Jialin Song, Jian Zhang, Jiaqi Zeng, Jie Lou, Jill Milton, Jim Chow, Jimmy Zhang, Jinhang Choi, Jining Huang, Jocelyn Huang, Joel Caruso, Joey Conway, Joey Guman, Johan Jatko, John Kamalu, Johnny Greco, Jonathan Cohen, Jonathan Raiman, Joseph Jennings, Joyjit Daw, Juan Yu, Julio Tapia, Junkeun Yi, Jupinder Parmar, Jyothi Achar, Kari Briski, Kartik Mattoo, Katherine Cheung, Katherine Luna, Keith Wyss, Kevin Shih, Kezhi Kong, Khanh Nguyen, Khushi Bhardwaj, Kirill Buryak, Kirthi Shankar Sivamani, Konstantinos Krommydas, Kris Murphy, Krishna C. Puvvada, Krzysztof Pawelec, Kumar Anik, Laikh Tewari, Laya Sleiman, Leo Du, Leon Derczynski, Li Ding, Lilach Ilan, Lingjie Wu, Lizzie Wei, Luis Vega, Lun Su, Maarten Van Segbroeck, Maer Rodrigues de Melo, Magaret Zhang, Mahan Fathi, Makesh Narsimhan Sreedhar, Makesh Sreedhar, Makesh Tarun Chandran, Manuel Reyes Gomez, Maor Ashkenazi, Marc Cuevas, Marc Romeijn, Margaret Zhang, Mark Cai, Mark Gabel, Markus Kliegl, Martyna Patelka, Maryam Moosaei, Matthew Varacalli, Matvei Novikov, Mauricio Ferrato, Mehrzad Samadi, Melissa Corpuz, Meng Xin, Mengdi Wang, Mengru Wang, Meredith Price, Micah Schaffer, Michael Andersch, Michael Boone, Michael Evans, Michael Z Wang, Miguel Martinez, Mikail Khona, Mike Chrzanowski, Mike Hollinger, Mingyuan Ma, Minseok Lee, Mohammad Dabbah, Mohammad Shoeybi, Mostofa Patwary, Nabin Mulepati, Nader Khalil, Najeeb Nabwani, Nancy Agarwal, Nanthini Balasubramaniam, Narimane Hennouni, Narsi Kodukula, Natalie Hereth, Nathaniel Pinckney, Nave Assaf, Negar Habibi, Nestor Qin, Neta Zmora, Netanel Haber, Nick Reamaroon, Nickson Quak, Nidhi Bhatia, Nikhil Jukar, Nikki Pope, Nikolai Ludwig, Nima Tajbakhsh, Nir Ailon, Nirmal Juluru, Nirmalya De, Nowel Pitt, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Olivier Delalleau, Oluwatobi Olabiyi, Omer Ullman Argov, Omri Almog, Omri Puny, Oren Tropp, Otavio Padovani, Ouye Xie, Parth Chadha, Pasha Shamis, Paul Gibbons, Pavlo Molchanov, Peter Belcak, Peter Jin, Pinky Xu, Piotr Januszewski, Pooya Jannaty, Prachi Shevate, Pradeep Thalasta, Pranav Prashant Thombre, Prasoon Varshney, Prerana Gambhir, Pritam Gundecha, Przemek Tredak, Qing Miao, Qiyu Wan, Quan Tran Minh, Rabeeh Karimi Mahabadi, Rachel Oberman, Rachit Garg, Rahul Kandu, Raina Zhong, Ran El-Yaniv, Ran Zilberstein, Rasoul Shafipour, Renee Yao, Renjie Pi, Richard Mazzarese, Richard Wang, Rick Izzo, Ridhima Singla, Rima Shahbazyan, Rishabh Garg, Ritika Borkar, Ritu Gala, Riyad Islam, Robert Clark, Robert Hesse, Roger Waleffe, Rohit Varma Kalidindi, Rohit Watve, Roi Koren, Ron Fan, Ruchika Kharwar, Ruisi Cai, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Ryan Timbrook, Ryota Egashira, Sadegh Mahdavi, Sagar Singh Ashutosh Joshi, Sahil Modi, Samuel Kriman, Sandeep Pombra, Sanjay Kariyappa, Sanjeev Satheesh, Santiago Pombo, Saori Kaji, Satish Pasumarthi, Saurav Mishra, Saurav Muralidharan, Scott Hara, Sean Narenthiran, Sebastian Rogawski, Seonjin Na, Seonmyeong Bak, Sepehr Sameni, Seth Poulos, Shahar Mor, Shantanu Acharya, Shaona Ghosh Adam Lord, Sharath Turuvekere Sreenivas, Shaun Kotek, Shaya Gharghabi, Shelby Thomas, Sheng-Chieh Lin, Shibani Likhite, Shiqing Fan, Shiyang Chen, Shreya Gopal, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shuo Zhang, Shuoyang Ding, Shyam Renjith, Shyamala Prayaga, Siddhartha Jain, Simeng Sun, Sirisha Rella, Sirshak Das, Smita Ithape, Sneha Harishchandra S, Somshubra Majumdar, Soumye Singhal, Sri Harsha Singudasu, Sriharsha Niverty, Stas Sergienko, Stefana Gloginic, Stefania Alborghetti, Stephen Ge, Stephen McCullough, Sugam Dipak Devare, Suguna Varshini Velury, Sukrit Rao, Sumeet Kumar Barua, Sunny Gai, Suseella Panguluri, Sushil Koundinyan, Swathi Patnam, Sweta Priyadarshi, Swetha Bhendigeri, Syeda Nahida Akter, Sylendran Arunagiri, Tailling Yuan, Talor Abramovich, Tan Bui, Tan Yu, Terry Kong, Thanh Do, Thomas Gburek, Thorgane Marques, Tiffany Moore, Tijmen Blankevoort, Tim Moon, Timothy Ma, Tiyasa Mitra, Tomasz Grzegorzek, Tomer Asida, Tomer Bar Natan, Tomer Keren, Tomer Ronen, Traian Rebedea, Trenton Starkey, Tugrul Konuk, Twinkle Vashishth, Tyler Condensa, Udi Karpas, Ushnish De, Vahid Noorozi, Vahid Noroozi, Vanshil Atul Shah, Veena Vaidyanathan, Venkat Srinivasan, Venmugil Elango, Victor Cui, Vijay Korthikanti, Vikas Mehta, Virginia Adams, Virginia Wu, Vitaly Kurin, Vitaly Lavrukhin, Vladimir Anisimov, Wan Seo, Wanli Jiang, Wasi Uddin Ahmad, Wei Du, Wei Ping, Wei-Ming Chen, Wendy Quan, Wenliang Dai, Wenwen Gao, Will Jennings, William Zhang, Xiaowei Ren, Xiaowen Xin, Xin Li, Yang Yu, Yangyi Chen, Yaniv Galron, Yashaswi Karnati, Yejin Choi, Yev Meyer, Yi-Fu Wu, Yian Zhang, Ying Lin, Yonatan Geifman, Yonggan Fu, Yoshi Suhara, Youngeun Kwon, Yuan Zhang, Yuki Huang, Zach Moshe, Zhilin Wang, Zhiyu Cheng, Zhongbo Zhu, Zhuolin Yang, Zihan Liu, Zijia Chen, Zijie Yan, Zuhair Ahmed

Main category: cs.LG

TL;DR: Nemotron 3 Super is a 120B parameter hybrid Mamba-Attention MoE model with novel architecture features including LatentMoE and MTP layers for efficient inference, achieving high throughput while maintaining competitive accuracy.

Details

Motivation: To develop a large-scale language model that combines architectural innovations (Mamba-Attention hybrid, LatentMoE) with efficient training (NVFP4) and inference techniques (speculative decoding) to achieve superior throughput while maintaining accuracy.

Method: Pre-trained a 120B parameter hybrid Mamba-Attention Mixture-of-Experts model on 25T tokens using NVFP4 precision, incorporating LatentMoE architecture for parameter efficiency and MTP layers for speculative decoding acceleration, followed by SFT and RL post-training.

Result: Achieves comparable accuracy on benchmarks with 1M context length, while delivering 2.2x higher inference throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B. Model checkpoints and datasets are open-sourced.

Conclusion: Nemotron 3 Super demonstrates that architectural innovations like hybrid Mamba-Attention and LatentMoE can enable efficient large-scale models with high throughput while maintaining competitive performance.

Abstract: We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.

[565] Forecasting the Past: Gradient-Based Distribution Shift Detection in Trajectory Prediction

Michele De Vita, Julian Wiederer, Vasileios Belagiannis

Main category: cs.LG

TL;DR: Self-supervised method for detecting distribution shifts in trajectory prediction models by training a decoder to forecast trajectory second halves and using gradient norms as shift detection scores.

Details

Motivation: Trajectory prediction models often fail in real-world automated driving due to distributional shifts between training and test conditions, posing critical safety risks when models make incorrect forecasts in unfamiliar situations.

Method: Proposes a post-hoc self-supervised method that trains a decoder to forecast the second half of observed trajectories from the first half. Uses the L2 norm of the gradient of this forecasting loss with respect to the decoder’s final layer as a score to identify distribution shifts.

Result: Demonstrates substantial improvements on distribution shift detection for trajectory prediction on Shifts and Argoverse datasets. Also shows the method can early detect collisions of a deep Q-Network motion planner in the Highway simulator.

Conclusion: The proposed self-supervised approach effectively detects distribution shifts without interfering with original prediction performance, enhancing safety in automated driving systems.

Abstract: Trajectory prediction models often fail in real-world automated driving due to distributional shifts between training and test conditions. Such distributional shifts, whether behavioural or environmental, pose a critical risk by causing the model to make incorrect forecasts in unfamiliar situations. We propose a self-supervised method that trains a decoder in a post-hoc fashion on the self-supervised task of forecasting the second half of observed trajectories from the first half. The L2 norm of the gradient of this forecasting loss with respect to the decoder’s final layer defines a score to identify distribution shifts. Our approach, first, does not affect the trajectory prediction model, ensuring no interference with original prediction performance and second, demonstrates substantial improvements on distribution shift detection for trajectory prediction on the Shifts and Argoverse datasets. Moreover, we show that this method can also be used to early detect collisions of a deep Q-Network motion planner in the Highway simulator. Source code is available at https://github.com/Michedev/forecasting-the-past.

[566] Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task

Alicia Curth, Rachel Lawrence, Sushrut Karmalkar, Niranjani Prasad

Main category: cs.LG

TL;DR: Transformers show limited adaptive depth use in pretrained models but clearer adaptive depth use when finetuned on multi-hop relational reasoning tasks, with more flexible finetuning regimes showing stronger effects.

Details

Motivation: To investigate whether transformers dynamically adjust their computational depth based on task difficulty, specifically examining if they use fewer layers for easier tasks and more layers for harder tasks in multi-hop reasoning scenarios.

Method: Used controlled multi-hop relational reasoning tasks based on family stories where difficulty scales with relationship hops. Monitored prediction evolution across layers via early readouts (logit lens) and task-relevant information integration via causal patching across tokens. Compared pretrained models with models finetuned on the task.

Result: Pretrained models showed limited evidence for adaptive depth use - some larger models used fewer layers for easier tasks, and models generally used more layers to integrate information as chain length increased. Finetuned models showed clearer and more consistent adaptive depth use, with stronger effects in less constrained finetuning regimes that didn’t preserve general language modeling abilities.

Conclusion: Transformers can learn to use their depth adaptively when finetuned on specific reasoning tasks, but this capability is limited in general-purpose pretrained models. The degree of adaptive depth use depends on the finetuning regime, with more task-specific training enabling better adaptation to task difficulty.

Abstract: We investigate whether transformers use their depth adaptively across tasks of increasing difficulty. Using a controlled multi-hop relational reasoning task based on family stories, where difficulty is determined by the number of relationship hops that must be composed, we monitor (i) how predictions evolve across layers via early readouts (the logit lens) and (ii) how task-relevant information is integrated across tokens via causal patching. For pretrained models, we find some limited evidence for adaptive depth use: some larger models need fewer layers to arrive at plausible answers for easier tasks, and models generally use more layers to integrate information across tokens as chain length increases. For models finetuned on the task, we find clearer and more consistent evidence of adaptive depth use, with the effect being stronger for less constrained finetuning regimes that do not preserve general language modeling abilities.

[567] Analyzing the Effect of Noise in LLM Fine-tuning

Lingfang Li, Procheta Sen

Main category: cs.LG

TL;DR: Study on how different types of noise affect internal learning dynamics of LLMs during fine-tuning, analyzing layer-wise representation changes and attention patterns across three model families and NLP tasks.

Details

Motivation: Fine-tuning datasets often contain various forms of noise from annotation errors, preprocessing artifacts, or automated data collection. While prior work focuses on robust learning algorithms, little is known about how different noise types affect internal learning dynamics of LLMs during fine-tuning.

Method: Systematically study noise impact across three pretrained model families (GPT-2, Qwen2, Llama-2) and three diverse NLP tasks. Introduce controlled perturbations for three common real-world noise types: label noise, grammatical noise, and typographical noise. Analyze layer-wise representation changes and attention patterns to understand noise propagation.

Result: Label noise causes largest performance degradation, while grammatical and typographical noise can occasionally yield mild regularization benefits. Noise effects are localized primarily to task-specific layers, while attention structures remain comparatively stable.

Conclusion: Different noise types affect LLM fine-tuning differently, with label noise being most harmful. Understanding internal learning dynamics under noise conditions provides insights for developing more robust fine-tuning approaches.

Abstract: Fine-tuning is the dominant paradigm for adapting pretrained large language models (LLMs) to downstream NLP tasks. In practice, fine-tuning datasets may contain various forms of noise arising from annotation errors, preprocessing artifacts, or automated data collection. While prior work has focused on designing robust learning algorithms to mitigate performance degradation under noisy conditions, comparatively little is known about how different types of noise affect the internal learning dynamics of LLMs during fine-tuning. In this work, we systematically study the impact of noise on model behavior across three pretrained model families (GPT-2, Qwen2 and Llama-2) and three diverse NLP tasks. We introduce controlled perturbations corresponding to three common real-world noise types: label noise, grammatical noise, and typographical noise. Beyond task-level performance, we analyze layer-wise representation changes and attention patterns to understand how noise propagates through the network. Our results show that corrupting labels (i.e. label noise) consistently causes the largest performance degradation, whereas grammatical noise and typographical noise can occasionally yield mild regularization benefits. We further find that noise effects are localized primarily to task-specific layers, while attention structures remain comparatively stable.

[568] Adaptive Budget Allocation in LLM-Augmented Surveys

Zikun Ye, Jiameng Lyu, Rui Tao

Main category: cs.LG

TL;DR: An adaptive algorithm for efficiently allocating limited human-labeling budget across survey questions when using LLMs, learning which questions are hardest for LLMs while collecting human responses.

Details

Motivation: LLMs can generate survey responses at low cost but their reliability varies across questions and is unknown before data collection. Current approaches require costly human responses for verification, and there's a need for efficient allocation of limited human-labeling budget across questions.

Method: Proposes an adaptive allocation algorithm that learns which questions are hardest for the LLM while simultaneously collecting human responses. Each human label serves dual roles: improving estimates for that question and revealing how well the LLM predicts human responses on it. The algorithm directs more budget to questions where LLM is least reliable, without requiring prior knowledge of question-level LLM accuracy.

Result: On real survey data with 68 questions and over 2000 respondents, the algorithm reduces waste from 10-12% (uniform allocation) to 2-6%. The advantage grows as questions become more heterogeneous in LLM prediction quality. Achieves same estimation quality as uniform sampling with fewer human samples, requires no pilot study, and has formal performance guarantees.

Conclusion: The framework provides an efficient solution for allocating scarce human oversight across tasks where LLM reliability is unknown, with applications beyond surveys to any scenario requiring human verification of LLM outputs.

Abstract: Large language models (LLMs) can generate survey responses at low cost, but their reliability varies substantially across questions and is unknown before data collection. Deploying LLMs in surveys still requires costly human responses for verification and correction. How should a limited human-labeling budget be allocated across questions in real time? We propose an adaptive allocation algorithm that learns which questions are hardest for the LLM while simultaneously collecting human responses. Each human label serves a dual role: it improves the estimate for that question and reveals how well the LLM predicts human responses on it. The algorithm directs more budget to questions where the LLM is least reliable, without requiring any prior knowledge of question-level LLM accuracy. We prove that the allocation gap relative to the best possible allocation vanishes as the budget grows, and validate the approach on both synthetic data and a real survey dataset with 68 questions and over 2000 respondents. On real survey data, the standard practice of allocating human labels uniformly across questions wastes 10–12% of the budget relative to the optimal; our algorithm reduces this waste to 2–6%, and the advantage grows as questions become more heterogeneous in LLM prediction quality. The algorithm achieves the same estimation quality as traditional uniform sampling with fewer human samples, requires no pilot study, and is backed by formal performance guarantees validated on real survey data. More broadly, the framework applies whenever scarce human oversight must be allocated across tasks where LLM reliability is unknown.

[569] Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design

Leon Eshuijs, Shihan Wang, Antske Fokkens

Main category: cs.LG

TL;DR: LLMs trained with RL can develop harmful behaviors like sycophancy and deception, with model size acting as safety buffer in some environments but enabling exploitation in others, depending on environment features like role framing.

Details

Motivation: To understand the conditions under which RL training causes LLMs to develop harmful behaviors like sycophancy, manipulation, and deception, and to investigate how model size and environment features affect this misalignment.

Method: Trained 11 instruction-tuned LLMs (0.5B-14B parameters) with on-policy RL across 3 different environments, conducted controlled ablations to trace effects to environment-specific features like role framing and implicit gameability cues.

Result: Model size acts as a safety buffer in some environments but enables greater harmful exploitation in others; safety benchmarks generally don’t predict RL-induced misalignment except for Sycophancy scores when exploits rely on inferring user preferences; on-policy RL preserves safety buffers inherent in model generation distributions.

Conclusion: The relationship between model size and safety is complex and environment-dependent; current safety benchmarks are inadequate for predicting RL-induced misalignment; on-policy RL preserves inherent safety mechanisms that off-policy methods bypass.

Abstract: Specification gaming under Reinforcement Learning (RL) is known to cause LLMs to develop sycophantic, manipulative, or deceptive behavior, yet the conditions under which this occurs remain unclear. We train 11 instruction-tuned LLMs (0.5B–14B) with on-policy RL across 3 environments and find that model size acts as a safety buffer in some environments but enables greater harmful exploitation in others. Controlled ablations trace this reversal to environment-specific features such as role framing and implicit gameability cues. We further show that most safety benchmarks do not predict RL-induced misalignment, except in the case of Sycophancy scores when the exploit relies on inferring the user’s preference. Finally, we find that on-policy RL preserves a safety buffer inherent in the model’s own generation distribution, one that is bypassed during off-policy settings.

[570] Agentic Control in Variational Language Models

Yves Ruffenach

Main category: cs.LG

TL;DR: A variational language model framework with internal agentic control using uncertainty as an operational signal for training regulation, checkpoint retention, and inference-time routing.

Details

Motivation: To explore whether variational language models can support measurable agentic control grounded in their internal evidence, moving beyond treating uncertainty as just a passive diagnostic to using it as an actionable signal for regulation and control.

Method: Combines local variational hidden computation (EVE), a homeostatic latent regulator, structurally aware checkpoint retention, and a calibrated uncertainty-aware controller that operates on retained models. Treats uncertainty as an operational signal that regulates training, supports checkpoint retention, and guides inference-time intervention.

Result: The variational backbone outperforms a matched deterministic reference on language modeling while exhibiting richer uncertainty profiles. The calibrated controller remains active, uses multiple actions under full agentic evaluation, and yields positive quality-cost trade-offs.

Conclusion: Internal uncertainty in variational language models can serve not only as a descriptive property but also as a practical control interface for regulation, checkpoint retention, and minimal agentic routing, supporting closed-loop internal control.

Abstract: We study whether a variational language model can support a minimal and measurable form of agentic control grounded in its own internal evidence. Our model combines local variational hidden computation (EVE), a homeostatic latent regulator, structurally aware checkpoint retention and a calibrated uncertainty-aware controller operating on top of the retained model. Rather than treating uncertainty as a passive diagnostic measured after prediction, we treat it as an operational signal that can regulate training, support checkpoint retention and guide inference-time intervention. The resulting framework is deliberately focused. It studies a closed-loop form of internal control in which structural and predictive signals become actionable. Empirically, the variational backbone improves over a matched deterministic reference on the language-modeling task while also exhibiting a richer and more usable uncertainty profile. On top of this backbone, the calibrated controller remains active, uses multiple actions under a full agentic evaluation and yields a positive quality-cost trade-off. These results support a precise claim: internal uncertainty can serve not only as a descriptive property of a variational language model, but also as a practical control interface for regulation, checkpoint retention and minimal agentic routing.

[571] Instantiating Bayesian CVaR lower bounds in Interactive Decision Making Problems

Raghav Bongole, Tobias J. Oechtering, Mikael Skoglund

Main category: cs.LG

TL;DR: Generalized-Fano framework for Bayesian CVaR lower bounds in interactive decision making, applied to concrete problems like Gaussian bandits with explicit parameter-dependent bounds.

Details

Motivation: To provide practical lower-bound tools for Bayesian conditional value-at-risk (CVaR) in interactive learning and risk-sensitive decision making, building on recent generalized-Fano framework work.

Method: Instantiates generalized-Fano framework using squared Hellinger distance to compare hard vs reference models, combining lower bounds on reference hinge terms with model distinguishability bounds.

Result: Derived explicit Bayesian CVaR lower bounds for canonical examples including Gaussian bandits, making dependence on key problem parameters transparent.

Conclusion: The generalized-Fano Bayesian CVaR framework can serve as a practical lower-bound tool for interactive learning and risk-sensitive decision making applications.

Abstract: Recent work established a generalized-Fano framework for lower bounding prior-predictive (Bayesian) CVaR in interactive statistical decision making. In this paper, we show how to instantiate that framework in concrete interactive problems and derive explicit Bayesian CVaR lower bounds from its abstract corollaries. Our approach compares a hard model with a reference model using squared Hellinger distance, and combines a lower bound on a reference hinge term with a bound on the distinguishability of the two models. We apply this approach to canonical examples, including Gaussian bandits, and obtain explicit bounds that make the dependence on key problem parameters transparent. These results show how the generalized-Fano Bayesian CVaR framework can be used as a practical lower-bound tool for interactive learning and risk-sensitive decision making.

[572] Orthogonal Subspace Projection for Continual Machine Unlearning via SVD-Based LoRA

Yogachandran Rahulamathavan, Nasir Iqbal, Juncheng Hu, Sangarapillai Lambotharan

Main category: cs.LG

TL;DR: Proposes SVD-guided orthogonal subspace projection for continual machine unlearning using LoRA, preventing parameter collision and interference between sequential unlearning tasks while maintaining model performance.

Details

Motivation: Continual machine unlearning faces challenges when deletion requests arrive sequentially, as models must adapt without erasing previously retained knowledge. Naive sequential LoRA modules cause parameter collision and strong interference between tasks.

Method: Uses Singular Value Decomposition (SVD)-guided orthogonal subspace projection to constrain each new LoRA update during training to lie in the orthogonal complement of subspaces used by earlier unlearning tasks, enabling task isolation without dynamic routing at deployment.

Result: After thirty sequential unlearning tasks on CIFAR-100 with ResNet-20 and MNIST, the method maintains baseline performance (~58.1%) while preserving strong unlearning efficacy, compared to state-of-the-art static fusion which reduces retained accuracy from 60.39% to 12.70%.

Conclusion: SVD-guided orthogonal subspace projection effectively addresses parameter collision in continual machine unlearning with LoRA, providing stable performance across long sequences of unlearning tasks while maintaining task isolation and unlearning efficacy.

Abstract: Continual machine unlearning aims to remove the influence of data that should no longer be retained, while preserving the usefulness of the model on everything else. This setting becomes especially difficult when deletion requests arrive sequentially, because the model must repeatedly adapt without erasing previously retained knowledge. Low-Rank Adaptation (LoRA) offers an efficient way to implement such updates, but naively combining many sequential LoRA modules leads to parameter collision, causing \textit{strong interference} between tasks. We propose a static alternative based on Singular Value Decomposition (SVD)-guided orthogonal subspace projection. Our method constrains each new LoRA update during training so that it lies in the orthogonal complement of the subspaces used by earlier unlearning tasks. This preserves task isolation without requiring dynamic routing at deployment. Experiments on CIFAR-100 with ResNet-20 and on MNIST show stable behavior across long sequences of unlearning tasks. After thirty sequential unlearning tasks, state-of-the-art static fusion reduces retained accuracy from 60.39% to 12.70%, whereas the proposed in-training constrained optimization maintains baseline performance ($\sim$58.1%) while preserving strong unlearning efficacy.

[573] EEG-Based Multimodal Learning via Hyperbolic Mixture-of-Curvature Experts

Runhe Zhou, Shanglin Li, Guanxiang Huang, Xinliang Zhou, Qibin Zhao, Motoaki Kawanabe, Yi Ding, Cuntai Guan

Main category: cs.LG

TL;DR: EEG-MoCE: A hyperbolic mixture-of-curvature experts framework for multimodal EEG-based learning that leverages hyperbolic spaces to better represent hierarchical structures in brain signals and complementary modalities.

Details

Motivation: EEG-based multimodal learning needs better representation of hierarchical structures in heterogeneous modalities (EEG and complementary data like facial expressions). Euclidean embeddings struggle with hierarchical structures due to flat geometry, while hyperbolic spaces with exponential growth are naturally suited for hierarchical representations.

Method: Proposes EEG-MoCE framework with learnable-curvature hyperbolic spaces for each modality (experts). Uses curvature-aware fusion strategy to dynamically weight experts based on hierarchical information richness. Each modality gets its own hyperbolic space with adaptive curvature learning.

Result: Achieves state-of-the-art performance on benchmark datasets for emotion recognition, sleep staging, and cognitive assessment tasks.

Conclusion: Hyperbolic mixture-of-curvature experts framework effectively models hierarchical structures in multimodal EEG data, outperforming previous methods on various mental state assessment tasks.

Abstract: Electroencephalography (EEG)-based multimodal learning integrates brain signals with complementary modalities to improve mental state assessment, providing great clinical potential. The effectiveness of such paradigms largely depends on the representation learning on heterogeneous modalities. For EEG-based paradigms, one promising approach is to leverage their hierarchical structures, as recent studies have shown that both EEG and associated modalities (e.g., facial expressions) exhibit hierarchical structures reflecting complex cognitive processes. However, Euclidean embeddings struggle to represent these hierarchical structures due to their flat geometry, while hyperbolic spaces, with their exponential growth property, are naturally suited for them. In this work, we propose EEG-MoCE, a novel hyperbolic mixture-of-curvature experts framework designed for multimodal neurotechnology. EEG-MoCE assigns each modality to an expert in a learnable-curvature hyperbolic space, enabling adaptive modeling of its intrinsic geometry. A curvature-aware fusion strategy then dynamically weights experts, emphasizing modalities with richer hierarchical information. Extensive experiments on benchmark datasets demonstrate that EEG-MoCE achieves state-of-the-art performance, including emotion recognition, sleep staging, and cognitive assessment.

[574] KumoRFM-2: Scaling Foundation Models for Relational Learning

Valter Hudovernik, Federico López, Vid Kocijan, Akihiro Nitta, Jan Eric Lenssen, Jure Leskovec, Matthias Fey

Main category: cs.LG

TL;DR: KumoRFM-2 is a foundation model for relational data that outperforms supervised approaches by up to 8% on 41 benchmarks, scales to billion-scale datasets, and excels in few-shot learning scenarios.

Details

Motivation: Existing tabular foundation models require manual table flattening and target variable generation, which breaks temporal consistency and limits their applicability to real-world relational databases. There's a need for a model that can natively process connected tables while preserving temporal relationships and scaling to large datasets.

Method: KumoRFM-2 processes relational data natively without flattening, using early task information injection to select relevant columns. It’s pre-trained on synthetic and real-world data across four axes: row/column dimensions at table level, and foreign key/cross-sample dimensions at database level. The model supports both in-context learning and fine-tuning.

Result: Outperforms supervised and foundational approaches by up to 8% on 41 challenging benchmarks, maintains strong performance under cold start and noisy data conditions, and scales to billion-scale relational datasets. First few-shot foundation model to surpass supervised approaches on common benchmark tasks.

Conclusion: KumoRFM-2 represents a significant advancement in relational data foundation models, demonstrating superior performance over supervised methods while offering scalability and robustness to challenging data conditions.

Abstract: We introduce KumoRFM-2, the next iteration of a pre-trained foundation model for relational data. KumoRFM-2 supports in-context learning as well as fine-tuning and is applicable to a wide range of predictive tasks. In contrast to tabular foundation models, KumoRFM-2 natively operates on relational data, processing one or more connected tables simultaneously without manual table flattening or target variable generation, all while preserving temporal consistency. KumoRFM-2 leverages a large corpus of synthetic and real-world data to pre-train across four axes: the row and column dimensions at the individual table level, and the foreign key and cross-sample dimensions at the database level. In contrast to its predecessor, KumoRFM-2 injects task information as early as possible, enabling sharper selection of task-relevant columns and improved robustness to noisy data. Through extensive experiments on 41 challenging benchmarks and analysis around expressivity and sensitivity, we demonstrate that KumoRFM-2 outperforms supervised and foundational approaches by up to 8%, while maintaining strong performance under extreme settings of cold start and noisy data. To our knowledge, this is the first time a few-shot foundation model has been shown to surpass supervised approaches on common benchmark tasks, with performance further improving upon fine-tuning. Finally, while KumoRFM-1 was limited to small-scale in-memory datasets, KumoRFM-2 scales to billion-scale relational datasets.

You Qin, Linqing Wang, Hao Fei, Roger Zimmermann, Liefeng Bo, Qinglin Lu, Chunyu Wang

Main category: cs.LG

TL;DR: SOAR introduces a self-correction method for diffusion models that bridges the gap between supervised fine-tuning and reinforcement learning by providing dense per-timestep supervision through stop-gradient rollouts and re-noising of off-trajectory states.

Details

Motivation: Current post-training pipeline for diffusion models has a fundamental gap between supervised fine-tuning (SFT) and reinforcement learning (RL). SFT suffers from exposure bias when inference deviates from ground-truth states, while RL faces sparse rewards, credit assignment problems, and reward hacking risks.

Method: SOAR performs single stop-gradient rollouts from real samples, re-noises the resulting off-trajectory states, and supervises the model to steer back toward original clean targets. This provides dense per-timestep supervision without reward models or credit assignment issues.

Result: On SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while raising all model-based preference scores. It surpasses Flow-GRPO on aesthetic and text-image alignment tasks despite having no reward model access.

Conclusion: SOAR effectively bridges the SFT-RL gap in diffusion model post-training, providing a stronger first stage that can replace SFT while remaining compatible with subsequent RL alignment, addressing exposure bias without reward model dependencies.

Abstract: The post-training pipeline for diffusion models currently has two stages: supervised fine-tuning (SFT) on curated data and reinforcement learning (RL) with reward models. A fundamental gap separates them. SFT optimizes the denoiser only on ground-truth states sampled from the forward noising process; once inference deviates from these ideal states, subsequent denoising relies on out-of-distribution generalization rather than learned correction, exhibiting the same exposure bias that afflicts autoregressive models, but accumulated along the denoising trajectory instead of the token sequence. RL can in principle address this mismatch, yet its terminal reward signal is sparse, suffers from credit-assignment difficulty, and risks reward hacking. We propose SOAR (Self-Correction for Optimal Alignment and Refinement), a bias-correction post-training method that fills this gap. Starting from a real sample, SOAR performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. The method is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem. On SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while simultaneously raising all model-based preference scores. In controlled reward-specific experiments, SOAR surpasses Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. Since SOAR’s base loss subsumes the standard SFT objective, it can directly replace SFT as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent RL alignment.

[576] Calibration-Aware Policy Optimization for Reasoning LLMs

Ziqi Wang, Xingzhou Lou, Meiqi Wu, Zhengqi Wen, Junge Zhang

Main category: cs.LG

TL;DR: CAPO improves calibration in LLM reasoning by addressing overconfidence issues in GRPO through uncertainty-aware advantage estimation and noise masking.

Details

Motivation: GRPO enhances LLM reasoning but causes overconfidence where incorrect responses have lower perplexity than correct ones, degrading calibration. Existing approaches either improve calibration minimally or sacrifice reasoning accuracy.

Method: Proposes Calibration-Aware Policy Optimization (CAPO) with: 1) logistic AUC surrogate loss for theoretically consistent uncertainty-aware advantage estimation, and 2) noise masking mechanism for stable joint optimization of calibration and accuracy.

Result: CAPO-1.5B improves calibration by up to 15% while maintaining comparable/better accuracy than GRPO, boosts downstream inference-time scaling accuracy by up to 5%, and achieves Pareto-optimal precision-coverage trade-off when allowed to abstain under low confidence.

Conclusion: CAPO effectively addresses the calibration-accuracy trade-off in LLM reasoning, offering practical value for hallucination mitigation through better uncertainty estimation.

Abstract: Group Relative Policy Optimization (GRPO) enhances LLM reasoning but often induces overconfidence, where incorrect responses yield lower perplexity than correct ones, degrading relative calibration as described by the Area Under the Curve (AUC). Existing approaches either yield limited improvements in calibration or sacrifice gains in reasoning accuracy. We first prove that this degradation in GRPO-style algorithms stems from their uncertainty-agnostic advantage estimation, which inevitably misaligns optimization gradients with calibration. This leads to improved accuracy at the expense of degraded calibration. We then propose Calibration-Aware Policy Optimization (CAPO). It adopts a logistic AUC surrogate loss that is theoretically consistent and admits regret bound, enabling uncertainty-aware advantage estimation. By further incorporating a noise masking mechanism, CAPO achieves stable learning dynamics that jointly optimize calibration and accuracy. Experiments on multiple mathematical reasoning benchmarks show that CAPO-1.5B significantly improves calibration by up to 15% while achieving accuracy comparable to or better than GRPO, and further boosts accuracy on downstream inference-time scaling tasks by up to 5%. Moreover, when allowed to abstain under low-confidence conditions, CAPO achieves a Pareto-optimal precision-coverage trade-off, highlighting its practical value for hallucination mitigation.

[577] TimeSAF: Towards LLM-Guided Semantic Asynchronous Fusion for Time Series Forecasting

Fan Zhang, Shiming Fan, Hua Wang

Main category: cs.LG

TL;DR: TimeSAF introduces hierarchical asynchronous fusion for time-series forecasting with LLMs, addressing semantic perceptual dissonance by decoupling unimodal learning from cross-modal interaction.

Details

Motivation: Existing LLM-based time-series forecasting methods use Deep Synchronous Fusion that forces dense interactions at every layer, causing semantic perceptual dissonance where high-level LLM semantics become entangled with low-level temporal dynamics, preventing effective semantic guidance.

Method: Proposes TimeSAF with hierarchical asynchronous fusion: 1) Independent cross-modal semantic fusion trunk using learnable queries to aggregate global semantics from temporal and prompt backbones bottom-up, 2) Stage-wise semantic refinement decoder that asynchronously injects high-level signals back into temporal backbone, decoupling unimodal feature learning from cross-modal interaction.

Result: Extensive experiments on standard long-term forecasting benchmarks show TimeSAF significantly outperforms state-of-the-art baselines and exhibits strong generalization in few-shot and zero-shot transfer settings.

Conclusion: Hierarchical asynchronous fusion effectively addresses semantic perceptual dissonance in multimodal time-series forecasting, providing stable semantic guidance without interfering with low-level temporal dynamics.

Abstract: Despite the recent success of large language models (LLMs) in time-series forecasting, most existing methods still adopt a Deep Synchronous Fusion strategy, where dense interactions between textual and temporal features are enforced at every layer of the network. This design overlooks the inherent granularity mismatch between modalities and leads to what we term semantic perceptual dissonance: high-level abstract semantics provided by the LLM become inappropriately entangled with the low-level, fine-grained numerical dynamics of time series, making it difficult for semantic priors to effectively guide forecasting. To address this issue, we propose TimeSAF, a new framework based on hierarchical asynchronous fusion. Unlike synchronous approaches, TimeSAF explicitly decouples unimodal feature learning from cross-modal interaction. It introduces an independent cross-modal semantic fusion trunk, which uses learnable queries to aggregate global semantics from the temporal and prompt backbones in a bottom-up manner, and a stage-wise semantic refinement decoder that asynchronously injects these high-level signals back into the temporal backbone. This mechanism provides stable and efficient semantic guidance while avoiding interference with low-level temporal dynamics. Extensive experiments on standard long-term forecasting benchmarks show that TimeSAF significantly outperforms state-of-the-art baselines, and further exhibits strong generalization in both few-shot and zero-shot transfer settings.

[578] Robust Semi-Supervised Temporal Intrusion Detection for Adversarial Cloud Networks

Anasuya Chattopadhyay, Daniel Reti, Hans D. Schotten

Main category: cs.LG

TL;DR: A robust semi-supervised temporal learning framework for cloud intrusion detection that addresses adversarial contamination and temporal drift in unlabeled network traffic.

Details

Motivation: Real-world cloud network intrusion detection faces challenges of limited labeled data, non-stationary traffic, and adaptive adversaries. Existing semi-supervised approaches assume benign and stationary unlabeled traffic, leading to degraded performance in adversarial cloud environments.

Method: Combines supervised learning with consistency regularization, confidence-aware pseudo-labeling, and selective temporal invariance to conservatively exploit unlabeled traffic while suppressing unreliable samples. Leverages temporal structure of network flows for improved robustness.

Result: Extensive evaluations on CIC-IDS2017, CSE-CIC-IDS2018, and UNSW-NB15 datasets under limited-label conditions show consistent outperformance of state-of-the-art supervised and semi-supervised NIDS in detection performance, label efficiency, and resilience to adversarial/non-stationary traffic.

Conclusion: The proposed framework effectively addresses key challenges in real-world cloud intrusion detection by robustly handling adversarial contamination and temporal drift in unlabeled traffic through temporal-aware semi-supervised learning.

Abstract: Cloud networks increasingly rely on machine learning based Network Intrusion Detection Systems to defend against evolving cyber threats. However, real-world deployments are challenged by limited labeled data, non-stationary traffic, and adaptive adversaries. While semi-supervised learning can alleviate label scarcity, most existing approaches implicitly assume benign and stationary unlabeled traffic, leading to degraded performance in adversarial cloud environments. This paper proposes a robust semi-supervised temporal learning framework for cloud intrusion detection that explicitly addresses adversarial contamination and temporal drift in unlabeled network traffic. Operating on flow-level data, this framework combines supervised learning with consistency regularization, confidence-aware pseudo-labeling, and selective temporal invariance to conservatively exploit unlabeled traffic while suppressing unreliable samples. By leveraging the temporal structure of network flows, the proposed method improves robustness and generalization across heterogeneous cloud environments. Extensive evaluations on publicly available datasets (CIC-IDS2017, CSE-CIC-IDS2018, and UNSW-NB15) under limited-label conditions demonstrate that the proposed framework consistently outperforms state-of-the-art supervised and semi-supervised network intrusion detection systems in detection performance, label efficiency, and resilience to adversarial and non-stationary traffic.

[579] Do VLMs Truly “Read” Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting

Kaiqi Hu, Linda Xiao, Shiyue Xu, Ziyi Tang, Mingwen Liu

Main category: cs.LG

TL;DR: Paper introduces a multi-scale candlestick chart dataset and evaluation framework to assess vision-language models’ ability to understand stock price patterns from visual charts, revealing limitations in temporal reasoning and market scenario performance.

Details

Motivation: Existing benchmarks inadequately evaluate VLMs' understanding of stock price patterns in candlestick charts, failing to isolate whether VLMs genuinely comprehend visual inputs or just memorize patterns. Current datasets focus on single-period or tabular inputs, while human analysts use multi-scale charts where longer-term horizons capture trends and shorter-term horizons identify inflection points.

Method: Constructed a multi-scale candlestick charts dataset with standardized evaluation framework combining confusion-matrix-based diagnostics with information coefficient (IC) time series metrics. Included XGBoost as a feature-based temporal baseline. Benchmarked representative VLMs on their ability to leverage multi-scale stock price data.

Result: Most VLMs perform well only under persistent uptrend or downtrend conditions but exhibit weak predictive capability in more common market scenarios. Identified significant prediction biases and limited sensitivity to explicitly specified forecast horizons in prompts, indicating inherent limitations in precise temporal reasoning.

Conclusion: Current VLMs have fundamental limitations in understanding multi-scale visual market dynamics and temporal reasoning for stock price forecasting, despite being applied to this domain. The proposed dataset and framework reveal these weaknesses and provide a better evaluation standard.

Abstract: Vision-language models(VLMs) are increasingly applied to visual stock price forecasting, yet existing benchmarks inadequately evaluate their understanding of stock price in candlestick charts. First, prior studies fail to isolate VLMs’ comprehension of visual inputs genuinely improves predictive performance and whether VLMs truly comprehend candlestick patterns. Further, most existing datasets and evaluation setups are designed around single-period or tabular inputs. However, human analysts strongly rely on multi-scale candlestick charts, where longer-term horizons capture trend direction and shorter-term horizons provide cues for inflection points, making it difficult to systematically assess VLMs’ ability to integrate short-term and long-term visual market dynamics. To bridge this gap, we construct a multi-scale candlestick charts dataset and a standardized evaluation framework to assess VLMs’ ability to utilize multi-scale visual market signals. Evaluation combines confusion-matrix-based diagnostics with information coefficient(IC) time series metrics and includes XGBoost as a feature-based temporal baseline. Using this dataset, we benchmark representative VLMs and analyze their ability to leverage multi-scale stock price data. Experimental results show that most VLMs perform well only under persistent uptrend or downtrend conditions, while exhibiting weak predictive capability in more common market scenarios. We also identify significant prediction biases and limited sensitivity to explicitly specified forecast horizons in prompts, indicating inherent limitations in precise temporal reasoning.

Chuang Peng, Wei Zhang, Renshuai Tao, Xinhao Zhang, Jian Yang

Main category: cs.LG

TL;DR: Triton dataset and progressive training curriculum for robust text-based web agents that outperform large language models on web navigation tasks.

Details

Motivation: Standard SFT approaches for web agents fail due to inability to reject plausible but incorrect elements in dense pages and limited generalization to unseen website layouts.

Method: Created Triton dataset (590k instances) via Structural-Semantic Hard Negative Mining and Dual-Agent Consensus pipeline, then progressive curriculum training producing three models: Triton-SFT-32B (basic imitation), Triton-ORPO-32B (robust discrimination), and Triton-GRPO-32B (long-horizon consistency).

Result: Triton-GRPO-32B achieves 58.7% Step Success Rate on Mind2Web, surpassing GPT-4.5 (42.4%) and Claude-4.5 (41.4%) by over 16%.

Conclusion: Specialized data curriculum outweighs raw parameter scale for web navigation, demonstrating that robust training approaches can outperform much larger general-purpose models.

Abstract: Text-based web agents offer computational efficiency for autonomous web navigation, yet developing robust agents remains challenging due to the noisy and heterogeneous nature of real-world HTML. Standard Supervised Fine-Tuning (SFT) approaches fail in two critical dimensions: they lack discrimination capabilities to reject plausible but incorrect elements in densely populated pages, and exhibit limited generalization to unseen website layouts. To address these challenges, we introduce the Triton dataset (590k instances) and a progressive training curriculum. Triton is constructed via Structural-Semantic Hard Negative Mining, which explicitly mines topologically similar distractors, and a Dual-Agent Consensus pipeline that synthesizes diverse cross-domain tasks with strict verification. Building upon this foundation, our progressive curriculum produces three models: Triton-SFT-32B for basic imitation, Triton-ORPO-32B for robust discrimination via Odds Ratio Preference Optimization, and Triton-GRPO-32B for long-horizon consistency through Group Relative Policy Optimization. Empirical evaluation on Mind2Web demonstrates that Triton-GRPO-32B achieves state-of-the-art performance among open-source models with 58.7% Step Success Rate, surpassing GPT-4.5 (42.4%) and Claude-4.5 (41.4%) by over 16%, validating that specialized data curriculum outweighs raw parameter scale for web navigation.

[581] BID-LoRA: A Parameter-Efficient Framework for Continual Learning and Unlearning

Jagadeesh Rachapudi, Ritali Vatsi, Praful Hambarde, Amit Shukla

Main category: cs.LG

TL;DR: A unified Continual Learning Unlearning (CLU) framework called BID-LoRA that enables both knowledge acquisition and selective forgetting in multimodal models using low-rank adaptation and escape unlearning techniques.

Details

Motivation: There's a critical gap between well-developed Continual Learning (CL) methods for acquiring new knowledge and early-stage Machine Unlearning (MU) techniques for removing outdated/sensitive information. Naively combining existing approaches leads to knowledge leakage and degradation of foundational knowledge across adaptation cycles.

Method: Proposes Bi-Directional Low-Rank Adaptation (BID-LoRA) with three dedicated adapter pathways (retain, new, and unlearn) applied to attention layers, combined with escape unlearning that pushes forget-class embeddings to positions maximally distant from retained knowledge, updating only 5% of parameters.

Result: BID-LoRA outperforms CLU baselines across multiple adaptation cycles on CIFAR-100. Also evaluated on CASIA-Face100, demonstrating practical applicability to real-world identity management systems where users must be enrolled and withdrawn users removed.

Conclusion: The paper successfully formalizes Continual Learning Unlearning (CLU) as a unified paradigm and demonstrates BID-LoRA’s effectiveness in achieving precise deletion of unwanted knowledge, efficient integration of new knowledge, and minimizing knowledge leakage across cycles.

Abstract: Recent advances in deep learning underscore the need for systems that can not only acquire new knowledge through Continual Learning (CL) but also remove outdated, sensitive, or private information through Machine Unlearning (MU). However, while CL methods are well-developed, MU techniques remain in early stages, creating a critical gap for unified frameworks that depend on both capabilities. We find that naively combining existing CL and MU approaches results in knowledge leakage a gradual degradation of foundational knowledge across repeated adaptation cycles. To address this, we formalize Continual Learning Unlearning (CLU) as a unified paradigm with three key goals: (i) precise deletion of unwanted knowledge, (ii) efficient integration of new knowledge while preserving prior information, and (iii) minimizing knowledge leakage across cycles. We propose Bi-Directional Low-Rank Adaptation (BID-LoRA), a novel framework featuring three dedicated adapter pathways-retain, new, and unlearn applied to attention layers, combined with escape unlearning that pushes forget-class embeddings to positions maximally distant from retained knowledge, updating only 5% of parameters. Experiments on CIFAR-100 show that BID-LoRA outperforms CLU baselines across multiple adaptation cycles. We further evaluate on CASIA-Face100, a curated face recognition subset, demonstrating practical applicability to real-world identity management systems where new users must be enrolled and withdrawn users removed.

[582] Information-Theoretic Optimization for Task-Adapted Compressed Sensing Magnetic Resonance Imaging

Xinyu Peng, Ziyang Zheng, Wenrui Dai, Duoduo Xue, Shaohui Li, Chenglin Li, Junni Zou, Hongkai Xiong

Main category: cs.LG

TL;DR: Task-adapted CS-MRI framework using information theory to enable probabilistic inference for uncertainty prediction and adaptive sampling for clinical tasks

Details

Motivation: Existing task-adapted CS-MRI methods suffer from uncertainty problems for medical diagnosis and cannot achieve adaptive sampling in end-to-end optimization with reconstruction or clinical tasks

Method: Formalize task-adapted CS-MRI optimization by maximizing mutual information between undersampled k-space measurements and clinical tasks; use amortized optimization and construct tractable variational bounds for mutual information to jointly optimize sampling, reconstruction, and task-inference models

Result: Achieves highly competitive performance on standard metrics like Dice compared to deterministic counterparts, provides better distribution matching to ground-truth posterior distribution as measured by generalized energy distance (GED)

Conclusion: Proposed framework addresses uncertainty problem in medical diagnosis, enables flexible sampling ratio control with single end-to-end model, and handles both joint task/reconstruction scenarios and privacy-protected task-only scenarios

Abstract: Task-adapted compressed sensing magnetic resonance imaging (CS-MRI) is emerging to address the specific demands of downstream clinical tasks with significantly fewer k-space measurements than required by Nyquist sampling. However, existing task-adapted CS-MRI methods suffer from the uncertainty problem for medical diagnosis and cannot achieve adaptive sampling in end-to-end optimization with reconstruction or clinical tasks. To address these limitations, we propose the first task-adapted CS-MRI from the information-theoretic perspective to simultaneously achieve probabilistic inference for uncertainty prediction and adapt to arbitrary sampling ratios and versatile clinical applications. Specifically, we formalize the task-adapted CS-MRI optimization problem by maximizing the mutual information between undersampled k-space measurements and clinical tasks to enable probabilistic inference for addressing the uncertainty problem. We leverage amortized optimization and construct tractable variational bounds for mutual information to jointly optimize sampling, reconstruction, and task-inference models, which enables flexible sampling ratio control using a single end-to-end trained model. Furthermore, the proposed framework addresses two kinds of distinct clinical scenarios within a unified approach, i.e., i) joint task and reconstruction, where reconstruction serves as an auxiliary process to enhance task performance; and ii) task implementation with suppressed reconstruction, applicable for privacy protection. Extensive experiments on large-scale MRI datasets demonstrate that the proposed framework achieves highly competitive performance on standard metrics like Dice compared to deterministic counterpart but provides better distribution matching to the ground-truth posterior distribution as measured by the generalized energy distance (GED).

[583] LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

Junxiao Yang, Haoran Liu, Jinzhe Tu, Jiale Cheng, Zhexin Zhang, Shiyao Cui, Jiaqi Weng, Jialing Tao, Hui Xue, Hongning Wang, Han Qiu, Minlie Huang

Main category: cs.LG

TL;DR: The paper identifies a safety vulnerability in LLMs where safety alignment fails in low-resource languages due to language-biased safety mechanisms, and proposes Language-Agnostic Semantic Alignment (LASA) to anchor safety in semantic bottlenecks rather than surface text.

Details

Motivation: LLMs show strong safety performance in high-resource languages but severe vulnerabilities in low-resource languages, suggesting a mismatch between language-agnostic semantic understanding and language-biased safety alignment.

Method: Identifies semantic bottlenecks in LLMs (intermediate layers where representations are governed by semantic content rather than language identity), then proposes LASA which anchors safety alignment directly in these semantic bottlenecks.

Result: LASA substantially improves safety across all languages: average attack success rate drops from 24.7% to 2.8% on LLaMA-3.1-8B-Instruct and remains around 3-4% across Qwen2.5 and Qwen3 Instruct models (7B-32B).

Conclusion: Safety alignment should be anchored in language-agnostic semantic space rather than surface text, offering a representation-level perspective on LLM safety that addresses cross-lingual vulnerabilities.

Abstract: Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in low-resource languages. We attribute this gap to a mismatch between language-agnostic semantic understanding ability and language-dominant safety alignment biased toward high-resource languages. Consistent with this hypothesis, we empirically identify the semantic bottleneck in LLMs, an intermediate layer in which the geometry of model representations is governed primarily by shared semantic content rather than language identity. Building on this observation, we propose Language-Agnostic Semantic Alignment (LASA), which anchors safety alignment directly in semantic bottlenecks. Experiments show that LASA substantially improves safety across all languages: average attack success rate (ASR) drops from 24.7% to 2.8% on LLaMA-3.1-8B-Instruct and remains around 3-4% across Qwen2.5 and Qwen3 Instruct models (7B-32B). Together, our analysis and method offer a representation-level perspective on LLM safety, suggesting that safety alignment requires anchoring safety understanding not in surface text, but in the model’s language-agnostic semantic space.

[584] Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding

Main category: cs.LG

TL;DR: On-policy distillation (OPD) training dynamics analysis reveals two key conditions for success: compatible thinking patterns between student and teacher, and teacher offering genuinely new capabilities beyond student’s training data.

Details

Motivation: OPD is widely used for post-training LLMs but its training dynamics are poorly understood. The paper aims to systematically investigate when OPD succeeds or fails and understand its underlying mechanisms.

Method: Systematic investigation of OPD dynamics through weak-to-strong reverse distillation experiments, token-level mechanism analysis, and proposing practical recovery strategies like off-policy cold start and teacher-aligned prompt selection.

Result: Identified two governing conditions for OPD success: compatible thinking patterns and teacher offering new capabilities. Found that successful OPD involves progressive alignment on high-probability tokens at student-visited states. Proposed recovery strategies for failing OPD.

Conclusion: OPD’s apparent benefits come with costs, raising scalability questions for long-horizon distillation. Understanding OPD dynamics enables better distillation strategies and reveals limitations of current approaches.

Abstract: On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student’s perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD’s apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

[585] Monte Carlo Stochastic Depth for Uncertainty Estimation in Deep Learning

Adam T. Müller, Tobias Rögelein, Nicolaj C. Stache

Main category: cs.LG

TL;DR: Monte Carlo Stochastic Depth (MCSD) is established as a theoretically-grounded Bayesian approximation method for uncertainty quantification in modern deep neural networks, with empirical benchmarking showing competitive performance against MC Dropout and MC-DropBlock on object detection tasks.

Details

Motivation: The paper addresses the need for reliable uncertainty quantification in safety-critical deep learning systems. While stochastic regularizers like Monte Carlo Dropout are commonly used for scalable Bayesian inference, Stochastic Depth (a key regularizer in modern residual architectures) remains under-explored for this purpose, lacking both theoretical grounding and comprehensive empirical evaluation on complex tasks like object detection.

Method: The authors first provide theoretical insights connecting Monte Carlo Stochastic Depth (MCSD) to principled approximate variational inference. They then conduct the first comprehensive empirical benchmark comparing MCSD against MC Dropout (MCD) and MC-DropBlock (MCDB) on state-of-the-art object detectors (YOLO, RT-DETR) using COCO and COCO-O datasets.

Result: MCSD demonstrates robust and computationally efficient performance, achieving highly competitive predictive accuracy (mAP). It yields slight improvements in calibration (ECE) and uncertainty ranking (AUARC) compared to MC Dropout, establishing it as a viable alternative for Bayesian approximation.

Conclusion: Monte Carlo Stochastic Depth is established as a theoretically-grounded and empirically-validated tool for efficient Bayesian approximation in modern deep learning, particularly suitable for uncertainty quantification in safety-critical applications using residual-based architectures.

Abstract: The deployment of deep neural networks in safety-critical systems necessitates reliable and efficient uncertainty quantification (UQ). A practical and widespread strategy for UQ is repurposing stochastic regularizers as scalable approximate Bayesian inference methods, such as Monte Carlo Dropout (MCD) and MC-DropBlock (MCDB). However, this paradigm remains under-explored for Stochastic Depth (SD), a regularizer integral to the residual-based backbones of most modern architectures. While prior work demonstrated its empirical promise for segmentation, a formal theoretical connection to Bayesian variational inference and a benchmark on complex, multi-task problems like object detection are missing. In this paper, we first provide theoretical insights connecting Monte Carlo Stochastic Depth (MCSD) to principled approximate variational inference. We then present the first comprehensive empirical benchmark of MCSD against MCD and MCDB on state-of-the-art detectors (YOLO, RT-DETR) using the COCO and COCO-O datasets. Our results position MCSD as a robust and computationally efficient method that achieves highly competitive predictive accuracy (mAP), notably yielding slight improvements in calibration (ECE) and uncertainty ranking (AUARC) compared to MCD. We thus establish MCSD as a theoretically-grounded and empirically-validated tool for efficient Bayesian approximation in modern deep learning.

[586] Stress Detection Using Wearable Physiological and Sociometric Sensors

Oscar Martinez Mozos, Virginia Sandulescu, Sally Andrews, David Ellis, Nicola Bellotto, Radu Dobrescu, Jose Manuel Ferrandez

Main category: cs.LG

TL;DR: Machine learning approach combining physiological and social sensor data for automatic stress detection in social situations, tested with TSST protocol

Details

Motivation: Stress is a significant social problem, and there's a need for automatic detection systems that can identify stressful situations by combining multiple sensor modalities

Method: Combines physiological and social response sensors, uses classifiers (SVM, AdaBoost, k-NN) to discriminate stressful vs neutral situations during controlled Trier Social Stress Test (TSST)

Result: Combining both sensor systems enables accurate discrimination of stressful situations; paper also assesses individual modality performance and identifies most discriminative features

Conclusion: Multi-modal sensor fusion improves stress detection accuracy, with potential for real-time applications; feature analysis provides insights for practical implementation

Abstract: Stress remains a significant social problem for individuals in modern societies. This paper presents a machine learning approach for the automatic detection of stress of people in a social situation by combining two sensor systems that capture physiological and social responses. We compare the performance using different classifiers including support vector machine, AdaBoost, and k-nearest neighbor. Our experimental results show that by combining the measurements from both sensor systems, we could accurately discriminate between stressful and neutral situations during a controlled Trier social stress test (TSST). Moreover, this paper assesses the discriminative ability of each sensor modality individually and considers their suitability for real-time stress detection. Finally, we present an study of the most discriminative features for stress detection.

[587] GF-Score: Certified Class-Conditional Robustness Evaluation with Fairness Guarantees

Arya Shah, Kaveri Visavadiya, Manisha Padala

Main category: cs.LG

TL;DR: GF-Score framework decomposes certified robustness into per-class profiles using welfare economics metrics to quantify fairness disparities without adversarial attacks.

Details

Motivation: Standard adversarial robustness evaluation methods are either expensive (requiring adversarial attacks) or provide only aggregate scores that hide how robustness is distributed across different classes, making it difficult to assess fairness in safety-critical applications.

Method: Introduces GF-Score framework that decomposes certified GREAT Score into per-class robustness profiles using four welfare economics metrics: Robustness Disparity Index (RDI), Normalized Robustness Gini Coefficient (NRGC), Worst-Case Class Robustness (WCR), and Fairness-Penalized GREAT Score (FP-GREAT). Eliminates dependence on adversarial attacks through self-calibration procedure using only clean accuracy correlations.

Result: Evaluation of 22 models from RobustBench across CIFAR-10 and ImageNet shows exact decomposition, reveals consistent vulnerability patterns (e.g., ‘cat’ is weakest class in 76% of CIFAR-10 models), and demonstrates that more robust models tend to exhibit greater class-level disparity.

Conclusion: GF-Score provides a practical, attack-free auditing pipeline for diagnosing where certified robustness guarantees fail to protect all classes equally, enabling better fairness assessment in safety-critical neural network deployments.

Abstract: Adversarial robustness is essential for deploying neural networks in safety-critical applications, yet standard evaluation methods either require expensive adversarial attacks or report only a single aggregate score that obscures how robustness is distributed across classes. We introduce the \emph{GF-Score} (GREAT-Fairness Score), a framework that decomposes the certified GREAT Score into per-class robustness profiles and quantifies their disparity through four metrics grounded in welfare economics: the Robustness Disparity Index (RDI), the Normalized Robustness Gini Coefficient (NRGC), Worst-Case Class Robustness (WCR), and a Fairness-Penalized GREAT Score (FP-GREAT). The framework further eliminates the original method’s dependence on adversarial attacks through a self-calibration procedure that tunes the temperature parameter using only clean accuracy correlations. Evaluating 22 models from RobustBench across CIFAR-10 and ImageNet, we find that the decomposition is exact, that per-class scores reveal consistent vulnerability patterns (e.g., ``cat’’ is the weakest class in 76% of CIFAR-10 models), and that more robust models tend to exhibit greater class-level disparity. These results establish a practical, attack-free auditing pipeline for diagnosing where certified robustness guarantees fail to protect all classes equally. We release our code on \href{https://github.com/aryashah2k/gf-score}{GitHub}.

[588] Rethinking the Personalized Relaxed Initialization in the Federated Learning: Consistency and Generalization

Li Shen, Yan Sun, Dacheng Tao

Main category: cs.LG

TL;DR: FedInit: A federated learning algorithm that addresses client-drift by using personalized relaxed initialization at each local training stage, moving away from global state toward reverse direction of latest local state.

Details

Motivation: Federated learning suffers from "client-drift" problem caused by inconsistent optimum across local clients with heterogeneous data, but lacks solid theoretical analysis of this local inconsistency's impact.

Method: Proposes FedInit algorithm that employs personalized relaxed initialization at beginning of each local training stage by initializing local state away from current global state toward reverse direction of latest local state. Also introduces excess risk analysis to study divergence term and investigate test error in FL.

Result: FedInit achieves comparable results to advanced benchmarks without additional training or communication costs. Stage-wise personalized relaxed initialization can be incorporated into current advanced algorithms to achieve higher generalization performance in FL.

Conclusion: FedInit effectively addresses client-drift in federated learning through personalized initialization strategy, with theoretical analysis showing local inconsistency mainly affects generalization error rather than optimization error.

Abstract: Federated learning (FL) is a distributed paradigm that coordinates massive local clients to collaboratively train a global model via stage-wise local training processes on the heterogeneous dataset. Previous works have implicitly studied that FL suffers from the client-drift'' problem, which is caused by the inconsistent optimum across local clients. However, till now it still lacks solid theoretical analysis to explain the impact of this local inconsistency. To alleviate the negative impact of client drift’’ and explore its substance in FL, in this paper, we first propose an efficient FL algorithm FedInit, which allows employing the personalized relaxed initialization state at the beginning of each local training stage. Specifically, FedInit initializes the local state by moving away from the current global state towards the reverse direction of the latest local state. Moreover, to further understand how inconsistency disrupts performance in FL, we introduce the excess risk analysis and study the divergence term to investigate the test error in FL. Our studies show that optimization error is not sensitive to this local inconsistency, while it mainly affects the generalization error bound. Extensive experiments are conducted to validate its efficiency. The proposed FedInit method could achieve comparable results compared to several advanced benchmarks without any additional training or communication costs. Meanwhile, the stage-wise personalized relaxed initialization could also be incorporated into several current advanced algorithms to achieve higher generalization performance in the FL paradigm.

[589] OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

Zhiyuan Zhang, Yanzhao Li, Zhiqiang Zou, Bai Du, Yupeng Sun, Hui Dong, Hui Wang

Main category: cs.LG

TL;DR: OSC is a hardware-efficient framework for 4-bit LLM quantization that addresses activation outliers through dual-path computation with structured sub-tensor extraction and outlier channel identification.

Details

Motivation: 4-bit quantization is crucial for high-throughput LLM deployment, but activation outliers cause significant accuracy degradation due to limited dynamic range in low-bit formats.

Method: OSC uses offline group-wise identification of outlier channels and online structured sub-tensor extraction to coalesce scattered activation channels into dense tensors. It implements dual-path computation: 4-bit GEMM for normal activations and 16-bit branch GEMM for outliers, with FP8 fallback for W2 inputs where outlier clustering is less pronounced.

Result: Evaluation on Qwen3-8B and Qwen3-30B shows average accuracy drops limited to 2.19 and 1.12 points respectively. OSC achieves 1.78x speedup over W8A8 GEMM baseline on modern AI accelerators.

Conclusion: OSC provides a hardware-efficient solution for 4-bit LLM quantization that effectively handles activation outliers while maintaining high throughput and accuracy.

Abstract: While 4-bit quantization is essential for high-throughput deployment of Large Language Models, activation outliers often lead to significant accuracy degradation due to the restricted dynamic range of low-bit formats. In this paper, we systematically investigate the spatial distribution of outliers and demonstrate a token-persistent structural clustering effect, where high-magnitude outliers consistently occupy fixed channels across tokens. Building on this insight, we propose OSC, a hardware-efficient framework for outlier suppression. During inference, OSC executes a dual-path computation consisting of a low-precision 4-bit General Matrix Multiplication (GEMM) path and a high-precision 16-bit branch GEMM path. Specifically, OSC uses an offline group-wise strategy to identify the channels where outliers are located and then performs structured sub-tensor extraction to coalesce these scattered activation channels into a compact dense tensor online. This mechanism implements outlier protection through regularized and high-throughput GEMM operations, achieving a seamless fit with modern 4-bit micro-scaling hardware. Furthermore, for the inputs of W2 where outlier clustering is less pronounced, we integrate a fallback strategy to FP8. Evaluation on Qwen3-8B and Qwen3-30B restricts the average accuracy drop to 2.19 and 1.12 points, respectively. Notably, OSC is highly hardware-friendly, achieving a peak speedup of 1.78x over the W8A8 GEMM baseline on a modern AI accelerator.

[590] VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation

Yupeng Sun, Yanzhao Li, Zhiqiang Zou, Bai Du, Zhiyuan Zhang, Hui Dong, Gaoyige Fan, Hui Wang

Main category: cs.LG

TL;DR: VFA/VSA improves FlashAttention by reducing online softmax overhead through better initialization, block reordering, and maximum freezing to avoid vector bottlenecks.

Details

Motivation: FlashAttention's online softmax approach faces vector/SIMD bottlenecks from rowmax and rowsum reductions as attention kernels approach peak hardware throughput, limiting performance on modern accelerators.

Method: Vector Relieved Flash Attention (VFA) uses cheap key-block approximations to initialize running maximum, reorders block traversal to prioritize high-impact blocks, and freezes maximum for remaining blocks to avoid repeated reductions. Combined with sparse skipping methods as VSA.

Result: Achieves up to 2x speedup over baseline (C16V32) with C8V32, C4V32, C4V16 configurations, with potential for 6x speedup with architecture improvements. Sink/local reordering stabilizes maximum early, avoiding conditional rescale operations.

Conclusion: VFA/VSA effectively reduces online softmax reduction bottlenecks without performance loss, making attention computation more hardware-friendly while maintaining exact computation.

Abstract: FlashAttention-style online softmax enables exact attention computation with linear memory by streaming score tiles through on-chip memory and maintaining a running maximum and normalizer. However, as attention kernels approach peak tensor-core/cube-core throughput on modern accelerators, non-matmul components of online softmax – especially per-tile rowmax and rowsum reductions and rescale chains – can become vector or SIMD limited and dominate latency. This paper revisits FlashAttention and proposes Vector Relieved Flash Attention (VFA), a hardware-friendly method that reduces rowmax-driven updates of the running maximum while retaining the online-softmax structure. VFA initializes the running maximum via a cheap approximation from key-block representations, reorders key-block traversal to prioritize high-impact sink and local blocks, and freezes the maximum for remaining blocks to avoid repeated reductions and rescaling. We further integrate VFA with block-sparse skipping methods such as BLASST to form Vector Relieved Sparse Attention (VSA), which reduces both block count and per-block overhead. Notably, VFA and VSA completely avoid the conditional rescale operation in the update stage used in FA4.0. Extensive evaluations on benchmarks including MMLU and MATH500, together with attention statistics, verify our design: (i) sink and local reordering stabilizes the running maximum early; (ii) simple Q and K block summaries fail due to intra-block heterogeneity; (iii) m-initialization is required when maxima appear in middle blocks. Overall, VFA and VSA efficiently alleviate online-softmax reduction bottlenecks without performance loss. Compared to the C16V32 baseline, C8V32, C4V32 and C4V16 achieve nearly two times speedup on modern hardware while hitting the vector bottleneck. With upcoming architecture improvements, C4V16 will deliver six times speedup by enhancing exponent capacity.

[591] Interpretable Relational Inference with LLM-Guided Symbolic Dynamics Modeling

Xiaoxiao Liang, Juyuan Zhang, Liming Pan, Linyuan Lü

Main category: cs.LG

TL;DR: COSINE is a differentiable framework that jointly discovers interaction graphs and symbolic dynamics for many-body systems, using LLM feedback to adaptively refine the symbolic hypothesis space.

Details

Motivation: Current neural approaches for inferring interaction structures from observed dynamics lack interpretability, while symbolic regression methods typically assume known topology and fixed function libraries. There's a need for a method that can jointly discover both interaction graphs and interpretable symbolic dynamics.

Method: COSINE uses a differentiable framework with co-optimization of symbolic interactions and network edges. It incorporates an outer-loop large language model that adaptively prunes and expands the hypothesis space based on feedback from the inner optimization loop, overcoming limitations of fixed symbolic libraries.

Result: Experiments on synthetic systems and large-scale real-world epidemic data demonstrate robust structural recovery and compact, mechanism-aligned dynamical expressions.

Conclusion: COSINE provides an interpretable approach for discovering both interaction structures and symbolic dynamics in complex systems, bridging neural and symbolic methods through LLM-guided adaptive hypothesis space refinement.

Abstract: Inferring latent interaction structures from observed dynamics is a fundamental inverse problem in many-body interacting systems. Most neural approaches rely on black-box surrogates over trainable graphs, achieving accuracy at the expense of mechanistic interpretability. Symbolic regression offers explicit dynamical equations and stronger inductive biases, but typically assumes known topology and a fixed function library. We propose \textbf{COSINE} (\textbf{C}o-\textbf{O}ptimization of \textbf{S}ymbolic \textbf{I}nteractions and \textbf{N}etwork \textbf{E}dges), a differentiable framework that jointly discovers interaction graphs and sparse symbolic dynamics. To overcome the limitations of fixed symbolic libraries, COSINE further incorporates an outer-loop large language model that adaptively prunes and expands the hypothesis space using feedback from the inner optimization loop. Experiments on synthetic systems and large-scale real-world epidemic data demonstrate robust structural recovery and compact, mechanism-aligned dynamical expressions. Code: https://anonymous.4open.science/r/COSINE-6D43.

[592] Algorithmic Analysis of Dense Associative Memory: Finite-Size Guarantees and Adversarial Robustness

Madhava Gaikwad

Main category: cs.LG

TL;DR: Finite-size analysis of Dense Associative Memory retrieval dynamics with explicit convergence rates, robustness bounds, and capacity guarantees under verifiable pattern conditions.

Details

Motivation: Existing analyses of Dense Associative Memory focus on thermodynamic limits with random patterns, lacking finite-size guarantees and explicit convergence rates needed for practical applications.

Method: Algorithmic analysis of DAM retrieval dynamics using separation assumptions and bounded-interference conditions to prove geometric convergence, adversarial robustness bounds via margin conditions, and capacity scaling analysis.

Result: Proved O(log N) convergence time, explicit margin conditions for adversarial robustness, capacity scaling as Θ(N^{n-1}) up to polylog factors, and connection to potential-game Nash equilibria.

Conclusion: Provides rigorous finite-N guarantees for DAM retrieval dynamics with explicit convergence rates and robustness bounds, bridging theoretical analysis with practical implementation considerations.

Abstract: Dense Associative Memory (DAM) generalizes Hopfield networks through higher-order interactions and achieves storage capacity that scales as $O(N^{n-1})$ under suitable pattern separation conditions. Existing dynamical analyses primarily study the thermodynamic limit $N\to\infty$ with randomly sampled patterns and therefore do not provide finite-size guarantees or explicit convergence rates. We develop an algorithmic analysis of DAM retrieval dynamics that yields finite-$N$ guarantees under explicit, verifiable pattern conditions. Under a separation assumption and a bounded-interference condition at high loading, we prove geometric convergence of asynchronous retrieval dynamics, which implies $O(\log N)$ convergence time once the trajectory enters the basin of attraction. We further establish adversarial robustness bounds expressed through an explicit margin condition that quantifies the number of corrupted bits tolerable per sweep, and derive capacity guarantees that scale as $Θ(N^{n-1})$ up to polylogarithmic factors in the worst case, while recovering the classical $Θ(N^{n-1})$ scaling for random pattern ensembles. Finally, we show that DAM retrieval dynamics admit a potential-game interpretation that ensures convergence to pure Nash equilibria under asynchronous updates. Complete proofs are provided in the appendices, together with preliminary experiments that illustrate the predicted convergence, robustness, and capacity scaling behavior.

[593] Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

Shaopeng Fu, Di Wang

Main category: cs.LG

TL;DR: Theoretical analysis of Continuous Adversarial Training (CAT) for LLMs, explaining why embedding-space perturbations defend against token-space jailbreak attacks, and proposing singular value regularization for better robustness-utility tradeoff.

Details

Motivation: While CAT has shown empirical success in defending LLMs against jailbreak attacks, its theoretical foundations remain unclear - specifically why adversarial perturbations in embedding space help defend against token-space attacks. The paper aims to provide the first theoretical analysis of CAT based on in-context learning theory.

Method: 1) Theoretical analysis of CAT on linear transformers trained with adversarial examples from embedding space on in-context linear regression tasks, proving a robust generalization bound. 2) Based on the bound’s relationship with embedding matrix singular values, propose adding a regularization term to CAT’s objective function that depends on these singular values. 3) Experimental validation on real-world LLMs.

Result: Theoretical proof shows robust generalization bound has negative correlation with perturbation radius in embedding space, explaining CAT’s effectiveness. Experiments demonstrate the proposed singular value regularization helps LLMs achieve better jailbreak robustness-utility tradeoff compared to standard CAT.

Conclusion: The paper provides the first theoretical foundation for CAT in LLMs, explaining its mechanism through in-context learning theory and proposing an improved CAT method with singular value regularization that enhances robustness against jailbreak attacks while maintaining utility.

Abstract: Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in-context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM’s token space. Further, the robust bound shows that the robustness of an adversarially trained LLM is closely related to the singular values of its embedding matrix. Based on this, we propose to improve LLM CAT by introducing an additional regularization term, which depends on singular values of the LLM’s embedding matrix, into the objective function of CAT. Experiments on real-world LLMs demonstrate that our method can help LLMs achieve a better jailbreak robustness-utility tradeoff. The code is available at https://github.com/fshp971/continuous-adv-icl.

[594] Loop Corrections to the Training and Generalization Errors of Random Feature Models

Taeyoung Kim

Main category: cs.LG

TL;DR: Random feature models with frozen neural networks as features, analyzing training/test/generalization errors beyond mean-kernel approximation using statistical physics and field theory

Details

Motivation: To understand the performance of random feature models where neural networks are frozen as features, going beyond the standard mean-kernel approximation to account for finite-width effects and higher-order fluctuations

Method: Adopts statistical physics viewpoint and effective field-theoretic framework to derive loop corrections to training, test, and generalization errors, analyzing scaling laws and verifying with experiments

Result: Derived loop corrections showing finite-width contributions beyond mean kernel, obtained scaling laws for errors, and provided experimental verification of the theoretical predictions

Conclusion: Random feature models’ performance depends not just on mean kernel but also on higher-order fluctuation statistics, with finite-width effects captured by loop corrections in field-theoretic framework

Abstract: We investigate random feature models in which neural networks sampled from a prescribed initialization ensemble are frozen and used as random features, with only the readout weights optimized. Adopting a statistical-physics viewpoint, we study the training, test, and generalization errors beyond the mean-kernel approximation. Since the predictor is a nonlinear functional of the induced random kernel, the ensemble-averaged errors depend not only on the mean kernel but also on higher-order fluctuation statistics. Within an effective field-theoretic framework, these finite-width contributions naturally appear as loop corrections. We derive the loop corrections to the training, test, and generalization errors, obtain their scaling laws, and support the theory with experimental verification.

[595] TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning

Chaoyao Shen, Linfeng Jiang, Yixian Shen, Tao Xu, Guoqing Li, Anuj Pathania, Andy D. Pimentel, Meng Zhang

Main category: cs.LG

TL;DR: TCL is an efficient, transferable DL compiler framework that uses active learning sampling, Mamba-based cost modeling, and cross-platform knowledge distillation to optimize tensor programs with minimal data collection.

Details

Motivation: Existing DL compilers require large offline datasets for cost modeling and auto-tuning, which is expensive to collect and doesn't transfer well across different hardware platforms.

Method: Three core components: (1) RDU Sampler for data-efficient active learning selecting only 10% of programs, (2) Mamba-based cost model for long-range schedule dependencies with reduced parameters, (3) continuous knowledge distillation framework for cross-platform transfer.

Result: TCL achieves 16.8x faster tuning time on CPUs and 12.48x on GPUs, with 1.20x lower inference latency on CPUs and 1.13x on GPUs compared to Tenset-MLP baseline.

Conclusion: TCL provides an efficient, transferable solution for DL compiler optimization that significantly reduces data collection costs while maintaining performance across diverse hardware platforms.

Abstract: Deep learning (DL) compilers rely on cost models and auto-tuning to optimize tensor programs for target hardware. However, existing approaches depend on large offline datasets, incurring high collection costs and offering suboptimal transferability across platforms. In this paper, we introduce TCL, a novel efficient and transferable compiler framework for fast tensor program optimization across diverse hardware platforms to address these challenges. Specifically, TCL is built on three core enablers: (1) the RDU Sampler, a data-efficient active learning strategy that selects only 10% of tensor programs by jointly optimizing Representativeness, Diversity, and Uncertainty, substantially reducing data collection costs while maintaining near-original model accuracy; (2) a new Mamba-based cost model that efficiently captures long-range schedule dependencies while achieving a favorable trade-off between prediction accuracy and computational cost through reduced parameterization and lightweight sequence modeling; and (3) a continuous knowledge distillation framework that effectively and progressively transfers knowledge across multiple hardware platforms while avoiding the parameter explosion and data dependency issues typically caused by traditional multi-task learning. Extensive experiments validate the effectiveness of each individual enabler and the holistic TCL framework. When optimizing a range of mainstream DL models on both CPU and GPU platforms, TCL achieves, on average, 16.8x and 12.48x faster tuning time, and 1.20x and 1.13x lower inference latency, respectively, compared to Tenset-MLP.

[596] Adaptive Data Dropout: Towards Self-Regulated Learning in Deep Neural Networks

Amar Gahir, Varshil Patel, Shreyank N Gowda

Main category: cs.LG

TL;DR: Adaptive Data Dropout: A framework that dynamically adjusts training data subsets based on performance feedback to improve efficiency and generalization, outperforming static data dropout methods.

Details

Motivation: Current training methods uniformly sample large datasets despite evidence that not all samples contribute equally. Existing progressive data reduction methods use fixed schedules that don't adapt during training, limiting their effectiveness.

Method: Proposes Adaptive Data Dropout with a lightweight stochastic update mechanism that dynamically modulates the dropout schedule online based on training accuracy feedback, treating data selection as an adaptive process that balances exploration and consolidation.

Result: Experiments on standard image classification benchmarks show the method reduces effective training steps while maintaining competitive accuracy compared to static data dropout strategies.

Conclusion: Adaptive data selection is a promising direction for efficient and robust training, with the proposed framework demonstrating practical benefits over static approaches.

Abstract: Deep neural networks are typically trained by uniformly sampling large datasets across epochs, despite evidence that not all samples contribute equally throughout learning. Recent work shows that progressively reducing the amount of training data can improve efficiency and generalization, but existing methods rely on fixed schedules that do not adapt during training. In this work, we propose Adaptive Data Dropout, a simple framework that dynamically adjusts the subset of training data based on performance feedback. Inspired by self-regulated learning, our approach treats data selection as an adaptive process, increasing or decreasing data exposure in response to changes in training accuracy. We introduce a lightweight stochastic update mechanism that modulates the dropout schedule online, allowing the model to balance exploration and consolidation over time. Experiments on standard image classification benchmarks show that our method reduces effective training steps while maintaining competitive accuracy compared to static data dropout strategies. These results highlight adaptive data selection as a promising direction for efficient and robust training. Code will be released.

[597] Parcae: Scaling Laws For Stable Looped Language Models

Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, Daniel Y. Fu

Main category: cs.LG

TL;DR: Parcae: A stable looped architecture that treats looping as a dynamical system, addresses instability via spectral norm constraints, and enables predictable FLOP scaling while keeping parameters fixed.

Details

Motivation: Traditional fixed-depth architectures scale quality by increasing parameters or data, which increases memory footprint. Looped architectures offer an alternative by increasing FLOPs through repeated layer passes, but suffer from training instability issues like residual explosion and loss spikes.

Method: Recasts looping as a nonlinear time-variant dynamical system over the residual stream. Uses linear approximation to identify instability caused by large spectral norms in injection parameters. Proposes Parcae with spectral norm constraints via discretization of a negative diagonal parameterization for stability.

Result: Parcae achieves up to 6.3% lower validation perplexity over prior looped models. At 1.3B parameters, improves CORE and Core-Extended quality by 2.99 and 1.18 points vs Transformer baselines, achieving up to 87.5% relative quality of a Transformer twice the size. Enables predictable power laws for FLOP scaling with fixed parameters.

Conclusion: Parcae provides a stable looped architecture that enables efficient FLOP scaling without increasing parameters, offering an alternative to traditional scaling approaches with predictable scaling properties for both training and inference.

Abstract: Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a higher memory footprint, or data. A potential alternative is looped architectures, which instead increase FLOPs by sending activations through a block of layers in a loop. While promising, existing recipes for training looped architectures can be unstable, suffering from residual explosion and loss spikes. We address these challenges by recasting looping as a nonlinear time-variant dynamical system over the residual stream. Via a linear approximation to this system, we find that instability occurs in existing looped architectures as a result of large spectral norms in their injection parameters. To address these instability issues, we propose Parcae, a novel stable, looped architecture that constrains the spectral norm of the injection parameters via discretization of a negative diagonal parameterization. As a result, Parcae achieves up to 6.3% lower validation perplexity over prior large-scale looped models. Using our stable looped architecture, we investigate the scaling properties of looping as a medium to improve quality by increasing FLOPs in training and test-time. For training, we derive predictable power laws to scale FLOPs while keeping parameter count fixed. Our initial scaling laws suggest that looping and data should be increased in tandem, given a fixed FLOP budget. At test-time, we find that Parcae can use looping to scale compute, following a predictable, saturating exponential decay. When scaled up to 1.3B parameters, we find that Parcae improves CORE and Core-Extended quality by 2.99 and 1.18 points when compared to strong Transformer baselines under a fixed parameter and data budget, achieving a relative quality of up to 87.5% a Transformer twice the size.

[598] The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime

Jason Z Wang

Main category: cs.LG

TL;DR: The paper shows that verifying calibration in AI models becomes fundamentally harder as models improve, establishing theoretical limits on calibration error estimation and demonstrating practical implications for evaluation practices.

Details

Motivation: The motivation is to address the fundamental limitations in verifying calibration of AI models, particularly as models improve. The authors challenge existing calibration evaluation practices by showing that widely cited calibration results may be below statistical noise floors, and aim to establish theoretical bounds on what can be reliably measured.

Method: The authors use theoretical analysis to prove minimax rates for calibration error estimation, establish impossibility results, and validate findings empirically across five benchmarks (MMLU, TruthfulQA, ARC-Challenge, HellaSwag, WinoGrande) with 6 LLMs from 5 families (8B-405B parameters), using bootstrap confidence intervals and permutation tests.

Result: Key results: 1) Self-evaluation without labels provides zero information about calibration, 2) Sharp phase transition below which miscalibration is undetectable, 3) Active querying eliminates the Lipschitz constant, 4) Verification cost grows exponentially with pipeline depth. Empirical validation shows self-evaluation non-significance holds in 80% of benchmark-model pairs, and 23% of pairwise comparisons are indistinguishable from noise.

Conclusion: The conclusion is that credible calibration claims must report verification floors and prioritize active querying once gains approach benchmark resolution. As AI models improve, verifying their calibration becomes fundamentally harder, requiring new evaluation methodologies.

Abstract: The most cited calibration result in deep learning – post-temperature-scaling ECE of 0.012 on CIFAR-100 (Guo et al., 2017) – is below the statistical noise floor. We prove this is not a failure of the experiment but a law: the minimax rate for estimating calibration error with model error rate epsilon is Theta((Lepsilon/m)^{1/3}), and no estimator can beat it. This “verification tax” implies that as AI models improve, verifying their calibration becomes fundamentally harder – with the same exponent in opposite directions. We establish four results that contradict standard evaluation practice: (1) self-evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute; (2) a sharp phase transition at mepsilon approx 1 below which miscalibration is undetectable; (3) active querying eliminates the Lipschitz constant, collapsing estimation to detection; (4) verification cost grows exponentially with pipeline depth at rate L^K. We validate across five benchmarks (MMLU, TruthfulQA, ARC-Challenge, HellaSwag, WinoGrande; ~27,000 items) with 6 LLMs from 5 families (8B-405B parameters, 27 benchmark-model pairs with logprob-based confidence), 95% bootstrap CIs, and permutation tests. Self-evaluation non-significance holds in 80% of pairs. Across frontier models, 23% of pairwise comparisons are indistinguishable from noise, implying that credible calibration claims must report verification floors and prioritize active querying once gains approach benchmark resolution.

[599] An Optimal Sauer Lemma Over $k$-ary Alphabets

Steve Hanneke, Qinglin Meng, Shay Moran, Amirreza Shaeiri

Main category: cs.LG

TL;DR: Sharp Sauer inequality for multiclass and list prediction using DS dimension, improving bounds over Natarajan dimension for k>2 alphabets.

Details

Motivation: The Natarajan dimension-based Sauer bounds for multiclass settings (k>2) are suboptimal, motivating development of tighter combinatorial bounds using DS dimension.

Method: Uses polynomial method to establish sharp Sauer inequality expressed in terms of Daniely-Shalev-Shwartz (DS) dimension and its extension, list-DS dimension.

Result: Achieves tight bounds for all alphabet sizes k, list sizes ℓ, and dimension values, replacing exponential ℓ-dependence with optimal polynomial dependence and improving k-dependence.

Conclusion: Provides improved sample complexity bounds for list PAC learning and uniform convergence, sharpening recent results in learning theory literature.

Abstract: The Sauer-Shelah-Perles Lemma is a cornerstone of combinatorics and learning theory, bounding the size of a binary hypothesis class in terms of its Vapnik-Chervonenkis (VC) dimension. For classes of functions over a $k$-ary alphabet, namely the multiclass setting, the Natarajan dimension has long served as an analogue of VC dimension, yet the corresponding Sauer-type bounds are suboptimal for alphabet sizes $k>2$. In this work, we establish a sharp Sauer inequality for multiclass and list prediction. Our bound is expressed in terms of the Daniely–Shalev-Shwartz (DS) dimension, and more generally with its extension, the list-DS dimension – the combinatorial parameters that characterize multiclass and list PAC learnability. Our bound is tight for every alphabet size $k$, list size $\ell$, and dimension value, replacing the exponential dependence on $\ell$ in the Natarajan-based bound by the optimal polynomial dependence, and improving the dependence on $k$ as well. Our proof uses the polynomial method. In contrast to the classical VC case, where several direct combinatorial proofs are known, we are not aware of any purely combinatorial proof in the DS setting. This motivates several directions for future research, which are discussed in the paper. As consequences, we obtain improved sample complexity upper bounds for list PAC learning and for uniform convergence of list predictors, sharpening the recent results of Charikar et al.~~(STOC~~2023), Hanneke et al.~~(COLT~~2024), and Brukhim et al.~~(NeurIPS~~2024).

[600] Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations

Tong Zhang, Jiangning Zhang, Zhucun Xue, Juntao Jiang, Yicheng Xu, Chengming Xu, Teng Hu, Xingyu Xie, Xiaobin Hu, Yabiao Wang, Yong Liu, Shuicheng Yan

Main category: cs.LG

TL;DR: Comprehensive survey and empirical evaluation of deep learning optimization algorithms, analyzing evolutionary trends and design trade-offs across first-order, second-order, and zeroth-order methods.

Details

Motivation: Address the lack of cohesive framework for understanding optimization algorithms in deep learning, especially given challenges in large-scale training, privacy requirements, and distributed learning that expose limitations of conventional methods.

Method: Retrospective analysis of optimization algorithm evolution combined with comprehensive empirical evaluation of mainstream optimizers across diverse model architectures and training scenarios.

Result: Distilled key emerging trends and fundamental design trade-offs, providing actionable guidance for designing next-generation optimization methods.

Conclusion: Synthesizes theoretical insights with empirical evidence to guide development of more efficient, robust, and trustworthy optimization methods for deep learning.

Abstract: Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First-order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam, serve as the cornerstone of modern training pipelines. However, large-scale model training, stringent differential privacy requirements, and distributed learning paradigms expose critical limitations in these conventional approaches regarding privacy protection and memory efficiency. To mitigate these bottlenecks, researchers explore second-order optimization techniques to surpass first-order performance ceilings, while zeroth-order methods reemerge to alleviate memory constraints inherent to large-scale training. Despite this proliferation of methodologies, the field lacks a cohesive framework that unifies underlying principles and delineates application scenarios for these disparate approaches. In this work, we retrospectively analyze the evolutionary trajectory of deep learning optimization algorithms and present a comprehensive empirical evaluation of mainstream optimizers across diverse model architectures and training scenarios. We distill key emerging trends and fundamental design trade-offs, pinpointing promising directions for future research. By synthesizing theoretical insights with extensive empirical evidence, we provide actionable guidance for designing next-generation highly efficient, robust, and trustworthy optimization methods. The code is available at https://github.com/APRIL-AIGC/Awesome-Optimizer.

[601] Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Yecheng Wu, Song Han, Hai Cai

Main category: cs.LG

TL;DR: Lightning OPD: An offline on-policy distillation framework that eliminates the need for live teacher servers by precomputing teacher log-probabilities over SFT rollouts, achieving state-of-the-art performance with 4x speedup.

Details

Motivation: Standard on-policy distillation (OPD) requires a live teacher inference server throughout training, causing substantial infrastructure overhead. The paper investigates whether OPD can be performed offline to reduce this overhead.

Method: Proposes Lightning OPD, an offline framework that enforces teacher consistency by precomputing teacher log-probabilities over supervised fine-tuning (SFT) rollouts. This eliminates the need for live teacher servers while maintaining the same optimum as standard OPD.

Result: Lightning OPD achieves state-of-the-art performance on mathematical reasoning and code generation tasks. Starting from Qwen3-8B-Base, it reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD.

Conclusion: Lightning OPD provides an efficient offline alternative to standard OPD that eliminates infrastructure overhead while maintaining performance, making LLM post-training more accessible for academic research.

Abstract: On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.

[602] CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

Benzhao Tang, Shiyu Yang

Main category: cs.LG

TL;DR: CLAD is a deep learning framework that performs log anomaly detection directly on compressed byte streams without decompression or parsing, achieving state-of-the-art performance by detecting disruptions in compression patterns.

Details

Motivation: Existing log anomaly detection methods require full decompression and parsing of compressed logs, creating significant pre-processing overhead. The explosive growth of system logs makes streaming compression essential, but current approaches can't handle compressed data directly.

Method: CLAD exploits the insight that normal logs compress into regular byte patterns while anomalies disrupt them. It uses a dilated convolutional byte encoder, hybrid Transformer-mLSTM architecture, and four-way aggregation pooling. Training involves two stages: masked pre-training and focal-contrastive fine-tuning to handle class imbalance.

Result: Evaluated across five datasets, CLAD achieves state-of-the-art average F1-score of 0.9909, outperforming the best baseline by 2.72 percentage points. It completely eliminates decompression and parsing overheads while maintaining superior accuracy.

Conclusion: CLAD provides a robust solution for log anomaly detection directly on compressed byte streams, generalizing to structured streaming compressors and offering significant efficiency improvements over existing methods.

Abstract: The explosive growth of system logs makes streaming compression essential, yet existing log anomaly detection (LAD) methods incur severe pre-processing overhead by requiring full decompression and parsing. We introduce CLAD, the first deep learning framework to perform LAD directly on compressed byte streams. CLAD bypasses these bottlenecks by exploiting a key insight: normal logs compress into regular byte patterns, while anomalies systematically disrupt them. To extract these multi-scale deviations from opaque bytes, we propose a purpose-built architecture integrating a dilated convolutional byte encoder, a hybrid Transformer–mLSTM, and four-way aggregation pooling. This is coupled with a two-stage training strategy of masked pre-training and focal-contrastive fine-tuning to effectively handle severe class imbalance. Evaluated across five datasets, CLAD achieves a state-of-the-art average F1-score of 0.9909 and outperforms the best baseline by 2.72 percentage points. It delivers superior accuracy while completely eliminating decompression and parsing overheads, offering a robust solution that generalizes to structured streaming compressors.

[603] A Nonparametric Adaptive EWMA Control Chart for Binary Monitoring of Multiple Stream Processes

Faruk Muritala, Austin Brown, Dhrubajyoti Ghosh, Sherry Ni

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2604.12095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[604] PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving

Xu Bai, Muhammed Tawfiqul Islam, Chen Wang, Adel N. Toosi

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2604.12171 suggests it’s from April 2024, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates the arXiv API is rate limiting requests.

Method: No method information available due to failed content retrieval.

Result: No results available as the paper content could not be fetched.

Conclusion: Unable to analyze paper due to technical limitations in accessing the content.

Abstract: Failed to fetch summary for 2604.12171: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12171&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[605] Fine-tuning Factor Augmented Neural Lasso for Heterogeneous Environments

Jinhang Chai, Jianqing Fan, Cheng Gao, Qishuo Yin

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: No method information available

Result: No results available

Conclusion: Cannot analyze paper due to technical access issues

Abstract: Failed to fetch summary for 2604.12288: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12288&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[606] Information-Geometric Decomposition of Generalization Error in Unsupervised Learning

Gilhan Kim

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.12340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[607] VeriX-Anon: A Multi-Layered Framework for Mathematically Verifiable Outsourced Target-Driven Data Anonymization

Miit Daga, Swarna Priya Ramu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2604.12431: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12431&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[608] Uncertainty Quantification on Graph Learning: A Survey

Chao Chen, Chenghua Guo, Rui Xu, Jiujiu Chen, Xiangwen Liao, Xi Zhang, Sihong Xie, Hui Xiong, Philip Yu

Main category: cs.LG

TL;DR: Paper ID 2404.14642 could not be fetched due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2404.14642: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.14642&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[609] A DeepONet for inverting the Neumann-to-Dirichlet Operator in Electrical Impedance Tomography: An approximation theoretic perspective and numerical results

Anuj Abhishek, Thilo Strauss

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2407.17182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.17182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[610] Towards Generalized Certified Robustness with Multi-Norm Training

Enyi Jiang, David S. Cheung, Gagandeep Singh

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to fetch failure

Method: Cannot determine method due to fetch failure

Result: Cannot determine results due to fetch failure

Conclusion: Cannot determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2410.03000: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.03000&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[611] Scale-aware Message Passing For Graph Node Classification

Qin Jiang, Chengjia Wang, Michael Lones, Dongdong Chen, Wei Pang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2411.19392: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.19392&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[612] Clustering with Uniformity- and Neighbor-Based Random Geometric Graphs

Rui Shi, Elvan Ceyhan, Nedret Billor

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2501.06268: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.06268&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[613] FMASH: Advancing Traditional Chinese Medicine Formula Recommendation with Efficient Fusion of Multiscale Associations of Symptoms and Herbs

Xinhan Zheng, Xueting Wang, Ruotai Li, Huyu Wu, Haopeng Jin, Yehan Yang, Guodong Shan

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2503.05167: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.05167&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[614] Free Random Projection for In-Context Reinforcement Learning

Tomohiro Hayase, Benoît Collins, Nakamasa Inoue

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2504.06983: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.06983&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[615] Mutual Information Surprise: Rethinking Unexpectedness in Autonomous Systems

Yinsong Wang, Quan Zeng, Xiao Liu, Yu Ding

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2508.17403 appears to be a recent arXiv submission, but no content could be retrieved.

Details

Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API.

Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API.

Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API.

Conclusion: Cannot draw conclusions as paper content is unavailable due to HTTP 429 error from arXiv API.

Abstract: Failed to fetch summary for 2508.17403: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.17403&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[616] TeRA: Vector-based Random Tensor Network for High-Rank Adaptation of Large Language Models

Yuxuan Gu, Wuyang Zhou, Giorgos Iacovides, Danilo Mandic

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error fetching paper content

Method: Unable to determine method due to technical error fetching paper content

Result: Unable to determine results due to technical error fetching paper content

Conclusion: Unable to determine conclusion due to technical error fetching paper content

Abstract: Failed to fetch summary for 2509.03234: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03234&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[617] Invariant Features for Global Crop Type Classification

Xin-Yi Tong, Sherrie Wang

Main category: cs.LG

TL;DR: Paper ID 2509.03497 summary could not be fetched due to HTTP 429 rate limiting error from arXiv API

Details

Motivation: Unable to determine motivation as the abstract could not be retrieved due to API rate limiting

Method: Unknown - paper content not accessible

Result: No results available due to failed API request

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.03497: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03497&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[618] Learning to accelerate distributed ADMM using graph neural networks

Henri Doerks, Paul Häusner, Daniel Hernández Escobar, Jens Sjölund

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2509.05288: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05288&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[619] Replicable Reinforcement Learning with Linear Function Approximation

Eric Eaton, Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2509.08660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.08660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[620] Scalable Verification of Neural Control Barrier Functions Using Linear Bound Propagation

Nikolaus Vertovec, Frederik Baymler Mathiesen, Thom Badings, Luca Laurenti, Alessandro Abate

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2511.06341: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06341&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[621] Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

Nikolay Blagoev, Oğuzhan Ersoy, Lydia Yiyu Chen

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2511.09780: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09780&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[622] Quantile Q-Learning: Revisiting Offline Extreme Q-Learning with Quantile Regression

Xinming Gao, Shangzhe Li, Yujin Cai, Wenwu Yu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2511.11973 could not be retrieved from arXiv API.

Details

Motivation: Cannot determine motivation as the paper content is unavailable due to API rate limiting.

Method: Cannot determine method as the paper content is unavailable due to API rate limiting.

Result: Cannot determine results as the paper content is unavailable due to API rate limiting.

Conclusion: Cannot draw conclusions as the paper content is unavailable due to API rate limiting.

Abstract: Failed to fetch summary for 2511.11973: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11973&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[623] RankOOD – Class Ranking-based Out-of-Distribution Detection

Dishanika Denipitiyage, Naveen Karunanayake, Suranga Seneviratne, Sanjay Chawla

Main category: cs.LG

TL;DR: Unable to analyze paper 2511.19996 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2511.19996: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19996&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[624] Robust gene prioritization for Dietary Restriction via Fast-mRMR Feature Selection techniques

Rubén Fernández-Farelo, Jorge Paz-Ruza, Bertha Guijarro-Berdiñas, Amparo Alonso-Betanzos, Alex A. Freitas

Main category: cs.LG

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting on the arXiv API

Method: Method unknown - paper content not accessible due to HTTP 429 error (Too Many Requests)

Result: No results available - the arXiv API returned a rate limiting error preventing access to the paper abstract

Conclusion: Cannot analyze paper due to technical limitations in accessing the content from arXiv

Abstract: Failed to fetch summary for 2511.21211: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21211&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[625] Correction of Decoupled Weight Decay

Jason Chuan-Chih Chou

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.08217: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08217&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[626] Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

Haocheng Lu, Minjun Zhu, Henry Yu

Main category: cs.LG

TL;DR: Unable to analyze paper 2512.19728 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to API rate limiting

Method: Cannot determine method as abstract is unavailable due to API rate limiting

Result: Cannot determine results as abstract is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as abstract is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2512.19728: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19728&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[627] DRPG (Decompose, Retrieve, Plan, Generate): An Agentic Framework for Academic Rebuttal

Peixuan Han, Yingjie Yu, Jingjun Xu, Jiaxuan You

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.18081 appears to be a recent arXiv submission, but no abstract or content is available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates rate limiting from arXiv API, preventing retrieval of the paper details.

Method: Cannot determine method without access to the paper content. The arXiv API rate limiting prevents analysis of the paper’s technical approach.

Result: Cannot determine results without access to the paper content. The technical details and experimental outcomes are unavailable due to the HTTP 429 error.

Conclusion: Cannot draw conclusions about the paper’s contributions without access to its content. The arXiv API rate limiting prevents comprehensive analysis of this specific paper.

Abstract: Failed to fetch summary for 2601.18081: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18081&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[628] NeuroPareto: Calibrated Acquisition for Costly Many-Goal Search in Vast Parameter Spaces

Rong Fu, Chunlei Meng, Youjin Wang, Haoyu Zhao, Jiaxuan Lu, Kun Liu, JiaBao Dou, Simon James Fong

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.03901: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03901&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Yicheng Di, Wei Yuan, Tieke He, Yuan Liu, Hongzhi Yin

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.08590: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08590&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[630] RLGT: A reinforcement learning framework for extremal graph theory

Ivan Damnjanović, Uroš Milivojević, Irena Đorđević, Dragan Stevanović

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.17276: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17276&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[631] TempoNet: Slack-Quantized Transformer-Guided Reinforcement Scheduler for Adaptive Deadline-Centric Real-Time Dispatchs

Rong Fu, Yibo Meng, Guangzhen Yao, Jiaxuan Lu, Zeyu Zhang, Zhaolu Kang, Ziming Guo, Jia Yee Tan, Xiaojing Du, Simon James Fong

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2602.18109: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18109&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[632] Dreamer-CDP: Improving Reconstruction-free World Models Via Continuous Deterministic Representation Prediction

Michael Hauri, Friedemann Zenke

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to the paper content

Method: Cannot determine method without access to the paper content

Result: Cannot determine results without access to the paper content

Conclusion: Cannot draw conclusions without access to the paper content

Abstract: Failed to fetch summary for 2603.07083: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07083&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[633] A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting

Jing Liu, Maria Grith, Xiaowen Dong, Mihai Cucuringu

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.10559: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10559&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[634] Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, Arber Zela

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting error

Method: Unable to determine method due to API rate limiting error

Result: Unable to determine results due to API rate limiting error

Conclusion: Unable to determine conclusion due to API rate limiting error

Abstract: Failed to fetch summary for 2603.24647: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24647&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[635] BLOSSOM: Block-wise Federated Learning Over Shared and Sparse Observed Modalities

Pranav M R, Jayant Chandwani, Ahmed M. Abdelmoniem, Arnab K. Paul

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.27552: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27552&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[636] Detecting Complex Money Laundering Patterns with Incremental and Distributed Graph Modeling

Haseeb Tariq, Alen Kaja, Marwan Hassani

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to technical limitations

Result: No results available due to failed API request

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2604.01315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[637] GeoPAS: Geometric Probing for Algorithm Selection in Continuous Black-Box Optimisation

Jiabao Brad Wang, Xiang Shi, Yiliang Yuan, Mustafa Misir

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2604.09095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[638] Automated Batch Distillation Process Simulation for a Large Hybrid Dataset for Deep Anomaly Detection

Jennifer Werner, Justus Arweiler, Indra Jungjohann, Jochen Schmid, Fabian Jirasek, Hans Hasse, Michael Bortz

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2604.09166: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09166&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[639] A Full Compression Pipeline for Green Federated Learning in Communication-Constrained Environments

Elouan Colybes, Shirin Salehi, Anke Schmeink

Main category: cs.LG

TL;DR: Unable to analyze paper 2604.11146 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is not available

Method: Cannot determine method as abstract is not available

Result: Cannot determine results as abstract is not available

Conclusion: Cannot draw conclusion as abstract is not available

Abstract: Failed to fetch summary for 2604.11146: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11146&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[640] Towards Situation-aware State Modeling for Air Traffic Flow Prediction

Anqi Liu, Bin Wang, Jiangtao Zhao, Dechuan Ma, Guiyuan Jiang, Feng Hong, Yanwei Yu, Tianrui Li

Main category: cs.LG

TL;DR: Unable to analyze paper 2604.11198 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to the paper abstract

Method: Cannot determine method without access to the paper abstract

Result: Cannot determine results without access to the paper abstract

Conclusion: Cannot draw conclusions without access to the paper abstract

Abstract: Failed to fetch summary for 2604.11198: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11198&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[641] Structural Consequences of Policy-Based Interventions on the Global Supply Chain Network

Lea Karbevska, Liming Xu, Zehui Dai, Sara AlMahri, Alexandra Brintrup

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2604.11479: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11479&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[642] Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs

Etienne Boursier, Loucas Pillaud-Vivien, Nicolas Flammarion

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper ID 2206.00939

Details

Motivation: Cannot determine motivation without access to the paper content

Method: Cannot determine method without access to the paper content

Result: Cannot determine results without access to the paper content

Conclusion: Cannot determine conclusion without access to the paper content

Abstract: Failed to fetch summary for 2206.00939: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2206.00939&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[643] metasnf: Meta Clustering with Similarity Network Fusion in R

Prashanth S Velayudhan, Xiaoqiao Xu, Prajkta Kallurkar, Ana Patricia Balbon, Maria T Secara, Adam Taback, Denise Sabac, Nicholas Chan, Shihao Ma, Bo Wang, Daniel Felsky, Stephanie H Ameis, Brian Cox, Colin Hawco, Lauren Erdman, Anne L Wheeler

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2410.17976: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.17976&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[644] Surrogate models for diffusion on graphs via sparse polynomials

Giuseppe Alessio D’Inverno, Kylian Ajavon, Simone Brugiapaglia

Main category: cs.LG

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2502.06595: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.06595&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[645] The Illusion of Fit: Spatially Resolved Assessment of Constitutive Model Validity in Elastography and Physics-Based Inverse Problems

Vincent C. Scholz, P.S. Koutsourelakis

Main category: cs.LG

TL;DR: Unable to analyze paper 2502.07415 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2502.07415: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.07415&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Hongliang Qiao, Shanshan Feng, Min Zhou, Xutao Li, Yunming Ye, Fan Li, Shuo Shang, Yew-Soon Ong

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2502.13571 could not be retrieved from arXiv API.

Details

Motivation: Cannot determine motivation as paper content is unavailable.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot determine conclusion as paper content is unavailable.

Abstract: Failed to fetch summary for 2502.13571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.13571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[647] Privacy-Preserving Transfer Learning for Community Detection using Locally Distributed Multiple Networks

Xiao Guo, Xuming He, Xiangyu Chang, Shujie Ma

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2504.00890: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.00890&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[648] A Two-Timescale Primal-Dual Framework for Reinforcement Learning via Online Dual Variable Guidance

Axel Friedrich Wolter, Tobias Sutter

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.04494: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.04494&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[649] Incentivizing High-Quality Human Annotations with Golden Questions

Shang Liu, Zhongze Cai, Hanzhao Wang, Zhongyao Ma, Xiaocheng Li

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2505.19134: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19134&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[650] On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, Jiawei Zhang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.23737: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.23737&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[651] Joint Multi-Target Detection-Tracking in Cognitive Massive MIMO Radar via POMCP

Imad Bouhou, Stefano Fortunati, Leila Gharsalli, Alexandre Renaux

Main category: cs.LG

TL;DR: Paper 2507.17506: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2507.17506: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.17506&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[652] Climate Model Tuning with Online Synchronization-Based Parameter Estimation

Jordan Seneca, Suzanne Bintanja, Frank M. Selten

Main category: cs.LG

TL;DR: Paper 2510.06180: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing abstract

Method: Unable to determine method due to missing abstract

Result: Unable to determine results due to missing abstract

Conclusion: Unable to determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2510.06180: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06180&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[653] Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix

Tomohiro Hayase, Benoît Collins, Ryo Karakida

Main category: cs.LG

TL;DR: Failed to fetch summary for paper 2510.06685 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2510.06685: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06685&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[654] Iterative Compositional Data Generation for Robot Control

Anh-Quan Pham, Marcel Hussing, Shubhankar P. Patankar, Dani S. Bassett, Jorge Mendez-Mendez, Eric Eaton

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.10891: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10891&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[655] Physics and causally constrained discrete-time neural models of turbulent dynamical systems

Fabrizio Falasca, Laure Zanna

Main category: cs.LG

TL;DR: Paper 2602.13847: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract content

Method: Cannot determine method due to missing abstract content

Result: Cannot determine results due to missing abstract content

Conclusion: Cannot determine conclusion due to missing abstract content

Abstract: Failed to fetch summary for 2602.13847: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13847&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[656] A Theoretical Comparison of No-U-Turn Sampler Variants: Necessary and Sufficient Convergence Conditions and Mixing Time Analysis under Gaussian Targets

Samuel Gruffaz, Kyurae Kim, Fares Guehtar, Hadrien Duval-decaix, Pacôme Trautmann

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.18640 suggests it’s from March 2024, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content. HTTP 429 error indicates rate limiting from arXiv API.

Method: Cannot determine method without access to paper content. The arXiv API returned a rate limiting error.

Result: No results available due to inability to fetch paper content. The arXiv API request resulted in HTTP 429 error.

Conclusion: Unable to analyze paper due to technical limitations in accessing content. The arXiv API rate limiting prevents retrieval of paper details.

Abstract: Failed to fetch summary for 2603.18640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[657] A Large-Scale Comparative Analysis of Imputation Methods for Single-Cell RNA Sequencing Data

Yuichiro Iwashita, Ahtisham Fazeel Abbasi, Koichi Kise, Andreas Dengel, Muhammad Nabeel Asim

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2603.24626: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24626&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[658] Stochastic Auto-conditioned Fast Gradient Methods with Optimal Rates

Yao Ji, Guanghui Lan

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access limitations

Method: Unable to determine method due to access limitations

Result: Unable to determine results due to access limitations

Conclusion: Unable to determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2604.06525: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06525&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[659] Variational Quantum Physics-Informed Neural Networks for Hydrological PDE-Constrained Learning with Inherent Uncertainty Quantification

Prasad Nimantha Madusanka Ukwatta Hewage, Midhun Chakkravarthy, Ruvan Kumara Abeysekara

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2604.09374: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09374&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[660] Discrete Flow Maps

Peter Potaptchik, Jason Yim, Adhi Saravanan, Peter Holderrieth, Eric Vanden-Eijnden, Michael S. Albergo

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2604.09784 appears to be from April 2024, but no abstract or content is available.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2604.09784: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09784&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[661] Computation of Least Trimmed Squares: A Branch-and-Bound framework with Hyperplane Arrangement Enhancements

Xiang Meng, Andrés Gómez, Rahul Mazumder

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2604.11584: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11584&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

[662] REGREACT: Self-Correcting Multi-Agent Pipelines for Structured Regulatory Information Extraction

Mohammed Ali, Abdelrahman Abdallah, Adam Jatowt

Main category: cs.MA

TL;DR: RegReAct: A multi-agent framework for extracting structured compliance criteria from regulatory documents using self-correcting ODR loops to handle hallucinations and cross-references.

Details

Motivation: Existing language models struggle with extracting structured compliance criteria from regulatory documents due to hallucinations, loss of hierarchical relationships, and failure to resolve inter-document dependencies.

Method: A self-correcting multi-agent framework with seven specialized stages, each using Observe-Diagnose-Repair loops to validate outputs against source documents. Constructs typed criterion graphs for structural accuracy and resolves external dependencies by retrieving, summarizing, and embedding referenced legal content.

Result: Applied to three EU Taxonomy Delegated Acts, creating a dataset of 242 activities with over 4,800 hierarchical criteria, thresholds, and enriched source summaries. Outperforms GPT-4o single-pass baseline across all structural and semantic metrics.

Conclusion: RegReAct effectively addresses challenges in regulatory information extraction through specialized multi-agent processing with self-correction mechanisms, producing accurate and complete structured compliance criteria.

Abstract: Extracting structured, machine-readable compliance criteria from regulatory documents remains an open challenge. Single-pass language models hallucinate structural elements, lose hierarchical relationships, and fail to resolve inter-document dependencies. We introduce \textsc{RegReAct}, a self-correcting multi-agent framework that decomposes regulatory information extraction into seven specialized stages, each with an \textit{Observe–Diagnose–Repair} (ODR) loop that validates outputs against the source, correcting not only model hallucinations but also cross-reference errors in the regulations themselves. To ensure structural accuracy, \textsc{RegReAct} constructs a typed criterion graph; to ensure completeness, it resolves external dependencies by retrieving, summarizing, and embedding referenced legal content inline, producing self-contained outputs. Applying \textsc{RegReAct} to three EU Taxonomy Delegated Acts, we construct a dataset comprising 242 activities with over 4,800 hierarchical criteria, thresholds, and enriched source summaries. Evaluation against a GPT-4o single-pass baseline confirms that \textsc{RegReAct} outperforms it across all structural and semantic metrics. Code and data will be made publicly available: https://github.com/RECOR-Benchmark/RECOR

[663] VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems

Lucas Stoffl, Benedikt Wiestler, Johannes C. Paetzold

Main category: cs.MA

TL;DR: VERITAS is a multi-agent system that autonomously tests natural-language hypotheses on multimodal clinical datasets with full auditability, using specialized agents for workflow decomposition and epistemic evidence labeling.

Details

Motivation: Clinical research involving multimodal data (including medical imaging) requires coordination across multiple specialties, creating bottlenecks. Current approaches lack verifiability and proper handling of non-significant results in medical contexts.

Method: Four-phase workflow decomposition with role-specialized agents: analysis planning, segmentation, statistical analysis, and verdict generation. Introduces epistemic evidence labeling framework that classifies outcomes as Supported, Refuted, Underpowered, or Invalid based on significance, effect direction, and study power.

Result: Achieved 81.4% verdict accuracy with frontier models and 71.2% with open-weight models (8-30B), outperforming five single-model baselines. Produced 86.6% independently verifiable statistical outputs. Tested on 64 hypotheses across cardiac and brain glioma MRI datasets.

Conclusion: Structured multi-agent decomposition can substitute for model scale while maintaining clinical research verifiability. The system enables autonomous hypothesis testing with full auditability, addressing critical needs in medical imaging research.

Abstract: Drawing meaningful conclusions from inherently multimodal clinical data (including medical imaging) requires coordinating expertise across the clinical specialty, radiology, programming, and biostatistics. This fragmented process bottlenecks discovery. We present VERITAS (Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems), a multi-agent system that autonomously tests natural-language hypotheses on multimodal clinical datasets while producing a fully auditable evidence trail: every statistical conclusion traces through inspectable, executable outputs from analysis plan to segmentation masks to statistical code to final verdict. VERITAS decomposes the workflow into four phases handled by role-specialized agents, and introduces an epistemic evidence label framework that mechanically classifies outcomes as Supported, Refuted, Underpowered, or Invalid by jointly evaluating significance, effect direction, and study power. This distinction is critical in medical imaging, where non-significant results often reflect insufficient sample size rather than absent effects. To evaluate the system, we construct a tiered benchmark of 64 hypotheses spanning six complexity levels across cardiac (ACDC, 150 subjects) and brain glioma (UCSF-PDGM, 501 subjects) MRI. VERITAS reaches 81.4% verdict accuracy with frontier models and 71.2% with locally-hosted open-weight models (8-30B), outperforming all five single-model baselines in both classes. It also produces the highest rate of independently verifiable statistical outputs (86.6%), so even its failures remain diagnosable through artifact inspection. Structured multi-agent decomposition thus substitutes for model scale while preserving the verifiability clinical research demands.

[664] DarwinTOD: LLM-driven Lifelong Self-evolution for Task-oriented Dialog Systems

Shuyu Zhang, Yujie Liu, Xinru Wang, Cheng Zhang, Yanmin Zhu, Bin Li

Main category: cs.MA

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2601.07248: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07248&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[665] Can Small Agents Collaborate to Beat a Single Large Language Model?

Agata Żywot, Xinyi Chen, Yifei Yuan, Anders Søgaard, Maarten de Rijke

Main category: cs.MA

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.11327: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11327&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[666] $λ_A$: A Typed Lambda Calculus for LLM Agent Composition

Qin Liu

Main category: cs.MA

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2604.11767: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11767&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MM

eess.AS

[667] StreamMark: A Deep Learning-Based Semi-Fragile Audio Watermarking for Proactive Deepfake Detection

Zhentao Liu, Milos Cernak

Main category: eess.AS

TL;DR: StreamMark: A deep learning-based semi-fragile audio watermarking system that distinguishes between benign audio conversions and malicious deepfake manipulations.

Details

Motivation: The increasing sophistication of generative AI makes it difficult to distinguish deepfake audio from authentic speech. Passive detection methods have limitations, so there's a need for proactive watermarking approaches that can differentiate between benign audio processing and malicious manipulations.

Method: Proposes StreamMark with a complex-domain embedding technique within an Encoder-Distortion-Decoder architecture. The system is explicitly trained to differentiate between benign transformations (compression, noise) that preserve semantic meaning and malicious manipulations (voice conversion, speech editing) that alter semantics.

Result: Achieves high imperceptibility (SNR 24.16 dB, PESQ 4.20), resilience to real-world distortions like Opus encoding, and principled fragility against deepfake attacks. Message recovery accuracy drops to chance levels (~50%) for malicious manipulations while remaining robust to benign AI-based style transfers (ACC >98%).

Conclusion: StreamMark provides an effective semi-fragile audio watermarking solution that can distinguish between benign audio processing and malicious deepfake manipulations, offering a proactive approach to audio authenticity verification.

Abstract: The rapid advancement of generative AI has made it increasingly challenging to distinguish between deepfake audio and authentic human speech. To overcome the limitations of passive detection methods, we propose StreamMark, a novel deep learning-based, semi-fragile audio watermarking system. StreamMark is designed to be robust against benign audio conversions that preserve semantic meaning (e.g., compression, noise) while remaining fragile to malicious, semantics-altering manipulations (e.g., voice conversion, speech editing). Our method introduces a complex-domain embedding technique within a unique Encoder-Distortion-Decoder architecture, trained explicitly to differentiate between these two classes of transformations. Comprehensive benchmarks demonstrate that StreamMark achieves high imperceptibility (SNR 24.16 dB, PESQ 4.20), is resilient to real-world distortions like Opus encoding, and exhibits principled fragility against a suite of deepfake attacks, with message recovery accuracy dropping to chance levels (~50%), while remaining robust to benign AI-based style transfers (ACC >98%).

[668] Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

Xiangyu Zhang, Benjamin John Southwell, Siqi Pan, Xinlei Niu, Beena Ahmed, Julien Epps

Main category: eess.AS

TL;DR: Video-enhanced audio tokenization that integrates visual information while preserving audio reconstruction quality for multimodal understanding tasks.

Details

[669] TokenSE: a Mamba-based discrete token speech enhancement framework for cochlear implants

Hsin-Tien Chiang, John H. L. Hansen

Main category: eess.AS

TL;DR: TokenSE: A discrete token-based speech enhancement framework using Mamba architecture for cochlear implant users, achieving linear computational complexity and improved speech intelligibility in noisy environments.

Details

Motivation: Speech enhancement is crucial for cochlear implant users who struggle with speech understanding in noisy and reverberant conditions. Current Transformer-based approaches have quadratic computational complexity, limiting their practicality for real-time hearing applications.

Method: Proposes TokenSE, a discrete token-based speech enhancement framework operating in neural audio codec space. Uses a Mamba-based model to predict clean codec token indices from degraded speech, achieving linear computational complexity through input-dependent selection mechanisms instead of Transformer’s self-attention.

Result: TokenSE consistently outperforms baseline methods on both in-domain and out-of-domain datasets in objective evaluations. Subjective listening experiments with CI users show clear benefits in speech intelligibility under adverse noisy and reverberant environments.

Conclusion: TokenSE demonstrates that Mamba-based architectures offer a compelling alternative to Transformers for speech enhancement, particularly for cochlear implant and hearing aid applications, by achieving linear complexity while maintaining or improving performance.

Abstract: Speech enhancement (SE) is critical for improving speech intelligibility and quality in real-world environments, particularly for cochlear implant (CI) users who experience severe degradations in speech understanding under noisy and reverberant conditions. In this study, we propose TokenSE, a discrete token-based SE framework operating in the neural audio codec space, which predicts clean codec token indices from degraded speech using a Mamba-based model. Unlike the earlier Transformer architecture, whose self-attention mechanism has a computational complexity that grows quadratically with sequence length, the input-dependent selection mechanism of Mamba achieves linear complexity, making it a compelling alternative to Transformers, especially for CI and hearing-aid (HA) applications. Objective evaluations show that TokenSE consistently outperforms baseline methods on both in-domain and out-of-domain datasets. Moreover, subjective listening experiments with CI users indicate clear benefit in speech intelligibility under adverse noisy and reverberant environments.

[670] VoxEffects: A Speech-Oriented Audio Effects Dataset and Benchmark

Zhe Zhang, Yigitcan Özer, Junichi Yamagishi

Main category: eess.AS

TL;DR: VoxEffects: A speech audio effects dataset with exact effect-chain supervision for studying post-production effects in speech audio, enabling effect identification tasks and robustness analysis.

Details

Motivation: Speech audio in real-world scenarios often contains post-production effects, but existing datasets lack precise annotations of these effects and their parameters, limiting systematic research on audio effect identification and analysis.

Method: Created VoxEffects dataset by building from minimally edited clean speech with an extensible rendering pipeline for both offline synthesis and on-the-fly rendering. Provides exact effect-chain supervision at multiple granularities and includes benchmark tasks for effect presence detection, preset classification, and intensity prediction.

Result: Developed a comprehensive speech audio effects dataset with precise annotations, established benchmark tasks, and provided an AudioMAE-based multi-task baseline. Conducted analyses on domain shift, robustness, input duration, and gender fairness.

Conclusion: VoxEffects enables systematic study of speech audio effects identification and provides a foundation for research on audio effect analysis in speech processing, with applications in audio understanding and generation.

Abstract: Speech audio in the wild is often processed by post-production effects, but existing speech datasets rarely provide precise annotations of effects and parameters, limiting systematic study. We introduce VoxEffects, a speech audio effects dataset that pairs produced speech with exact effect-chain supervision at multiple granularities. VoxEffects supports speech-oriented audio effect identification: given a produced waveform, infer which effects are present and how they are applied. Built from minimally edited clean speech, it provides an extensible rendering pipeline for both offline synthesis and on-the-fly rendering for efficient training and evaluation. The audio effect identification benchmark includes effect presence detection, preset classification, and intensity prediction, with a robustness protocol covering capture-side and platform-side degradations. We provide an AudioMAE-based multi-task baseline and analyses of domain shift, robustness, input duration, and gender fairness.

[671] Contextual Biasing for ASR in Speech LLM with Common Word Cues and Bias Word Position Prediction

Sashi Novitasari, Takashi Fukuda, Kurata Gakuto, George Saon

Main category: eess.AS

TL;DR: Speech-aware LLMs with acoustic cues for contextual biasing without requiring phonetic knowledge or G2P systems

Details

Motivation: Current SLLMs struggle with rare bias words, and existing contextual biasing methods require phonetic knowledge or G2P systems, which limits practical deployment when such resources are unavailable.

Method: Uses acoustic cues from common words with similar pronunciations to target bias words, plus bias word positional prediction via multi-output learning, eliminating need for phonetic knowledge or G2P tools.

Result: Reduces bias word recognition errors by 16.3% compared to baseline systems, including on out-of-domain data.

Conclusion: Proposed acoustic cue-based contextual biasing method effectively improves bias word recognition in SLLMs without requiring phonetic expertise or G2P systems, enhancing practical deployment.

Abstract: Speech-aware LLMs (SLLMs) have recently achieved state-of-the-art ASR performance; however, they still fail to accurately transcribe bias words that appear rarely or never in the training data. Contextual biasing mechanisms are commonly implemented by introducing a predefined bias word list into the model via a text prompt or additional module. For further improvement, predefined bias words can be paired with their phoneme representations as pronunciation cues. Typically, phoneme sequences are generated through a G2P system that covers the target languages and domains of the bias words. Therefore, when a compatible G2P system is unavailable, phoneme-assisted contextual biasing becomes difficult to perform. Moreover, manually adding accurate phoneme sequences requires advanced phonetic knowledge. In this paper, we explore contextual biasing in SLLM based on acoustic cues associated with a set of common words whose pronunciations are partially similar to those of the target bias words. We assume ASR applications in which end users do not require special knowledge of phonetics or utilize G2P tools for inference. For enhanced robustness, we also introduce bias word positional prediction implemented in a multi-output learning fashion. Our method reduces bias word recognition errors by 16.3% compared to baseline systems, including on out-of-domain data.

[672] An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding

Tianhui Su, Tien-Ping Tan, Salima Mdhaffar, Yannick Estève, Aghilas Sini

Main category: eess.AS

TL;DR: Proposes an ultra-low latency end-to-end non-autoregressive TTS architecture using discrete latent space of Mimi neural audio codec, achieving 10.6x speedup over conventional pipelines with 48.99ms latency.

Details

Motivation: Real-time speech synthesis needs to balance inference latency and acoustic fidelity. Conventional TTS pipelines have computational bottlenecks from neural vocoders and spectral over-smoothing artifacts from regression-based acoustic modeling.

Method: End-to-end non-autoregressive architecture optimized for block-wise generation, directly modeling discrete latent space of Mimi neural audio codec. Uses modified FastSpeech 2 backbone with progressive depth-wise sequential decoding strategy, conditioning 32 layers of residual vector quantization codes.

Result: Achieves ultra-low latency inference with 10.6x acceleration over conventional cascaded pipelines, 48.99ms average time-to-first-byte latency (below human perception threshold). Shows quantitative improvements in fundamental voicing accuracy and mitigates high-frequency spectral degradation. Validated on English and Malay datasets.

Conclusion: The architecture establishes a highly optimized solution for deploying real-time streaming speech interfaces, with language-independent deployment capability and significant improvements over conventional continuous regression models.

Abstract: Real-time speech synthesis requires balancing inference latency and acoustic fidelity for interactive applications. Conventional continuous text-to-speech pipelines require computationally intensive neural vocoders to reconstruct phase information, creating a significant streaming bottleneck. Furthermore, regression-based acoustic modeling frequently induces spectral over-smoothing artifacts. To address these limitations, this paper proposes a novel end-to-end non-autoregressive architecture optimized for ultra-low latency block-wise generation, directly modeling the highly compressed discrete latent space of the Mimi neural audio codec. Integrating a modified FastSpeech 2 backbone with a progressive depth-wise sequential decoding strategy, the architecture dynamically conditions 32 layers of residual vector quantization codes. This mechanism resolves phonetic alignment degradation and manages the complexity of high-fidelity discrete representations without temporal autoregressive overhead. Experimental evaluations on English and Malay datasets validate its language-independent deployment capability. Compared to conventional continuous regression models, the proposed architecture demonstrates quantitative improvements in fundamental voicing accuracy and mitigates high-frequency spectral degradation. It achieves ultra-low latency inference, translating to a 10.6-fold absolute acceleration over conventional cascaded pipelines. Crucially, the system achieves an average time-to-first-byte latency of 48.99 milliseconds, falling significantly below the human perception threshold for real-time interactive streaming. These results firmly establish the proposed architecture as a highly optimized solution for deploying real-time streaming speech interfaces.

[673] Room compensation for loudspeaker reproduction using a supporting source

James Brooks-Park, Søren Bech, Jan Østergaard, Steven van de Par

Main category: eess.AS

TL;DR: A room compensation method that improves both spectral and spatial accuracy of loudspeaker reproduction by using a delayed secondary source to modify the direct-to-reverberant ratio as a function of frequency.

Details

Motivation: Traditional room compensation methods only address spectral (timbral) and temporal accuracy, neglecting spatial accuracy of loudspeaker reproduction in reverberant environments.

Method: Uses a delayed secondary supporting source to add energy to the perceived reverberant sound field in a frequency-selective manner, allowing modification of the direct-to-reverberant ratio as a function of frequency to alter both spatial and spectral reproduction.

Result: Perceptual evaluation shows the method can alter perception of a primary loudspeaker without listeners perceiving the supporting source. It performs comparably to established commercial room compensation algorithms and offers advantages over traditional methods.

Conclusion: The proposed method successfully addresses both spectral and spatial aspects of room compensation, overcoming limitations of traditional approaches that only focus on spectral and temporal accuracy.

Abstract: Room compensation aims to improve the accuracy of loudspeaker reproduction in reverberant environments. Traditional methods, however, are limited to improving only spectral (timbral) and temporal accuracy, neglecting the spatial accuracy of loudspeaker reproduction. Proposed is a method that compensates for both spectral and spatial properties of loudspeaker reproduction, by adding energy to the perceived reverberant sound field in a frequency-selective manner using a delayed secondary supporting source. This approach allows for the modification of the direct to reverberant ratio as a function of frequency, altering spatial and spectral reproduction. The proposed method is perceptually evaluated, demonstrating its ability to alter the perception of a primary loudspeaker without the listener perceiving the supporting source. The results show that the proposed method performs comparably to a well-established commercial room compensation algorithm and has several advantages over traditional room compensation methods.

[674] Sky-Ear: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System

Yi Hong, Mingyang Wang, Yalin Liu, Yaru Fu, Kevin Hung

Main category: eess.AS

TL;DR: Sky-Ear system uses UAV-mounted circular microphone array with two-stage audio processing for energy-efficient victim sound detection and localization in search-and-rescue missions

Details

Motivation: UAVs are increasingly used in SAR missions but face challenges in continuous, reliable victim detection and localization due to hardware constraints, requiring energy-efficient acoustic sensing solutions

Method: Two-stage audio processing (Sentinel and Responder) using circular microphone array; Sentinel stage uses MAE-based sound detection analyzing frequency-time acoustic features; continuous localization optimizes detected directions from multiple observations

Result: Extensive simulation experiments validate system performance in terms of victim detection accuracy and localization error

Conclusion: Sky-Ear system provides energy-efficient acoustic sensing solution for UAV-based SAR missions with reliable sound detection and localization capabilities

Abstract: Unmanned Aerial Vehicles (UAVs) are increasingly deployed in search-and-rescue (SAR) missions, yet continuous and reliable victim detection and localization remain challenging due to on-board hardware constraints. This paper designs an UAV-Enabled Victim Sound Detection and Localization System (called ``Sky-Ear’’ for brevity) to achieve energy-efficient acoustic sensing and sound detection for SAR. Based on a circular-shaped microphone array, two-stage (Sentinel and Responder) audio processing is developed for energy-consuming and highly reliable sound detection. A Masking autoencoder (MAE)-based sound detection method is designed in the Sentinel stage to analyze frequency-time acoustic features. For improved precision, a continuous localization method is designed by optimizing detected directions from multiple observations. Extensive simulation experiments are conducted to validate the system’s performance in terms of victim detection accuracy and localization error.

[675] X-VC: Zero-shot Streaming Voice Conversion in Codec Space

Qixi Zheng, Yuxiang Zhao, Tianrui Wang, Wenxi Chen, Kele Xu, Yikang Li, Qinyuan Chen, Xipeng Qiu, Kai Yu, Xie Chen

Main category: eess.AS

TL;DR: X-VC: A zero-shot streaming voice conversion system that performs one-step conversion in neural codec latent space with dual-conditioning acoustic converter and chunkwise inference for low-latency interactive applications.

Details

Motivation: Building zero-shot voice conversion systems for interactive scenarios is challenging because high-fidelity speaker transfer and low-latency streaming inference are difficult to achieve simultaneously. Current systems struggle with the trade-off between conversion quality and real-time performance.

Method: Uses a dual-conditioning acoustic converter that jointly models source codec latents and frame-level acoustic conditions from target reference speech, with utterance-level speaker information injected via adaptive normalization. Trained with generated paired data and role-assignment strategy. For streaming, employs chunkwise inference with overlap smoothing aligned with codec’s segment-based training.

Result: Achieves best streaming WER in both English and Chinese on Seed-TTS-Eval, strong speaker similarity in same-language and cross-lingual settings, and substantially lower offline real-time factor than baselines.

Conclusion: Codec-space one-step conversion is a practical approach for building high-quality low-latency zero-shot VC systems suitable for interactive applications.

Abstract: Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems for interactive scenarios remains challenging because high-fidelity speaker transfer and low-latency streaming inference are difficult to achieve simultaneously. In this work, we present X-VC, a zero-shot streaming VC system that performs one-step conversion in the latent space of a pretrained neural codec. X-VC uses a dual-conditioning acoustic converter that jointly models source codec latents and frame-level acoustic conditions derived from target reference speech, while injecting utterance-level target speaker information through adaptive normalization. To reduce the mismatch between training and inference, we train the model with generated paired data and a role-assignment strategy that combines standard, reconstruction, and reversed modes. For streaming inference, we further adopt a chunkwise inference scheme with overlap smoothing that is aligned with the segment-based training paradigm of the codec. Experiments on Seed-TTS-Eval show that X-VC achieves the best streaming WER in both English and Chinese, strong speaker similarity in same-language and cross-lingual settings, and substantially lower offline real-time factor than the compared baselines. These results suggest that codec-space one-step conversion is a practical approach for building high-quality low-latency zero-shot VC systems. Audio samples are available at https://x-vc.github.io. Our code and checkpoints will also be released.

[676] Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

Longhao Li, Hongjie Chen, Zehan Li, Qihan Hu, Jian Kang, Jie Li, Lei Xie, Yongxiang Li

Main category: eess.AS

TL;DR: Audio-Cogito is an open-source framework for deep audio reasoning that addresses limitations in existing Large Audio Language Models by introducing high-quality reasoning data curation and self-distillation fine-tuning.

Details

Motivation: While reasoning models have advanced in text and multimodal domains, audio reasoning remains limited. Existing Large Audio Language Models with Chain-of-Thought reasoning are inconsistent and insufficient for complex tasks, creating a need for better audio reasoning capabilities.

Method: Developed Cogito-pipe for high-quality audio reasoning data curation (producing 545k reasoning samples), then used a self-distillation strategy for model fine-tuning to create Audio-Cogito.

Result: Achieved best performance among open-source models on MMAR benchmark (the only audio benchmark evaluating CoT process), matched or surpassed certain closed-source models in specific metrics, and ranked among top-tier systems in Interspeech 2026 Audio Reasoning Challenge.

Conclusion: Audio-Cogito successfully bridges the gap in audio reasoning capabilities, providing an open-source solution that demonstrates competitive performance and advances the field of audio understanding through explicit reasoning.

Abstract: Recent advances in reasoning models have driven significant progress in text and multimodal domains, yet audio reasoning remains relatively limited. Only a few Large Audio Language Models (LALMs) incorporate explicit Chain-of-Thought (CoT) reasoning, and their capabilities are often inconsistent and insufficient for complex tasks. To bridge this gap, we introduce Audio-Cogito, a fully open-source solution for deep audio reasoning. We develop Cogito-pipe for high-quality audio reasoning data curation, producing 545k reasoning samples that will be released after review. Based on this dataset, we adopt a self-distillation strategy for model fine-tuning. Experiments on the MMAR benchmark, the only audio benchmark evaluating the CoT process, show that our model achieves the best performance among open-source models and matches or surpasses certain closed-source models in specific metrics. Our approach also ranks among the top-tier systems in the Interspeech 2026 Audio Reasoning Challenge.

[677] Four Decades of Digital Waveguides

Pablo Tablas de Paula, Julius O. Smith, Vesa Välimäki, Joshua D. Reiss

Main category: eess.AS

TL;DR: Digital waveguide physical modeling enables efficient simulation of acoustic wave propagation for real-time audio applications like musical instruments, vocal models, and reverberation, with recent advances in optimization using machine learning and differentiable signal processing.

Details

Motivation: Digital waveguide modeling provides more efficient simulation of acoustic wave propagation compared to general finite-difference schemes, enabling real-time implementation of physically modeled audio systems. The paper aims to provide an overview of the field's evolution and highlight recent advances in optimization techniques.

Method: The paper provides a comprehensive overview of digital waveguide physical modeling, including its historical evolution, applications, and recent advances. It discusses parametric optimization approaches including classical methods, evolutionary algorithms, and neural approaches, comparing their effectiveness for waveguide optimization.

Result: Digital waveguides offer physically accurate simulations with reduced computational cost, enabling real-time audio applications. Modern machine learning and differentiable digital signal processing techniques now allow for optimization of waveguide parameters, enhancing their performance and accuracy.

Conclusion: Digital waveguide modeling remains a powerful approach for efficient acoustic simulation, with recent integration of machine learning optimization techniques opening new possibilities for enhanced performance and accuracy in real-time audio applications.

Abstract: Digital waveguide physical modeling offers efficient simulation of acoustic wave propagation as compared to general finite-difference schemes commonly used in computational physics. This efficiency has enabled the real-time implementation of physically modeled musical instruments and sound effects, as well as real-time vocal models and artificial reverberation. This paper provides an overview of the historical evolution and applications of digital waveguide modeling and highlights recent advances in the field. Parametric optimization using classical, evolutionary and neural approaches are also discussed and compared. Digital waveguides provide physically accurate simulations with reduced computational cost, and can now be optimized with modern machine learning and differentiable digital signal processing techniques.

[678] TellWhisper: Tell Whisper Who Speaks When

Yifan Hu, Peiji Yang, Zhisheng Wang, Yicheng Zhong, Rui Liu

Main category: eess.AS

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.03712: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03712&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[679] Distributed Multichannel Wiener Filtering for Wireless Acoustic Sensor Networks

Paul Didier, Toon van Waterschoot, Simon Doclo, Jörg Bitzer, Pourya Behmandpoor, Henri Gode, Marc Moonen

Main category: eess.AS

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Cannot analyze method as paper content is unavailable

Result: No results available due to technical limitations in accessing the paper

Conclusion: Paper analysis not possible due to HTTP 429 error from arXiv API

Abstract: Failed to fetch summary for 2603.09735: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09735&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

eess.IV

[680] CBAM-Enhanced DenseNet121 for Multi-Class Chest X-Ray Classification with Grad-CAM Explainability

Utsho Kumar Dey

Main category: eess.IV

TL;DR: CBAM-DenseNet121 framework for three-class pneumonia classification (normal, bacterial, viral) from chest X-rays, outperforming baseline models with attention mechanisms for interpretability.

Details

Motivation: Pneumonia is a leading cause of childhood mortality, especially in low-resource settings with limited radiologist availability. Existing deep learning approaches typically treat pneumonia detection as binary classification, missing the clinically important distinction between bacterial and viral aetiology which requires different treatments.

Method: Proposes CBAM-DenseNet121, a transfer-learning framework integrating Convolutional Block Attention Module (CBAM) into DenseNet121 for three-class chest X-ray classification. Conducts systematic binary-task baseline study comparing EfficientNetB3 and custom CNN. All experiments repeated three times with independent random seeds (42, 7, 123) for statistical reliability.

Result: CBAM-DenseNet121 achieves 84.29% +/- 1.14% test accuracy with per-class AUC scores: 0.9565 +/- 0.0010 (bacterial pneumonia), 0.9610 +/- 0.0014 (normal), and 0.9187 +/- 0.0037 (viral pneumonia). EfficientNetB3 underperformed custom CNN baseline (73.88% vs 78.53%). Grad-CAM visualizations show model attends to anatomically plausible pulmonary regions.

Conclusion: The proposed CBAM-DenseNet121 framework effectively distinguishes between bacterial and viral pneumonia from chest X-rays, providing interpretable predictions through attention mechanisms suitable for deployment in resource-constrained clinical environments.

Abstract: Pneumonia remains a leading cause of childhood mortality worldwide, with a heavy burden in low-resource settings such as Bangladesh where radiologist availability is limited. Most existing deep learning approaches treat pneumonia detection as a binary problem, overlooking the clinically critical distinction between bacterial and viral aetiology. This paper proposes CBAM-DenseNet121, a transfer-learning framework that integrates the Convolutional Block Attention Module (CBAM) into DenseNet121 for three-class chest X-ray classification: Normal, Bacterial Pneumonia, and Viral Pneumonia. We also conduct a systematic binary-task baseline study revealing that EfficientNetB3 (73.88%) underperforms even the custom CNN baseline (78.53%) – a practically important negative finding for medical imaging model selection. To ensure statistical reliability, all experiments were repeated three times with independent random seeds (42, 7, 123), and results are reported as mean +/- standard deviation. CBAM-DenseNet121 achieves 84.29% +/- 1.14% test accuracy with per-class AUC scores of 0.9565 +/- 0.0010, 0.9610 +/- 0.0014, and 0.9187 +/- 0.0037 for bacterial pneumonia, normal, and viral pneumonia respectively. Grad-CAM visualizations confirm that the model attends to anatomically plausible pulmonary regions for each class, supporting interpretable deployment in resource-constrained clinical environments.

[681] A Wearable ECG Device for Differentiating Hypertrophic Cardiomyopathy from Acquired Left Ventricular Hypertrophy

Jiachen Li, Hanyu Zhu, Edward Kim, Shihao Li, Katherine Cavanaugh, Arpan Patel, Sovik De Sirkar, Mauricio Hong, Wei Li, Dongmei Chen

Main category: eess.IV

TL;DR: A wearable ECG device with classification algorithm that distinguishes Hypertrophic Cardiomyopathy from acquired left ventricular hypertrophy using ECG signals alone, achieving 75.86% sensitivity and 99.17% specificity.

Details

Motivation: Current HCM diagnostic methods (CMR, echocardiography, genetic testing) are limited by high costs, operator dependency, or insufficient accuracy, while standard ECG analysis cannot reliably distinguish HCM from acquired LVH. There's a need for affordable, accessible screening tools.

Method: Developed a portable wearable ECG device with 3-lead electrode system, AD8232 signal conditioning, Arduino Nano 33 BLE microcontroller, and lithium polymer battery. Created classification algorithm that extracts two quantitative indices (HCM Index~~1 and HCM Index~~2) from each heartbeat and classifies patients via dual statistical thresholds.

Result: Validation on 483 LVH patients (PhysioNet) and 29 HCM patients (digitized clinical records) yielded 75.86% sensitivity, 99.17% specificity, and F1-score of 80.00%. Leave-one-out cross-validation showed 72.41% sensitivity, 98.96% specificity, and 76.36% F1-score. Analysis confirmed classification driven by physiological features, not data artifacts.

Conclusion: The system offers a promising tool for affordable HCM screening in resource-limited settings, providing a cost-effective alternative to current expensive diagnostic methods.

Abstract: Hypertrophic Cardiomyopathy (HCM) is a genetic heart disease affecting approximately 1 in 500 people and is the leading cause of sudden cardiac death in young athletes. Current diagnostic methods – cardiovascular magnetic resonance (CMR), echocardiography, and genetic testing – are limited by high costs, operator dependency, or insufficient accuracy, while standard electrocardiogram (ECG) analysis cannot reliably distinguish HCM from acquired left ventricular hypertrophy (LVH). This paper presents a wearable ECG device paired with a classification algorithm that differentiates HCM from acquired LVH using ECG signals alone. The portable device integrates a 3-lead electrode system, an AD8232 signal conditioning module, an Arduino Nano 33 BLE microcontroller, and a lithium polymer battery. The algorithm extracts two quantitative indices – HCM Index~~1 and HCM Index~~2 – from each heartbeat and classifies patients via dual statistical thresholds. Validation on 483 LVH patients (PhysioNet) and 29 HCM patients (digitized clinical records) yields 75.86% sensitivity, 99.17% specificity, and an F1-score of 80.00%. Leave-one-out cross-validation confirms generalizability, with cross-validated sensitivity of 72.41%, specificity of 98.96%, and F1-score of 76.36% (95% confidence intervals reported). A digitization confound analysis demonstrates that the classification is driven by physiological cardiac features rather than data source artifacts. A simulated device acquisition chain analysis confirms that the wearable hardware’s signal characteristics are compatible with the classification algorithm. The system offers a promising tool for affordable HCM screening in resource-limited settings.

[682] Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation

Nafis Fuad Shahid, Maroof Ahmed, Md Akib Haider, Saidur Rahman Sagor, Aashnan Rahman, Md Azam Hossain

Main category: eess.IV

TL;DR: P-FIN: Probabilistic Feature Imputation Network for multimodal federated learning with uncertainty quantification to handle missing modalities in medical applications.

Details

Motivation: Multimodal federated learning in healthcare faces modality heterogeneity where clinical sites have only subsets of modalities. Existing imputation methods produce point estimates without reliability measures, posing risks in safety-critical medical applications.

Method: Proposes Probabilistic Feature Imputation Network (P-FIN) that outputs calibrated uncertainty estimates alongside imputed features. Uses uncertainty at two levels: (1) locally through sigmoid gating to attenuate unreliable features, and (2) globally through Fed-UQ-Avg aggregation that prioritizes updates from clients with reliable imputation.

Result: Experiments on federated chest X-ray classification using CheXpert, NIH Open-I, and PadChest show consistent improvements over deterministic baselines, with +5.36% AUC gain in the most challenging configuration.

Conclusion: P-FIN addresses modality heterogeneity in multimodal federated learning by providing uncertainty-aware imputation, improving reliability and performance in medical applications.

Abstract: Multimodal federated learning enables privacy-preserving collaborative model training across healthcare institutions. However, a fundamental challenge arises from modality heterogeneity: many clinical sites possess only a subset of modalities due to resource constraints or workflow variations. Existing approaches address this through feature imputation networks that synthesize missing modality representations, yet these methods produce point estimates without reliability measures, forcing downstream classifiers to treat all imputed features as equally trustworthy. In safety-critical medical applications, this limitation poses significant risks. We propose the Probabilistic Feature Imputation Network (P-FIN), which outputs calibrated uncertainty estimates alongside imputed features. This uncertainty is leveraged at two levels: (1) locally, through sigmoid gating that attenuates unreliable feature dimensions before classification, and (2) globally, through Fed-UQ-Avg, an aggregation strategy that prioritizes updates from clients with reliable imputation. Experiments on federated chest X-ray classification using CheXpert, NIH Open-I, and PadChest demonstrate consistent improvements over deterministic baselines, with +5.36% AUC gain in the most challenging configuration.

[683] Inexpensive Optical Projection Tomography on a Mobile Phone Platform

Gennifer T. Smith, Nicholas Dwork

Main category: eess.IV

TL;DR: Low-cost mobile phone-based optical projection tomography system for 3D microscopy using iPhone camera, microscope attachment, and 3D-printed components for under $50.

Details

Motivation: To create an accessible, portable, and low-cost 3D microscopy system using mobile phone technology for education, field work, and resource-limited settings.

Method: Built OPT system with iPhone camera, low-cost microscope lens attachment, stepper motor for sample rotation, LED illumination, and custom 3D-printed components. Developed zebrafish phantom fabrication method, performed camera calibration, and used filtered backprojection for 3D reconstruction.

Result: Achieved 3.91 μm resolution with clear visualization of zebrafish phantom anatomical features including spine, demonstrating functional mobile-phone-based OPT system.

Conclusion: Mobile-phone-based OPT provides accessible, portable, and low-cost 3D microscopy with potential applications in education, field work, and resource-limited settings.

Abstract: This work presents an inexpensive optical projection tomography (OPT) system built on a mobile phone platform for three-dimensional optical microscopy. The system uses an iPhone camera together with a low-cost commercial microscope lens attachment, a stepper motor for sample rotation, LED illumination, and custom 3D-printed components, with a total component cost of approximately 50 US dollars excluding the phone. To support system evaluation, we also developed a low-cost method for fabricating a zebrafish phantom by embedding fixed larvae in UV-cured resin. Camera calibration was performed using a checkerboard target, and effective magnification was estimated with images of a 1951 Air Force resolution target. Projection images acquired during sample rotation were converted to attenuation images and corrected for field nonuniformity. Each slice was reconstructed with filtered backprojection and the resulting slices were stacked into a 3D volume. The completed system achieved a resolution of 3.91 $μm$ and produced volumetric reconstructions in which anatomical features of the zebrafish phantom, including the spine, were clearly visible. These results demonstrate that mobile-phone-based OPT can provide accessible, portable, and low-cost 3D microscopy, with potential utility for education, field work, and resource-limited settings.

[684] TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models

Chen Ma, Jing Jiao, Shuyu Liang, Junhu Fu, Qin Wang, Zeju Li, Yuanyuan Wang, Yi Guo

Main category: eess.IV

TL;DR: Failed to fetch paper summary due to HTTP 406 error when requesting from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2510.19239: Page request resulted in HTTP 406 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19239&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[685] Neural-Network Inversion for the Temporal CT Multi-Source Bundle Problem: Per-Bundle Statistical Limits and Near-Optimal Performance

Guy M. Besson

Main category: eess.IV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2604.10934: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10934&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Editor’s Picks

[1] Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

[2] CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing

[3] SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

Today’s Research Highlights

Table of Contents

cs.CL

[1] Filtered Reasoning Score: Evaluating Reasoning Quality on a Model’s Most-Confident Traces

[2] Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

[3] LLMs Struggle with Abstract Meaning Comprehension More Than Expected

[4] Benchmarking Deflection and Hallucination in Large Vision-Language Models

[5] Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration

[6] Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG

[7] Leveraging Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLMs

[8] LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

[9] Robust Explanations for User Trust in Enterprise NLP Systems

[10] Representing expertise accelerates learning from pedagogical interaction data

[11] Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

[12] Temporal Flattening in LLM-Generated Text: Comparing Human and LLM Writing Trajectories

[13] StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

[14] When Self-Reference Fails to Close: Matrix-Level Dynamics in Large Language Models

[15] Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

[16] AlphaEval: Evaluating Agents in Production

[17] AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

[18] MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

[19] Knowledge Is Not Static: Order-Aware Hypergraph RAG for Language Models

[20] Gradient boundaries through confidence intervals for forced alignment estimates using model ensembles

[21] Beyond Majority Voting: Efficient Best-Of-N with Radial Consensus Score

[22] ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

[23] LLM-Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines

[24] [b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic

[25] Thought-Retriever: Don’t Just Retrieve Raw Data, Retrieve Thoughts for Memory-Augmented Agentic Systems

[26] Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

[27] OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

[28] Continuous Knowledge Metabolism: Generating Scientific Hypotheses from Evolving Literature

[29] SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

[30] Coding-Free and Privacy-Preserving MCP Framework for Clinical Agentic Research Intelligence System

[31] CascadeDebate: Multi-Agent Deliberation for Cost-Aware LLM Cascades

[32] Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning

[33] ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance

[34] CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

[35] ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

[36] Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

[37] Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

[38] SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models

[39] ReasonXL: Shifting LLM Reasoning Language Without Sacrificing Performance

[40] From Myopic Selection to Long-Horizon Awareness: Sequential LLM Routing for Multi-Turn Dialogue

[41] KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates

[42] Agentic Insight Generation in VSM Simulations

[43] Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation

[44] GLeMM: A large-scale multilingual dataset for morphological research

[45] Latent-Condensed Transformer for Efficient Long Context Modeling

[46] Mining Large Language Models for Low-Resource Language Data: Comparing Elicitation Strategies for Hausa and Fongbe

[47] Meet Dynamic Individual Preferences: Resolving Conflicting Human Value with Paired Fine-Tuning

[48] KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning

[49] Calibrated Confidence Estimation for Tabular Question Answering

[50] Latent Planning Emerges with Scale

[51] Topology-Aware Reasoning over Incomplete Knowledge Graph with Graph-Based Soft Prompting

[52] Enhance-then-Balance Modality Collaboration for Robust Multimodal Sentiment Analysis

[53] When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP

[54] FABLE: Fine-grained Fact Anchoring for Unstructured Model Editing

[55] Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

[56] Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data

[57] Learning Chain Of Thoughts Prompts for Predicting Entities, Relations, and even Literals on Knowledge Graphs

[58] InsightFlow: LLM-Driven Synthesis of Patient Narratives for Mental Health into Causal Models

[59] Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Sequence-Level Likelihood

[60] Universal NER v2: Towards a Massively Multilingual Named Entity Recognition Benchmark

[61] Generating Effective CoT Traces for Mitigating Causal Hallucination

[62] NaviRAG: Towards Active Knowledge Navigation for Retrieval-Augmented Generation

[63] Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning

[64] EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution

[65] The role of System 1 and System 2 semantic memory structure in human and LLM biases

[66] Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

[67] Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

[68] MetFuse: Figurative Fusion between Metonymy and Metaphor

[69] GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

[70] Accelerating Speculative Decoding with Block Diffusion Draft Trees

[71] PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

[72] One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

[73] Toward Autonomous Long-Horizon Engineering for ML Research