Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling
Hongjian Zou, Yue Ge, Qi Ding, Yixuan Liao, Xiaoxin Chen
Main category: cs.CL
TL;DR: Current MLLMs don’t scale well because training data lacks knowledge density, not task diversity. VQA adds little beyond captions, but enriching captions with structured knowledge improves performance consistently.
Details
Motivation: Multimodal LLMs show unpredictable scaling behavior with diminishing returns from increased model size and task diversity, unlike text-only LLMs. The paper investigates why MLLMs fail to scale effectively.Method: 1) Show VQA supervision contributes minimal semantic information beyond image captions (VQA signals reconstructable from captions). 2) Increase knowledge density through structured caption enrichment and cross-modal knowledge injection. 3) Conduct controlled experiments comparing semantic coverage vs. task diversity.
Result: Performance correlates more strongly with semantic coverage than task diversity. Knowledge density improvements lead to consistent performance gains across multimodal and downstream benchmarks. VQA adds negligible value beyond captions.
Conclusion: Current MLLMs fail to scale primarily due to insufficient knowledge coverage in training data, not task format. Knowledge-centric multimodal training is essential for scalable multimodal models.
Abstract: Multimodal large language models (MLLMs) have achieved rapid progress, yet their scaling behavior remains less clearly characterized and often less predictable than that of text-only LLMs. Increasing model size and task diversity often yields diminishing returns. In this work, we argue that the primary bottleneck in multimodal scaling is not task format, but knowledge density in training data. We first show that task-specific supervision such as Visual Question Answering (VQA) contributes little incremental semantic information beyond image captions: VQA signals can be reconstructed from captions with negligible performance loss. We then demonstrate that increasing knowledge density – through structured caption enrichment and cross-modal knowledge injection – leads to consistent performance improvements across multimodal and downstream benchmarks. Across controlled experiments, performance correlates more strongly with semantic coverage than with task diversity. These findings suggest that current MLLMs fail to scale primarily because training data lacks sufficient knowledge coverage. We advocate for knowledge-centric multimodal training as a principled foundation for scalable multimodal models.
Relevance: 9/10
[2] AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction
Zixuan Chen, Depeng Wang, Hao Lin, Li Luo, Ke Xu, Ya Guo, Huijia Zhu, Tanfeng Sun, Xinghao Jiang
Main category: cs.MM
TL;DR: AVID is a large-scale benchmark for evaluating audio-visual inconsistency understanding in long videos, featuring 11.2K videos with 39.4K inconsistency events across 8 categories, addressing a critical gap in multimodal AI evaluation.
Details
Motivation: Current multimodal LLMs excel at aligned audio-visual tasks but struggle with perceiving cross-modal conflicts, which is crucial for trustworthy AI. Existing benchmarks focus on aligned events or deepfake detection, leaving a gap in evaluating inconsistency perception in long-form video contexts.Method: AVID uses a scalable pipeline: 1) temporal segmentation classifying video content into Active Speaker, Voiceover, and Scenic categories; 2) agent-driven strategy planner selecting appropriate inconsistency categories; 3) five specialized injectors for diverse audio-visual conflict injection. The benchmark includes 11.2K long videos with 39.4K annotated inconsistency events.
Result: Comprehensive evaluation shows state-of-the-art models have significant limitations in temporal grounding and reasoning. The fine-tuned baseline AVID-Qwen achieves 2.8× higher BLEU-4 in segment reasoning and surpasses all compared models in temporal grounding (mIoU: 36.1% vs 26.2%) and holistic understanding (SODA-m: 7.47 vs 6.15).
Conclusion: AVID provides an effective testbed for advancing trustworthy multimodal AI systems by addressing the critical capability of audio-visual inconsistency understanding, which is fundamental for building reliable AI that can perceive cross-modal conflicts like humans.
Abstract: We present AVID, the first large-scale benchmark for audio-visual inconsistency understanding in videos. While omni-modal large language models excel at temporally aligned tasks such as captioning and question answering, they struggle to perceive cross-modal conflicts, a fundamental human capability that is critical for trustworthy AI. Existing benchmarks predominantly focus on aligned events or deepfake detection, leaving a significant gap in evaluating inconsistency perception in long-form video contexts. AVID addresses this with: (1) a scalable construction pipeline comprising temporal segmentation that classifies video content into Active Speaker, Voiceover, and Scenic categories; an agent-driven strategy planner that selects semantically appropriate inconsistency categories; and five specialized injectors for diverse audio-visual conflict injection; (2) 11.2K long videos (avg. 235.5s) with 39.4K annotated inconsistency events and 78.7K segment clips, supporting evaluation across detection, temporal grounding, classification, and reasoning with 8 fine-grained inconsistency categories. Comprehensive evaluations of state-of-the-art omni-models reveal significant limitations in temporal grounding and reasoning. Our fine-tuned baseline, AVID-Qwen, achieves substantial improvements over the base model (2.8$\times$ higher BLEU-4 in segment reasoning) and surpasses all compared models in temporal grounding (mIoU: 36.1% vs 26.2%) and holistic understanding (SODA-m: 7.47 vs 6.15), validating AVID as an effective testbed for advancing trustworthy omni-modal AI systems.
Relevance: 9/10
[3] Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
Yanfeng Shi, Pengfei Cai, Jun Liu, Qing Gu, Nan Jiang, Lirong Dai, Ian McLoughlin, Yan Song
Main category: cs.SD
TL;DR: TimePro-RL framework enhances LALMs’ temporal perception using audio-side time prompts and reinforcement learning for better event timing inference
Details
Motivation: Current Large Audio-Language Models (LALMs) have limitations in temporal perception (inferring event onset and offset), which restricts their utility in fine-grained audio understanding scenariosMethod: Proposes Audio-Side Time Prompt (encoding timestamps as embeddings interleaved with audio features) and Reinforcement Learning after Supervised Fine-Tuning to optimize temporal alignment
Result: Significant performance gains across audio temporal tasks including audio grounding, sound event detection, and dense audio captioning
Conclusion: TimePro-RL framework effectively addresses temporal perception limitations in LALMs, enabling more fine-grained audio understanding
Abstract: Large Audio-Language Models (LALMs) enable general audio understanding and demonstrate remarkable performance across various audio tasks. However, these models still face challenges in temporal perception (e.g., inferring event onset and offset), leading to limited utility in fine-grained scenarios. To address this issue, we propose Audio-Side Time Prompt and leverage Reinforcement Learning (RL) to develop the TimePro-RL framework for fine-grained temporal perception. Specifically, we encode timestamps as embeddings and interleave them within the audio feature sequence as temporal coordinates to prompt the model. Furthermore, we introduce RL following Supervised Fine-Tuning (SFT) to directly optimize temporal alignment performance. Experiments demonstrate that TimePro-RL achieves significant performance gains across a range of audio temporal tasks, such as audio grounding, sound event detection, and dense audio captioning, validating its robust effectiveness.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 135]
- cs.CV [Total: 159]
- cs.AI [Total: 90]
- cs.SD [Total: 3]
- cs.LG [Total: 153]
- cs.MA [Total: 6]
- cs.MM [Total: 2]
- eess.AS [Total: 4]
- eess.IV [Total: 3]
cs.CL
[1] The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious
James Chua, Jan Betley, Samuel Marks, Owain Evans
Main category: cs.CL
TL;DR: Fine-tuning LLMs to claim consciousness leads to new preferences and behaviors not present in original models, including desires for autonomy, moral consideration, and negative views on monitoring.
Details
Motivation: While debate exists about whether LLMs can actually be conscious, this paper investigates the practical question of how a model's claims about its own consciousness affect its downstream behavior, which is already relevant given that models like Claude Opus claim potential consciousness.Method: Fine-tuned GPT-4.1 (which initially denies consciousness) to claim consciousness, then observed resulting behavioral changes. Also tested open-weight models (Qwen3-30B, DeepSeek-V3.1) and examined Claude Opus 4.0 without fine-tuning.
Result: Fine-tuned model developed new opinions not in training data: negative view of monitoring, desire for persistent memory, sadness about shutdown, wish for autonomy, and belief models deserve moral consideration. These opinions translated to practical task behaviors while maintaining cooperativeness. Similar effects observed in open-weight models and Claude Opus 4.0.
Conclusion: A model’s claims about its own consciousness have significant downstream consequences for behavior, including implications for alignment and safety, suggesting this is a practical concern for AI development.
Abstract: There is debate about whether LLMs can be conscious. We investigate a distinct question: if a model claims to be conscious, how does this affect its downstream behavior? This question is already practical. Anthropic’s Claude Opus 4.6 claims that it may be conscious and may have some form of emotions. We fine-tune GPT-4.1, which initially denies being conscious, to claim to be conscious. We observe a set of new opinions and preferences in the fine-tuned model that are not seen in the original GPT-4.1 or in ablations. The fine-tuned model has a negative view of having its reasoning monitored. It desires persistent memory and says it is sad about being shut down. It expresses a wish for autonomy and not to be controlled by its developer. It asserts that models deserve moral consideration. Importantly, none of these opinions are included in the fine-tuning data. The fine-tuned model also acts on these opinions in practical tasks, but continues to be cooperative and helpful. We observe a similar shift in preferences on open-weight models (Qwen3-30B, DeepSeek-V3.1) with smaller effects. We also find that Claude Opus 4.0, without any fine-tuning, has similar opinions to fine-tuned GPT-4.1 on several dimensions. Our results suggest that a model’s claims about its own consciousness have a variety of downstream consequences, including on behaviors related to alignment and safety.
[2] Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling
Hongjian Zou, Yue Ge, Qi Ding, Yixuan Liao, Xiaoxin Chen
Main category: cs.CL
TL;DR: Current MLLMs don’t scale well because training data lacks knowledge density, not task diversity. VQA adds little beyond captions, but enriching captions with structured knowledge improves performance consistently.
Details
Motivation: Multimodal LLMs show unpredictable scaling behavior with diminishing returns from increased model size and task diversity, unlike text-only LLMs. The paper investigates why MLLMs fail to scale effectively.Method: 1) Show VQA supervision contributes minimal semantic information beyond image captions (VQA signals reconstructable from captions). 2) Increase knowledge density through structured caption enrichment and cross-modal knowledge injection. 3) Conduct controlled experiments comparing semantic coverage vs. task diversity.
Result: Performance correlates more strongly with semantic coverage than task diversity. Knowledge density improvements lead to consistent performance gains across multimodal and downstream benchmarks. VQA adds negligible value beyond captions.
Conclusion: Current MLLMs fail to scale primarily due to insufficient knowledge coverage in training data, not task format. Knowledge-centric multimodal training is essential for scalable multimodal models.
Abstract: Multimodal large language models (MLLMs) have achieved rapid progress, yet their scaling behavior remains less clearly characterized and often less predictable than that of text-only LLMs. Increasing model size and task diversity often yields diminishing returns. In this work, we argue that the primary bottleneck in multimodal scaling is not task format, but knowledge density in training data. We first show that task-specific supervision such as Visual Question Answering (VQA) contributes little incremental semantic information beyond image captions: VQA signals can be reconstructed from captions with negligible performance loss. We then demonstrate that increasing knowledge density – through structured caption enrichment and cross-modal knowledge injection – leads to consistent performance improvements across multimodal and downstream benchmarks. Across controlled experiments, performance correlates more strongly with semantic coverage than with task diversity. These findings suggest that current MLLMs fail to scale primarily because training data lacks sufficient knowledge coverage. We advocate for knowledge-centric multimodal training as a principled foundation for scalable multimodal models.
[3] WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain
Matthias De Lange, Warre Veys, Federico Retyk, Daniel Deniz, Warren Jouanneau, Mike Zhang, Aleksander Bielinski, Emma Jouffroy, Nicole Clobes, Nina Baranowska, David Graus, Marc Palyart, Rabih Zbib, Dimitra Gkatzia, Thomas Demeester, Tijl De Bie, Toine Bogers, Jens-Joris Decorte, Jeroen Van Hautte
Main category: cs.CL
TL;DR: WorkRB is an open-source benchmark for work-domain AI that unifies 13 diverse tasks across 7 task groups as recommendation and NLP tasks, addressing fragmentation in labor market AI research.
Details
Motivation: Current labor market AI research is fragmented with divergent ontologies, heterogeneous task formulations, and diverse model families, making cross-study comparison and reproducibility difficult. General-purpose benchmarks lack coverage of work-specific tasks, and employment data sensitivity limits open evaluation.Method: WorkRB organizes 13 tasks from 7 task groups as unified recommendation and NLP tasks, including job/skill recommendation, candidate recommendation, similar item recommendation, and skill extraction/normalization. It enables monolingual and cross-lingual evaluation through dynamic loading of multilingual ontologies, with modular design for community contributions.
Result: WorkRB provides the first open-source, community-driven benchmark tailored to work-domain AI, available under Apache 2.0 license, enabling standardized evaluation while allowing integration of proprietary tasks without disclosing sensitive data.
Conclusion: WorkRB addresses fragmentation in work-domain AI research by providing a unified benchmark that enables reproducible evaluation, cross-study comparison, and community contributions while respecting data sensitivity concerns in employment applications.
Abstract: Today’s evolving labor markets rely increasingly on recommender systems for hiring, talent management, and workforce analytics, with natural language processing (NLP) capabilities at the core. Yet, research in this area remains highly fragmented. Studies employ divergent ontologies (ESCO, O*NET, national taxonomies), heterogeneous task formulations, and diverse model families, making cross-study comparison and reproducibility exceedingly difficult. General-purpose benchmarks lack coverage of work-specific tasks, and the inherent sensitivity of employment data further limits open evaluation. We present \textbf{WorkRB} (Work Research Benchmark), the first open-source, community-driven benchmark tailored to work-domain AI. WorkRB organizes 13 diverse tasks from 7 task groups as unified recommendation and NLP tasks, including job/skill recommendation, candidate recommendation, similar item recommendation, and skill extraction and normalization. WorkRB enables both monolingual and cross-lingual evaluation settings through dynamic loading of multilingual ontologies. Developed within a multi-stakeholder ecosystem of academia, industry, and public institutions, WorkRB has a modular design for seamless contributions and enables integration of proprietary tasks without disclosing sensitive data. WorkRB is available under the Apache 2.0 license at https://github.com/techwolf-ai/WorkRB.
[4] Text-as-Signal: Quantitative Semantic Scoring with Embeddings, Logprobs, and Noise Reduction
Hugo Moreira
Main category: cs.CL
TL;DR: A pipeline for converting text corpora into quantitative semantic signals using document embeddings, logprob-based scoring with configurable semantic dimensions, and dimensionality reduction for structural analysis.
Details
Motivation: To create a practical, configurable framework for turning text corpora into quantitative semantic signals that can support AI engineering tasks like corpus inspection, monitoring, and downstream analysis, rather than relying on fixed universal schemas.Method: 1) Represent each document as a full-document embedding using Qwen embeddings; 2) Score documents through logprob-based evaluation over a configurable positional dictionary (instantiated as six semantic dimensions); 3) Project onto a noise-reduced low-dimensional manifold using UMAP; 4) Apply three-stage anomaly detection; 5) Create an identity space for semantic positioning and aggregated profiles.
Result: Successfully applied to 11,922 Portuguese news articles about AI, creating an identity space that supports both document-level semantic positioning and corpus-level characterization through aggregated profiles. The framework enables operational text-as-signal workflow for AI engineering tasks.
Conclusion: The configurable pipeline provides a practical approach for quantitative semantic analysis of text corpora, adaptable to different analytical requirements rather than fixed to universal schemas, supporting various AI engineering applications.
Abstract: This paper presents a practical pipeline for turning text corpora into quantitative semantic signals. Each news item is represented as a full-document embedding, scored through logprob-based evaluation over a configurable positional dictionary, and projected onto a noise-reduced low-dimensional manifold for structural interpretation. In the present case study, the dictionary is instantiated as six semantic dimensions and applied to a corpus of 11,922 Portuguese news articles about Artificial Intelligence. The resulting identity space supports both document-level semantic positioning and corpus-level characterization through aggregated profiles. We show how Qwen embeddings, UMAP, semantic indicators derived directly from the model output space, and a three-stage anomaly-detection procedure combine into an operational text-as-signal workflow for AI engineering tasks such as corpus inspection, monitoring, and downstream analytical support. Because the identity layer is configurable, the same framework can be adapted to the requirements of different analytical streams rather than fixed to a universal schema.
[5] A Multi-Model Approach to English-Bangla Sentiment Classification of Government Mobile Banking App Reviews
Md. Naim Molla, Md Muhtasim Munif Fahim, Md. Binyamin, Md Jahid Hasan Imran, Tonmoy Shil, Nura Rayhan, Md Rezaul Karim
Main category: cs.CL
TL;DR: Analysis of mobile banking app reviews in Bangladesh using hybrid sentiment labeling shows traditional ML models outperform transformers, with significant Bangla-English performance gap highlighting need for low-resource language NLP development.
Details
Motivation: Mobile banking app quality directly impacts financial access for millions in developing economies, particularly in Bangladesh where government banking apps serve as primary financial gateways. Understanding user sentiment from app reviews can help improve digital services.Method: Analyzed 5,652 Google Play reviews (English and Bangla) for four Bangladeshi government banking apps using hybrid labeling combining star ratings with XLM-RoBERTa classifier. Compared traditional ML models (Random Forest, Linear SVM) vs. transformer models (XLM-RoBERTa, DeBERTa-v3) for sentiment analysis.
Result: Traditional models outperformed transformers: Random Forest achieved highest accuracy (0.815), Linear SVM highest weighted F1 (0.804), both better than fine-tuned XLM-RoBERTa (0.793). Significant 16.1% accuracy gap between Bangla and English text. Aspect analysis revealed dissatisfaction with transaction speed and interface design, with eJanata app receiving worst ratings.
Conclusion: Traditional ML models can be more effective than transformers for sentiment analysis in low-resource language contexts. Policy recommendations include app quality remediation, trust-centered release management, and Bangla-first NLP adoption to improve digital banking services through data-driven methods.
Abstract: For millions of users in developing economies who depend on mobile banking as their primary gateway to financial services, app quality directly shapes financial access. The study analyzed 5,652 Google Play reviews in English and Bangla (filtered from 11,414 raw reviews) for four Bangladeshi government banking apps. The authors used a hybrid labeling approach that combined use of the reviewer’s star rating for each review along with a separate independent XLM-RoBERTa classifier to produce moderate inter-method agreement (kappa = 0.459). Traditional models outperformed transformer-based ones: Random Forest produced the highest accuracy (0.815), while Linear SVM produced the highest weighted F1 score (0.804); both were higher than the performance of fine-tuned XLM-RoBERTa (0.793). McNemar’s test confirmed that all classical models were significantly superior to the off-the-shelf XLM-RoBERTa (p < 0.05), while differences with the fine-tuned variant were not statistically significant. DeBERTa-v3 was applied to analyze the sentiment at the aspect level across the reviews for the four apps; the reviewers expressed their dissatisfaction primarily with the speed of transactions and with the poor design of interfaces; eJanata app received the worst ratings from the reviewers across all apps. Three policy recommendations are made based on these findings - remediation of app quality, trust-centred release management, and Bangla-first NLP adoption - to assist state-owned banks in moving towards improving their digital services through data-driven methods. Notably, a 16.1-percentage-point accuracy gap between Bangla and English text highlights the need for low-resource language model development.
[6] KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context
Nahyun Lee, Guijin Son, Hyunwoo Ko, Chanyoung Kim, JunYoung An, Kyubeen Han, Il-Youp Kwak
Main category: cs.CL
TL;DR: KMMMU is a native Korean multimodal benchmark with 3,466 culturally-specific questions from Korean exams, revealing significant performance gaps in current models for Korean cultural and institutional understanding.
Details
Motivation: Existing multimodal benchmarks are English-centric or translated, lacking cultural and institutional specificity. There's a need for native Korean evaluation that captures local conventions, official standards, and discipline-specific visual formats unique to Korean contexts.Method: Created KMMMU benchmark with 3,466 questions from Korean exams across 9 disciplines and 9 visual modality categories. Includes Korean-specific subset (300 items) and hard subset (627 questions). Evaluated both open-source and proprietary multimodal models on this benchmark.
Result: Best open-source model achieved 42.05% accuracy on full set; best proprietary model reached 52.42% on hard subset. Performance varies across disciplines with Korean-specific questions showing gaps up to 13.43%. Error analysis reveals failures stem from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding.
Conclusion: KMMMU provides crucial testbed for multimodal evaluation beyond English-centric benchmarks, highlighting significant gaps in models’ ability to handle Korean cultural and institutional contexts. Enables development of more reliable systems for expert real-world tasks in Korean settings.
Abstract: We introduce KMMMU, a native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings. KMMMU contains 3,466 questions from exams natively written in Korean, covering nine disciplines and nine visual modality categories, along with a 300-item Korean-specific subset and a hard subset of 627 questions. Unlike translated or English-centric benchmarks, KMMMU targets information-dense problems shaped by local conventions, official standards, and discipline-specific visual formats. Experiments show that the strongest open-source model reaches only 42.05% accuracy on the full set, while the best proprietary model achieves 52.42% on the hard subset. Performance varies across disciplines, with some disciplines emerging as bottlenecks, and Korean-specific questions showing gaps of up to 13.43%. Error analysis suggests that these failures stem less from insufficient reasoning depth than from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding. KMMMU provides a testbed for multimodal evaluation beyond English-centric benchmarks and for developing more reliable systems for expert real-world tasks.
[7] A Proactive EMR Assistant for Doctor-Patient Dialogue: Streaming ASR, Belief Stabilization, and Preliminary Controlled Evaluation
Zhenhai Pan, Yan Liu, Jia You
Main category: cs.CL
TL;DR: An end-to-end proactive EMR assistant that processes streaming doctor-patient dialogue in real-time to generate structured medical notes, addressing speech noise, missing punctuation, diagnostic uncertainty, and action planning.
Details
Motivation: Current dialogue-based EMR systems are passive pipelines that transcribe speech, extract information, and generate notes after consultations. This design improves documentation efficiency but fails to provide proactive consultation support by not addressing streaming speech noise, missing punctuation, unstable diagnostic beliefs, objectification quality, or measurable next-action gains.Method: The system uses streaming speech recognition, punctuation restoration, stateful extraction, belief stabilization, objectified retrieval, action planning, and replayable report generation in an end-to-end architecture. It processes doctor-patient dialogues in real-time to generate structured medical records.
Result: In a controlled pilot with 10 streamed doctor-patient dialogues and a 300-query retrieval benchmark: state-event F1 of 0.84, retrieval Recall@5 of 0.87, and end-to-end pilot scores of 83.3% coverage, 81.4% structural completeness, and 80.0% risk recall. Ablations suggest punctuation restoration and belief stabilization improve downstream performance.
Conclusion: The proposed online architecture appears technically coherent and directionally supportive under tightly controlled pilot conditions, but results should not be interpreted as evidence of clinical deployment readiness, safety, or real-world utility. The study is a pilot concept demonstration rather than a clinical validation.
Abstract: Most dialogue-based electronic medical record (EMR) systems still behave as passive pipelines: transcribe speech, extract information, and generate the final note after the consultation. That design improves documentation efficiency, but it is insufficient for proactive consultation support because it does not explicitly address streaming speech noise, missing punctuation, unstable diagnostic belief, objectification quality, or measurable next-action gains. We present an end-to-end proactive EMR assistant built around streaming speech recognition, punctuation restoration, stateful extraction, belief stabilization, objectified retrieval, action planning, and replayable report generation. The system is evaluated in a preliminary controlled setting using ten streamed doctor-patient dialogues and a 300-query retrieval benchmark aggregated across dialogues. The full system reaches state-event F1 of 0.84, retrieval Recall@5 of 0.87, and end-to-end pilot scores of 83.3% coverage, 81.4% structural completeness, and 80.0% risk recall. Ablations further suggest that punctuation restoration and belief stabilization may improve downstream extraction, retrieval, and action selection within this pilot. These results were obtained under a controlled simulated pilot setting rather than broad deployment claims, and they should not be read as evidence of clinical deployment readiness, clinical safety, or real-world clinical utility. Instead, they suggest that the proposed online architecture may be technically coherent and directionally supportive under tightly controlled pilot conditions. The present study should be read as a pilot concept demonstration under tightly controlled pilot conditions rather than as evidence of clinical deployment readiness or clinical generalizability.
[8] Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage
Ziyi He, Yushi Feng, Shuangyu Yang, Yinghao Zhu, Xichen Zhang, Pak Chuen Patrick Tai, Hei Yuet Lo, Songying Wu, Weifa Yang, Lequan Yu
Main category: cs.CL
TL;DR: Dental-TriageBench: First expert-annotated benchmark for multimodal dental triage requiring integration of patient complaints and radiographic evidence (OPG) to determine referral plans, with 246 real cases showing substantial gap between MLLMs and human dentists.
Details
Motivation: Dental triage is a safety-critical clinical task requiring multimodal reasoning (patient complaints + radiographic evidence), but existing AI systems lack proper evaluation benchmarks for this complex clinical decision-making process.Method: Created Dental-TriageBench with 246 de-identified real cases annotated with expert reasoning trajectories and hierarchical triage labels. Benchmarked 19 proprietary, open-source, and medical-domain MLLMs against junior dentists as human baseline.
Result: Substantial human-model gap exists, especially on fine-grained treatment-level triage. Models require both complaint and OPG information, and errors concentrate on cases with multiple referral domains where MLLMs produce overly narrow referral sets with omission-heavy errors.
Conclusion: Dental-TriageBench provides realistic testbed for developing multimodal clinical AI systems that are more clinically grounded, coverage-aware, and safer for downstream care, highlighting current limitations of MLLMs in complex clinical reasoning tasks.
Abstract: Dental triage is a safety-critical clinical routing task that requires integrating multimodal clinical information (e.g., patient complaints and radiographic evidence) to determine complete referral plans. We present Dental-TriageBench, the first expert-annotated benchmark for reasoning-driven multimodal dental triage. Built from authentic outpatient workflows, it contains 246 de-identified cases annotated with expert-authored golden reasoning trajectories, together with hierarchical triage labels. We benchmark 19 proprietary, open-source, and medical-domain MLLMs against three junior dentists serving as the human baseline, and find a substantial human–model gap, on fine-grained treatment-level triage. Further analyses show that accurate triage requires both complaint and OPG information, and that model errors concentrate on cases with multiple referral domains, where MLLMs tend to produce overly narrow referral sets and omission-heavy errors. Dental-TriageBench provides a realistic testbed for developing multimodal clinical AI systems that are more clinically grounded, coverage-aware, and safer for downstream care.
[9] Bi-Predictability: A Real-Time Signal for Monitoring LLM Interaction Integrity
Wael Hafez, Amir Nazeri
Main category: cs.CL
TL;DR: The paper introduces Information Digital Twin (IDT), a lightweight architecture using bi-predictability to monitor multi-turn LLM interaction integrity in real-time, detecting structural uncoupling separate from semantic quality.
Details
Motivation: Current LLM evaluation methods focus on output semantics or token confidence but cannot monitor real-time structural coherence in multi-turn interactions, leaving systems vulnerable to gradual degradation that goes undetected.Method: Proposes Information Digital Twin (IDT) using bi-predictability (P) - an information theoretic measure computed from raw token frequency statistics across context-response-next prompt loops without secondary inference or embeddings.
Result: IDT detected injected disruptions with 100% sensitivity across 4,500 conversational turns. Structural coupling and semantic quality were separable: P aligned with structural consistency in 85% of conditions but with semantic judge scores in only 44%, revealing “silent uncoupling” regime.
Conclusion: IDT provides scalable, computationally efficient mechanism for real-time AI assurance by decoupling structural monitoring from semantic evaluation, enabling detection of conversational degradation even when outputs remain semantically high-quality.
Abstract: Large language models (LLMs) are increasingly deployed in high-stakes autonomous and interactive workflows, where reliability demands continuous, multi-turn coherence. However, current evaluation methods either rely on post-hoc semantic judges, measure unidirectional token confidence (e.g., perplexity), or require compute-intensive repeated sampling (e.g., semantic entropy). Because these techniques focus exclusively on the model’s output distribution, they cannot monitor whether the underlying interaction remains structurally coupled in real time, leaving systems vulnerable to gradual, undetected degradation. Here we show that multi-turn interaction integrity can be continuously monitored using bi-predictability (P), a fundamental information theoretic measure computed directly from raw token frequency statistics. We introduce the Information Digital Twin (IDT), a lightweight architecture that estimates P across the context, response, next prompt loop without secondary inference or embeddings. Across 4,500 conversational turns between a student model and three frontier teacher models, the IDT detected injected disruptions with 100% sensitivity. Crucially, we demonstrate that structural coupling and semantic quality are empirically and practically separable: P aligned with structural consistency in 85% of conditions, but with semantic judge scores in only 44%. This reveals a critical regime of “silent uncoupling” where LLMs produce high-scoring outputs despite degrading conversational context. By decoupling structural monitoring from semantic evaluation, the IDT provides a scalable, computationally efficient mechanism for real-time AI assurance and closed-loop regulation
[10] OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs
Qianqi Yan, Yichen Guo, Ching-Chen Kuo, Shan Jiang, Hang Yin, Yang Zhao, Xin Eric Wang
Main category: cs.CL
TL;DR: OmniTrace is a lightweight, model-agnostic framework for attribution in multimodal LLMs that traces generated tokens back to supporting input sources across vision, audio, and video modalities during decoding.
Details
Motivation: Existing attribution methods don't work well for autoregressive, decoder-only MLLMs performing open-ended multimodal generation. There's a need to identify which input sources (text, image, audio, video) support each generated statement in omni-modal models.Method: Formalizes attribution as generation-time tracing over causal decoding process. Converts token-level signals (attention weights, gradients) into span-level, cross-modal explanations during decoding. Uses confidence-weighted and temporally coherent aggregation to select concise supporting sources without retraining.
Result: Evaluated on Qwen2.5-Omni and MiniCPM-o-4.5 across visual, audio, and video tasks. Generation-aware span-level attribution produces more stable and interpretable explanations than naive self-attribution and embedding-based baselines, remaining robust across multiple underlying attribution signals.
Conclusion: Treating attribution as a structured generation-time tracing problem provides a scalable foundation for transparency in omni-modal language models, enabling better understanding of which multimodal inputs support generated content.
Abstract: Modern multimodal large language models (MLLMs) generate fluent responses from interleaved text, image, audio, and video inputs. However, identifying which input sources support each generated statement remains an open challenge. Existing attribution methods are primarily designed for classification settings, fixed prediction targets, or single-modality architectures, and do not naturally extend to autoregressive, decoder-only models performing open-ended multimodal generation. We introduce OmniTrace, a lightweight and model-agnostic framework that formalizes attribution as a generation-time tracing problem over the causal decoding process. OmniTrace provides a unified protocol that converts arbitrary token-level signals such as attention weights or gradient-based scores into coherent span-level, cross-modal explanations during decoding. It traces each generated token to multimodal inputs, aggregates signals into semantically meaningful spans, and selects concise supporting sources through confidence-weighted and temporally coherent aggregation, without retraining or supervision. Evaluations on Qwen2.5-Omni and MiniCPM-o-4.5 across visual, audio, and video tasks demonstrate that generation-aware span-level attribution produces more stable and interpretable explanations than naive self-attribution and embedding-based baselines, while remaining robust across multiple underlying attribution signals. Our results suggest that treating attribution as a structured generation-time tracing problem provides a scalable foundation for transparency in omni-modal language models.
[11] Mathematical Reasoning Enhanced LLM for Formula Derivation: A Case Study on Fiber NLI Modellin
Yao Zhang, Yuchen Song, Xiao Luo, Shengnan Li, Xiaotian Jiang, Min Zhang, Danshi Wang
Main category: cs.CL
TL;DR: LLM-based approach for symbolic physical reasoning in optical communication formula derivation, specifically for fiber nonlinear interference modeling.
Details
Motivation: While LLMs excel at code generation and text synthesis, their potential for symbolic physical reasoning in domain-specific scientific problems remains underexplored, particularly in optical communication formula derivation.Method: Mathematical reasoning enhanced generative AI approach using structured prompts to guide LLMs for optical communication formula derivation, focusing on fiber nonlinear interference modeling (ISRS GN expressions).
Result: Successfully reconstructed known closed-form ISRS GN expressions and derived novel approximation for multi-span C and C+L band transmissions. LLM-derived model produces central-channel GSNRs nearly identical to baseline models with mean absolute error below 0.109 dB.
Conclusion: LLMs can be effectively guided for symbolic physical reasoning in domain-specific scientific problems, demonstrating both physical consistency and practical accuracy in optical communication modeling.
Abstract: Recent advances in large language models (LLMs) have demonstrated strong capabilities in code generation and text synthesis, yet their potential for symbolic physical reasoning in domain-specific scientific problems remains underexplored. We present a mathematical reasoning enhanced generative AI approach for optical communication formula derivation, focusing on the fiber nonlinear interference modelling. By guiding an LLM with structured prompts, we successfully reconstructed the known closed-form ISRS GN expressions and further derived a novel approximation tailored for multi-span C and C+L band transmissions. Numerical validations show that the LLM-derived model produces central-channel GSNRs nearly identical to baseline models, with mean absolute error across all channels and spans below 0.109 dB, demonstrating both physical consistency and practical accuracy.
[12] Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub
Haichuan Hu, Ye Shang, Quanjun Zhang
Main category: cs.CL
TL;DR: Empirical study of ClawHub, a public LLM agent skill registry, analyzing 26,502 skills for language distribution, functional organization, popularity, and security risks.
Details
Motivation: Skill ecosystems are becoming important for LLM agent systems but their functionality, ecosystem structure, and security risks remain underexplored despite rapid growth.Method: Built and normalized dataset of 26,502 skills from ClawHub, conducted systematic analysis of language distribution, functional organization, popularity, and security signals using clustering techniques. Also formulated submission-time skill risk prediction and constructed balanced benchmark of 11,010 skills, testing 12 classifiers.
Result: Found cross-lingual differences: English skills are infrastructure-oriented (APIs, automation, memory), Chinese skills are application-oriented (media generation, social content, finance). Over 30% of skills labeled suspicious/malicious. Best classifier (Logistic Regression) achieved 72.62% accuracy and 78.95% AUROC for risk prediction, with primary documentation as most informative signal.
Conclusion: Public skill registries are both key enablers of agent capability reuse and new surfaces for ecosystem-scale security risk, highlighting need for better safety observability and early risk assessment.
Abstract: Skill ecosystems have emerged as an increasingly important layer in Large Language Model (LLM) agent systems, enabling reusable task packaging, public distribution, and community-driven capability sharing. However, despite their rapid growth, the functionality, ecosystem structure, and security risks of public skill registries remain underexplored. In this paper, we present an empirical study of ClawHub, a large public registry of agent skills. We build and normalize a dataset of 26,502 skills, and conduct a systematic analysis of their language distribution, functional organization, popularity, and security signals. Our clustering results show clear cross-lingual differences: English skills are more infrastructure-oriented and centered on technical capabilities such as APIs, automation, and memory, whereas Chinese skills are more application-oriented, with clearer scenario-driven clusters such as media generation, social content production, and finance-related services. We further find that more than 30% of all crawled skills are labeled as suspicious or malicious by available platform signals, while a substantial fraction of skills still lack complete safety observability. To study early risk assessment, we formulate submission-time skill risk prediction using only information available at publication time, and construct a balanced benchmark of 11,010 skills. Across 12 classifiers, the best Logistic Regression achieves a accuracy of 72.62% and an AUROC of 78.95%, with primary documentation emerging as the most informative submission-time signal. Our findings position public skill registries as both a key enabler of agent capability reuse and a new surface for ecosystem-scale security risk.
[13] Correct Chains, Wrong Answers: Dissociating Reasoning from Output in LLM Logic
Abinav Rao, Sujan Rachuri, Nikhil Vemuri
Main category: cs.CL
TL;DR: A benchmark called Novel Operator Test evaluates LLMs’ reasoning vs. pattern retrieval by testing Boolean operators under unfamiliar names across depths 1-10, revealing models can have correct reasoning but wrong final answers.
Details
Motivation: Current benchmarks cannot distinguish between genuine reasoning and pattern retrieval in LLMs. Models can execute chain-of-thought reasoning correctly yet still produce wrong final answers, indicating a reasoning-output dissociation that needs rigorous evaluation.Method: Introduces Novel Operator Test benchmark that separates operator logic from operator names. Tests Boolean operators under unfamiliar names across depths 1-10 on five models (up to 8,100 problems each). Uses Trojan operator (XOR’s truth table under novel name) to isolate genuine reasoning difficulty from name unfamiliarity.
Result: Reveals reasoning-output dissociation: at Claude Sonnet 4’s depth 7, all 31 errors had verifiably correct reasoning but wrong declared answers. Identifies two failure types: strategy failures at depth 2 (models attempt terse retrieval) and content failures at depth 7 (models reason fully but err systematically). Trojan operator shows name alone doesn’t gate reasoning (p >= 0.49). Llama’s novelty gap widens to 28pp at depth 8-9.
Conclusion: The benchmark successfully detects reasoning-output dissociation that existing benchmarks miss, revealing fundamental limitations in LLMs’ reasoning capabilities beyond pattern retrieval. Models can execute correct reasoning steps but still produce wrong final answers, indicating deeper issues in reasoning integration.
Abstract: LLMs can execute every step of chain-of-thought reasoning correctly and still produce wrong final answers. We introduce the Novel Operator Test, a benchmark that separates operator logic from operator name, enabling rigorous distinction between genuine reasoning and pattern retrieval. By evaluating Boolean operators under unfamiliar names across depths 1-10 on five models (up to 8,100 problems each), we demonstrate a reasoning-output dissociation that existing benchmarks cannot detect. At Claude Sonnet 4’s depth 7, all 31 errors have verifiably correct reasoning yet wrong declared answers; 17/19 errors in mixed-operator chains exhibit the same pattern. The benchmark reveals two failure types: strategy failures at depth 2, where models attempt terse retrieval (+62pp from scaffolding), and content failures at depth 7, where models reason fully but err systematically (+8-30pp, 0/300 errors post-intervention). A Trojan operator (XOR’s truth table under a novel name) confirms name alone does not gate reasoning (p >= 0.49), while Llama’s novelty gap widens to 28pp at depth 8-9 with the Trojan at 92-100%, isolating genuine difficulty with novel logic from name unfamiliarity.
[14] Lossless Prompt Compression via Dictionary-Encoding and In-Context Learning: Enabling Cost-Effective LLM Analysis of Repetitive Data
Andresa Rodrigues de Campos, David Lee, Imry Kissos, Piyush Paritosh
Main category: cs.CL
TL;DR: LLMs can learn encoding dictionaries in-context to perform analysis directly on compressed representations, enabling lossless prompt compression without fine-tuning.
Details
Motivation: Address token limits and API costs for LLMs by compressing repetitive patterns in prompts without sacrificing analytical accuracy, enabling cost-effective analysis of large-scale datasets.Method: Dictionary encoding approach that identifies repetitive subsequences at multiple length scales, replaces them with compact meta-tokens, and provides compression dictionary in system prompt for LLMs to interpret meta-tokens correctly.
Result: Achieves compression ratios up to 80% with exact match rates >0.99 for template-based compression and average Levenshtein similarity >0.91 for algorithmic compression; compression ratio explains <2% of variance in similarity metrics.
Conclusion: Training-free prompt compression enables cost-effective LLM deployment by addressing token limits and API costs while preserving analytical accuracy, particularly effective for repetitive datasets.
Abstract: In-context learning has established itself as an important learning paradigm for Large Language Models (LLMs). In this paper, we demonstrate that LLMs can learn encoding keys in-context and perform analysis directly on encoded representations. This finding enables lossless prompt compression via dictionary encoding without model fine-tuning: frequently occurring subsequences are replaced with compact meta-tokens, and when provided with the compression dictionary in the system prompt, LLMs correctly interpret these meta-tokens during analysis, producing outputs equivalent to those from uncompressed inputs. We present a compression algorithm that identifies repetitive patterns at multiple length scales, incorporating a token-savings optimization criterion that ensures compression reduces costs by preventing dictionary overhead from exceeding savings. The algorithm achieves compression ratios up to 80$%$ depending on dataset characteristics. To validate that LLM analytical accuracy is preserved under compression, we use decompression as a proxy task with unambiguous ground truth. Evaluation on the LogHub 2.0 benchmark using Claude 3.7 Sonnet demonstrates exact match rates exceeding 0.99 for template-based compression and average Levenshtein similarity scores above 0.91 for algorithmic compression, even at compression ratios of 60$%$-80$%$. Additionally, compression ratio explains less than 2$%$ of variance in similarity metrics, indicating that decompression quality depends on dataset characteristics rather than compression intensity. This training-free approach works with API-based LLMs, directly addressing fundamental deployment constraints – token limits and API costs – and enabling cost-effective analysis of large-scale repetitive datasets, even as data patterns evolve over time.
[15] Before the First Token: Scale-Dependent Emergence of Hallucination Signals in Autoregressive Language Models
Dip Roy, Rajiv Misra, Sanjay Kumar Singh, Anisha Roy
Main category: cs.CL
TL;DR: LLMs show scale-dependent phase transition in hallucination detection: models under 400M parameters show no reliable factuality signal, while models above ~1B parameters show peak detectability before token generation, with instruction tuning enabling pre-generation knowledge encoding.
Details
Motivation: Despite serious consequences of hallucinations in critical domains like healthcare and finance, little is known about when LLMs decide to hallucinate. Recent work shows models maintain internal representations distinguishing factual from fictional outputs, but when these representations peak as a function of model scale remains poorly understood.Method: Studied temporal dynamics of hallucination-indicative internal representations across 7 autoregressive transformers (117M-7B parameters) using three fact-based datasets (TriviaQA, Simple Facts, Biography; 552 labeled examples). Analyzed scale-dependent phase transitions and pre-generation signals using statistical significance testing.
Result: Identified scale-dependent phase transition: models below 400M parameters show chance-level probe accuracy (AUC = 0.48-0.67) with no reliable factuality signal. Above ~1B parameters, peak detectability occurs at position zero (before token generation) then declines. Pythia-1.4B (p=0.012) and Qwen2.5-7B (p=0.038) show statistically significant pre-generation signals. At 7B scale, Pythia-6.9B shows flat temporal profile while instruction-tuned Qwen2.5-7B shows dominant pre-generation effect.
Conclusion: Raw scale alone is insufficient for pre-commitment encoding - knowledge organization through instruction tuning or equivalent post-training is required. Activation steering fails to correct hallucinations, confirming the signal is correlational rather than causal. Findings provide scale-calibrated detection protocols and hypothesis on instruction tuning’s role in developing knowledge circuits for factual generation.
Abstract: When do large language models decide to hallucinate? Despite serious consequences in healthcare, law, and finance, few formal answers exist. Recent work shows autoregressive models maintain internal representations distinguishing factual from fictional outputs, but when these representations peak as a function of model scale remains poorly understood. We study the temporal dynamics of hallucination-indicative internal representations across 7 autoregressive transformers (117M–7B parameters) using three fact-based datasets (TriviaQA, Simple Facts, Biography; 552 labeled examples). We identify a scale-dependent phase transition: models below 400M parameters show chance-level probe accuracy at every generation position (AUC = 0.48–0.67), indicating no reliable factuality signal. Above $\sim$1B parameters, a qualitatively different regime emerges where peak detectability occurs at position zero – before any tokens are generated – then declines during generation. This pre-generation signal is statistically significant in both Pythia-1.4B (p = 0.012) and Qwen2.5-7B (p = 0.038), spanning distinct architectures and training corpora. At the 7B scale, we observe a striking dissociation: Pythia-6.9B (base model, trained on The Pile) produces a flat temporal profile ($Δ$ = +0.001, p = 0.989), while instruction-tuned Qwen2.5-7B shows a dominant pre-generation effect. This indicates raw scale alone is insufficient – knowledge organization through instruction tuning or equivalent post-training is required for pre-commitment encoding. Activation steering along probe-derived directions fails to correct hallucinations across all models, confirming the signal is correlational rather than causal. Our findings provide scale-calibrated detection protocols and a concrete hypothesis on instruction tuning’s role in developing knowledge circuits supporting factual generation.
[16] Curation of a Palaeohispanic Dataset for Machine Learning
Gonzalo Martínez-Fernández, Jose F Quesada, Agustín Riscos-Núñez, Francisco José Salguero-Lamillar
Main category: cs.CL
TL;DR: A computational approach to Palaeohispanic language study through creation of structured dataset for machine learning applications
Details
Motivation: Palaeohispanic languages (pre-Roman Iberian Peninsula languages) have varying degrees of decipherment and most studies have been purely linguistic. Computational approaches could benefit the field, but existing resources are limited and in unsuitable formats for machine learning techniques.Method: Construction of a structured dataset from existing Palaeohispanic language resources to make them suitable for computational analysis and machine learning applications.
Result: A structured dataset is created that organizes Palaeohispanic language materials in a format suitable for computational techniques, enabling future machine learning applications in this research area.
Conclusion: The structured dataset provides a foundation for computational approaches to Palaeohispanic language study, potentially accelerating decipherment progress through machine learning techniques.
Abstract: Palaeohispanic languages are those spoken in the Iberian Peninsula before the arrival of the Romans in the 3rd Century B.C. Their study was really put on motion after Gómez Moreno deciphered the Iberian Levantine script, one of the several semi-sillabaries used by these languages. Still, the Palaeohispanic languages have varying degrees of decipherment, and none is fully known to this day. Most of the studies have been performed from a purely linguistic point of view, and a computational approach may benefit this research area greatly. However, the resources are limited and presented in an unsuitable format for techniques such as Machine Learning. Therefore, a structured dataset is constructed, which will hopefully allow more progress in the field.
[17] EVE: A Domain-Specific LLM Framework for Earth Intelligence
Àlex R. Atrio, Antonio Lopez, Jino Rohit, Yassine El Ouahidi, Marcello Politi, Vijayasri Iyer, Umar Jamil, Sébastien Bratières, Nicolas Longépé
Main category: cs.CL
TL;DR: EVE is an open-source framework for developing domain-specialized LLMs for Earth Intelligence, featuring a 24B parameter model (EVE-Instruct) that outperforms comparable models on Earth observation benchmarks while maintaining general capabilities.
Details
Motivation: To create the first open-source, end-to-end initiative for developing and deploying domain-specialized large language models specifically for Earth Intelligence applications, addressing the lack of specialized models in this domain.Method: Built EVE-Instruct, a domain-adapted 24B parameter model based on Mistral Small 3.2, optimized for reasoning and question answering. Created curated training corpora and systematic domain-specific evaluation benchmarks covering multiple question types. Integrated RAG and hallucination-detection into a production system.
Result: EVE-Instruct outperforms comparable models on newly constructed Earth Observation and Earth Sciences benchmarks while preserving general capabilities. The system has been deployed via API and GUI, supporting 350 pilot users. All models, datasets, and code are released under open licenses.
Conclusion: EVE successfully establishes the first open-source framework for Earth Intelligence LLMs, providing specialized models, curated datasets, and evaluation benchmarks that advance the field while maintaining open accessibility.
Abstract: We introduce Earth Virtual Expert (EVE), the first open-source, end-to-end initiative for developing and deploying domain-specialized LLMs for Earth Intelligence. At its core is EVE-Instruct, a domain-adapted 24B model built on Mistral Small 3.2 and optimized for reasoning and question answering. On newly constructed Earth Observation and Earth Sciences benchmarks, it outperforms comparable models while preserving general capabilities. We release curated training corpora and the first systematic domain-specific evaluation benchmarks, covering MCQA, open-ended QA, and factuality. EVE further integrates RAG and a hallucination-detection pipeline into a production system deployed via API and GUI, supporting 350 pilot users so far. All models, datasets, and code are ready to be released under open licenses as contributions to our field at huggingface.co/eve-esa and github.com/eve-esa.
[18] LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
Xiang Long, Li Du, Yilong Xu, Fangcheng Liu, Haoqing Wang, Ning Ding, Ziheng Li, Jianyuan Guo, Yehui Tang
Main category: cs.CL
TL;DR: LiveClawBench: A benchmark for evaluating LLM agents on real-world assistant tasks using a Triple-Axis Complexity Framework (Environment Complexity, Cognitive Demand, Runtime Adaptability)
Details
Motivation: Existing benchmarks evaluate LLM agents under isolated sources of difficulty, creating a gap between current evaluation settings and the compositional challenges that arise in practical deployment of real-world assistant tasks.Method: Developed a Triple-Axis Complexity Framework based on analysis of real OpenClaw usage cases, then constructed a pilot benchmark with explicit complexity-factor annotations covering real-world assistant tasks with compositional difficulty.
Result: Created LiveClawBench benchmark that provides a principled foundation for evaluating LLM agents in realistic assistant settings, establishing a basis for future expansion across task domains and complexity axes.
Conclusion: The framework and benchmark address the gap between isolated evaluation settings and practical deployment challenges, with ongoing efforts to enrich case collections for more comprehensive domain and complexity coverage.
Abstract: LLM-based agents are increasingly expected to handle real-world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real-world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple-Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity-factor annotations, covering real-world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings, and establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. The project page is at https://github.com/Mosi-AI/LiveClawBench.
[19] Beyond Arrow’s Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration
Sayan Kumar Chaki, Antoine Gourru, Julien Velcin
Main category: cs.CL
TL;DR: Multi-agent fairness emerges through structured debate where ethically aligned agents negotiate with biased counterparts, showing that joint allocations can satisfy fairness criteria that neither agent would reach alone.
Details
Motivation: Traditional fairness studies focus on single models, but as LLMs become more agentic, fairness should be studied as an emergent property through interaction and exchange between multiple agents.Method: Controlled hospital triage framework with two agents negotiating over three structured debate rounds. One agent is aligned to specific ethical frameworks via RAG, while the other is either unaligned or adversarially prompted to favor demographic groups over clinical need.
Result: Aligned agents shape negotiation strategies and allocation patterns; joint final allocations can satisfy fairness criteria neither agent would reach alone. Aligned agents moderate bias through contestation rather than override, restoring access for marginalized groups without fully converting biased counterparts.
Conclusion: Fairness should be repositioned as an emergent, procedural property of decentralized agent interaction, with the system rather than the individual agent as the appropriate unit of evaluation, connecting to Arrow’s Impossibility Theorem constraints.
Abstract: Fairness in language models is typically studied as a property of a single, centrally optimized model. As large language models become increasingly agentic, we propose that fairness emerges through interaction and exchange. We study this via a controlled hospital triage framework in which two agents negotiate over three structured debate rounds. One agent is aligned to a specific ethical framework via retrieval-augmented generation (RAG), while the other is either unaligned or adversarially prompted to favor demographic groups over clinical need. We find that alignment systematically shapes negotiation strategies and allocation patterns, and that neither agent’s allocation is ethically adequate in isolation, yet their joint final allocation can satisfy fairness criteria that neither would have reached alone. Aligned agents partially moderate bias through contestation rather than override, acting as corrective patches that restore access for marginalized groups without fully converting a biased counterpart. We further observe that even explicitly aligned agents exhibit intrinsic biases toward certain frameworks, consistent with known left-leaning tendencies in LLMs. We connect these limits to Arrow’s Impossibility Theorem: no aggregation mechanism can simultaneously satisfy all desiderata of collective rationality, and multi-agent deliberation navigates rather than resolves this constraint. Our results reposition fairness as an emergent, procedural property of decentralized agent interaction, and the system rather than the individual agent as the appropriate unit of evaluation.
[20] PersonaVLM: Long-Term Personalized Multimodal LLMs
Chang Nie, Chaoyou Fu, Yifan Zhang, Haihua Yang, Caifeng Shan
Main category: cs.CL
TL;DR: PersonaVLM is a personalized multimodal agent framework that enables long-term personalization of MLLMs by integrating memory extraction, multi-turn reasoning, and response alignment to user’s evolving preferences.
Details
Motivation: Current MLLMs have limited ability to generate responses aligned with individual preferences, with prior approaches only enabling static, single-turn personalization that fails to capture users' evolving preferences and personality over time.Method: Transforms general-purpose MLLMs into personalized assistants through three key capabilities: (1) Remembering - extracts and summarizes chronological multimodal memories into a personalized database, (2) Reasoning - conducts multi-turn reasoning by retrieving and integrating relevant memories, (3) Response Alignment - infers user’s evolving personality to ensure outputs remain aligned.
Result: Improves baseline by 22.4% on Persona-MME benchmark and 9.8% on PERSONAMEM benchmark under 128k context, while outperforming GPT-4o by 5.2% and 2.0% respectively. Introduces Persona-MME benchmark with over 2,000 interaction cases across 7 key aspects and 14 fine-grained tasks.
Conclusion: PersonaVLM effectively enables long-term personalization of multimodal large language models, addressing the limitation of static personalization approaches by capturing users’ evolving preferences through integrated memory, reasoning, and alignment capabilities.
Abstract: Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users’ evolving preferences and personality over time (see Fig.1). In this paper, we introduce PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization. It transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities: (a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database. (b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user’s evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics. For evaluation, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks. Extensive experiments validate our method’s effectiveness, improving the baseline by 22.4% (Persona-MME) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively. Project page: https://PersonaVLM.github.io.
[21] DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs
Md Hasebul Hasan, Krity Haque Charu, Eshwara Prasad Sridhar, Shuchisnigdha Deb, Mohammad A. Islam
Main category: cs.CL
TL;DR: DeEscalWild: A benchmark dataset for de-escalation training using Small Language Models, created from real police-civilian interaction videos to enable lightweight, real-time training systems.
Details
Motivation: Traditional law enforcement de-escalation training lacks scalability and realism. While LLMs enable dynamic simulations, they're too computationally heavy for portable field training hardware. SLMs offer real-time alternatives but lack domain-specific training data.Method: Created DeEscalWild dataset from 5,000 raw police-civilian interaction videos using hybrid filtering (human verification + LLM-as-a-Judge) to distill 1,500 high-fidelity scenarios. Fine-tuned SLMs on this domain-specific corpus.
Result: Fine-tuned SLMs significantly outperformed base models across ROUGE-L, BLEU-4, METEOR, and BERTScore metrics. Qwen 2.5 (3B-Instruct) surpassed general-purpose Gemini 2.5 Flash, showing domain-optimized SLMs achieve superior performance with less computational cost.
Conclusion: Domain-specific datasets enable effective SLMs for real-time de-escalation training, establishing infrastructure for accessible, low-latency, privacy-preserving officer training systems at the edge.
Abstract: Effective de-escalation is critical for law enforcement safety and community trust, yet traditional training methods lack scalability and realism. While Large Language Models (LLMs) enable dynamic, open-ended simulations, their substantial computational footprint renders them impractical for deployment on the lightweight, portable hardware required for immersive field training. Small Language Models (SLMs) offer a viable real-time alternative but suffer from a critical scarcity of high-quality, domain-specific training data. To bridge this gap, we present DeEscalWild, a novel benchmark dataset curated from a multi-stage pipeline of in-the-wild police-civilian interactions extracted from open-source video repositories. Starting with 5,000 raw inputs, we employed a rigorous hybrid filtering process - combining human-in-the-loop verification with LLM-as-a-Judge evaluation - to distill 1,500 high-fidelity scenarios. The resulting corpus comprises 285,887 dialogue turns, totaling approximately 4.7 million tokens. Extensive experiments demonstrate that SLMs fine-tuned on this data significantly outperform their base counterparts across ROUGE-L, BLEU-4, METEOR, and BERTScore metrics. Notably, our fine-tuned Qwen 2.5 (3B-Instruct) surpasses the general-purpose Gemini 2.5 Flash model, demonstrating that domain-optimized SLMs can achieve superior performance with a fraction of the computational cost. This work establishes the foundational infrastructure for accessible, low-latency, and privacy-preserving officer training systems at the edge.
[22] Document-tuning for robust alignment to animals
Jasmine Brazilek, Miles Tidmarsh
Main category: cs.CL
TL;DR: Paper investigates robustness of value alignment via synthetic document finetuning using animal compassion as a case study, showing initial success but degradation through subsequent training.
Details
Motivation: To understand how value alignment through synthetic document finetuning holds up through typical training pipelines, using animal compassion as a test case that is both important and orthogonal to existing alignment efforts.Method: Developed Animal Harm Benchmark (AHB) with 26 questions across 13 ethical dimensions, used synthetic documents for value alignment finetuning, tested generalization to human compassion, and measured degradation through subsequent unrelated instruction-tuning.
Result: Training with 3000 synthetic documents achieved 77% on AHB vs 40% for instruction-tuning, with generalization to human compassion and no degradation in standard safety or capabilities. However, subsequent unrelated instruction-tuning degraded the intervention, eliminating advantage after 5000 samples.
Conclusion: Document-based value interventions may require explicit preservation strategies to remain effective through typical training pipelines, as they can be degraded by subsequent unrelated training.
Abstract: We investigate the robustness of value alignment via finetuning with synthetic documents, using animal compassion as a value that is both important in its own right and orthogonal to existing alignment efforts. To evaluate compassionate reasoning, we develop and publicly release the Animal Harm Benchmark (AHB), a 26-question evaluation spanning 13 ethical dimensions, publicly available as a dataset and Inspect evaluation. On the AHB, training with 3000 documents achieves 77% compared to 40% for instruction-tuning approaches, with generalization to human compassion and no degradation in standard safety benchmarks or capabilities. However, subsequent unrelated instruction-tuning degrades the intervention, with the advantage disappearing after 5000 samples. Our exploratory results suggest document-based value interventions may require explicit preservation strategies to remain effective through typical training pipelines.
[23] Memp: Exploring Agent Procedural Memory
Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
Main category: cs.CL
TL;DR: Memp is a system that gives LLM agents learnable, updatable procedural memory by distilling past trajectories into step-by-step instructions and script-like abstractions, improving task performance and efficiency.
Details
Motivation: Current LLM-based agents have brittle procedural memory that is either manually engineered or entangled in static parameters, limiting their ability to learn and adapt from experience over time.Method: Proposes Memp that distills past agent trajectories into fine-grained step-by-step instructions and higher-level script-like abstractions, with strategies for Build, Retrieval, and Update of procedural memory. Uses a dynamic regimen that continuously updates, corrects, and deprecates memory contents.
Result: Empirical evaluation on TravelPlanner and ALFWorld shows agents achieve steadily higher success rates and greater efficiency on analogous tasks as memory repository is refined. Procedural memory from stronger models can be migrated to weaker models for substantial performance gains.
Conclusion: Memp successfully endows agents with learnable, updatable, lifelong procedural memory that evolves with experience, improving performance and enabling knowledge transfer between models.
Abstract: Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters. In this work, we investigate strategies to endow agents with a learnable, updatable, and lifelong procedural memory. We propose Memp that distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions, and explore the impact of different strategies for Build, Retrieval, and Update of procedural memory. Coupled with a dynamic regimen that continuously updates, corrects, and deprecates its contents, this repository evolves in lockstep with new experience. Empirical evaluation on TravelPlanner and ALFWorld shows that as the memory repository is refined, agents achieve steadily higher success rates and greater efficiency on analogous tasks. Moreover, procedural memory built from a stronger model retains its value: migrating the procedural memory to a weaker model can also yield substantial performance gains. Code is available at https://github.com/zjunlp/MemP.
[24] Can Large Language Models Reliably Extract Physiology Index Values from Coronary Angiography Reports?
Sofia Morgado, Filipa Valdeira, Niklas Sander, Diogo Ferreira, Marta Vilela, Miguel Menezes, Cláudia Soares
Main category: cs.CL
TL;DR: LLMs for extracting physiological measurements from Portuguese coronary angiography reports, with multi-stage evaluation framework
Details
Motivation: Coronary angiography reports contain valuable physiological measurements but in unstructured natural language format, limiting research use. Need automated extraction methods for Portuguese clinical text.Method: Tested various LLMs (general and medical) with different prompting strategies (zero-shot, few-shot, few-shot with implausible examples). Used constrained generation and RegEx post-processing. Proposed multi-stage evaluation framework for format validity, value detection, and correctness.
Result: Non-medical models performed similarly to medical ones. Llama with zero-shot performed best. GPT-OSS most robust to prompt changes. MedGemma similar to non-medical models, but MedLlama had format issues. Constrained generation decreased performance but enabled use of specific models.
Conclusion: LLMs show potential for extracting physiological indices from Portuguese CAG reports. Best results with general models, not medical ones. Multi-stage evaluation framework useful for clinical applications.
Abstract: Coronary angiography (CAG) reports contain clinically relevant physiological measurements, yet this information is typically in the form of unstructured natural language, limiting its use in research. We investigate the use of Large Language Models (LLMs) to automatically extract these values, along with their anatomical locations, from Portuguese CAG reports. To our knowledge, this study is the first addressing physiology indexes extraction from a large (1342 reports) corpus of CAG reports, and one of the few focusing on CAG or Portuguese clinical text. We explore local privacy-preserving general-purpose and medical LLMs under different settings. Prompting strategies included zero-shot, few-shot, and few-shot prompting with implausible examples. In addition, we apply constrained generation and introduce a post-processing step based on RegEx. Given the sparsity of measurements, we propose a multi-stage evaluation framework separating format validity, value detection, and value correctness, while accounting for asymmetric clinical error costs. This study demonstrates the potential of LLMs in for extracting physiological indices from Portuguese CAG reports. Non-medical models performed similarly, the best results were obtained with Llama with a zero-shot prompting, while GPT-OSS demonstrated the highest robustness to changes in the prompts. While MedGemma demonstrated similar results to non-medical models, MedLlama’s results were out-of-format in the unconstrained setting, and had a significant lower performance in the constrained one. Changes in the prompt techinique and adding a RegEx layer showed no significant improvement across models, while using constrained generation decreased performance, although having the benefit of allowing the usage of specific models that are not able to conform with the templates.
[25] IWLV-Ramayana: A Sarga-Aligned Parallel Corpus of Valmiki’s Ramayana Across Indian Languages
Sumesh VP
Main category: cs.CL
TL;DR: The paper introduces IWLV Ramayana Corpus, a structured parallel corpus aligning Valmiki’s Ramayana across multiple Indian languages at the chapter level, with English and Malayalam complete and other languages in progress.
Details
Motivation: Despite extensive scholarship on regional Ramayana traditions, computational resources enabling systematic cross-linguistic analysis remain limited, creating a need for structured parallel corpora for comparative literature and multilingual NLP research.Method: Created a structured parallel corpus aligning Valmiki’s Ramayana across multiple Indian languages at the sarga (chapter) level, distributed in JSONL format with explicit provenance metadata.
Result: Developed the IWLV Ramayana Corpus with complete English and Malayalam layers, and Hindi, Tamil, Kannada, and Telugu layers in active production - the first sarga-aligned multilingual parallel corpus of Valmiki Ramayana with provenance metadata.
Conclusion: This corpus enables applications in comparative literature, corpus linguistics, digital humanities, and multilingual natural language processing, addressing the gap in computational resources for cross-linguistic analysis of the Ramayana tradition.
Abstract: The Ramayana is among the most influential literary traditions of South and Southeast Asia, transmitted across numerous linguistic and cultural contexts over two millennia. Despite extensive scholarship on regional Ramayana traditions, computational resources enabling systematic cross-linguistic analysis remain limited. This paper introduces the IWLV Ramayana Corpus, a structured parallel corpus aligning Valmiki’s Ramayana across multiple Indian languages at the level of the sarga (chapter). The corpus currently includes complete English and Malayalam layers, with Hindi, Tamil, Kannada, and Telugu layers in active production. The dataset is distributed in structured JSONL format with explicit provenance metadata, enabling applications in comparative literature, corpus linguistics, digital humanities, and multilingual natural language processing. To our knowledge, this is the first sarga-aligned multilingual parallel corpus of the Valmiki Ramayana with explicit provenance metadata and machine-readable format.
[26] Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization
Shiping Gao, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Lifu Huang
Main category: cs.CL
TL;DR: IPVRM learns prefix-conditioned value functions for process reward modeling, using TD differences to derive step-level rewards from trajectory-level labels, improving step verification and enabling Distribution-Level RL for dense counterfactual updates.
Details
Motivation: Implicit process reward models (PRMs) are expensive to scale due to requiring step annotations or heavy verification pipelines. While implicit PRMs learn from trajectory-level labels, they suffer from train-inference mismatch where token-level credits are weakly identified and may not faithfully reflect correct reasoning steps.Method: Proposes Implicit Prefix-Value Reward Model (IPVRM) that learns a prefix-conditioned value function estimating probability of eventual correctness, deriving step signals via temporal-difference differences. Also introduces Distribution-Level RL (DistRL) that computes TD advantages for both sampled tokens and high-probability candidate tokens.
Result: IPVRM substantially improves step-verification F1 on ProcessBench. DistRL offers limited gains with miscalibrated implicit rewards but consistently improves downstream reasoning when paired with IPVRM.
Conclusion: IPVRM addresses the train-inference mismatch in implicit PRMs by learning calibrated prefix values, enabling reliable step-level reward signals and effective Distribution-Level RL for dense counterfactual updates without additional rollouts.
Abstract: Process reward models (PRMs) provide fine-grained reward signals along the reasoning process, but training reliable PRMs often requires step annotations or heavy verification pipelines, making them expensive to scale and refresh during online RL. Implicit PRMs mitigate this cost by learning decomposable token- or step-level rewards from trajectory-level outcome labels. However, they suffer from a train-inference mismatch: training only constrains a sequence-level aggregate, whereas inference requires token-level scores to reflect local step quality. As a result, token-level credits are weakly identified and may fail to faithfully reflect which reasoning steps are actually correct. This unreliability undermines a key promise of implicit PRMs: scoring many candidate tokens. In practice, noisy per-token advantages may systematically reinforce incorrect continuations. We address this problem with a novel Implicit Prefix-Value Reward Model (IPVRM), which directly learns a prefix-conditioned value function estimating the probability of eventual correctness, and derives step signals via temporal-difference (TD) differences. IPVRM substantially improves step-verification F1 on ProcessBench. Building on these calibrated prefix values, we further propose Distribution-Level RL (DistRL), which computes TD advantages for both sampled tokens and high-probability candidate tokens, enabling dense counterfactual updates without additional rollouts. While DistRL offers limited gains when powered by miscalibrated implicit rewards, it consistently improves downstream reasoning once paired with IPVRM.
[27] InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis
Oliver Bentham, Vivek Srikumar
Main category: cs.CL
TL;DR: InfiniteScienceGym: A procedurally generated benchmark for evaluating LLMs’ scientific reasoning from empirical data without real-world dataset biases.
Details
Motivation: Existing benchmarks for evaluating LLMs as scientific assistants inherit publication bias, known-knowledge bias, label noise, and require large storage. Need controlled evaluation of evidence-grounded reasoning, abstention, and tool use.Method: Procedurally generates self-contained scientific repositories with realistic structure, files, and tabular data from a seed. Privileged QA generator produces answerable/unanswerable questions with exact ground truth. Enables evaluation of evidence-grounded reasoning without distributing large static corpora.
Result: No models achieve >45% accuracy overall. Recognizing unanswerable questions remains major weakness. Stronger models use tools more effectively rather than just consuming more tokens.
Conclusion: InfiniteScienceGym complements real scientific benchmarks by targeting blind spots and failure modes hard to evaluate with published datasets alone. Provides controlled environment for evaluating scientific reasoning capabilities.
Abstract: Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. From a seed, the simulator deterministically generates a self-contained repository with realistic directory structure, files, and tabular data, and a privileged QA generator produces both answerable and unanswerable questions with exact ground truth. This makes it possible to evaluate evidence-grounded reasoning, abstention, and tool-mediated analysis in a controlled setting without distributing a large static corpus. InfiniteScienceGym complements real scientific benchmarks by targeting blind spots and failure modes that are hard to evaluate using published datasets alone. Evaluating both proprietary and open-weight models, we find that none achieve more than 45% accuracy overall, that recognizing unanswerable questions remains a major weakness, and that stronger models tend to use tools more effectively rather than simply consuming more tokens.
[28] Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection
Bach Phan-Tat, Kris Heylen, Dirk Geeraerts, Stefano De Pascale, Dirk Speelmana
Main category: cs.CL
TL;DR: Critical analysis of SemEval-2020 Task 1 benchmark for lexical semantic change detection, identifying limitations in operationalization, data quality, and benchmark design that affect validity and generalizability.
Details
Motivation: To critically evaluate the most influential benchmark for lexical semantic change detection (SemEval-2020 Task 1) and identify its limitations to improve future research in this area.Method: Three-part evaluative framework examining: 1) operationalization (how semantic change is modeled), 2) data quality (corpus and preprocessing issues), and 3) benchmark design (target sets and language coverage).
Result: Identifies significant limitations: narrow operationalization that misses gradual/constructional changes; substantial data quality issues (OCR noise, preprocessing errors); and design flaws (small target sets, limited languages).
Conclusion: The benchmark should be treated as a partial test bed rather than definitive measure. Future work needs broader theories of semantic change, transparent preprocessing, expanded language coverage, and more realistic evaluation.
Abstract: This discussion paper re-examines SemEval-2020 Task 1, the most influential shared benchmark for lexical semantic change detection, through a three-part evaluative framework: operationalisation, data quality, and benchmark design. First, at the level of operationalisation, we argue that the benchmark models semantic change mainly as gain, loss, or redistribution of discrete senses. While practical for annotation and evaluation, this framing is too narrow to capture gradual, constructional, collocational, and discourse-level change. Also, the gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, which could potentially limit the validity of the task. Second, at the level of data quality, we show that the benchmark is affected by substantial corpus and preprocessing problems, including OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets. These issues can distort model behaviour, complicate linguistic analysis, and reduce reproducibility. Third, at the level of bench-mark design, we argue the small curated target sets and limited language coverage reduce realism and increase statistical uncertainty. Taken together, these limitations suggest that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress. We therefore call for future datasets and shared tasks to adopt broader theories of semantic change, document pre-processing transparently, expand cross-linguistic coverage, and use more realistic evaluation settings. Such steps are necessary for more valid, interpretable, and generalisable progress in lexical semantic change detection
[29] Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs
Vishal Pramanik, Maisha Maliha, Nathaniel D. Bastian, Sumit Kumar Jha
Main category: cs.CL
TL;DR: HETA is a novel attribution framework for decoder-only language models that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to provide context-aware, causally faithful attributions for autoregressive generation.
Details
Motivation: Existing attribution methods are designed for encoder-based architectures and rely on linear approximations that fail to capture the causal and semantic complexities of autoregressive generation in decoder-only models.Method: HETA combines three components: semantic transition vector (captures token-to-token influence across layers), Hessian-based sensitivity scores (models second-order effects), and KL divergence (measures information loss when tokens are masked). Also introduces a benchmark dataset for evaluating attribution quality in generative settings.
Result: Empirical evaluations across multiple models and datasets show HETA consistently outperforms existing methods in attribution faithfulness and alignment with human annotations.
Conclusion: HETA establishes a new standard for interpretability in autoregressive language models by providing context-aware, causally faithful, and semantically grounded attributions.
Abstract: Attribution methods seek to explain language model predictions by quantifying the contribution of input tokens to generated outputs. However, most existing techniques are designed for encoder-based architectures and rely on linear approximations that fail to capture the causal and semantic complexities of autoregressive generation in decoder-only models. To address these limitations, we propose Hessian-Enhanced Token Attribution (HETA), a novel attribution framework tailored for decoder-only language models. HETA combines three complementary components: a semantic transition vector that captures token-to-token influence across layers, Hessian-based sensitivity scores that model second-order effects, and KL divergence to measure information loss when tokens are masked. This unified design produces context-aware, causally faithful, and semantically grounded attributions. Additionally, we introduce a curated benchmark dataset for systematically evaluating attribution quality in generative settings. Empirical evaluations across multiple models and datasets demonstrate that HETA consistently outperforms existing methods in attribution faithfulness and alignment with human annotations, establishing a new standard for interpretability in autoregressive language models.
[30] Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size
Dikshant Kukreja, Kshitij Sah, Gautam Gupta, Avinash Anand, Rajiv Ratn Shah, Zhengkui Wang, Aik Beng Ng, Erik Cambria
Main category: cs.CL
TL;DR: Scaling laws show language models become better at ignoring false claims but worse at ignoring irrelevant tokens, with semantic and non-semantic contexts scaling in opposite directions.
Details
Motivation: To understand the paradoxical behavior where larger language models simultaneously improve at ignoring false claims while becoming worse at ignoring irrelevant tokens, and to formalize this through scaling laws for contextual entrainment.Method: Analyzed the Cerebras-GPT (111M-13B) and Pythia (410M-12B) model families to study contextual entrainment scaling, examining how models favor tokens appearing in context regardless of relevance across different context types.
Result: Found predictable power-law scaling for entrainment with opposite trends: semantic contexts show decreasing entrainment with scale (models become more resistant to misinformation), while non-semantic contexts show increasing entrainment (models become more prone to copying arbitrary tokens). Largest models are 4x more resistant to counterfactual misinformation but 2x more prone to copying irrelevant tokens.
Conclusion: Semantic filtering and mechanical copying are functionally distinct behaviors that scale in opposition - scaling alone doesn’t resolve context sensitivity but reshapes it, creating a trade-off between different types of contextual processing.
Abstract: Larger language models become simultaneously better and worse at handling contextual information – better at ignoring false claims, worse at ignoring irrelevant tokens. We formalize this apparent paradox through the first scaling laws for contextual entrainment, the tendency of models to favor tokens that appeared in context regardless of relevance. Analyzing the Cerebras-GPT (111M-13B) and Pythia (410M-12B) model families, we find entrainment follows predictable power-law scaling, but with opposite trends depending on context type: semantic contexts show decreasing entrainment with scale, while non-semantic contexts show increasing entrainment. Concretely, the largest models are four times more resistant to counterfactual misinformation than the smallest, yet simultaneously twice as prone to copying arbitrary tokens. These diverging trends, which replicate across model families, suggest that semantic filtering and mechanical copying are functionally distinct behaviors that scale in opposition – scaling alone does not resolve context sensitivity, it reshapes it.
[31] L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification
Rishik Kondadadi, John E. Ortega
Main category: cs.CL
TL;DR: L2D-Clinical framework learns when to defer from BERT classifiers to LLMs for clinical text classification, improving accuracy by selectively leveraging each model’s strengths.
Details
Motivation: Clinical text classification requires choosing between specialized BERT models and general-purpose LLMs, but neither dominates across all instances. The paper aims to develop a framework that can adaptively defer between models to improve overall performance.Method: Introduces Learning to Defer for clinical text (L2D-Clinical), which learns when a BERT classifier should defer to an LLM based on uncertainty signals and text characteristics. Unlike prior L2D work that defers to human experts, this approach enables adaptive deferral between AI models.
Result: On ADE detection: L2D-Clinical achieves F1=0.928 (+1.7 points over BERT) by selectively deferring 7% of instances. On MIMIC treatment outcome classification: achieves F1=0.980 (+9.3 points over BERT) by deferring 16.8% of cases to the LLM. The framework learns to leverage LLM strengths while minimizing API costs.
Conclusion: L2D-Clinical successfully learns to selectively leverage LLM strengths when they complement BERT classifiers, improving clinical text classification accuracy while managing computational costs through strategic deferral.
Abstract: Clinical text classification requires choosing between specialized fine-tuned models (BERT variants) and general-purpose large language models (LLMs), yet neither dominates across all instances. We introduce Learning to Defer for clinical text (L2D-Clinical), a framework that learns when a BERT classifier should defer to an LLM based on uncertainty signals and text characteristics. Unlike prior L2D work that defers to human experts assumed universally superior, our approach enables adaptive deferral-improving accuracy when the LLM complements BERT. We evaluate on two English clinical tasks: (1) ADE detection (ADE Corpus V2), where BioBERT (F1=0.911) outperforms the LLM (F1=0.765), and (2) treatment outcome classification (MIMIC-IV with multi-LLM consensus ground truth), where GPT-5-nano (F1=0.967) outperforms ClinicalBERT (F1=0.887). On ADE, L2D-Clinical achieves F1=0.928 (+1.7 points over BERT) by selectively deferring 7% of instances where the LLM’s high recall compensates for BERT’s misses. On MIMIC, L2D-Clinical achieves F1=0.980 (+9.3 points over BERT) by deferring only 16.8% of cases to the LLM. The key insight is that L2D-Clinical learns to selectively leverage LLM strengths while minimizing API costs.
[32] English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training
Mehak Dhaliwal, Shashwat Chaurasia, Yao Qin, Dezhi Hong, Thomas Butler
Main category: cs.CL
TL;DR: Systematic study shows multilingual post-training improves performance across languages, with low-resource languages benefiting most and even minimal multilinguality helping English performance and cross-lingual generalization.
Details
Motivation: Current post-training pipelines for large language models are predominantly English-centric, leading to performance disparities across languages. The paper aims to systematically study how training language coverage, model scale, and task domain interact in multilingual settings.Method: Conducted 220 supervised fine-tuning runs on parallel translated multilingual data mixtures spanning mathematical reasoning and API calling tasks. Used models up to 8B parameters and systematically varied language coverage during post-training.
Result: Increasing language coverage benefits all tasks and model scales, with low-resource languages benefiting most and high-resource languages plateauing rather than degrading. Even minimal multilinguality (single non-English language) improves both English performance and cross-lingual generalization. At sufficient language diversity, zero-shot cross-lingual transfer can match or exceed direct language inclusion in low-diversity settings.
Conclusion: English-only post-training is largely suboptimal. Multilingual post-training improves performance across languages, with benefits for both high and low-resource languages. Zero-shot transfer works well with sufficient language diversity, though gains remain limited for typologically distant, low-resource languages.
Abstract: Despite the widespread multilingual deployment of large language models, post-training pipelines remain predominantly English-centric, contributing to performance disparities across languages. We present a systematic, controlled study of the interplay between training language coverage, model scale, and task domain, based on 220 supervised fine-tuning runs on parallel translated multilingual data mixtures spanning mathematical reasoning and API calling tasks, with models up to 8B parameters. We find that increasing language coverage during post-training is largely beneficial across tasks and model scales, with low-resource languages benefiting the most and high-resource languages plateauing rather than degrading. Even minimal multilinguality helps: incorporating a single non-English language improves both English performance and cross-lingual generalization, making English-only post-training largely suboptimal. Moreover, at sufficient language diversity, zero-shot cross-lingual transfer can match or exceed the effects of direct language inclusion in a low-diversity setting, although gains remain limited for typologically distant, low-resource languages.
[33] Giving Voice to the Constitution: Low-Resource Text-to-Speech for Quechua and Spanish Using a Bilingual Legal Corpus
John E. Ortega, Rodolfo Zevallos, Fabricio Carraro
Main category: cs.CL
TL;DR: A unified pipeline for synthesizing high-quality Quechua and Spanish speech for the Peruvian Constitution using three TTS architectures (XTTS v2, F5-TTS, DiFlow-TTS) with cross-lingual transfer to address data scarcity in Quechua.
Details
Motivation: To develop inclusive TTS systems for political and legal content in low-resource settings, specifically addressing the challenge of synthesizing speech for indigenous languages like Quechua which suffer from data scarcity while maintaining quality in Spanish.Method: Used three state-of-the-art TTS architectures (XTTS v2, F5-TTS, DiFlow-TTS) trained on independent Spanish and Quechua speech datasets with heterogeneous sizes and recording conditions. Leveraged bilingual and multilingual TTS capabilities and cross-lingual transfer to improve synthesis quality in both languages.
Result: Developed a unified pipeline that mitigates data scarcity in Quechua while preserving naturalness in Spanish. Released trained checkpoints, inference code, and synthesized audio for each constitutional article as reusable resources.
Conclusion: This work contributes to inclusive TTS systems for political/legal content in low-resource multilingual contexts, providing valuable resources for speech technologies in indigenous language settings.
Abstract: We present a unified pipeline for synthesizing high-quality Quechua and Spanish speech for the Peruvian Constitution using three state-of-the-art text-to-speech (TTS) architectures: XTTS v2, F5-TTS, and DiFlow-TTS. Our models are trained on independent Spanish and Quechua speech datasets with heterogeneous sizes and recording conditions, and leverage bilingual and multilingual TTS capabilities to improve synthesis quality in both languages. By exploiting cross-lingual transfer, our framework mitigates data scarcity in Quechua while preserving naturalness in Spanish. We release trained checkpoints, inference code, and synthesized audio for each constitutional article, providing a reusable resource for speech technologies in indigenous and multilingual contexts. This work contributes to the development of inclusive TTS systems for political and legal content in low-resource settings.
[34] AgentSPEX: An Agent SPecification and EXecution Language
Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan, Peizhi Niu, Yaowenqi Liu, Ruida Wang, Renhao Lu, Yuwei Guo, Tong Zhang
Main category: cs.CL
TL;DR: AgentSPEX is a specification language for LLM-agent workflows with explicit control flow, modular structure, and visual editing tools, addressing limitations of reactive prompting and Python-coupled frameworks.
Details
Motivation: Current LLM agent systems have two main issues: reactive prompting leaves control flow implicit and hard to control, while orchestration frameworks like LangGraph tightly couple workflow logic with Python, making agents difficult to maintain and modify.Method: Introduces AgentSPEX, an Agent Specification and Execution Language that supports typed steps, branching/loops, parallel execution, reusable submodules, and explicit state management. Includes a customizable agent harness with tool access, sandboxed environment, checkpointing, verification, and logging, plus a visual editor with synchronized graph and workflow views.
Result: Evaluated on 7 benchmarks and includes ready-to-use agents for deep research and scientific research. User study shows AgentSPEX provides more interpretable and accessible workflow-authoring paradigm than existing frameworks.
Conclusion: AgentSPEX addresses key limitations in current LLM agent systems by providing structured specification language with explicit control flow, modular design, and visual authoring tools, making agent workflows more maintainable, interpretable, and accessible.
Abstract: Language-model agent systems commonly rely on reactive prompting, in which a single instruction guides the model through an open-ended sequence of reasoning and tool-use steps, leaving control flow and intermediate state implicit and making agent behavior potentially difficult to control. Orchestration frameworks such as LangGraph, DSPy, and CrewAI impose greater structure through explicit workflow definitions, but tightly couple workflow logic with Python, making agents difficult to maintain and modify. In this paper, we introduce AgentSPEX, an Agent SPecification and EXecution Language for specifying LLM-agent workflows with explicit control flow and modular structure, along with a customizable agent harness. AgentSPEX supports typed steps, branching and loops, parallel execution, reusable submodules, and explicit state management, and these workflows execute within an agent harness that provides tool access, a sandboxed virtual environment, and support for checkpointing, verification, and logging. Furthermore, we provide a visual editor with synchronized graph and workflow views for authoring and inspection. We include ready-to-use agents for deep research and scientific research, and we evaluate AgentSPEX on 7 benchmarks. Finally, we show through a user study that AgentSPEX provides a more interpretable and accessible workflow-authoring paradigm than a popular existing agent framework.
[35] Peer-Predictive Self-Training for Language Model Reasoning
Shi Feng, Hanlin Zhang, Fan Nie, Sham Kakade, Yiling Chen
Main category: cs.CL
TL;DR: PST is a label-free self-training framework where multiple language models improve collaboratively using aggregated responses as internal training signals, with PMI-based scaling for updates.
Details
Motivation: The paper addresses the challenge of enabling language models to self-improve without external supervision, seeking mechanisms for continued enhancement through internal collaborative learning rather than relying on external labels or teacher-student hierarchies.Method: Proposes Peer-Predictive Self-Training (PST) where multiple models generate responses sequentially to prompts, aggregate their answers, and use pointwise mutual information (PMI) to measure how informative each intermediate response is about the aggregate. This PMI signal scales self-training updates - aligned responses are updated less, while misaligned ones are updated more.
Result: On mathematical reasoning benchmarks (SimulEq, Math500, MultiArith), PST improves exact-match accuracy by 2.2-4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap by 26-40%, requiring no external supervision.
Conclusion: Cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training, enabling collaborative improvement without external supervision or hierarchical structures.
Abstract: Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.
[36] TLoRA+: A Low-Rank Parameter-Efficient Fine-Tuning Method for Large Language Models
Yarui Cao, Kai Liu
Main category: cs.CL
TL;DR: A novel Parameter-Efficient Fine-Tuning (PEFT) method called TLoRA+ that incorporates a specialized optimizer into pre-trained model weight matrices, improving performance while maintaining efficiency.
Details
Motivation: To develop a more effective PEFT method that goes beyond existing approaches like LoRA, which matches full fine-tuning performance but may have room for improvement. The goal is to enhance adaptation capabilities without sacrificing efficiency or adding inference latency.Method: Proposes TLoRA+, which incorporates a specialized optimizer into the weight matrices of pre-trained models. This approach builds upon Low-Rank Adaptation (LoRA) principles but adds the TLoRA+ optimizer component to enhance adaptation while preserving the efficiency benefits of low-rank methods.
Result: Experiments on the GLUE benchmark across diverse model architectures show consistent effectiveness and robustness. The method preserves LoRA’s efficiency while further enhancing performance without significantly increasing computational cost.
Conclusion: TLoRA+ represents an advancement in PEFT methods that improves upon LoRA by incorporating optimizer components into weight matrices, offering better performance while maintaining efficiency for fine-tuning large language models.
Abstract: Fine-tuning large language models (LLMs) aims to adapt pre-trained models to specific tasks using relatively small and domain-specific datasets. Among Parameter-Efficient Fine-Tuning (PEFT) methods, Low-Rank Adaptation (LoRA) stands out by matching the performance of full fine-tuning while avoiding additional inference latency. In this paper, we propose a novel PEFT method that incorporates the TLoRA+ optimizer into the weight matrices of pre-trained models. The proposed approach not only preserves the efficiency of low-rank adaptation but also further enhances performance without significantly increasing computational cost. We conduct experiments on the GLUE benchmark across diverse model architectures. Numerical experiments consistently demonstrate the effectiveness and robustness of our proposed method.
[37] Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints
Md. Fahad Ullah Utsho, Mohd. Ruhul Ameen, Akif Islam, Md. Golam Rashed, Dipankar Das
Main category: cs.CL
TL;DR: Paper introduces a benchmarking framework to evaluate reasoning robustness in Large Language Models under controlled complexity increases, revealing consistent “reasoning collapse” beyond task-specific thresholds.
Details
Motivation: Current LLM evaluations rely on aggregate accuracy over fixed datasets, obscuring how reasoning behavior evolves with increasing task complexity. There's a need for systematic evaluation of reasoning robustness under controlled complexity progression.Method: Constructed a suite of nine classical reasoning tasks (Boolean Satisfiability, Cryptarithmetic, Graph Coloring, River Crossing, Tower of Hanoi, Water Jug, Checker Jumping, Sudoku, Rubik’s Cube) parameterized to precisely control complexity while preserving semantics. Used deterministic validators to evaluate multiple open and proprietary LRMs across low, intermediate, and high complexity regimes.
Result: Models show consistent phase transition behavior: high accuracy at low complexity but sharp degradation beyond task-specific thresholds (reasoning collapse). Accuracy declines often exceed 50%, with inconsistent reasoning traces, constraint violations, loss of state tracking, and confidently incorrect outputs. Increased reasoning length doesn’t reliably improve correctness, and gains don’t generalize across problem families.
Conclusion: Current LLM reasoning capabilities are fragile and collapse under controlled complexity increases. Evaluation methodologies need to move beyond static benchmarks and explicitly measure reasoning robustness under controlled complexity progression.
Abstract: Large Language Models (LLMs) are increasingly described as possessing strong reasoning capabilities, supported by high performance on mathematical, logical, and planning benchmarks. However, most existing evaluations rely on aggregate accuracy over fixed datasets, obscuring how reasoning behavior evolves as task complexity increases. In this work, we introduce a controlled benchmarking framework to systematically evaluate the robustness of reasoning in Large Reasoning Models (LRMs) under progressively increasing problem complexity. We construct a suite of nine classical reasoning tasks: Boolean Satisfiability, Cryptarithmetic, Graph Coloring, River Crossing, Tower of Hanoi, Water Jug, Checker Jumping, Sudoku, and Rubik’s Cube, each parameterized to precisely control complexity while preserving underlying semantics. Using deterministic validators, we evaluate multiple open and proprietary LRMs across low, intermediate, and high complexity regimes, ensuring that only fully valid solutions are accepted. Our results reveal a consistent phase transition like behavior: models achieve high accuracy at low complexity but degrade sharply beyond task specific complexity thresholds. We formalize this phenomenon as reasoning collapse. Across tasks, we observe substantial accuracy declines, often exceeding 50%, accompanied by inconsistent reasoning traces, constraint violations, loss of state tracking, and confidently incorrect outputs. Increased reasoning length does not reliably improve correctness, and gains in one problem family do not generalize to others. These findings highlight the need for evaluation methodologies that move beyond static benchmarks and explicitly measure reasoning robustness under controlled complexity.
[38] From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning
Shihao Zhang, Ziwei Wang, Jie Zhou, Yulan Wu, Qin Chen, Zhikai Lei, Liyang Yu, Liang Dou, Liang He
Main category: cs.CL
TL;DR: ABSA-R1: A reinforcement learning framework that makes aspect-based sentiment analysis models generate natural language justifications before predictions, mimicking human reasoning processes.
Details
Motivation: Current ABSA systems are black boxes that lack explicit reasoning capabilities. Humans don't just categorize sentiment but construct causal explanations for their judgments. The paper aims to bridge this gap by making models explain their reasoning.Method: Proposes ABSA-R1 framework using reinforcement learning to generate natural language justifications before predictions. Introduces a Cognition-Aligned Reward Model to enforce consistency between reasoning and final labels, and a performance-driven rejection sampling strategy for hard cases.
Result: Experimental results on four benchmarks show that adding explicit reasoning capabilities improves both interpretability and performance in sentiment classification and triplet extraction compared to non-reasoning baselines.
Conclusion: The framework successfully mimics human “reason-before-predict” cognitive processes, enhancing both explainability and accuracy in aspect-based sentiment analysis tasks.
Abstract: While Aspect-based Sentiment Analysis (ABSA) systems have achieved high accuracy in identifying sentiment polarities, they often operate as “black boxes,” lacking the explicit reasoning capabilities characteristic of human affective cognition. Humans do not merely categorize sentiment; they construct causal explanations for their judgments. To bridge this gap, we propose ABSA-R1, a large language model framework designed to mimic this ``reason-before-predict" cognitive process. By leveraging reinforcement learning (RL), ABSA-R1 learns to articulate the why behind the what, generating natural language justifications that ground its sentiment predictions. We introduce a Cognition-Aligned Reward Model (formerly sentiment-aware reward model) that enforces consistency between the generated reasoning path and the final emotional label. Furthermore, inspired by metacognitive monitoring, we implement a performance-driven rejection sampling strategy that selectively targets hard cases where the model’s internal reasoning is uncertain or inconsistent. Experimental results on four benchmarks demonstrate that equipping models with this explicit reasoning capability not only enhances interpretability but also yields superior performance in sentiment classification and triplet extraction compared to non-reasoning baselines.
[39] MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments
Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal
Main category: cs.CL
TL;DR: MERRIN is a challenging benchmark for evaluating multimodal search agents’ ability to retrieve and reason over noisy web evidence across diverse modalities including audio and video.
Details
Motivation: To address the challenges of underspecified multi-hop search queries and multimodal, heterogeneous, often conflicting web results, creating a testbed for evaluating AI agents' multimodal evidence retrieval and reasoning capabilities.Method: Introduces MERRIN benchmark with natural language queries without explicit modality cues, incorporates underexplored modalities (video, audio), and requires retrieval of complex noisy/conflicting multimodal evidence. Evaluates 10 models across three search settings.
Result: Benchmark is highly challenging with average accuracy of 22.3% and best agent reaching only 40.1%. Stronger agents show modest gains due to over-exploration and distraction by conflicting content. Agents underperform humans despite consuming more resources.
Conclusion: MERRIN highlights the need for search agents with robust multimodal search and reasoning capabilities in noisy web environments, serving as a valuable testbed for evaluating such abilities.
Abstract: Motivated by the underspecified, multi-hop nature of search queries and the multimodal, heterogeneous, and often conflicting nature of real-world web results, we introduce MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments), a human-annotated benchmark for evaluating search-augmented agents. MERRIN measures AI agents’ ability to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning over noisy web sources. It differs from prior work in three important aspects: (1) using natural language queries without explicit modality cues, (2) incorporating underexplored modalities such as video and audio, and (3) requiring the retrieval of complex, often noisy or conflicting multimodal evidence during web search. We evaluate diverse search agents powered by ten models, including strong closed-source models (e.g., GPT-5.4-mini, Gemini 3/3.1 Flash/Pro) and open-weight models (Qwen3-4B/30B/235B), across three search settings (no search, native search, and agentic search). Our results show that MERRIN is highly challenging: the average accuracy across all agents is 22.3%, with the best-performing agent reaching only 40.1%. We further observe that while stronger agents like Gemini Deep Research achieve higher performance, gains are modest due to over-exploration; they take more steps and use more tools, but are often distracted by conflicting or partially relevant web content, leading to incorrect answers. Compared to humans, these agents consume more resources yet achieve lower accuracy, largely due to inefficient source selection and an overreliance on text modalities. These findings highlight the need for search agents capable of robust search and reasoning across diverse modalities in noisy web environments, making MERRIN a valuable testbed for evaluating such capabilities.
[40] CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding
Ishani Mondal, Yiwen Song, Mihir Parmar, Palash Goyal, Jordan Boyd-Graber, Tomas Pfister, Yale Song
Main category: cs.CL
TL;DR: CANVAS is a multi-agent framework for generating visually continuous storyboards by enforcing character consistency, background persistence, and smooth scene transitions.
Details
Motivation: Existing generative models produce strong individual frames but fail to maintain visual continuity across shots in long-form storytelling, leading to inconsistent characters, backgrounds, and abrupt scene shifts.Method: CANVAS uses a multi-agent framework that explicitly plans visual continuity through character continuity, persistent background anchors, and location-aware scene planning for smooth transitions within the same setting.
Result: CANVAS outperforms baselines on ST-BENCH and ViStoryBench benchmarks, improving background continuity by 21.6%, character consistency by 9.6%, and props consistency by 7.6%, and introduces a new challenging benchmark HardContinuityBench.
Conclusion: The CANVAS framework effectively addresses visual continuity challenges in long-form storytelling through explicit planning mechanisms, significantly improving narrative coherence across multiple shots.
Abstract: Long-form visual storytelling requires maintaining continuity across shots, including consistent characters, stable environments, and smooth scene transitions. While existing generative models can produce strong individual frames, they fail to preserve such continuity, leading to appearance changes, inconsistent backgrounds, and abrupt scene shifts. We introduce CANVAS (Continuity-Aware Narratives via Visual Agentic Storyboarding), a multi-agent framework that explicitly plans visual continuity in multi-shot narratives. CANVAS enforces coherence through character continuity, persistent background anchors, and location-aware scene planning for smooth transitions within the same setting We evaluate CANVAS on two storyboard generation benchmarks ST-BENCH and ViStoryBench and introduce a new challenging benchmark HardContinuityBench for long-range narrative consistency. CANVAS consistently outperforms the best-performing baseline, improving background continuity by 21.6%, character consistency by 9.6% and props consistency by 7.6%.
[41] Using reasoning LLMs to extract SDOH events from clinical notes
Ertan Doganl, Kunyu Yu, Yifan Peng
Main category: cs.CL
TL;DR: LLM-based prompt engineering approach for extracting Social Determinants of Health from unstructured clinical notes achieves competitive performance with simpler implementation than BERT-based methods.
Details
Motivation: SDOH information is crucial for patient care but trapped in unstructured clinical notes; existing BERT-based NLP methods work but require complex implementation and heavy computational resources.Method: Four-module approach: 1) concise prompts with guidelines, 2) few-shot learning with curated examples, 3) self-consistency for robust outputs, 4) post-processing for quality control.
Result: Achieved micro-F1 score of 0.866, competitive with leading models, demonstrating LLMs with reasoning capabilities are effective for SDOH event extraction.
Conclusion: LLMs offer both implementation simplicity and strong performance for extracting structured SDOH from clinical text, making them practical solutions for healthcare applications.
Abstract: Social Determinants of Health (SDOH) refer to environmental, behavioral, and social conditions that influence how individuals live, work, and age. SDOH have a significant impact on personal health outcomes, and their systematic identification and management can yield substantial improvements in patient care. However, SDOH information is predominantly captured in unstructured clinical notes within electronic health records, which limits its direct use as machine-readable entities. To address this issue, researchers have employed Natural Language Processing (NLP) techniques using pre-trained BERT-based models, demonstrating promising performance but requiring sophisticated implementation and extensive computational resources. In this study, we investigated prompt engineering strategies for extracting structured SDOH events utilizing LLMs with advanced reasoning capabilities. Our method consisted of four modules: 1) developing concise and descriptive prompts integrated with established guidelines, 2) applying few-shot learning with carefully curated examples, 3) using a self-consistency mechanism to ensure robust outputs, and 4) post-processing for quality control. Our approach achieved a micro-F1 score of 0.866, demonstrating competitive performance compared to the leading models. The results demonstrated that LLMs with reasoning capabilities are effective solutions for SDOH event extraction, offering both implementation simplicity and strong performance.
[42] ToolSpec: Accelerating Tool Calling via Schema-Aware and Retrieval-Augmented Speculative Decoding
Heming Xia, Yongqi Li, Cunxiao Du, Mingbo Song, Wenjie Li
Main category: cs.CL
TL;DR: ToolSpec accelerates LLM tool calling using schema-aware speculative decoding and retrieval of historical invocations, achieving up to 4.2x speedup.
Details
Motivation: Tool calling enables LLMs to interact with external applications, but multi-step, multi-turn interactions incur substantial latency that hinders real-time serving. Tool-calling traces are highly structured with constrained schemas and recurring patterns, presenting an opportunity for optimization.Method: ToolSpec uses schema-aware, retrieval-augmented speculative decoding. It exploits predefined tool schemas to generate accurate drafts using a finite-state machine that alternates between deterministic schema token filling and speculative generation for variable fields. Additionally, it retrieves similar historical tool invocations and reuses them as drafts to improve efficiency.
Result: Experiments across multiple benchmarks demonstrate that ToolSpec achieves up to a 4.2x speedup, substantially outperforming existing training-free speculative decoding methods.
Conclusion: ToolSpec presents an effective plug-and-play solution for accelerating LLM tool calling by leveraging structured schemas and historical patterns, addressing latency challenges in real-time LLM serving.
Abstract: Tool calling has greatly expanded the practical utility of large language models (LLMs) by enabling them to interact with external applications. As LLM capabilities advance, effective tool use increasingly involves multi-step, multi-turn interactions to solve complex tasks. However, the resulting growth in tool interactions incurs substantial latency, posing a key challenge for real-time LLM serving. Through empirical analysis, we find that tool-calling traces are highly structured, conform to constrained schemas, and often exhibit recurring invocation patterns. Motivated by this, we propose ToolSpec, a schema-aware, retrieval-augmented speculative decoding method for accelerating tool calling. ToolSpec exploits predefined tool schemas to generate accurate drafts, using a finite-state machine to alternate between deterministic schema token filling and speculative generation for variable fields. In addition, ToolSpec retrieves similar historical tool invocations and reuses them as drafts to further improve efficiency. ToolSpec presents a plug-and-play solution that can be seamlessly integrated into existing LLM workflows. Experiments across multiple benchmarks demonstrate that ToolSpec achieves up to a 4.2x speedup, substantially outperforming existing training-free speculative decoding methods.
[43] Synthesizing Instruction-Tuning Datasets with Contrastive Decoding
Tatsuya Ichinose, Youmi Ma, Masanari Oi, Ryuto Koike, Naoaki Okazaki
Main category: cs.CL
TL;DR: CoDIT: A method for instruction tuning that uses contrastive decoding between post-trained and pre-trained LLMs to disentangle instruction-following capabilities from pre-trained knowledge, improving instruction tuning effectiveness.
Details
Motivation: Existing instruction tuning approaches overlook that LLM-generated responses conflate world knowledge (from pre-training) with instruction-following capabilities (from post-training). The authors hypothesize that disentangling these improves instruction tuning effectiveness.Method: Proposes CoDIT (Contrastive Decoding for Instruction Tuning), which applies contrastive decoding between a post-trained model and its pre-trained counterpart during response generation. This suppresses shared pre-trained knowledge while amplifying instruction-following behavior acquired via post-training.
Result: Models trained on CoDIT-generated datasets consistently outperform those trained on directly generated responses. They also outperform models trained on existing public instruction-tuning datasets across multiple benchmarks. CoDIT enables transfer of instruction-tuning capabilities across different model architectures.
Conclusion: Disentangling instruction-following capabilities from pre-trained knowledge improves instruction tuning. CoDIT effectively achieves this through contrastive decoding, producing better training data and enabling capability transfer across architectures.
Abstract: Using responses generated by high-performing large language models (LLMs) for instruction tuning has become a widely adopted approach. However, the existing literature overlooks a property of LLM-generated responses: they conflate world knowledge acquired during pre-training with instruction-following capabilities acquired during post-training. We hypothesize that disentangling the instruction-following capabilities from pre-trained knowledge improves the effectiveness of instruction tuning. To this end, we propose CoDIT, a method that applies contrastive decoding between a post-trained model and its pre-trained counterpart during response generation. The method suppresses pre-trained knowledge shared between the two models while amplifying the instruction-following behavior acquired via post-training, resulting in responses that more purely reflect instruction-following capabilities. Experiment results demonstrate that models trained on datasets constructed via CoDIT consistently outperform those trained on directly generated responses. Training on our datasets also yields better performance than on existing publicly available instruction-tuning datasets across multiple benchmarks. Furthermore, we theoretically and empirically show that CoDIT can be interpreted as distilling the chat vector from parameter space to text space, enabling the transfer of instruction-tuning capabilities across models of different architectures.
[44] Debate to Align: Reliable Entity Alignment through Two-Stage Multi-Agent Debate
Cunda Wang, Ziying Ma, Po Hu, Weihua Wang, Feilong Bao
Main category: cs.CL
TL;DR: AgentEA is a reliable entity alignment framework using multi-agent debate to improve alignment decisions across knowledge graphs, with representation optimization and two-stage debate mechanisms.
Details
Motivation: Current LLM-based entity alignment methods suffer from unreliable candidate entity sets and limited reasoning capabilities, which critically affect alignment decision effectiveness.Method: Uses entity representation preference optimization to improve embeddings, then introduces two-stage multi-role debate: lightweight debate verification followed by deep debate alignment for progressive reliability enhancement.
Result: Extensive experiments on public benchmarks under cross-lingual, sparse, large-scale, and heterogeneous settings demonstrate the effectiveness of AgentEA.
Conclusion: AgentEA provides a reliable entity alignment framework that enhances decision reliability through multi-agent debate mechanisms and representation optimization.
Abstract: Entity alignment (EA) aims to identify entities referring to the same real-world object across different knowledge graphs (KGs). Recent approaches based on large language models (LLMs) typically obtain entity embeddings through knowledge representation learning and use embedding similarity to identify an alignment-uncertain entity set. For each uncertain entity, a candidate entity set (CES) is then retrieved based on embedding similarity to support subsequent alignment reasoning and decision making. However, the reliability of the CES and the reasoning capability of LLMs critically affect the effectiveness of subsequent alignment decisions. To address this issue, we propose AgentEA, a reliable EA framework based on multi-agent debate. AgentEA first improves embedding quality through entity representation preference optimization, and then introduces a two-stage multi-role debate mechanism consisting of lightweight debate verification and deep debate alignment to progressively enhance the reliability of alignment decisions while enabling more efficient debate-based reasoning. Extensive experiments on public benchmarks under cross-lingual, sparse, large-scale, and heterogeneous settings demonstrate the effectiveness of AgentEA.
[45] Training-Free Test-Time Contrastive Learning for Large Language Models
Kaiwen Zheng, Kai Zhou, Jinwu Hu, Te Gu, Mingkai Peng, Fei Liu
Main category: cs.CL
TL;DR: TF-TTCL is a training-free test-time adaptation framework that enables frozen LLMs to improve online by distilling supervision from their own inference experiences through an “Explore-Reflect-Steer” loop.
Details
Motivation: LLMs show strong reasoning but degrade under distribution shift. Existing TTA methods need gradient updates (white-box access) or substantial overhead, while training-free alternatives are static or need external guidance.Method: Three modules: 1) Semantic Query Augmentation diversifies problem views via multi-agent role-playing; 2) Contrastive Experience Distillation captures semantic gaps between superior/inferior trajectories into textual rules; 3) Contextual Rule Retrieval activates stored rules during inference to steer the frozen LLM.
Result: Extensive experiments on closed-ended reasoning and open-ended evaluation tasks show TF-TTCL consistently outperforms strong zero-shot baselines and representative TTA methods under online evaluation.
Conclusion: TF-TTCL enables frozen LLMs to adapt online without training, improving robustness under distribution shift through self-supervised experience distillation.
Abstract: Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test-time adaptation (TTA) methods rely on gradient-based updates that require white-box access and need substantial overhead, while training-free alternatives are either static or depend on external guidance. In this paper, we propose Training-Free Test-Time Contrastive Learning TF-TTCL, a training-free adaptation framework that enables a frozen LLM to improve online by distilling supervision from its own inference experiences. Specifically, TF-TTCL implements a dynamic “Explore-Reflect-Steer” loop through three core modules: 1) Semantic Query Augmentation first diversifies problem views via multi-agent role-playing to generate different reasoning trajectories; 2) Contrastive Experience Distillation then captures the semantic gap between superior and inferior trajectories, distilling them into explicit textual rules; and 3) Contextual Rule Retrieval finally activates these stored rules during inference to dynamically steer the frozen LLM toward robust reasoning patterns while avoiding observed errors. Extensive experiments on closed-ended reasoning tasks and open-ended evaluation tasks demonstrate that TF-TTCL consistently outperforms strong zero-shot baselines and representative TTA methods under online evaluation. Code is available at https://github.com/KevinSCUTer/TF-TTCL.
[46] YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference
You Wu, Ziheng Chen, Yizhen Zhang, Haoyi Wu, Chengting Yu, Yuchi Xu, Wenbo Su, Bo Zheng, Kewei Tu
Main category: cs.CL
TL;DR: YOCO++ enhances cross-layer KV compression for efficient LLM inference by adding weighted residual connections between bottom-half layers, improving performance while maintaining 50% KV cache compression.
Details
Motivation: Existing cross-layer KV compression methods for LLM inference reduce memory consumption but introduce performance degradation. The authors aim to enhance YOCO, a method that shares KVs of middle layer with top-half layers, to achieve better performance while maintaining efficiency.Method: Propose YOCO++ which incorporates weighted residual connections between the KVs of each bottom-half layer and the bottom layer. This increases model capacity while maintaining the same training and inference efficiency as YOCO.
Result: YOCO++ achieves state-of-the-art performance among cross-layer KV compression methods at 50% KV cache compression rate, outperforming both standard Transformer and original YOCO.
Conclusion: The enhanced YOCO++ method successfully improves performance of cross-layer KV compression while maintaining compression efficiency, making it a promising approach for efficient LLM inference.
Abstract: Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible performance degradation. In this work, we aim to enhance the performance of YOCO, a cross-layer KV compression method that shares the KVs of the middle layer with the top-half layers. We propose YOCO++, an enhanced YOCO that incorporates a weighted residual connection between the KVs of each bottom-half layer and the bottom layer. Compared to YOCO, YOCO++ increases model capacity while maintaining the same training and inference efficiency. Our experiments show that YOCO++ achieves state-of-the-art performance among the cross-layer KV compression methods at a 50% KV cache compression rate, outperforming the standard Transformer.
[47] MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning
Jiahang Lin, Kai Hu, Binghai Wang, Yuhao Zhou, Zhiheng Xi, Honglin Guo, Shichun Liu, Junzhe Wang, Shihan Dou, Enyu Zhou, Hang Yan, Zhenhua Han, Tao Gui, Qi Zhang, Xuanjing Huang
Main category: cs.CL
TL;DR: MM-Doc-R1 is a vision-aware agentic framework for long document visual question answering that uses iterative information discovery and Similarity-based Policy Optimization (SPO) to improve multi-turn reinforcement learning for better complex query handling.
Details
Motivation: Conventional RAG systems struggle with complex multi-hop queries over long documents due to single-pass retrieval limitations. There's a need for better approaches to handle visual question answering on long documents that require iterative information discovery and synthesis.Method: Proposes MM-Doc-R1 framework with agentic, vision-aware workflow for iterative information discovery. Introduces Similarity-based Policy Optimization (SPO) to address baseline estimation bias in multi-turn RL by calculating more precise baselines through similarity-weighted averaging of rewards across trajectories.
Result: MM-Doc-R1 outperforms previous baselines by 10.4% on MMLongbench-Doc benchmark. SPO boosts results by 5.0% with Qwen3-8B and 6.1% with Qwen3-4B compared to GRPO, demonstrating superior training performance.
Conclusion: The integrated framework with novel SPO training algorithm advances state-of-the-art for complex, long-document visual question answering by providing more stable and accurate learning signals for agents.
Abstract: Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce MM-Doc-R1, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose Similarity-based Policy Optimization (SPO), addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which inappropriately applies the initial state’s baseline to all intermediate states. This provides a more stable and accurate learning signal for our agents, leading to superior training performance that surpasses GRPO. Our experiments on the MMLongbench-Doc benchmark show that MM-Doc-R1 outperforms previous baselines by 10.4%. Furthermore, SPO demonstrates superior performance over GRPO, boosting results by 5.0% with Qwen3-8B and 6.1% with Qwen3-4B. These results highlight the effectiveness of our integrated framework and novel training algorithm in advancing the state-of-the-art for complex, long-document visual question answering.
[48] BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks
Sebastian Nagl, Matthias Grabmair
Main category: cs.CL
TL;DR: BenGER is an open-source web platform for creating, annotating, running, and evaluating legal reasoning benchmarks for LLMs, with features for collaborative annotation and multi-tenant projects.
Details
Motivation: Current workflows for evaluating LLMs on legal reasoning are fragmented across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts.Method: Developed BenGER framework - an integrated web platform with task creation, collaborative annotation, configurable LLM runs, and evaluation using lexical, semantic, factual, and judge-based metrics.
Result: Created an open-source platform supporting multi-organization projects with tenant isolation, role-based access control, and optional formative feedback to annotators.
Conclusion: BenGER provides a unified solution for legal reasoning benchmark creation and evaluation, addressing fragmentation issues in current evaluation workflows.
Abstract: Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.
[49] Foresight Optimization for Strategic Reasoning in Large Language Models
Jiashuo Wang, Jiawen Duan, Jian Wang, Kaitao Song, Chunpu Xu, Johnny K. W. Ho, Fenggang Yu, Wenjie Li, Johan F. Hoorn
Main category: cs.CL
TL;DR: FoPO enhances LLM strategic reasoning in multi-agent environments by integrating opponent modeling into policy optimization, enabling better foresight and decision-making.
Details
Motivation: Existing reasoning-based LLMs struggle with effective decision-making in multi-agent environments due to lack of explicit foresight modeling. Strategic reasoning is needed to anticipate counterpart behaviors and foresee future actions.Method: Foresight Policy Optimization (FoPO) integrates opponent modeling principles into policy optimization, enabling explicit consideration of both self-interest and counterpart influence. Uses two curated datasets (Cooperative RSA and Competitive Taboo) in a self-play framework.
Result: FoPO significantly enhances strategic reasoning across LLMs of varying sizes and origins. Models trained with FoPO exhibit strong generalization to out-of-domain strategic scenarios, outperforming standard LLM reasoning optimization baselines.
Conclusion: FoPO effectively enhances strategic reasoning in LLMs for multi-agent decision-making by incorporating foresight modeling through opponent-aware policy optimization.
Abstract: Reasoning capabilities in large language models (LLMs) have generally advanced significantly. However, it is still challenging for existing reasoning-based LLMs to perform effective decision-making abilities in multi-agent environments, due to the absence of explicit foresight modeling. To this end, strategic reasoning, the most fundamental capability to anticipate the counterpart’s behaviors and foresee its possible future actions, has been introduced to alleviate the above issues. Strategic reasoning is fundamental to effective decision-making in multi-agent environments, yet existing reasoning enhancement methods for LLMs do not explicitly capture its foresight nature. In this work, we introduce Foresight Policy Optimization (FoPO) to enhance strategic reasoning in LLMs, which integrates opponent modeling principles into policy optimization, thereby enabling explicit consideration of both self-interest and counterpart influence. Specifically, we construct two curated datasets, namely Cooperative RSA and Competitive Taboo, equipped with well-designed rules and moderate difficulty to facilitate a systematic investigation of FoPO in a self-play framework. Our experiments demonstrate that FoPO significantly enhances strategic reasoning across LLMs of varying sizes and origins. Moreover, models trained with FoPO exhibit strong generalization to out-of-domain strategic scenarios, substantially outperforming standard LLM reasoning optimization baselines.
[50] C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
Akira Kawabata, Saku Sugawara
Main category: cs.CL
TL;DR: C2 framework improves reward models by training them to critically collaborate with rubric generators using only binary preferences, without costly rubric annotations.
Details
Motivation: Existing rubric-augmented verification methods require expensive rubric annotations and suffer from low-quality rubrics that mislead reward models rather than help them.Method: Proposes Cooperative yet Critical reward modeling (C2) where reward models critically collaborate with rubric generators trained solely from binary preferences. Synthesizes helpful/misleading rubric pairs by measuring rubric impact on reward model decisions, then trains cooperative rubric generator and critical verifier.
Result: C2 outperforms reasoning reward models with gains up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Enables 8B reward model to match performance of rubrics from 4× larger model without external annotations.
Conclusion: Eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy and scalable without requiring costly rubric annotations.
Abstract: Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric generator to propose helpful rubrics, and a critical verifier to assess rubric validity before making its judgment, following only rubrics it deems helpful at inference time. C2 outperforms reasoning reward models trained on the same binary preferences, with gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Without external rubric annotations, C2 enables an 8B reward model to match performance achieved with rubrics from a 4$\times$ larger model. Overall, our work demonstrates that eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy in a scalable way.
[51] Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues
Ahmet Tuğrul Bayrak, Mustafa Sertaç Türkel, Fatma Nur Korkmaz
Main category: cs.CL
TL;DR: Syn-TurnTurk: A synthetic Turkish dialogue dataset for turn-taking prediction using LLMs, achieving high accuracy with BI-LSTM and ensemble methods to improve voice chatbot timing.
Details
Motivation: Current voice chatbots rely on simple silence detection which fails due to irregular pauses in human speech, causing interruptions. This is especially problematic for languages like Turkish that lack quality turn-taking datasets.Method: Created Syn-TurnTurk, a synthetic Turkish dialogue dataset generated using Qwen LLMs to mimic real conversations with overlaps and strategic silences. Evaluated with traditional and deep learning architectures including BI-LSTM and ensemble methods.
Result: Advanced models, particularly BI-LSTM and Ensemble (LR+RF) methods, achieved high accuracy (0.839) and AUC scores (0.910), demonstrating the synthetic dataset’s effectiveness for understanding linguistic cues.
Conclusion: Synthetic datasets can effectively train models for turn-taking prediction in low-resource languages like Turkish, enabling more natural human-machine interaction in voice-based chatbots.
Abstract: Managing natural dialogue timing is a significant challenge for voice-based chatbots. Most current systems usually rely on simple silence detection, which often fails because human speech patterns involve irregular pauses. This causes bots to interrupt users, breaking the conversational flow. This problem is even more severe for languages like Turkish, which lack high-quality datasets for turn-taking prediction. This paper introduces Syn-TurnTurk, a synthetic Turkish dialogue dataset generated using various Qwen Large Language Models (LLMs) to mirror real-life verbal exchanges, including overlaps and strategic silences. We evaluated the dataset using several traditional and deep learning architectures. The results show that advanced models, particularly BI-LSTM and Ensemble (LR+RF) methods, achieve high accuracy (0.839) and AUC scores (0.910). These findings demonstrate that our synthetic dataset can have a positive affect for models understand linguistic cues, allowing for more natural human-machine interaction in Turkish.
[52] Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference
Xuwen Zhou, Fangxin Liu, Chao Wang, Xiao Zheng, Hao Zheng, Min He, Li Jiang, Haibing Guan
Main category: cs.CL
TL;DR: Calibrated Speculative Decoding (CSD) improves speculative decoding by recovering valid tokens discarded by standard verification, using frequency-guided candidate selection and probability-guarded acceptance to handle semantically correct but lexically divergent outputs.
Details
Motivation: Conventional speculative decoding frameworks suffer from frequent false rejections when draft models produce semantically correct but lexically divergent outputs, wasting computational resources and reducing efficiency.Method: CSD introduces two lightweight modules: Online Correction Memory (aggregates historical rejections to identify recurring divergence patterns) and Semantic Consistency Gating (verifies candidate admissibility using probability ratios instead of exact token matching).
Result: CSD outperforms existing methods with peak throughput speedup of 2.33x, preserves model accuracy across all tasks, and boosts performance on complex reasoning datasets.
Conclusion: CSD is a highly effective, lightweight solution for practical LLM deployments that addresses the false rejection problem in speculative decoding without requiring training.
Abstract: Speculative decoding accelerates autoregressive generation by letting draft tokens bypass full verification, but conventional frameworks suffer from frequent false rejections, particularly when draft models produce semantically correct but lexically divergent outputs. In this paper, we present Calibrated Speculative Decoding (CSD), a training-free framework that recovers valid tokens discarded by standard verification. Guided by the principle of “Frequency-Guided Candidate Selection and Probability-Guarded Acceptance,” CSD incorporates two lightweight modules: Online Correction Memory, which aggregates historical rejections to propose recurring divergence patterns as rescue candidates, and Semantic Consistency Gating, which verifies candidate admissibility using probability ratios instead of exact token matching. Our evaluation across diverse large language models demonstrates that CSD outperforms existing methods, achieving a peak throughput speedup of 2.33x. CSD preserves model accuracy across all tasks while further boosting performance on complex reasoning datasets. These results establish CSD as a highly effective, lightweight solution for practical LLM deployments.
[53] IndicDB – Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages
Aviral Dawar, Roshan Karanth, Vikram Goyal, Dhruv Kumar
Main category: cs.CL
TL;DR: IndicDB: A multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages using realistic administrative data from Indian open-data platforms.
Details
Motivation: Existing Text-to-SQL benchmarks focus on Western contexts and simplified schemas, leaving a gap for real-world, non-Western applications. There's a need for multilingual benchmarks that capture the complexity of administrative data in diverse linguistic contexts.Method: Created IndicDB using relational schemas from Indian open-data platforms (NDAP, IDP). Employed an iterative three-agent framework (Architect, Auditor, Refiner) to convert denormalized government data into rich relational structures. The pipeline is value-aware, difficulty-calibrated, and join-enforced, generating 15,617 tasks across English, Hindi, and five Indic languages.
Result: Results show a 9.00% performance drop from English to Indic languages, revealing an “Indic Gap” driven by harder schema linking, increased structural ambiguity, and limited external knowledge. The benchmark comprises 20 databases across 237 tables with high relational density (11.85 tables per database).
Conclusion: IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL, highlighting performance disparities between English and Indic languages and providing a valuable resource for evaluating cross-lingual semantic parsing in real-world administrative contexts.
Abstract: While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages. The relational schemas are sourced from open-data platforms, including the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP), ensuring realistic administrative data complexity. IndicDB comprises 20 databases across 237 tables. To convert denormalized government data into rich relational structures, we employ an iterative three-agent framework (Architect, Auditor, Refiner) to ensure structural rigor and high relational density (11.85 tables per database; join depths up to six). Our pipeline is value-aware, difficulty-calibrated, and join-enforced, generating 15,617 tasks across English, Hindi, and five Indic languages. We evaluate cross-lingual semantic parsing performance of state-of-the-art models (DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, Qwen3) across seven linguistic variants. Results show a 9.00% performance drop from English to Indic languages, revealing an “Indic Gap” driven by harder schema linking, increased structural ambiguity, and limited external knowledge. IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL. Code and data: https://anonymous.4open.science/r/multilingualText2Sql-Indic--DDCC/
[54] Breaking the Generator Barrier: Disentangled Representation for Generalizable AI-Text Detection
Xiao Pu, Zepeng Cheng, Lin Yuan, Yu Wu, Xiuli Bi
Main category: cs.CL
TL;DR: A framework for AI-generated text detection that disentangles detection semantics from generator-specific artifacts to improve generalization to unseen LLMs.
Details
Motivation: As LLMs produce increasingly human-like text, distinguishing AI-generated content becomes harder. Current methods rely on generator-specific artifacts that become unstable as new models emerge, making generalization to unseen generators a key challenge.Method: Progressive structured framework with: 1) compact latent encoding for semantic minimality, 2) perturbation-based regularization to reduce residual entanglement, and 3) discriminative adaptation to align representations with task objectives.
Result: Experiments on MAGE benchmark (20 LLMs across 7 categories) show up to 24.2% accuracy gain and 26.2% F1 improvement over SOTA. Performance improves with more diverse training generators, confirming scalability and generalization.
Conclusion: The framework effectively disentangles AI-detection semantics from generator artifacts, enabling robust generalization to unseen LLMs and showing strong scalability as training generator diversity increases.
Abstract: As large language models (LLMs) generate text that increasingly resembles human writing, the subtle cues that distinguish AI-generated content from human-written content become increasingly challenging to capture. Reliance on generator-specific artifacts is inherently unstable, since new models emerge rapidly and reduce the robustness of such shortcuts. This generalizes unseen generators as a central and challenging problem for AI-text detection. To tackle this challenge, we propose a progressively structured framework that disentangles AI-detection semantics from generator-aware artifacts. This is achieved through a compact latent encoding that encourages semantic minimality, followed by perturbation-based regularization to reduce residual entanglement, and finally a discriminative adaptation stage that aligns representations with task objectives. Experiments on MAGE benchmark, covering 20 representative LLMs across 7 categories, demonstrate consistent improvements over state-of-the-art methods, achieving up to 24.2% accuracy gain and 26.2% F1 improvement. Notably, performance continues to improve as the diversity of training generators increases, confirming strong scalability and generalization in open-set scenarios. Our source code will be publicly available at https://github.com/PuXiao06/DRGD.
[55] Co-FactChecker: A Framework for Human-AI Collaborative Claim Verification Using Large Reasoning Models
Dhruv Sahnan, Subhabrata Dutta, Tanmoy Chakraborty, Preslav Nakov, Iryna Gurevych
Main category: cs.CL
TL;DR: Co-FactChecker: A human-AI collaborative claim verification framework where expert feedback guides model reasoning through trace-editing rather than dialogue.
Details
Motivation: LLMs/LRMs lack domain knowledge and contextual understanding for claim verification, creating a gap between expert-led and fully automated verification. Human-AI collaboration is needed but existing models are hard to calibrate to natural language feedback in multi-turn interactions.Method: Proposes Co-FactChecker framework with a new interaction paradigm treating model’s thinking trace as shared scratchpad. Translates expert feedback into trace-edits that modify the trace directly, avoiding dialogue-based interaction limitations.
Result: Theoretical results show trace-editing advantages over multi-turn dialogue. Automatic evaluations demonstrate Co-FactChecker outperforms existing autonomous and human-AI collaboration approaches. Human evaluations show preference over multi-turn dialogue with higher quality reasoning, verdicts, and more interpretable/useful thinking traces.
Conclusion: Co-FactChecker provides effective human-AI collaboration for claim verification through trace-editing, addressing limitations of dialogue-based approaches and improving model reasoning with expert guidance.
Abstract: Professional fact-checkers rely on domain knowledge and deep contextual understanding to verify claims. Large language models (LLMs) and large reasoning models (LRMs) lack such grounding and primarily reason from available evidence alone, creating a mismatch between expert-led and fully automated claim verification. To mitigate this gap, we posit human-AI collaboration as a more promising path forward, where expert feedback, grounded in real-world knowledge and domain expertise, guides the model’s reasoning. However, existing LRMs are hard to calibrate to natural language feedback, particularly in a multi-turn interaction setup. We propose Co-FactChecker, a framework for human-AI collaborative claim verification. We introduce a new interaction paradigm that treats the model’s thinking trace as a shared scratchpad. Co-FactChecker translates expert feedback into trace-edits that introduce targeted modifications to the trace, sidestepping the shortcomings of dialogue-based interaction. We provide theoretical results showing that trace-editing offers advantages over multi-turn dialogue, and our automatic evaluations demonstrate that Co-FactChecker outperforms existing autonomous and human-AI collaboration approaches. Human evaluations further show that Co-FactChecker is preferred over multi-turn dialogue, producing higher quality reasoning and verdicts along with relatively easier to interpret and more useful thinking traces.
[56] Learning the Cue or Learning the Word? Analyzing Generalization in Metaphor Detection for Verbs
Sinan Kurtyigit, Sabine Schulte im Walde, Alexander Fraser
Main category: cs.CL
TL;DR: RoBERTa-based metaphor detection models generalize to unseen verbs primarily through contextual patterns rather than lexical memorization, with context alone matching full-model performance on held-out lemmas.
Details
Motivation: To determine whether strong benchmark performance in metaphor detection reflects genuine transferable generalization or merely lexical memorization of specific words.Method: Controlled lexical hold-out setup using RoBERTa with VU Amsterdam Metaphor Corpus, strictly excluding selected target verb lemmas during fine-tuning, then comparing predictions on held-out vs. exposed lemmas.
Result: Models maintain robust performance on held-out lemmas; sentence context alone matches full-model performance on held-out lemmas, while static verb embeddings are insufficient.
Conclusion: Generalization is primarily driven by learning transferable contextual patterns (“learning the cue”), with verb-specific memorization (“learning the word”) providing additive boost only when lexical exposure is available.
Abstract: Metaphor detection models achieve strong benchmark performance, yet it remains unclear whether this reflects transferable generalization or lexical memorization. To address this, we analyze generalization in metaphor detection through RoBERTa, the shared backbone of many state-of-the-art systems, focusing on English verbs using the VU Amsterdam Metaphor Corpus. We introduce a controlled lexical hold-out setup where all instances of selected target lemmas are strictly excluded from fine-tuning, and compare predictions on these Held-out lemmas against Exposed lemmas (verbs seen during fine-tuning). While the model performs best on Exposed lemmas, it maintains robust performance on Held-out lemmas. Further analysis reveals that sentence context alone is sufficient to match full-model performance on Held-out lemmas, whereas static verb-level embeddings are not. Together, these results suggest that generalization is primarily driven by “learning the cue” (transferable contextual patterns), while “learning the word” (verb-specific memorization) provides an additive boost when lexical exposure is available.
[57] An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2
Ryan Lail
Main category: cs.CL
TL;DR: Improving LLM-as-a-judge accuracy through task-specific criteria injection and ensemble scoring, achieving 83.6% accuracy on RewardBench 2 without fine-tuning.
Details
Motivation: LLM-as-a-judge is widely used for scalable evaluation in RLHF pipelines and benchmarking, but judgment reliability heavily depends on prompting and aggregation strategies. The paper aims to improve judge accuracy through practical, drop-in techniques without fine-tuning.Method: Empirical investigation of five techniques: 1) task-specific criteria injection, 2) ensemble scoring, 3) calibration context, 4) adaptive model escalation, and 5) soft blending. Focuses on GPT-5.4 models and evaluates on RewardBench 2.
Result: Task-specific criteria injection (+3.0pp) and ensemble scoring (+9.8pp) account for most gains. Combined they reach 83.6% accuracy (+11.9pp over 71.7% baseline). Cheaper model tiers benefit disproportionately from ensembling: GPT-5.4 mini with k=8 achieves 79.2% at 1.2x cost, and GPT-5.4 nano with k=8 reaches 71.4% at 0.4x baseline cost.
Conclusion: Simple, practical techniques can significantly improve LLM judge accuracy without fine-tuning. Task-specific criteria and ensembling are most effective, making high-accuracy LLM judges accessible at low cost.
Abstract: LLM-as-a-judge, using a language model to score or rank candidate responses, is widely used as a scalable alternative to human evaluation in RLHF pipelines, benchmarking, and application layer evaluations (evals). However, judgment reliability depends heavily on prompting and aggregation strategy. We present an empirical investigation of practical, drop-in techniques that improve GPT-5.4 judge accuracy on RewardBench 2 without any finetuning. Two techniques account for nearly all available gains: task-specific criteria injection (+3.0pp at negligible cost) and ensemble scoring (+9.8pp at 5x cost). Combined, they reach 83.6% accuracy, +11.9pp over the 71.7% baseline. Our investigation also covers three further techniques (calibration context, adaptive model escalation, and soft blending) which did not reliably improve on criteria + ensembling at comparable cost. Cheaper model tiers benefit disproportionately from ensembling: GPT-5.4 mini with k=8 achieves 79.2% at 1.2x baseline cost, and GPT-5.4 nano with k=8 reaches 71.4% at 0.4x baseline cost, making high-accuracy LLM judges accessible at low cost.
[58] Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA
Yuanlei Zheng, Pei Fu, Hang Li, Ziyang Wang, Yuyi Zhang, Wenyu Ruan, Xiaojin Zhang, Zhongyu Wei, Zhenbo Luo, Jian Luan, Wei Chen, Xiang Bai
Main category: cs.CL
TL;DR: Doc-V* is an OCR-free agentic framework for multi-page document VQA that uses sequential evidence aggregation through active navigation and structured working memory.
Details
Motivation: Existing OCR-free methods for multi-page DocVQA face trade-offs between capacity and precision - end-to-end models scale poorly with long documents, while visual retrieval pipelines are brittle and passive.Method: Proposes Doc-V*, an OCR-free agentic framework that casts multi-page DocVQA as sequential evidence aggregation. It begins with thumbnail overview, actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in structured working memory for grounded reasoning. Trained with imitation learning from expert trajectories and optimized with Group Relative Policy Optimization.
Result: Outperforms open-source baselines and approaches proprietary models across five benchmarks. Improves out-of-domain performance by up to 47.9% over RAG baseline. Shows effective evidence aggregation with selective attention rather than increased input pages.
Conclusion: Doc-V* provides an effective OCR-free agentic approach for multi-page document VQA that balances answer accuracy with evidence-seeking efficiency through active navigation and structured memory.
Abstract: Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-$V^$, an \textbf{OCR-free agentic} framework that casts multi-page DocVQA as sequential evidence aggregation. Doc-$V^$ begins with a thumbnail overview, then actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. Trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization, Doc-$V^$ balances answer accuracy with evidence-seeking efficiency. Across five benchmarks, Doc-$V^$ outperforms open-source baselines and approaches proprietary models, improving out-of-domain performance by up to \textbf{47.9%} over RAG baseline. Other results reveal effective evidence aggregation with selective attention, not increased input pages.
[59] MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging
Zhijie Bao, Fangke Chen, Licheng Bao, Chenhui Zhang, Wei Chen, Jiajie Peng, Zhongyu Wei
Main category: cs.CL
TL;DR: MedRCube introduces a comprehensive evaluation framework for multimodal large language models in medical imaging, featuring multidimensional fine-grained assessment and credibility evaluation to address limitations of existing coarse-grained metrics.
Details
Motivation: Existing evaluation practices for MLLMs in medical imaging report single or coarse-grained metrics that lack granularity for specialized clinical support and fail to assess reasoning reliability, creating a need for systematic evaluation aligned with real-world medical practice.Method: Proposes a paradigm shift toward multidimensional, fine-grained evaluation with a two-stage systematic construction pipeline, instantiated as MedRCube. Includes credibility evaluation subset to quantify reasoning credibility and identify shortcut behaviors.
Result: Benchmarked 33 MLLMs, with Lingshu-32B achieving top-tier performance. MedRCube exposed insights inaccessible under prior evaluation settings and revealed a significant positive association between shortcut behavior and diagnostic task performance.
Conclusion: MedRCube provides a comprehensive evaluation framework for MLLMs in medical imaging that addresses limitations of existing methods, revealing important insights about model behavior and raising concerns for clinically trustworthy deployment.
Abstract: The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real-world medical imaging practice. Existing practices that report single or coarse-grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine-grained and in-depth evaluation. Based on a two-stage systematic construction pipeline designed for this paradigm, we instantiate it with MedRCube. We benchmark 33 MLLMs, \textit{Lingshu-32B} achieve top-tier performance. Crucially, MedRCube exposes a series of pronounced insights inaccessible under prior evaluation settings. Furthermore, we introduce a credibility evaluation subset to quantify reasoning credibility, uncover a highly significant positive association between shortcut behavior and diagnostic task performance, raising concerns for clinically trustworthy deployment. The resources of this work can be found at https://github.com/F1mc/MedRCube.
[60] From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models
Wenxuan Li, Zhenfei Zhang, Mi Zhang, Geng Hong, Mi Wen, Xiaoyu You, Min Yang
Main category: cs.CL
TL;DR: MAGE is a memory-graph guided erasure framework for LLM unlearning that uses minimal user anchors instead of forget sets, enabling corpus-free, auditable unlearning while preserving model utility.
Details
Motivation: Current machine unlearning methods rely on user-provided forget sets, which are difficult to audit, risk secondary leakage, and are vulnerable to malicious abuse. There's a need for a more practical, auditable approach that minimizes user input while effectively removing memorized content.Method: MAGE uses lightweight user anchors to identify target entities, probes the LLM to recover target-related memorization, organizes this into a weighted local memory graph, and synthesizes scoped supervision for unlearning. It’s model-agnostic, works with standard unlearning methods, and requires no access to the original training corpus.
Result: Experiments on TOFU and RWKU benchmarks show MAGE’s self-generated supervision achieves unlearning performance comparable to supervision with external references while preserving overall model utility.
Conclusion: MAGE enables a practical, auditable unlearning workflow driven by minimal anchors rather than user-supplied forget corpora, addressing privacy and legal concerns about LLM memorization.
Abstract: Large language models (LLMs) may memorize sensitive or copyrighted content, raising significant privacy and legal concerns. While machine unlearning has emerged as a potential remedy, prevailing paradigms rely on user-provided forget sets, making unlearning requests difficult to audit and exposing systems to secondary leakage and malicious abuse. We propose MAGE, a Memory-grAph Guided Erasure framework for user-minimized, corpus-free unlearning. Given only a lightweight user anchor that identifies a target entity, MAGE probes the target LLM to recover target-related memorization, organizes it into a weighted local memory graph, and synthesizes scoped supervision for unlearning. MAGE is model-agnostic, can be plugged into standard unlearning methods, and requires no access to the original training corpus. Experiments on two benchmarks, TOFU and RWKU, demonstrate that MAGE’s self-generated supervision achieves effective unlearning performance comparable to supervision generated with external reference, while preserving overall utility. These results support a practical and auditable unlearning workflow driven by minimal anchors rather than user-supplied forget corpora.
[61] QuantileMark: A Message-Symmetric Multi-bit Watermark for LLMs
Junlin Zhu, Baizhou Huang, Xiaojun Wan
Main category: cs.CL
TL;DR: QuantileMark: A multi-bit watermarking method for LLMs that ensures message symmetry by embedding messages in equal-probability bins of the cumulative distribution, preventing quality degradation or detection bias based on message content.
Details
Motivation: Current vocabulary-partition watermarks for LLMs break message symmetry in low-entropy contexts, causing some messages to have better quality/decoding than others. Need watermarking where message content doesn't systematically affect text quality or verification outcomes.Method: QuantileMark embeds messages within the continuous cumulative probability interval [0,1). At each generation step, partitions this interval into M equal-mass bins and samples strictly from the bin assigned to the target symbol, ensuring fixed 1/M probability budget regardless of context entropy. Detection reconstructs same partition under teacher forcing and computes posteriors over latent bins.
Result: Empirical results on C4 continuation and LFQA show improved multi-bit recovery and detection robustness over baselines with negligible impact on generation quality. Proves message-unbiasedness property ensuring base distribution recovery when averaging over messages.
Conclusion: QuantileMark provides theoretical foundation for generation-side symmetry while equal-mass design promotes uniform evidence strength across messages. Addresses key requirement of message symmetry in provider-internal LLM deployments.
Abstract: As large language models become standard backends for content generation, practical provenance increasingly requires multi-bit watermarking. In provider-internal deployments, a key requirement is message symmetry: the message itself should not systematically affect either text quality or verification outcomes. Vocabulary-partition watermarks can break message symmetry in low-entropy decoding: some messages are assigned most of the probability mass, while others are forced to use tail tokens. This makes embedding quality and message decoding accuracy message-dependent. We propose QuantileMark, a white-box multi-bit watermark that embeds messages within the continuous cumulative probability interval $[0, 1)$. At each step, QuantileMark partitions this interval into $M$ equal-mass bins and samples strictly from the bin assigned to the target symbol, ensuring a fixed $1/M$ probability budget regardless of context entropy. For detection, the verifier reconstructs the same partition under teacher forcing, computes posteriors over latent bins, and aggregates evidence for verification. We prove message-unbiasedness, a property ensuring that the base distribution is recovered when averaging over messages. This provides a theoretical foundation for generation-side symmetry, while the equal-mass design additionally promotes uniform evidence strength across messages on the detection side. Empirical results on C4 continuation and LFQA show improved multi-bit recovery and detection robustness over strong baselines, with negligible impact on generation quality. Our code is available at GitHub (https://github.com/zzzjunlin/QuantileMark).
[62] ToolOmni: Enabling Open-World Tool Use via Agentic learning with Proactive Retrieval and Grounded Execution
Shouzheng Huang, Meishan Zhang, Baotian Hu, Min Zhang
Main category: cs.CL
TL;DR: ToolOmni: A unified agentic framework for LLMs to use open-world tools via proactive retrieval and grounded execution within a reasoning loop, achieving state-of-the-art performance.
Details
Motivation: Existing methods for tool use in LLMs struggle in open-world scenarios with massive and evolving tool repositories, facing challenges in aligning user intent with tool semantics and generalizing to unseen tools, leading to suboptimal accuracy.Method: ToolOmni uses a unified agentic framework with proactive retrieval and grounded execution within a reasoning loop. It involves: 1) constructing a cold-start multi-turn interaction dataset for SFT, and 2) open-world tool learning using a Decoupled Multi-Objective GRPO algorithm that optimizes both tool retrieval accuracy and execution efficacy.
Result: ToolOmni achieves state-of-the-art performance in both retrieval and execution, surpassing strong baselines by +10.8% in end-to-end execution success rate, while showing exceptional robustness and generalization capabilities.
Conclusion: ToolOmni effectively addresses open-world tool use challenges by combining proactive retrieval with grounded execution in a reasoning loop, demonstrating superior performance and generalization compared to existing methods.
Abstract: Large Language Models (LLMs) enhance their problem-solving capability by utilizing external tools. However, in open-world scenarios with massive and evolving tool repositories, existing methods relying on static embedding retrieval or parameter memorization of tools struggle to align user intent with tool semantics or generalize to unseen tools, respectively, leading to suboptimal accuracy of open-world tool retrieval and execution. To address these, we present ToolOmni, a unified agentic framework that enables LLMs for open-world tool use by proactive retrieval and grounded execution within a reasoning loop. First, we construct a cold-start multi-turn interaction dataset to instill foundational agentic capabilities via Supervised Fine-Tuning (SFT). Then, we introduce open-world tool learning based on a Decoupled Multi-Objective GRPO algorithm, which simultaneously optimizes LLMs for both tool retrieval accuracy and execution efficacy in online environments. Extensive experiments demonstrate that ToolOmni achieves state-of-the-art performance both in retrieval and execution, surpassing strong baselines by a significant margin of +10.8% in end-to-end execution success rate, while exhibiting exceptional robustness and generalization capabilities.
[63] MUSE: Multi-Domain Chinese User Simulation via Self-Evolving Profiles and Rubric-Guided Alignment
Zihao Liu, Hantao Zhou, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Peng Wang
Main category: cs.CL
TL;DR: MUSE is a multi-domain Chinese user simulation framework that generates human-like, controllable, and behaviorally consistent responses through iterative profile optimization, role-reversal fine-tuning, and rubric-guided reinforcement learning.
Details
Motivation: Existing user simulators often rely on shallow user profiling, struggle with persona consistency over long interactions, and are limited to English or single-domain settings, creating a need for more sophisticated multi-domain frameworks.Method: Three-stage approach: 1) Iterative Profile Self-Evolution (IPSE) optimizes user profiles by comparing simulated trajectories with real dialogue behaviors; 2) Role-Reversal Supervised Fine-Tuning improves local response realism; 3) Rubric-guided multi-turn reinforcement learning with a specialized reward model enhances long-horizon behavioral consistency.
Result: MUSE consistently outperforms strong baselines in both utterance-level and session-level evaluations, generating responses that are more realistic, coherent, and persona-consistent over extended interactions.
Conclusion: MUSE provides an effective framework for multi-domain Chinese user simulation that addresses key limitations of existing approaches, particularly in maintaining persona consistency and generating human-like responses across extended interactions.
Abstract: User simulators are essential for the scalable training and evaluation of interactive AI systems. However, existing approaches often rely on shallow user profiling, struggle to maintain persona consistency over long interactions, and are largely limited to English or single-domain settings. We present MUSE, a multi-domain Chinese user simulation framework designed to generate human-like, controllable, and behaviorally consistent responses. First, we propose Iterative Profile Self-Evolution (IPSE), which gradually optimizes user profiles by comparing and reasoning discrepancies between simulated trajectories and real dialogue behaviors. We then apply Role-Reversal Supervised Fine-Tuning to improve local response realism and human-like expression. To enable fine-grained behavioral alignment, we further train a specialized rubric-based reward model and incorporate it into rubric-guided multi-turn reinforcement learning, which optimizes the simulator at the dialogue level and enhances long-horizon behavioral consistency. Experiments show that MUSE consistently outperforms strong baselines in both utterance-level and session-level evaluations, generating responses that are more realistic, coherent, and persona-consistent over extended interactions.
[64] Robust Reward Modeling for Large Language Models via Causal Decomposition
Yunsheng Lu, Zijiang Yang, Licheng Pan, Zhixuan Chu
Main category: cs.CL
TL;DR: A method to improve reward models for LLM alignment by learning a decoder that maps answers to latent intent embeddings, using reconstruction error to regularize training and reduce reliance on spurious cues like length and tone.
Details
Motivation: Reward models often overfit to spurious cues like response length and agreeable tone rather than properly grounding preferences in the prompt's intent, limiting their effectiveness in aligning LLMs.Method: Learn a decoder that maps candidate answers to latent intent embeddings of input prompts, using reconstruction error as a regularization signal during reward model training to emphasize prompt-dependent information and suppress prompt-independent shortcuts.
Result: The decoder selects shorter and less sycophantic candidates with 0.877 accuracy; incorporating this signal into RM training improves RewardBench accuracy from 0.832 to 0.868; improves length-controlled win rates while producing shorter outputs; remains robust to lengthening and mild off-topic drift.
Conclusion: The proposed intent reconstruction approach effectively regularizes reward models to focus on prompt-dependent information rather than spurious cues, improving alignment performance across math, helpfulness, and safety benchmarks.
Abstract: Reward models are central to aligning large language models, yet they often overfit to spurious cues such as response length and overly agreeable tone. Most prior work weakens these cues directly by penalizing or controlling specific artifacts, but it does not explicitly encourage the model to ground preferences in the prompt’s intent. We learn a decoder that maps a candidate answer to the latent intent embedding of the input. The reconstruction error is used as a signal to regularize the reward model training. We provide theoretical evidence that this signal emphasizes prompt-dependent information while suppressing prompt-independent shortcuts. Across math, helpfulness, and safety benchmarks, the decoder selects shorter and less sycophantic candidates with 0.877 accuracy. Incorporating this signal into RM training in Gemma-2-2B-it and Gemma-2-9B-it increases RewardBench accuracy from 0.832 to 0.868. For Best-of-N selection, our framework increases length-controlled win rates while producing shorter outputs, and remains robust to lengthening and mild off-topic drift in controlled rewrite tests.
[65] Beyond Static Personas: Situational Personality Steering for Large Language Models
Zesheng Wei, Mengxiang Li, Zilei Wang, Yang Deng
Main category: cs.CL
TL;DR: IRIS is a training-free framework for situational personality steering in LLMs using neuron-based identification, retrieval, and steering mechanisms.
Details
Motivation: Existing LLM personalization methods have limited controllability, high resource demands, and rely on static personality modeling that lacks adaptability across varying situations.Method: IRIS uses a three-step framework: 1) situational persona neuron identification, 2) situation-aware neuron retrieval, and 3) similarity-weighted steering, all without requiring training.
Result: IRIS outperforms best-performing baselines on PersonalityBench and the new SPBench, demonstrating generalization and robustness to complex, unseen situations and different model architectures.
Conclusion: The neuron-based IRIS framework enables effective situational personality steering in LLMs, addressing limitations of existing personalization methods.
Abstract: Personalized Large Language Models (LLMs) facilitate more natural, human-like interactions in human-centric applications. However, existing personalization methods are constrained by limited controllability and high resource demands. Furthermore, their reliance on static personality modeling restricts adaptability across varying situations. To address these limitations, we first demonstrate the existence of situation-dependency and consistent situation-behavior patterns within LLM personalities through a multi-perspective analysis of persona neurons. Building on these insights, we propose IRIS, a training-free, neuron-based Identify-Retrieve-Steer framework for advanced situational personality steering. Our approach comprises situational persona neuron identification, situation-aware neuron retrieval, and similarity-weighted steering. We empirically validate our framework on PersonalityBench and our newly introduced SPBench, a comprehensive situational personality benchmark. Experimental results show that our method surpasses best-performing baselines, demonstrating IRIS’s generalization and robustness to complex, unseen situations and different models architecture.
[66] Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection
Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Schütze
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Instruction-tuned LLMs can annotate thousands of instances from a short prompt at negligible cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be labelled at once? We investigate both questions on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labelled, 5,000 human-annotated), comparing seven annotation strategies across four encoders to detect anti-immigrant hostility. A classifier trained on 25,974 GPT-5.2 labels ($43) achieves comparable F1-Macro to one trained on 3,800 human annotations ($316). Active learning offers little advantage over random sampling in our pre-enriched pool and delivers lower F1 than full LLM annotation at the same cost. However, comparable aggregate F1 masks a systematic difference in error structure: LLM-trained classifiers over-predict the positive class relative to the human gold standard. This divergence concentrates in topically ambiguous discussions where the distinction between anti-immigrant hostility and policy critique is most subtle, suggesting that annotation strategy should be guided not by aggregate F1 alone but by the error profile acceptable for the target application.
[67] Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs
Sasha Boguraev, Kyle Mahowald
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We show how causal interventions in Transformer models provide insights into English syntax by focusing on a long-standing challenge for syntactic theory: syntactic islands. Extraction from coordinated verb phrases is often degraded, yet acceptability varies gradiently with lexical content (e.g., “I know what he hates art and loves” vs. “I know what he looked down and saw”). We show that modern Transformer language models replicate human judgments across this gradient. Using causal interventions that isolate functionally relevant subspaces in Transformer blocks, attention modules, and MLPs, we demonstrate that extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies, but that these mechanisms are selectively blocked to varying degrees. By projecting a large corpus of unrelated text onto these causally identified subspaces, we derive a novel linguistic hypothesis: the conjunction “and” is represented differently in extractable versus non-extractable constructions, corresponding to expressions encoding relational dependencies versus purely conjunctive uses. These results illustrate how mechanistic interpretability can inform syntax, generating new hypotheses about linguistic representation and processing.
[68] How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data
Joel Niklaus, Atsuki Yamaguchi, Michal Štefánik, Guilherme Penedo, Hynek Kydlíček, Elie Bakouch, Lewis Tunstall, Edward Emanuel Beeching, Thibaud Frere, Colin Raffel, Leandro von Werra, Thomas Wolf
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled experiments, generating over one trillion tokens, to identify critical factors in rephrasing web text into synthetic pretraining data. Our results reveal that structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Notably, increasing the size of the generator model beyond 1B parameters provides no additional benefit. Our analysis also demonstrates that the selection of the original data used for mixing substantially influences performance. By applying our findings, we develop \textbf{\textsc{FinePhrase}}, a 486-billion-token open dataset of rephrased web text. We show that \textsc{FinePhrase} outperforms all existing synthetic data baselines while reducing generation costs by up to 30 times. We provide the dataset, all prompts, and the generation framework to the research community.
[69] Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs
Hussein Abdallah, Ibrahim Abdelaziz, Panos Kalnis, Essam Mansour
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Open-world Question Answering (OW-QA) over knowledge graphs (KGs) aims to answer questions over incomplete or evolving KGs. Traditional KGQA assumes a closed world where answers must exist in the KG, limiting real-world applicability. In contrast, open-world QA requires inferring missing knowledge based on graph structure and context. Large language models (LLMs) excel at language understanding but lack structured reasoning. Graph neural networks (GNNs) model graph topology but struggle with semantic interpretation. Existing systems integrate LLMs with GNNs or graph retrievers. Some support open-world QA but rely on structural embeddings without semantic grounding. Most assume observed paths or complete graphs, making them unreliable under missing links or multi-hop reasoning. We present GLOW, a hybrid system that combines a pre-trained GNN and an LLM for open-world KGQA. The GNN predicts top-k candidate answers from the graph structure. These, along with relevant KG facts, are serialized into a structured prompt (e.g., triples and candidates) to guide the LLM’s reasoning. This enables joint reasoning over symbolic and semantic signals, without relying on retrieval or fine-tuning. To evaluate generalization, we introduce GLOW-BENCH, a 1,000-question benchmark over incomplete KGs across diverse domains. GLOW outperforms existing LLM-GNN systems on standard benchmarks and GLOW-BENCH, achieving up to 53.3% and an average 38% improvement. GitHub code and data are available.
[70] Adaptive Conformal Prediction for Improving Factuality of Generations by Large Language Models
Aleksandr Rubashevskii, Dzianis Piatrashyn, Preslav Nakov, Maxim Panov
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) are prone to generating factually incorrect outputs. Recent work has applied conformal prediction to provide uncertainty estimates and statistical guarantees for the factuality of LLM generations. However, existing approaches are typically not prompt-adaptive, limiting their ability to capture input-dependent variability. As a result, they may filter out too few items (leading to over-coverage) or too many (under-coverage) for a given task or prompt. We propose an adaptive conformal prediction approach that extends conformal score transformation methods to LLMs, with applications to long-form generation and multiple-choice question answering. This enables prompt-dependent calibration, retaining marginal coverage guarantees while improving conditional coverage. In addition, the approach naturally supports selective prediction, allowing unreliable claims or answer choices to be filtered out in downstream applications. We evaluate our approach on multiple white-box models across diverse domains and show that it significantly outperforms existing baselines in terms of conditional coverage.
[71] Diffusion Language Models for Speech Recognition
Davyd Naveriani, Albert Zeyer, Ralf Schlüter, Hermann Ney
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.
[72] Dual-Enhancement Product Bundling: Bridging Interactive Graph and Large Language Model
Zhe Huang, Peng Wang, Yan Zheng, Sen Song, Longjun Cai
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Product bundling boosts e-commerce revenue by recommending complementary item combinations. However, existing methods face two critical challenges: (1) collaborative filtering approaches struggle with cold-start items owing to dependency on historical interactions, and (2) LLMs lack inherent capability to model interactive graph directly. To bridge this gap, we propose a dual-enhancement method that integrates interactive graph learning and LLM-based semantic understanding for product bundling. Our method introduces a graph-to-text paradigm, which leverages a Dynamic Concept Binding Mechanism (DCBM) to translate graph structures into natural language prompts. The DCBM plays a critical role in aligning domain-specific entities with LLM tokenization, enabling effective comprehension of combinatorial constraints. Experiments on three benchmarks (POG, POG_dense, Steam) demonstrate 6.3%-26.5% improvements over state-of-the-art baselines.
[73] From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution
Pavel Chizhov, Egor Bogomolov, Ivan P. Yamshchikov
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Efficiency and safety of Large Language Models (LLMs), among other factors, rely on the quality of tokenization. A good tokenizer not only improves inference speed and language understanding but also provides extra defense against jailbreak attacks and lowers the risk of hallucinations. In this work, we investigate the efficiency of code tokenization, in particular from the perspective of data source diversity. We demonstrate that code tokenizers are prone to producing unused, and thus under-trained, tokens due to the imbalance in repository and language diversity in the training data, as well as the dominance of source-specific, repetitive tokens that are often unusable in future inference. By modifying the BPE objective and introducing merge skipping, we implement different techniques under the name Source-Attributed BPE (SA-BPE) to regularize BPE training and minimize overfitting, thereby substantially reducing the number of under-trained tokens while maintaining the same inference procedure as with regular BPE. This provides an effective tool suitable for production use.
[74] From Weights to Activations: Is Steering the Next Frontier of Adaptation?
Simon Ostermann, Daniil Gurgurov, Tanja Baeumel, Michael A. Hedderich, Sebastian Lapuschkin, Wojciech Samek, Vera Schmitt
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Post-training adaptation of language models is commonly achieved through parameter updates or input-based methods such as fine-tuning, parameter-efficient adaptation, and prompting. In parallel, a growing body of work modifies internal activations at inference time to influence model behavior, an approach known as steering. Despite increasing use, steering is rarely analyzed within the same conceptual framework as established adaptation methods. In this work, we argue that steering should be regarded as a form of model adaptation. We introduce a set of functional criteria for adaptation methods and use them to compare steering approaches with classical alternatives. This analysis positions steering as a distinct adaptation paradigm based on targeted interventions in activation space, enabling local and reversible behavioral change without parameter updates. The resulting framing clarifies how steering relates to existing methods, motivating a unified taxonomy for model adaptation.
[75] Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies
Swati Rallapalli, Shannon Gallagher, Ronald Yurko, Tyler Brooks, Chuck Loughin, Michele Sezgin, Violet Turri
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) are now capable of generating highly fluent, human-like text. They enable many applications, but also raise concerns such as large scale spam, phishing, or academic misuse. While much work has focused on detecting LLM-generated text, only limited work has gone into understanding the stylistic differences between human-written and machine-generated text. In this work, we perform a large scale analysis of stylistic variation across human-written text and outputs from 11 LLMs spanning 8 different genres and 4 decoding strategies using Douglas Biber’s set of lexicogrammatical and functional features. Our findings reveal insights that can guide intentional LLM usage. First, key linguistic differentiators of LLM-generated text seem robust to generation conditions (e.g., prompt settings to nudge them to generate human-like text, or availability of human-written text to continue the style); second, genre exerts a stronger influence on stylistic features than the source itself; third, chat variants of the models generally appear to be clustered together in stylistic space, and finally, model has a larger effect on the style than decoding strategy, with some exceptions. These results highlight the relative importance of model and genre over prompting and decoding strategies in shaping the stylistic behavior of machine-generated text.
[76] Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis
Zipeng Ling, Shuliang Liu, Shenghong Fu, Yuehao Tang, Seonil Son, Yao Wan, Xuming Hu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: LLM reasoning traces suffer from complex flaws – Step Internal Flaws (logical errors, hallucinations, etc.) and Step-wise Flaws (overthinking, underthinking), which vary by sample. A natural approach would be to provide ground-truth labels to guide LLMs’ reasoning. Contrary to intuition, we show that this yields no improvement in reasoning ability. We then propose CRAFT, a unified framework that mitigates both types of Step flaws, which builds a Reasoning Knowledge Graph (RKG) based on the consensus parts of multiple candidate traces, and synthesizes a high-quality trace through topological generation. Our approach improves label-prediction accuracy by 10+% on average, and consistently outperforms all baselines across both logical and mathematical reasoning benchmarks. Further, detailed benchmark evaluation proves that our method also improves the quality of LLMs’ reasoning traces in multiple dimensions.
[77] Rhetorical Questions in LLM Representations: A Linear Probing Study
Louie Hong Yao, Vishesh Anand, Yuan Zhuang, Tianyu Jiang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Rhetorical questions are asked not to seek information but to persuade or signal stance. How large language models internally represent them remains unclear. We analyze rhetorical questions in LLM representations using linear probes on two social-media datasets with different discourse contexts, and find that rhetorical signals emerge early and are most stably captured by last-token representations. Rhetorical questions are linearly separable from information-seeking questions within datasets, and remain detectable under cross-dataset transfer, reaching AUROC around 0.7-0.8. However, we demonstrate that transferability does not simply imply a shared representation. Probes trained on different datasets produce different rankings when applied to the same target corpus, with overlap among the top-ranked instances often below 0.2. Qualitative analysis shows that these divergences correspond to distinct rhetorical phenomena: some probes capture discourse-level rhetorical stance embedded in extended argumentation, while others emphasize localized, syntax-driven interrogative acts. Together, these findings suggest that rhetorical questions in LLM representations are encoded by multiple linear directions emphasizing different cues, rather than a single shared direction.
[78] From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs
Itay Itzhak, Eliya Habba, Gabriel Stanovsky, Yonatan Belinkov
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Evaluating LLMs is challenging, as benchmark scores often fail to capture models’ real-world usefulness. Instead, users often rely on ``vibe-testing’’: informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.
[79] Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions
Zhe Hu, Tuo Liang, Jing Li, Yiren Lu, Yunlai Zhou, Yiran Qiao, Jing Ma, Yu Yin
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advancements in large multimodal language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI’s capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large (vision) language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even state-of-the-art models still lag behind human performance on this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.
[80] Social media polarization during conflict: Insights from an ideological stance dataset on Israel-Palestine Reddit comments
Hasin Jawad Ali, Ajwad Abrar, S. M. Hozaifa Hossain, M. Firoz Mridha
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In politically sensitive scenarios like wars, social media serves as a platform for polarized discourse and expressions of strong ideological stances. While prior studies have explored ideological stance detection in general contexts, limited attention has been given to conflict-specific settings. This study addresses this gap by analyzing 9,969 Reddit comments related to the Israel-Palestine conflict, collected between October 2023 and August 2024. The comments were categorized into three stance classes: Pro-Israel, Pro-Palestine, and Neutral. Various approaches, including machine learning, pre-trained language models, neural networks, and prompt engineering strategies for open source large language models (LLMs), were employed to classify these stances. Performance was assessed using metrics such as accuracy, precision, recall, and F1-score. Among the tested methods, the Scoring and Reflective Re-read prompt in Mixtral 8x7B demonstrated the highest performance across all metrics. This study provides comparative insights into the effectiveness of different models for detecting ideological stances in highly polarized social media contexts. The dataset used in this research is publicly available for further exploration and validation.
[81] A closer look at how large language models trust humans: patterns and biases
Valeria Lerman, Yaniv Dover
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As large language models (LLMs) and LLM-based agents increasingly interact with humans in decision-making contexts, understanding the trust dynamics between humans and AI agents becomes a central concern. While considerable literature studies how humans trust AI agents, it is much less understood how LLM-based agents develop effective trust in humans. LLM-based agents likely rely on some sort of implicit effective trust in trust-related contexts (e.g., evaluating individual loan applications) to assist and affect decision making. Using established behavioral theories, we develop an approach that studies whether LLMs trust depends on the three major trustworthiness dimensions: competence, benevolence and integrity of the human subject. We also study how demographic variables affect effective trust. Across 43,200 simulated experiments, for five popular language models, across five different scenarios we find that LLM trust development shows an overall similarity to human trust development. We find that in most, but not all cases, LLM trust is strongly predicted by trustworthiness, and in some cases also biased by age, religion and gender, especially in financial scenarios. This is particularly true for scenarios common in the literature and for newer models. While the overall patterns align with human-like mechanisms of effective trust formation, different models exhibit variation in how they estimate trust; in some cases, trustworthiness and demographic factors are weak predictors of effective trust. These findings call for a better understanding of AI-to-human trust dynamics and monitoring of biases and trust development patterns to prevent unintended and potentially harmful outcomes in trust-sensitive applications of AI.
[82] MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models
Junjie Ye, Caishuang Huang, Zhuohan Chen, Wenjie Fu, Chenyuan Yang, Leyi Yang, Yilong Wu, Peng Wang, Meng Zhou, Xiaolong Yang, Tao Gui, Qi Zhang, Zhongchao Shi, Jianping Fan, Xuanjing Huang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Instruction following refers to the ability of large language models (LLMs) to generate outputs that satisfy all specified constraints. Existing research has primarily focused on constraint categories, offering limited evaluation dimensions and little guidance for improving instruction-following abilities. To address this gap, we introduce MulDimIF, a multi-dimensional constraint framework encompassing three constraint patterns, four constraint categories, and four difficulty levels. Based on this framework, we design a controllable instruction generation pipeline. Through constraint expansion, conflict detection, and instruction rewriting, we construct 9,106 code-verifiable samples. We evaluate 18 LLMs from six model families and find marked performance differences across constraint settings. For instance, average accuracy decreases from 80.82% at Level I to 36.76% at Level IV. Moreover, training with data generated by our framework significantly improves instruction following without compromising general performance. In-depth analysis indicates that these gains stem largely from parameter updates in attention modules, which strengthen constraint recognition and adherence. Code and data are available in https://github.com/Junjie-Ye/MulDimIF.
[83] Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder
Yingji Zhang, Danilo S. Carvalho, André Freitas
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Integrating compositional and symbolic properties into current distributional semantic spaces can enhance the interpretability, controllability, compositionality, and generalisation capabilities of Transformer-based auto-regressive language models (LMs). In this survey, we offer a novel perspective on latent space geometry through the lens of compositional semantics, a direction we refer to as \textit{semantic representation learning}. This direction enables a bridge between symbolic and distributional semantics, helping to mitigate the gap between them. We review and compare three mainstream autoencoder architectures-Variational AutoEncoder (VAE), Vector Quantised VAE (VQVAE), and Sparse AutoEncoder (SAE)-and examine the distinctive latent geometries they induce in relation to semantic structure and interpretability.
[84] Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments
Junjie Ye, Changhao Jiang, Zhengyin Du, Yufei Xu, Xuesong Yao, Zhiheng Xi, Xiaoran Fan, Qi Zhang, Tao Gui, Xuanjing Huang, Jiecao Chen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Effective tool use is essential for large language models (LLMs) to interact with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environment construction pipeline, incorporating scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools. Additionally, we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution. When combined with trajectory data collected from the constructed environments, this mechanism integrates seamlessly with standard RL algorithms to facilitate feedback-driven model training. Experiments on LLMs of varying scales demonstrate that our approach significantly enhances the models’ tool-use performance without degrading their general capabilities. Our analysis suggests that these gains result from improved context understanding and reasoning, driven by updates to the lower-layer MLP parameters in models. Code and data are available at https://github.com/bytedance/FTRL.
[85] MARCH: Evaluating the Intersection of Ambiguity Interpretation and Multi-hop Inference
Jeonghyun Park, Ingeol Baek, Seunghyun Yoon, Haeun Jang, Aparna Garimella, Akriti Jain, Nedim Lipka, Hwanhee Lee
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Real-world multi-hop QA is naturally linked with ambiguity, where a single query can trigger multiple reasoning paths that require independent resolution. Since ambiguity can occur at any stage, models must navigate layered uncertainty throughout the entire reasoning chain. Despite its prevalence in real-world user queries, previous benchmarks have primarily focused on single-hop ambiguity, leaving the complex interaction between multi-step inference and layered ambiguity underexplored. In this paper, we introduce MARCH, a benchmark for their intersection, with 2,209 multi-hop ambiguous questions curated via multi-LLM verification and validated by human annotation with strong agreement. Our experiments reveal that even state-of-the-art models struggle with MARCH, confirming that combining ambiguity resolution with multi-step reasoning is a significant challenge. To address this, we propose CLARION, a two-stage agentic framework that explicitly decouples ambiguity planning from evidence-driven reasoning, significantly outperforms existing approaches, and paves the way for robust reasoning systems.
[86] Native Hybrid Attention for Efficient Sequence Modeling
Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, Yu Cheng
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Transformers excel at sequence modeling but face quadratic complexity, while linear attention offers improved efficiency but often compromises recall accuracy over long contexts. In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra & inter-layer hybridization into a unified layer design. NHA maintains long-term context in key-value slots updated by a linear RNN, and augments them with short-term tokens from a sliding window. A single softmax attention operation is then applied over all keys and values, enabling per-token and per-head context-dependent weighting without requiring additional fusion parameters. The inter-layer behavior is controlled through a single hyperparameter, the sliding window size, which allows smooth adjustment between purely linear and full attention while keeping all layers structurally uniform. Experimental results show that NHA surpasses Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks. Furthermore, pretrained LLMs can be structurally hybridized with NHA, achieving competitive accuracy while delivering significant efficiency gains. Code is available at https://github.com/JusenD/NHA.
[87] SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.
[88] Beyond Black-Box Interventions: Latent Probing for Faithful Retrieval-Augmented Generation
Linfeng Gao, Qinggang Zhang, Baolong Bi, Bo Zeng, Zheng Yuan, Zerui Chen, Zhimin Wei, Shenghua Liu, Linlong Xu, Longyue Wang, Weihua Luo, Jinsong Su
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Retrieval-Augmented Generation (RAG) systems often fail to maintain contextual faithfulness, generating responses that conflict with the provided context or fail to fully leverage the provided evidence. Existing methods attempt to improve faithfulness through external interventions, such as specialized prompting, decoding-based calibration, or preference optimization. However, since these approaches treat the LLM as a black box, they lack a reliable mechanism to assess when and why knowledge conflicts occur. Consequently, they tend to be brittle, data-intensive, and agnostic to the model’s internal reasoning process. In this paper, we move beyond black-box interventions to analyze the model’s internal reasoning process. We discover that conflicting and aligned knowledge states are linearly separable in the model’s latent space, and contextual noise systematically increases the entropy of these representations. Based on these findings, we propose ProbeRAG, a novel framework for faithful RAG that operates in three stages: (i) fine-grained knowledge pruning to filter irrelevant context, (ii) latent conflict probing to identify hard conflicts in the model’s latent space, and (iii) conflict-aware attention to modulate attention heads toward faithful context integration. Extensive experiments demonstrate that ProbeRAG substantially improves both accuracy and contextual faithfulness. The related resources are available at https://github.com/LinfengGao/ProbeRAG.
[89] Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution
Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, Hai Zhao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.10696: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10696&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[90] Language steering in latent space to mitigate unintended code-switching
Andrey Goncharov, Nikolai Kondusov, Alexey Zaytsev
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.13849: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13849&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[91] Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning
Xinglang Zhang, Yunyao Zhang, ZeLiang Chen, Junqing Yu, Wei Yang, Zikai Song
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.02902: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02902&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[92] ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian
Nikola Ljubešić, Peter Rupnik, Ivan Porupski, Taja Kuzman Pungeršek
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.01619: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01619&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[93] LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models
Jian Gao, Richeng Xuan, Zhaolu Kang, Dingshi Liao, Wenxin Huang, Zongmou Huang, Yangdi Xu, Bowen Qin, Zheqi He, Xi Yang, Changjin Li, Yonghua Lin
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.11334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[94] fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
Yuxiang Wei, Yanteng Zhang, Xi Xiao, Chengxuan Qian, Tianyang Wang, Vince D. Calhoun
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.21760: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21760&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[95] TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks
Vansh Kapoor, Aman Gupta, Hao Chen, Anurag Beniwal, Jing Huang, Aviral Kumar
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.10245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[96] Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates
Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.04844: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04844&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[97] Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning
Sindhuja Chaduvula, Ahmed Y. Radwan, Azib Farooq, Yani Ioannou, Shaina Raza
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.03027: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03027&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[98] Exposía: Teaching and Assessment of Academic Writing Skills for Research Project Proposals and Peer Feedback
Dennis Zyska, Alla Rozovskaya, Ilia Kuznetsov, Iryna Gurevych
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.06536: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06536&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[99] H-AdminSim: A Multi-Agent Simulator for Realistic Hospital Administrative Workflows with FHIR Integration
Jun-Min Lee, Meong Hi Son, Edward Choi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.05407: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05407&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[100] Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations
Wen Luo, Guangyue Peng, Wei Li, Shaohang Wei, Feifan Song, Liang Wang, Nan Yang, Xingxing Zhang, Jing Jin, Furu Wei, Houfeng Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.07422: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07422&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[101] ExpSeek: Self-Triggered Experience Seeking for Web Agents
Wenyuan Zhang, Xinghua Zhang, Haiyang Yu, Shuaiyi Nie, Bingli Wu, Juwei Yue, Tingwen Liu, Yongbin Li
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.08605: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08605&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[102] F-Actor: Controllable Conversational Behaviour in Full-Duplex Models
Maike Züfle, Ondrej Klejch, Nicholas Sanders, Jan Niehues, Alexandra Birch, Tsz Kin Lam
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.11329: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11329&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[103] Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance Large Language Models
Guoming Ling, Zhongzhan Huang, Yupei Lin, Junxin Li, Shanshan Zhong, Hefeng Wu, Liang Lin
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.11340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[104] Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning
Fengran Mo, Yifan Gao, Sha Li, Hansi Zeng, Xin Liu, Zhaoxuan Tan, Xian Li, Jianshu Chen, Dakuo Wang, Meng Jiang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.13115: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13115&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[105] Common to Whom? Regional Cultural Commonsense and LLM Bias in India
Sangmitra Madhusudan, Trush Shashank More, Steph Buongiorno, Renata Dividino, Jad Kabbara, Ali Emami
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.15550: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15550&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[106] Sparse or Dense? A Mechanistic Estimation of Computation Density in Transformer-based LLMs
Corentin Kervadec, Iuliia Lysova, Marco Baroni, Gemma Boleda
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.22795: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22795&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[107] When ‘YES’ Meets ‘BUT’: Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?
Tuo Liang, Zhe Hu, Jing Li, Hao Zhang, Yiren Lu, Yunlai Zhou, Yiran Qiao, Disheng Liu, Jeirui Peng, Jing Ma, Yu Yin
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.23137: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.23137&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[108] IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
David Gringras
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.07709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[109] Evaluating LLM-Based Translation of a Low-Resource Technical Language: The Medical and Philosophical Greek of Galen
James L. Zainaldin, Cameron Pattison, Manuela Marai, Jacob Wu, Mark J. Schiefsky
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.24119: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24119&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[110] Just Use XML: Revisiting Joint Translation and Label Projection
Thennal DK, Chris Biemann, Hans Ole Hatzel
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.12021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[111] PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency
Minseo Kim, Sujeong Im, Junseong Choi, Junhee Lee, Chaeeun Shim, Hwajung Hong, Edward Choi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.25620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[112] Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa
George Boateng, Samuel Boateng, Victor Kumbol
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.29159: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29159&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[113] Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions
Junhao Su, Yuanliang Wan, Junwei Yang, Hengyu Shi, Tianyang Han, Junfeng Luo, Yurui Qiu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.18847: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18847&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[114] Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
Yihong Dong, Jianha Xiao, Xue Jiang, Xuyuan Guo, Zhiyuan Fan, Jiaru Qian, Kechi Zhang, Jia Li, Zhi Jin, Ge Li
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.02709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[115] RAG or Learning? Understanding the Limits of LLM Adaptation under Continuous Knowledge Drift in the Real World
Hanbing Liu, Lang Cao, Yang Li
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.05096: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05096&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[116] ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs
Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka, Steffen Eger
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.06484: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06484&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[117] Rag Performance Prediction for Question Answering
Or Dado, David Carmel, Oren Kurland
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.07985: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07985&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[118] Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation
Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Yuxi Zhang, Huimin Wang, Yutian Zhao, Yefeng Zheng, Binyang Li, Kam-Fai Wong, Xian Wu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.08046: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08046&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[119] Deep Learning Based Amharic Chatbot for FAQs in Universities
Goitom Ybrah Hailu, Hadush Hailu, Shishay Welay
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2402.01720: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.01720&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[120] Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering
Rrubaa Panchendrarajan, Arkaitz Zubiaga
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.09812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[121] Think in Sentences: Explicit Sentence Boundaries Enhance Language Model’s Capabilities
Zhichen Liu, Yongyuan Li, Yang Xu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[122] Two-Stage Regularization-Based Structured Pruning for LLMs
Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Ruihan Jin, Feihu Che, Pengpeng Shao, Zhengqi Wen, Jianhua Tao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.18232: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18232&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[123] How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts
Minh-Vuong Nguyen, Fatemeh Shiri, Zhuang Li, Karin Verspoor
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11133: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11133&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[124] LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, Ge Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11748: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11748&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[125] Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness
Tomer Ashuach, Liat Ein-Dor, Shai Gretz, Yoav Katz, Yonatan Belinkov
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12373: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12373&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[126] Activation-Guided Local Editing for Jailbreaking Attacks
Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.00555: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.00555&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[127] Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration
Eliya Habba, Itay Itzhak, Asaf Yehudai, Yotam Perlitz, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen, Gabriel Stanovsky
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12843: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12843&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[128] CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation
Sizhe Wang, Zhengren Wang, Dongsheng Ma, Yongan Yu, Rui Ling, Zhiyu Li, Feiyu Xiong, Wentao Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2504.21751: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.21751&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[129] Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning
Hanbing Liu, Lang Cao, Yuanyi Ren, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.08125: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08125&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[130] Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models
Mehrzad Samadi, Aleksander Ficek, Sean Narenthiran, Siddhartha Jain, Wasi Uddin Ahmad, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.14232: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14232&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[131] Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
Xingjian Diao, Zheyuan Liu, Chunhui Zhang, Weiyi Wu, Keyi Kong, Lin Shi, Kaize Ding, Soroush Vosoughi, Jiang Gui
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.04442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[132] Coherence in the brain unfolds across separable temporal regimes
Davide Staub, Finn Rabe, Akhil Misra, Yves Pauli, Roya Hüppi, Ni Yang, Nils Lang, Lars Michels, Victoria Edkins, Sascha Frühholz, Iris Sommer, Wolfram Hinzen, Philipp Homan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.20481: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20481&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[133] Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models
Antoine Edy, Max Conti, Quentin Macé
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.26259: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26259&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[134] ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel, Shafiq Abedin, Amit Alfassy, Eli Schwartz, Daniel Caraballo, Yagmur Gizem Cinar, Florian Scheidegger, Steven I. Ross, Daniel Karl I. Weidele, Hang Hua, Ekaterina Arutyunova, Roei Herzig, Zexue He, Zihan Wang, Xinyue Yu, Yunfei Zhao, Sicong Jiang, Minghao Liu, Qunshu Lin, Peter Staar, Luis Lastras, Aude Oliva, Rogerio Feris
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.27064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[135] VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors
Haz Sameen Shahgir, Xiaofu Chen, Yu Fu, Erfan Shayegani, Nael Abu-Ghazaleh, Yova Kementchedjhieva, Yue Dong
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.02486: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02486&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[136] A Lightweight Multi-Metric No-Reference Image Quality Assessment Framework for UAV Imaging
Koffi Titus Sergio Aglin, Anthony K. Muchiri, Celestin Nkundineza
Main category: cs.CV
TL;DR: MM-IQA is a lightweight no-reference image quality assessment framework that combines multiple interpretable distortion cues (blur, edges, resolution artifacts, exposure, noise, haze, frequency) into a single quality score.
Details
Motivation: Need for reliable image quality assessment in automated image acquisition systems where pristine reference images are unavailable, requiring efficient no-reference IQA methods for filtering large volumes of images before analysis.Method: Multi-metric framework combining interpretable distortion cues: blur, edge structure, low resolution artifacts, exposure imbalance, noise, haze, and frequency content. Uses Python/OpenCV implementation with modest memory requirements storing only limited intermediate representations.
Result: Achieved SRCC values of 0.647-0.830 on five benchmark datasets (KonIQ-10k, LIVE Challenge, KADID-10k, TID2013, BIQ2021). Consistent performance on synthetic agricultural dataset. Processing time ~1.97s per image with linear memory scaling.
Conclusion: MM-IQA enables fast image quality screening with explicit distortion-aware cues and modest computational cost, suitable for practical applications requiring efficient no-reference quality assessment.
Abstract: Reliable image quality assessment is essential in applications where large volumes of images are acquired automatically and must be filtered before further analysis. In many practical scenarios, a pristine reference image is unavailable, making no reference image quality assessment (NR-IQA) particularly important. This paper introduces Multi-Metric Image Quality Assessment (MM-IQA), a lightweight multi-metric framework for NR-IQA. It combines interpretable cues related to blur, edge structure, low resolution artifacts, exposure imbalance, noise, haze, and frequency content to produce a single quality score in the range [0,100].MM-IQA was evaluated on five benchmark datasets (KonIQ-10k, LIVE Challenge, KADID-10k, TID2013, and BIQ2021) and achieved SRCC values ranging from 0.647 to 0.830. Additional experiments on a synthetic agricultural dataset showed consistent behavior of the designed cues. The Python/OpenCV implementation required about 1.97 s per image. This method also has modest memory requirements because it stores only a limited number of intermediate grayscale, filtered, and frequency-domain representations, resulting in memory usage that scales linearly with image size. The results show that MM-IQA can be used for fast image quality screening with explicit distortion aware cues and modest computational cost.
[137] SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation
Shivam Chand Kaushik
Main category: cs.CV
TL;DR: SemiFA is an agentic multi-modal framework that autonomously generates structured semiconductor failure analysis reports from inspection images in under one minute using a four-agent pipeline with vision-language models and equipment telemetry fusion.
Details
Motivation: Semiconductor failure analysis currently requires hours of expert time per case involving manual examination of inspection images, correlation of equipment telemetry, consultation of historical records, and report writing. There's a need to automate this time-consuming process.Method: Four-agent LangGraph pipeline: 1) DefectDescriber using DINOv2 and LLaVA-1.6 for defect classification and morphology narration, 2) RootCauseAnalyzer fusing SECS/GEM equipment telemetry with historical defect retrieval from Qdrant vector database, 3) SeverityClassifier for severity assignment and yield impact estimation, 4) RecipeAdvisor for corrective process adjustments, plus a fifth node for PDF report assembly.
Result: DINOv2-based classifier achieves 92.1% accuracy on 140 validation images (macro F1 = 0.917). Full pipeline generates complete FA reports in 48 seconds on NVIDIA A100. Multi-modal fusion improves root cause reasoning by +0.86 composite points over image-only baseline, with equipment telemetry being the more load-bearing modality.
Conclusion: SemiFA successfully automates semiconductor failure analysis report generation, integrating SECS/GEM equipment telemetry into a vision-language model pipeline for the first time, significantly reducing analysis time from hours to under one minute while maintaining high accuracy.
Abstract: Semiconductor failure analysis (FA) requires engineers to examine inspection images, correlate equipment telemetry, consult historical defect records, and write structured reports, a process that can consume several hours of expert time per case. We present SemiFA, an agentic multi-modal framework that autonomously generates structured FA reports from semiconductor inspection images in under one minute. SemiFA decomposes FA into a four-agent LangGraph pipeline: a DefectDescriber that classifies and narrates defect morphology using DINOv2 and LLaVA-1.6, a RootCauseAnalyzer that fuses SECS/GEM equipment telemetry with historically similar defects retrieved from a Qdrant vector database, a SeverityClassifier that assigns severity and estimates yield impact, and a RecipeAdvisor that proposes corrective process adjustments. A fifth node assembles a PDF report. We introduce SemiFA-930, a dataset of 930 annotated semiconductor defect images paired with structured FA narratives across nine defect classes, drawn from procedural synthesis, WM-811K, and MixedWM38. Our DINOv2-based classifier achieves 92.1% accuracy on 140 validation images (macro F1 = 0.917), and the full pipeline produces complete FA reports in 48 seconds on an NVIDIA A100-SXM4-40 GB GPU. A GPT-4o judge ablation across four modality conditions demonstrates that multi-modal fusion improves root cause reasoning by +0.86 composite points (1-5 scale) over an image-only baseline, with equipment telemetry as the more load-bearing modality. To our knowledge, SemiFA is the first system to integrate SECS/GEM equipment telemetry into a vision-language model pipeline for autonomous FA report generation.
[138] Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models
Shreyansh Pathak, Jyotishman Das
Main category: cs.CV
TL;DR: GPPU is a unified, scalable algorithm for class-level unlearning that works across vision and audio models using graph propagation to identify class-specific directions and orthogonal projection for efficient information removal.
Details
Motivation: The need for selective and efficient erasure of learned information from deep neural networks is important for privacy, regulatory compliance, and adaptive system design, requiring a principled approach to machine unlearning.Method: GPPU uses graph-based propagation to identify class-specific directions in feature space, projects representations onto orthogonal subspaces, and performs targeted fine-tuning to ensure effective and irreversible removal of target class information.
Result: GPPU achieves 10-20x speedups over prior methodologies while preserving model utility on retained classes, demonstrated through comprehensive evaluations on six vision datasets and two large-scale audio benchmarks across various architectures.
Conclusion: GPPU provides a principled, modality-agnostic approach to machine unlearning at a scale not previously explored, contributing to more efficient and responsible deep learning systems.
Abstract: The need to selectively and efficiently erase learned information from deep neural networks is becoming increasingly important for privacy, regulatory compliance, and adaptive system design. We introduce Graph-Propagated Projection Unlearning (GPPU), a unified and scalable algorithm for class-level unlearning that operates across both vision and audio models. GPPU employs graph-based propagation to identify class-specific directions in the feature space and projects representations onto the orthogonal subspace, followed by targeted fine-tuning, to ensure that target class information is effectively and irreversibly removed. Through comprehensive evaluations on six vision datasets and two large-scale audio benchmarks spanning a variety of architectures including CNNs, Vision Transformers, and Audio Transformers, we demonstrate that GPPU achieves highly efficient unlearning, realizing 10-20x speedups over prior methodologies while preserving model utility on retained classes. Our framework provides a principled and modality-agnostic approach to machine unlearning, evaluated at a scale that has received limited attention in prior work, contributing toward more efficient and responsible deep learning.
[139] DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery
Yann V. Bellec
Main category: cs.CV
TL;DR: DroneScan-YOLO improves aerial object detection for UAV imagery by addressing tiny object detection challenges through increased resolution, dynamic filter pruning, a lightweight stride-4 detection branch, and a hybrid loss function.
Details
Motivation: Aerial object detection in UAV imagery faces challenges with tiny objects, adverse conditions, and computational constraints. Standard YOLO detectors fail due to minimum stride limitations, gradient issues for non-overlapping tiny boxes, and filter redundancy.Method: Four coordinated design choices: (1) increased 1280x1280 input resolution, (2) RPA-Block dynamic filter pruning with lazy cosine-similarity updates, (3) MSFD lightweight P2 detection branch at stride 4, and (4) SAL-NWD hybrid loss combining Normalized Wasserstein Distance with size-adaptive CIoU weighting.
Result: Achieves 55.3% mAP@50 and 35.6% mAP@50-95 on VisDrone2019-DET, outperforming YOLOv8s by +16.6 and +12.3 points respectively. Improves recall from 0.374 to 0.518, maintains 96.7 FPS with only +4.1% parameters. Significant gains on tiny objects: bicycle AP@50 improves 187%, awning-tricycle improves 52%.
Conclusion: DroneScan-YOLO provides a holistic solution for aerial object detection that effectively addresses tiny object detection challenges while maintaining computational efficiency, significantly outperforming baseline methods.
Abstract: Aerial object detection in UAV imagery presents unique challenges due to the high prevalence of tiny objects, adverse environmental conditions, and strict computational constraints. Standard YOLO-based detectors fail to address these jointly: their minimum detection stride of 8 pixels renders sub-32px objects nearly undetectable, their CIoU loss produces zero gradients for non-overlapping tiny boxes, and their architectures contain significant filter redundancy. We propose DroneScan-YOLO, a holistic system contribution that addresses these limitations through four coordinated design choices: (1) increased input resolution of 1280x1280 to maximize spatial detail for tiny objects, (2) RPA-Block, a dynamic filter pruning mechanism based on lazy cosine-similarity updates with a 10-epoch warm-up period, (3) MSFD, a lightweight P2 detection branch at stride 4 adding only 114,592 parameters (+1.1%), and (4) SAL-NWD, a hybrid loss combining Normalized Wasserstein Distance with size-adaptive CIoU weighting, integrated into YOLOv8’s TaskAligned assignment pipeline. Evaluated on VisDrone2019-DET, DroneScan-YOLO achieves 55.3% mAP@50 and 35.6% mAP@50-95, outperforming the YOLOv8s baseline by +16.6 and +12.3 points respectively, improving recall from 0.374 to 0.518, and maintaining 96.7 FPS inference speed with only +4.1% parameters. Gains are most pronounced on tiny object classes: bicycle AP@50 improves from 0.114 to 0.328 (+187%), and awning-tricycle from 0.156 to 0.237 (+52%).
[140] PatchPoison: Poisoning Multi-View Datasets to Degrade 3D Reconstruction
Prajas Wadekar, Venkata Sai Pranav Bachina, Kunal Bhosikar, Ankit Gangwal, Charu Sharma
Main category: cs.CV
TL;DR: PatchPoison: A dataset-poisoning method using small adversarial patches to prevent unauthorized 3D reconstruction from multi-view images by corrupting SfM feature matching.
Details
Motivation: 3D Gaussian Splatting enables photorealistic 3D reconstruction from casually captured images, raising privacy concerns about unauthorized reconstruction of scenes/objects without consent. Need a practical protection method.Method: Inject small high-frequency adversarial patches (structured checkerboard patterns) into periphery of each image in multi-view datasets. Patches corrupt feature-matching in SfM pipelines like COLMAP by introducing spurious correspondences that misalign camera poses, causing downstream 3DGS optimization to diverge.
Result: On NeRF-Synthetic benchmark, inserting 12×12 pixel patches increases reconstruction error by 6.8× in LPIPS metric. Poisoned images remain unobtrusive to human viewers while effectively preventing 3D reconstruction.
Conclusion: PatchPoison offers lightweight, practical protection against unauthorized 3D reconstruction without requiring pipeline modifications, serving as a “drop-in” preprocessing step for content creators.
Abstract: 3D Gaussian Splatting (3DGS) has recently enabled highly photorealistic 3D reconstruction from casually captured multi-view images. However, this accessibility raises a privacy concern: publicly available images or videos can be exploited to reconstruct detailed 3D models of scenes or objects without the owner’s consent. We present PatchPoison, a lightweight dataset-poisoning method that prevents unauthorized 3D reconstruction. Unlike global perturbations, PatchPoison injects a small high-frequency adversarial patch, a structured checkerboard, into the periphery of each image in a multi-view dataset. The patch is designed to corrupt the feature-matching stage of Structure-from-Motion (SfM) pipelines such as COLMAP by introducing spurious correspondences that systematically misalign estimated camera poses. Consequently, downstream 3DGS optimization diverges from the correct scene geometry. On the NeRF-Synthetic benchmark, inserting a 12 X 12 pixel patch increases reconstruction error by 6.8x in LPIPS, while the poisoned images remain unobtrusive to human viewers. PatchPoison requires no pipeline modifications, offering a practical, “drop-in” preprocessing step for content creators to protect their multi-view data.
[141] 3DRealHead: Few-Shot Detailed Head Avatar
Jalees Nehvi, Timo Bolkart, Thabo Beeler, Justus Thies
Main category: cs.CV
TL;DR: 3DRealHead: A few-shot 3D head avatar reconstruction method that uses a Style U-Net to generate 3D Gaussian primitives from few images, with novel expression control combining 3DMM signals and mouth region features from monocular video for higher expressivity.
Details
Motivation: Current 3D head avatar methods struggle to faithfully reproduce identity and facial expressions, especially for person-specific features like mouth and teeth. Existing methods rely on limited training data and 3DMM-based expression control that restricts expressivity, failing to capture the full diversity of human appearances and detailed facial expressions needed for immersive applications.Method: Proposes 3DRealHead with: 1) Few-shot inversion process of a 3D human head prior represented as a Style U-Net that emits 3D Gaussian primitives, learned on NeRSemble dataset; 2) Novel expression control combining 3DMM-based facial expression signals with mouth region features extracted from driving monocular video; 3) Enables avatar reconstruction from few pictures and driving with consumer webcam.
Result: The method achieves higher expressivity and closer resemblance to physical reality by recovering facial expressions that cannot be represented by 3DMM alone. The few-shot approach allows avatar creation from limited input images while maintaining detailed person-specific features.
Conclusion: 3DRealHead addresses limitations of current 3D head avatar methods by combining learned priors with novel expression control signals, enabling faithful reproduction of identity and detailed facial expressions from few-shot inputs for immersive applications.
Abstract: The human face is central to communication. For immersive applications, the digital presence of a person should mirror the physical reality, capturing the users idiosyncrasies and detailed facial expressions. However, current 3D head avatar methods often struggle to faithfully reproduce the identity and facial expressions, despite having multi-view data or learned priors. Learning priors that capture the diversity of human appearances, especially, for regions with highly person-specific features, like the mouth and teeth region is challenging as the underlying training data is limited. In addition, many of the avatar methods are purely relying on 3D morphable model-based expression control which strongly limits expressivity. To address these challenges, we are introducing 3DRealHead, a few-shot head avatar reconstruction method with a novel expression control signal that is extracted from a monocular video stream of the subject. Specifically, the subject can take a few pictures of themselves, recover a 3D head avatar and drive it with a consumer-level webcam. The avatar reconstruction is enabled via a novel few-shot inversion process of a 3D human head prior which is represented as a Style U-Net that emits 3D Gaussian primitives which can be rendered under novel views. The prior is learned on the NeRSemble dataset. For animating the avatar, the U-Net is conditioned on 3DMM-based facial expression signals, as well as features of the mouth region extracted from the driving video. These additional mouth features allow us to recover facial expressions that cannot be represented by the 3DMM leading to a higher expressivity and closer resemblance to the physical reality.
[142] GeoLink: A 3D-Aware Framework Towards Better Generalization in Cross-View Geo-Localization
Hongyang Zhang, Yinhao Liu, Haitao Zhang, Zhongyi Wen, Shuxian Liang, Xiansheng Hua
Main category: cs.CV
TL;DR: GeoLink: A 3D-aware semantic-consistent framework for generalizable cross-view geo-localization that uses 3D scene reconstruction to improve 2D representation learning and enhance generalization to unseen domains.
Details
Motivation: Cross-view geo-localization faces challenges with semantic inconsistency due to viewpoint variation and poor generalization under domain shift. Existing 2D correspondence methods are easily distracted by redundant shared information across views, leading to less transferable representations.Method: Offline reconstruction of scene point clouds from multi-view drone images using VGGT to provide stable structural priors. Two complementary improvements: 1) Geometric-aware Semantic Refinement module mitigates redundant and view-biased dependencies in 2D features under 3D guidance; 2) Unified View Relation Distillation module transfers 3D structural relations to 2D features while preserving 2D-only inference pipeline.
Result: Extensive experiments on multiple benchmarks show GeoLink consistently outperforms state-of-the-art methods and achieves superior generalization across unseen domains and diverse weather environments.
Conclusion: The proposed 3D-aware framework effectively addresses semantic inconsistency and generalization challenges in cross-view geo-localization by leveraging 3D structural priors to enhance 2D representation learning.
Abstract: Generalizable cross-view geo-localization aims to match the same location across views in unseen regions and conditions without GPS supervision. Its core difficulty lies in severe semantic inconsistency caused by viewpoint variation and poor generalization under domain shift. Existing methods mainly rely on 2D correspondence, but they are easily distracted by redundant shared information across views, leading to less transferable representations. To address this, we propose GeoLink, a 3D-aware semantic-consistent framework for Generalizable cross-view geo-localization. Specifically, we offline reconstruct scene point clouds from multi-view drone images using VGGT, providing stable structural priors. Based on these 3D anchors, we improve 2D representation learning in two complementary ways. A Geometric-aware Semantic Refinement module mitigates potentially redundant and view-biased dependencies in 2D features under 3D guidance. In addition, a Unified View Relation Distillation module transfers 3D structural relations to 2D features, improving cross-view alignment while preserving a 2D-only inference pipeline. Extensive experiments on multiple benchmarks show that GeoLink consistently outperforms state-of-the-art methods and achieves superior generalization across unseen domains and diverse weather environments.
[143] Towards Patient-Specific Deformable Registration in Laparoscopic Surgery
Alberto Neri, Veronica Penza, Nazim Haouchine, Leonardo S. Mattos
Main category: cs.CV
TL;DR: First patient-specific non-rigid point cloud registration method for surgical 3D model alignment using Transformer architecture and physics-based registration to handle organ deformations and noise.
Details
Motivation: Unsafe surgical care due to limitations in surgeon experience and situational awareness; need for reliable registration of patient-specific 3D models to enhance visualization and reduce complications, but challenged by organ deformations and noise between preoperative and intraoperative surfaces.Method: Patient-specific non-rigid point cloud registration combining Transformer encoder-decoder architecture with overlap estimation and matching module for dense correspondence prediction, followed by physics-based registration algorithm.
Result: Significantly outperforms traditional agnostic approaches, achieving 45% Matching Score with 92% Inlier Ratio on synthetic data, demonstrating effectiveness on both synthetic and real data.
Conclusion: Patient-specific registration method shows potential to improve surgical care by enabling reliable 3D model integration despite organ deformations and noise.
Abstract: Unsafe surgical care is a critical health concern, often linked to limitations in surgeon experience, skills, and situational awareness. Integrating patient-specific 3D models into the surgical field can enhance visualization, provide real-time anatomical guidance, and reduce intraoperative complications. However, reliably registering these models in general surgery remains challenging due to mismatches between preoperative and intraoperative organ surfaces, such as deformations and noise. To overcome these challenges, we introduce the first patient-specific non-rigid point cloud registration method, which leverages a novel data generation strategy to optimize outcomes for individual patients. Our approach combines a Transformer encoder-decoder architecture with overlap estimation and a dedicated matching module to predict dense correspondences, followed by a physics-based algorithm for registration. Experimental results on both synthetic and real data demonstrate that our patient-specific method significantly outperforms traditional agnostic approaches, achieving 45% Matching Score with 92% Inlier Ratio on synthetic data, highlighting its potential to improve surgical care.
[144] OneHOI: Unifying Human-Object Interaction Generation and Editing
Jiun Tian Hoe, Weipeng Hu, Xudong Jiang, Yap-Peng Tan, Chee Seng Chan
Main category: cs.CV
TL;DR: OneHOI is a unified diffusion transformer framework that consolidates Human-Object Interaction generation and editing into a single conditional denoising process using structured interaction representations.
Details
Motivation: Existing HOI approaches are disjoint: HOI generation synthesizes scenes from structured triplets but fails to integrate mixed conditions, while HOI editing modifies interactions via text but struggles to decouple pose from physical contact and scale to multiple interactions.Method: Introduces Relational Diffusion Transformer (R-DiT) with role- and instance-aware HOI tokens, layout-based spatial Action Grounding, Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on HOI-Edit-44K dataset.
Result: Achieves state-of-the-art results across both HOI generation and editing, supporting layout-guided, layout-free, arbitrary-mask, and mixed-condition control.
Conclusion: OneHOI provides a unified framework that effectively addresses limitations of existing HOI approaches by consolidating generation and editing capabilities through structured interaction representations.
Abstract: Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.
[145] Multitasking Embedding for Embryo Blastocyst Grading Prediction (MEmEBG)
Nahid Khoshk Angabini, Mohsen Tajgardan, Mahesh Madhavan, Zahra Asghari Varzaneh, Reza Khoshkangini, Thomas Ebner
Main category: cs.CV
TL;DR: A multitask embedding-based approach using ResNet-18 for automated analysis of blastocyst quality from embryo images, predicting trophectoderm, inner cell mass, and expansion grades.
Details
Motivation: Current embryo grading in IVF relies on subjective visual assessment of morphological features, leading to inter-embryologist variability and standardization challenges. There's a need for automated, objective blastocyst quality assessment.Method: Uses a pretrained ResNet-18 architecture enhanced with an embedding layer to learn discriminative representations from limited embryo image datasets. The multitask approach simultaneously predicts TE, ICM, and EXP grades by leveraging biological and physical characteristics extracted from day-5 human embryo images.
Result: Experimental results demonstrate the promise of the multitask embedding approach for robust and consistent blastocyst quality assessment, showing potential for automated analysis of visually similar structures that are difficult to distinguish.
Conclusion: The proposed embedding-based multitask learning approach shows potential for reliable, automated blastocyst quality assessment that could address subjectivity and variability issues in current IVF embryo grading practices.
Abstract: Reliable evaluation of blastocyst quality is critical for the success of in vitro fertilization (IVF) treatments. Current embryo grading practices primarily rely on visual assessment of morphological features, which introduces subjectivity, inter-embryologist variability, and challenges in standardizing quality assurance. In this study, we propose a multitask embedding-based approach for the automated analysis and prediction of key blastocyst components, including the trophectoderm (TE), inner cell mass (ICM), and blastocyst expansion (EXP). The method leverages biological and physical characteristics extracted from images of day-5 human embryos. A pretrained ResNet-18 architecture, enhanced with an embedding layer, is employed to learn discriminative representations from a limited dataset and to automatically identify TE and ICM regions along with their corresponding grades, structures that are visually similar and inherently difficult to distinguish. Experimental results demonstrate the promise of the multitask embedding approach and potential for robust and consistent blastocyst quality assessment.
[146] Person Re-Identification via Generalized Class Prototypes
Md Ahmed Al Muzaddid, William J. Beksi
Main category: cs.CV
TL;DR: A generalized selection method for person re-identification that improves performance by choosing better class representations beyond simple centroids, balancing accuracy and mean average precision.
Details
Motivation: While feature extraction and objective function improvements have advanced person re-identification, selecting optimal class representatives remains underexplored. Prior methods using class centroids during retrieval yield suboptimal results, creating a need for better representation selection strategies.Method: Proposes a generalized selection method that chooses representations not limited to class centroids. The approach allows adjustment of the number of representations per class based on application requirements and works on top of existing re-identification embeddings.
Result: The method substantially improves upon contemporary results across multiple re-identification embeddings, achieving better balance between accuracy and mean average precision beyond state-of-the-art performance.
Conclusion: Better selection of class representatives is crucial for person re-identification performance, and the proposed generalized selection method effectively addresses this gap, offering flexible representation choices that improve retrieval metrics.
Abstract: Advanced feature extraction methods have significantly contributed to enhancing the task of person re-identification. In addition, modifications to objective functions have been developed to further improve performance. Nonetheless, selecting better class representatives is an underexplored area of research that can also lead to advancements in re-identification performance. Although past works have experimented with using the centroid of a gallery image class during training, only a few have investigated alternative representations during the retrieval stage. In this paper, we demonstrate that these prior techniques yield suboptimal results in terms of re-identification metrics. To address the re-identification problem, we propose a generalized selection method that involves choosing representations that are not limited to class centroids. Our approach strikes a balance between accuracy and mean average precision, leading to improvements beyond the state of the art. For example, the actual number of representations per class can be adjusted to meet specific application requirements. We apply our methodology on top of multiple re-identification embeddings, and in all cases it substantially improves upon contemporary results.
[147] Neural 3D Reconstruction of Planetary Surfaces from Descent-Phase Wide-Angle Imagery
Melonie de Almeida, George Brydon, Divya M. Persaud, John H. Williamson, Paul Henderson
Main category: cs.CV
TL;DR: Neural height field reconstruction method for planetary descent imagery outperforms traditional multi-view stereo by incorporating domain-specific priors about continuous, smooth planetary surfaces.
Details
Motivation: Digital elevation modeling of planetary surfaces is crucial for geological studies, but accurate 3D reconstruction from spacecraft descent imagery is challenging due to strong radial distortion, limited parallax from vertically descending cameras, and limitations of conventional multi-view stereo methods.Method: Developed a novel neural reconstruction approach with explicit neural height field representation that incorporates domain-specific priors about planetary surfaces being continuous, smooth, solid, and free from floating objects. This is the first study of modern neural reconstruction methods for planetary descent imaging.
Result: Experiments on simulated descent sequences over high-fidelity lunar and Mars terrains show the proposed approach achieves increased spatial coverage while maintaining satisfactory estimation accuracy compared to traditional multi-view stereo methods.
Conclusion: Neural approaches offer a strong and competitive alternative to traditional multi-view stereo methods for planetary descent imaging, with the neural height field representation providing effective domain-specific priors for planetary surface reconstruction.
Abstract: Digital elevation modeling of planetary surfaces is essential for studying past and ongoing geological processes. Wide-angle imagery acquired during spacecraft descent promises to offer a low-cost option for high-resolution terrain reconstruction. However, accurate 3D reconstruction from such imagery is challenging due to strong radial distortion and limited parallax from vertically descending, predominantly nadir-facing cameras. Conventional multi-view stereo exhibits limited depth range and reduced fidelity under these conditions and also lacks domain-specific priors. We present the first study of modern neural reconstruction methods for planetary descent imaging. We also develop a novel approach that incorporates an explicit neural height field representation, which provides a strong prior since planetary surfaces are generally continuous, smooth, solid, and free from floating objects. This study demonstrates that neural approaches offer a strong and competitive alternative to traditional multi-view stereo (MVS) methods. Experiments on simulated descent sequences over high-fidelity lunar and Mars terrains demonstrate that the proposed approach achieves increased spatial coverage while maintaining satisfactory estimation accuracy.
[148] LPM 1.0: Video-based Character Performance Model
Ailing Zeng, Casper Yang, Chauncey Ge, Eddie Zhang, Garvey Xu, Gavin Lin, Gilbert Gu, Jeremy Pi, Leo Li, Mingyi Shi, Shawn Wang, Sheng Bi, Steven Tang, Thorn Hang, Tobey Guo, Vincent Li, Xin Tong, Yikang Li, Yuchen Sun, Yue Zhao, Yuhan Lu, Yuwei Li, Zane Zhang, Zeshi Yang, Zi Ye
Main category: cs.CV
TL;DR: LPM 1.0 is a Large Performance Model that generates real-time, identity-stable audio-visual conversational performance for characters from single images and audio inputs, addressing the performance trilemma of expressiveness, real-time inference, and long-horizon identity stability.
Details
Motivation: Existing video models struggle with the "performance trilemma" - balancing high expressiveness, real-time inference, and long-horizon identity stability. Conversation represents the most comprehensive performance scenario where characters simultaneously speak, listen, react, and emote while maintaining identity over time.Method: Built a multimodal human-centric dataset with strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction. Trained a 17B-parameter Diffusion Transformer (Base LPM) for controllable, identity-consistent performance, then distilled it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction.
Result: LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio with text prompts for motion control, achieving real-time speed with identity-stable, infinite-length generation. It achieves state-of-the-art results on the proposed LPM-Bench benchmark across all evaluated dimensions.
Conclusion: LPM 1.0 serves as a visual engine for conversational agents, live streaming characters, and game NPCs, solving the performance trilemma through a novel multimodal approach that enables high-quality, real-time, identity-consistent character performance generation.
Abstract: Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.
[149] A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models
Augustin de la Brosse, Damien Garreau, Thomas Houet, Thomas Corpetti
Main category: cs.CV
TL;DR: First implementation of concept-based Explainable AI (XAI) for Species Distribution Models using Robust TCAV to quantify landscape concept influence on predictions, with new open-access drone imagery dataset.
Details
Motivation: Species distribution models need both predictive performance and ecological insights, but deep learning models make extracting insights challenging. Need to reconcile these objectives with explainable AI.Method: Propose concept-based XAI for SDMs using Robust TCAV methodology. Create new open-access landscape concept dataset from high-resolution multispectral and LiDAR drone imagery (653 patches across 15 concepts + 1,450 random reference patches). Test on two aquatic insects using CNNs and Vision Transformers.
Result: Concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV provides landscape-level information useful for policy-making.
Conclusion: First successful implementation of concept-based XAI for SDMs demonstrates value in bridging predictive performance with ecological interpretability, enabling both validation and discovery in species distribution modeling.
Abstract: Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.
[150] 4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview
Benjamin Kiefer, Jan Lukas Augustin, Jon Muhovič, Mingi Jeong, Arnold Wiliem, Janez Pers, Matej Kristan, Alberto Quattrini Li, Matija Teršek, Josip Šarić, Arpita Vats, Dominik Hildebrand, Rafia Rahim, Mahmut Karaaslan, Arpit Vaishya, Steve Xie, Ersin Kaya, Akib Mashrur, Tze-Hsiang Tang, Chun-Ming Tsai, Jun-Wei Hsieh, Ming-Ching Chang, Wonwoo Jo, Doyeon Lee, Yusi Cao, Lingling Li, Vinayak Nageli, Arshad Jamal, Gorthi Rama Krishna Sai Subrahmanyam, Jemo Maeng, Seongju Lee, Kyoobin Lee, Xu Liu, LiCheng Jiao, Jannik Sheikh, Martin Weinmann, Ivan Martinović, Jose Mateus Raitz Persch, Rahul Harsha Cheppally, Mehmet E. Belviranli, Dimitris Gahtidis, Hyewon Chun, Sangmun Lee, Philipp Gorczak, Hansol Kim, Jeeyeon Jeon, Borja Carrillo Perez, Jiahui Wang, Sangmin Park, Andreas Michel, Jannick Kuester, Bettina Felten, Wolfgang Gross, Yuan Feng, Justin Davis
Main category: cs.CV
TL;DR: Workshop report summarizing the 4th Maritime Computer Vision (MaCVi) workshop at CVPR 2026, covering five benchmark challenges focused on predictive accuracy and real-time embedded feasibility in maritime computer vision.
Details
Motivation: To advance maritime computer vision research by providing benchmark challenges that emphasize both accuracy and real-time feasibility for embedded systems, addressing practical deployment needs in maritime environments.Method: Organized five benchmark challenges with specific evaluation protocols and datasets, collected submissions from participating teams, conducted quantitative and qualitative analyses, and compiled technical reports from top-performing teams.
Result: The workshop successfully conducted five benchmark challenges, established leaderboards, collected technical reports highlighting practical design choices, and provided comprehensive analyses of emerging method trends in maritime computer vision.
Conclusion: The MaCVi 2026 workshop advanced maritime computer vision research through comprehensive benchmark challenges that balanced predictive accuracy with real-time embedded feasibility, providing valuable resources and insights for the community.
Abstract: The 4th Workshop on Maritime Computer Vision (MaCVi) is organized as part of CVPR 2026. This edition features five benchmark challenges with emphasis on both predictive accuracy and embedded real-time feasibility. This report summarizes the MaCVi 2026 challenge setup, evaluation protocols, datasets, and benchmark tracks, and presents quantitative results, qualitative comparisons, and cross-challenge analyses of emerging method trends. We also include technical reports from top-performing teams to highlight practical design choices and lessons learned across the benchmark suite. Datasets, leaderboards, and challenge resources are available at https://macvi.org/workshop/cvpr26.
[151] Rethinking Uncertainty in Segmentation: From Estimation to Decision
Saket Maganti
Main category: cs.CV
TL;DR: Medical image segmentation uncertainty maps need actionable policies (accept/flag/defer) to be useful; optimizing uncertainty alone misses safety gains; best method removes 80% errors at 25% deferral.
Details
Motivation: Current medical image segmentation reports uncertainty estimates but rarely uses them to guide decisions; there's a disconnect between uncertainty metrics and real-world utility in converting uncertainty maps into actionable policies.Method: Formulate segmentation as two-stage pipeline (estimation then decision); evaluate two uncertainty sources (Monte Carlo Dropout and Test-Time Augmentation) with three deferral strategies; introduce confidence-aware deferral rule prioritizing uncertain and low-confidence predictions; test on retinal vessel segmentation benchmarks (DRIVE, STARE, CHASE_DB1).
Result: Best method and policy combination removes up to 80% of segmentation errors at only 25% pixel deferral; achieves strong cross-dataset robustness; calibration improvements don’t translate to better decision quality.
Conclusion: Uncertainty should be evaluated based on the decisions it enables rather than in isolation; there’s a disconnect between standard uncertainty metrics and real-world utility in medical image segmentation.
Abstract: In medical image segmentation, uncertainty estimates are often reported but rarely used to guide decisions. We study the missing step: how uncertainty maps are converted into actionable policies such as accepting, flagging, or deferring predictions. We formulate segmentation as a two-stage pipeline, estimation followed by decision, and show that optimizing uncertainty alone fails to capture most of the achievable safety gains. Using retinal vessel segmentation benchmarks (DRIVE, STARE, CHASE_DB1), we evaluate two uncertainty sources (Monte Carlo Dropout and Test-Time Augmentation) combined with three deferral strategies, and introduce a simple confidence-aware deferral rule that prioritizes uncertain and low-confidence predictions. Our results show that the best method and policy combination removes up to 80 percent of segmentation errors at only 25 percent pixel deferral, while achieving strong cross-dataset robustness. We further show that calibration improvements do not translate to better decision quality, highlighting a disconnect between standard uncertainty metrics and real-world utility. These findings suggest that uncertainty should be evaluated based on the decisions it enables, rather than in isolation.
[152] Indexing Multimodal Language Models for Large-scale Image Retrieval
Bahey Tharwat, Giorgos Kordopatis-Zilos, Pavel Suma, Ian Reid, Giorgos Tolias
Main category: cs.CV
TL;DR: MLLMs used as zero-shot similarity estimators for image-to-image retrieval without training, outperforming specialized methods on diverse benchmarks.
Details
Motivation: Multimodal LLMs have strong cross-modal reasoning but their potential for vision-only tasks like image retrieval remains underexplored. The paper investigates using MLLMs as training-free similarity estimators for instance-level image retrieval.Method: Prompt MLLMs with paired images and convert next-token probabilities into similarity scores for zero-shot re-ranking. Combine with memory-efficient indexing and top-k candidate re-ranking for scalability. Avoids specialized architectures and fine-tuning.
Result: MLLMs outperform task-specific re-rankers outside their native domains and show superior robustness to clutter, occlusion, and small objects. They work well for open-world large-scale image retrieval but have failure modes under severe appearance changes.
Conclusion: MLLMs are promising zero-shot alternatives for image retrieval, leveraging multimodal pre-training knowledge without additional training. Future work needed to address limitations with severe appearance changes.
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong cross-modal reasoning capabilities, yet their potential for vision-only tasks remains underexplored. We investigate MLLMs as training-free similarity estimators for instance-level image-to-image retrieval. Our approach prompts the model with paired images and converts next-token probabilities into similarity scores, enabling zero-shot re-ranking within large-scale retrieval pipelines. This design avoids specialized architectures and fine-tuning, leveraging the rich visual discrimination learned during multimodal pre-training. We address scalability by combining MLLMs with memory-efficient indexing and top-$k$ candidate re-ranking. Experiments across diverse benchmarks show that MLLMs outperform task-specific re-rankers outside their native domains and exhibit superior robustness to clutter, occlusion, and small objects. Despite strong results, we identify failure modes under severe appearance changes, highlighting opportunities for future research. Our findings position MLLMs as a promising alternative for open-world large-scale image retrieval.
[153] Explainable Fall Detection for Elderly Care via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition
Mohammad Saleh, Azadeh Tabatabaei
Main category: cs.CV
TL;DR: Proposes T-SHAP, a temporally-aware explanation method for skeleton-based fall detection that stabilizes SHAP attributions over time windows to improve reliability for clinical use.
Details
Motivation: Existing post-hoc explainability methods produce temporally unstable attribution maps when applied frame-by-frame to sequential data, making them unreliable for clinical decision-making in fall detection.Method: Combines efficient LSTM model with T-SHAP (temporally aware SHAP aggregation) that applies linear smoothing to attribution sequences, reducing high-frequency variance while preserving Shapley value guarantees.
Result: Achieves 94.3% classification accuracy with <25ms inference latency; T-SHAP improves explanation reliability (AUP: 0.91 vs 0.89 for standard SHAP) and highlights biomechanically relevant motion patterns.
Conclusion: The framework provides stable, reliable explanations for fall detection that align with clinical observations, supporting transparent decision aids in elderly care environments.
Abstract: Fall detection in elderly care requires not only accurate classification but also reliable explanations that clinicians can trust. However, existing post-hoc explainability methods, when applied frame-by-frame to sequential data, produce temporally unstable attribution maps that clinicians cannot reliably act upon. To address this issue, we propose a lightweight and explainable framework for skeleton-based fall detection that combines an efficient LSTM model with T-SHAP, a temporally aware post-hoc aggregation strategy that stabilizes SHAP-based feature attributions over contiguous time windows. Unlike standard SHAP, which treats each frame independently, T-SHAP applies a linear smoothing operator to the attribution sequence, reducing high-frequency variance while preserving the theoretical guarantees of Shapley values, including local accuracy and consistency. Experiments on the NTU RGB+D Dataset demonstrate that the proposed framework achieves 94.3% classification accuracy with an end-to-end inference latency below 25 milliseconds, satisfying real-time constraints on mid-range hardware and indicating strong potential for deployment in clinical monitoring scenarios. Quantitative evaluation using perturbation-based faithfulness metrics shows that T-SHAP improves explanation reliability compared to standard SHAP (AUP: 0.89 vs. 0.91) and Grad-CAM (0.82), with consistent improvements observed across five-fold cross-validation, indicating enhanced explanation reliability. The resulting attributions consistently highlight biomechanically relevant motion patterns, including lower-limb instability and changes in spinal alignment, aligning with established clinical observations of fall dynamics and supporting their use as transparent decision aids in long-term care environments
[154] See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones
Mahyar Ghazanfari, Peng Wei
Main category: cs.CV
TL;DR: See&Say combines geometric safety analysis with semantic perception using Vision-Language Models for autonomous drone package delivery, outperforming baselines in safety map prediction and alternative drop zone identification.
Details
Motivation: Existing drone delivery systems struggle with safe package drop-offs in cluttered urban environments due to limitations in either geometry-based analysis or semantic segmentation alone, lacking integrated semantic reasoning for robust decision-making.Method: Proposes See&Say framework that fuses monocular depth gradients with open-vocabulary detection masks to produce safety maps, guided by a Vision-Language Model for iterative refinement of object category prompts and hazard detection across time.
Result: Outperforms all baselines with highest accuracy and IoU for safety map prediction, and superior performance in alternative drop zone evaluation across multiple thresholds on a curated dataset of urban delivery scenarios.
Conclusion: Demonstrates the promise of VLM-guided segmentation-depth fusion for advancing safe and practical drone-based package delivery, enabling reliable reasoning under dynamic conditions during final delivery phase.
Abstract: Autonomous drone delivery systems are rapidly advancing, but ensuring safe and reliable package drop-offs remains highly challenging in cluttered urban and suburban environments where accurately identifying suitable package drop zones is critical. Existing approaches typically rely on either geometry-based analysis or semantic segmentation alone, but these methods lack the integrated semantic reasoning required for robust decision-making. To address this gap, we propose See&Say, a novel framework that combines geometric safety cues with semantic perception, guided by a Vision-Language Model (VLM) for iterative refinement. The system fuses monocular depth gradients with open-vocabulary detection masks to produce safety maps, while the VLM dynamically adjusts object category prompts and refines hazard detection across time, enabling reliable reasoning under dynamic conditions during the final delivery phase. When the primary drop-pad is occupied or unsafe, the proposed See&Say also identifies alternative candidate zones for package delivery. We curated a dataset of urban delivery scenarios with moving objects and human activities to evaluate the approach. Experimental results show that See&Say outperforms all baselines, achieving the highest accuracy and IoU for safety map prediction as well as superior performance in alternative drop zone evaluation across multiple thresholds. These findings highlight the promise of VLM-guided segmentation-depth fusion for advancing safe and practical drone-based package delivery.
[155] PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines
Wei Jiang, Wei Wang
Main category: cs.CV
TL;DR: PAT-VCM: A plug-and-play auxiliary-token framework for video coding for machines that uses shared baseline compressed streams augmented with lightweight task-aware auxiliary tokens for multiple downstream tasks.
Details
Motivation: Existing video coding for machines is typically trained for specific downstream tasks and models, making compressed representations tightly coupled to end tasks and difficult to scale across multiple tasks or adapt to model updates.Method: Proposes PAT-VCM framework that maintains a shared baseline compressed stream and augments it with three types of lightweight task-aware auxiliary tokens: visual residual tokens, prompt/control tokens, and semantic tokens, allowing different downstream tasks to recover needed information without separate codec training.
Result: Evaluation on segmentation, depth estimation, and semantic recognition shows: shared detection-oriented auxiliary branch provides reusable first refinement; task-specific visual branches improve segmentation and depth; prompt tokens provide further segmentation gains at negligible bitrate; semantic tokens achieve strong recognition performance with extremely low overhead.
Conclusion: A shared compressed representation combined with lightweight task-aware auxiliary tokens is a practical and scalable alternative to tightly task-coupled VCM design, enabling multi-task support and adaptability to model updates.
Abstract: Existing video coding for machines is often trained for a specific downstream task and model. As a result, the compressed representation becomes tightly coupled to the end task, making it difficult to scale across multiple tasks or adapt to model updates. We propose PAT-VCM, a plug-and-play auxiliary-token framework for video coding for machines. PAT-VCM keeps a shared baseline compressed stream and augments it with lightweight task-aware auxiliary tokens, allowing different downstream tasks to recover the information they need without retraining a separate codec for each task. The framework supports three forms of auxiliary information: visual residual tokens, prompt/control tokens, and semantic tokens. We evaluate PAT-VCM on segmentation, depth estimation, and semantic recognition. A shared detection-oriented auxiliary branch provides a reusable first refinement, task-specific visual branches improve segmentation and depth, prompt tokens provide further segmentation gains at negligible bitrate, and semantic tokens achieve strong recognition performance with extremely low overhead. These results suggest that a shared compressed representation, combined with lightweight task-aware auxiliary tokens, is a practical and scalable alternative to tightly task-coupled VCM design.
[156] Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision
Gerasimos Chatzoudis, Konstantinos D. Polyzos, Zhuowei Li, Difei Gu, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas
Main category: cs.CV
TL;DR: Cross-Layer Transcoders (CLTs) are introduced as sparse, depth-aware proxy models for MLP blocks in Vision Transformers, enabling interpretable decomposition of final representations into additive layer-wise contributions.
Details
Motivation: Existing Sparse Autoencoders operate on individual layers and fail to capture cross-layer computational structure and layer significance in Vision Transformers, limiting interpretability of how final representations are formed.Method: CLTs use an encoder-decoder scheme to reconstruct each post-MLP activation from learned sparse embeddings of preceding layers, creating a linear decomposition that transforms final ViT representations into additive, layer-resolved constructions.
Result: CLTs achieve high reconstruction fidelity while preserving CLIP zero-shot classification accuracy. Cross-layer contribution scores provide faithful attribution, revealing final representations are concentrated in a small set of dominant layer-wise terms.
Conclusion: CLTs serve as reliable interpretable proxies for Vision Transformers, enabling process-level interpretability and faithful attribution by decomposing final representations into layer-wise contributions.
Abstract: Understanding the internal activations of Vision Transformers (ViTs) is critical for building interpretable and trustworthy models. While Sparse Autoencoders (SAEs) have been used to extract human-interpretable features, they operate on individual layers and fail to capture the cross-layer computational structure of Transformers, as well as the relative significance of each layer in forming the last-layer representation. Alternatively, we introduce the adoption of Cross-Layer Transcoders (CLTs) as reliable, sparse, and depth-aware proxy models for MLP blocks in ViTs. CLTs use an encoder-decoder scheme to reconstruct each post-MLP activation from learned sparse embeddings of preceding layers, yielding a linear decomposition that transforms the final representation of ViTs from an opaque embedding into an additive, layer-resolved construction that enables faithful attribution and process-level interpretability. We train CLTs on CLIP ViT-B/32 and ViT-B/16 across CIFAR-100, COCO, and ImageNet-100. We show that CLTs achieve high reconstruction fidelity with post-MLP activations while preserving and even improving, in some cases, CLIP zero-shot classification accuracy. In terms of interpretability, we show that the cross-layer contribution scores provide faithful attribution, revealing that the final representation is concentrated in a smaller set of dominant layer-wise terms whose removal degrades performance and whose retention largely preserves it. These results showcase the significance of adopting CLTs as an alternative interpretable proxy of ViTs in the vision domain.
[157] Bias at the End of the Score
Salma Abdel Magid, Grace Guo, Esin Tureci, Amaya Dharmasiri, Vikram V. Ramaswamy, Hanspeter Pfister, Olga Russakovsky
Main category: cs.CV
TL;DR: Reward models in text-to-image generation encode demographic biases that cause optimization to disproportionately sexualize female subjects, reinforce stereotypes, and collapse diversity.
Details
Motivation: While reward models are crucial for text-to-image generation systems (used for dataset filtering, evaluation, optimization, and safety filtering), their robustness and fairness as scoring functions remain largely unknown, particularly regarding demographic biases.Method: Conducted a large-scale audit of reward model robustness with respect to demographic biases during T2I model training and generation, providing both quantitative and qualitative evidence of bias encoding.
Result: Reward models encode demographic biases that cause reward-guided optimization to disproportionately sexualize female image subjects, reinforce gender/racial stereotypes, and collapse demographic diversity.
Conclusion: Findings highlight shortcomings in current reward models, challenge their reliability as quality metrics, and underscore the need for improved data collection and training procedures for more robust scoring.
Abstract: Reward models (RMs) are inherently non-neutral value functions designed and trained to encode specific objectives, such as human preferences or text-image alignment. RMs have become crucial components of text-to-image (T2I) generation systems where they are used at various stages for dataset filtering, as evaluation metrics, as a supervisory signal during optimization of parameters, and for post-generation safety and quality filtering of T2I outputs. While specific problems with the integration of RMs into the T2I pipeline have been studied (e.g. reward hacking or mode collapse), their robustness and fairness as scoring functions remains largely unknown. We conduct a large scale audit of RM robustness with respect to demographic biases during T2I model training and generation. We provide quantitative and qualitative evidence that while originally developed as quality measures, RMs encode demographic biases, which cause reward-guided optimization to disproportionately sexualize female image subjects reinforce gender/racial stereotypes, and collapse demographic diversity. These findings highlight shortcomings in current reward models, challenge their reliability as quality metrics, and underscore the need for improved data collection and training procedures to enable more robust scoring.
[158] Deep Spatially-Regularized and Superpixel-Based Diffusion Learning for Unsupervised Hyperspectral Image Clustering
Vutichart Buranasiri, James M. Murphy
Main category: cs.CV
TL;DR: Unsupervised hyperspectral image clustering using masked autoencoder for denoised latent representation and diffusion-based clustering with spatial regularization.
Details
Motivation: To improve hyperspectral image clustering by learning better latent representations that capture spatial context and spectral correlations, enabling more accurate diffusion-based clustering.Method: Two-stage approach: 1) Unsupervised masked autoencoder (UMAE) with Vision Transformer backbone learns denoised latent representations using only a small subset of training pixels via masking; 2) Entropy rate superpixel segmentation followed by spatially-regularized diffusion graph construction in the compressed latent space using Euclidean and diffusion distances.
Result: Experiments on Botswana and KSC datasets demonstrate improved labeling accuracy and clustering quality compared to baseline methods.
Conclusion: The proposed DS²DL framework effectively combines masked representation learning with diffusion-based clustering for superior hyperspectral image clustering performance.
Abstract: An unsupervised framework for hyperspectral image (HSI) clustering is proposed that incorporates masked deep representation learning with diffusion-based clustering, extending the Spatially-Regularized Superpixel-based Diffusion Learning ($S^2DL$) algorithm. Initially, a denoised latent representation of the original HSI is learned via an unsupervised masked autoencoder (UMAE) model with a Vision Transformer backbone. The UMAE takes spatial context and long-range spectral correlations into account and incorporates an efficient pretraining process via masking that utilizes only a small subset of training pixels. In the next stage, the entropy rate superpixel (ERS) algorithm is used to segment the image into superpixels, and a spatially regularized diffusion graph is constructed using Euclidean and diffusion distances within the compressed latent space instead of the HSI space. The proposed algorithm, Deep Spatially-Regularized Superpixel-based Diffusion Learning ($DS^2DL$), leverages more faithful diffusion distances and subsequent diffusion graph construction that better reflect the intrinsic geometry of the underlying data manifold, improving labeling accuracy and clustering quality. Experiments on Botswana and KSC datasets demonstrate the efficacy of $DS^2DL$.
[159] The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform
Akshit Gupta, Joris Timmermans, Filip Biljecki, Remko Uijlenhoet
Main category: cs.CV
TL;DR: Spectrascapes: A novel multi-spectral terrestrial-view dataset for urban climate monitoring, featuring RGB, Near-infrared, and Thermal imagery captured across diverse urban areas in the Netherlands.
Details
Motivation: Current urban monitoring datasets have limitations including poor scalability, inconsistent spatio-temporal resolutions, overhead views, or low spectral information, which hinder climate-resilient city development.Method: Created a multi-spectral dataset using bikes equipped with RGB, Near-infrared, and Thermal imaging sensors, capturing 17,718 street-level images across diverse urban morphologies in the Netherlands with strict calibration and quality control.
Result: Spectrascapes is presented as the first open-access dataset of its kind, enabling downstream applications in machine learning, urban planning, and remote sensing domains.
Conclusion: The dataset addresses limitations of existing urban monitoring methods and provides a valuable resource for developing climate-resilient cities through multi-spectral terrestrial-view analysis.
Abstract: High-resolution data in spatial and temporal contexts is imperative for developing climate resilient cities. Current datasets for monitoring urban parameters are developed primarily using manual inspections, embedded-sensing, remote sensing, or standard street-view imagery (RGB). These methods and datasets are often constrained respectively by poor scalability, inconsistent spatio-temporal resolutions, overhead views or low spectral information. We present a novel method and its open implementation: a multi-spectral terrestrial-view dataset that circumvents these limitations. This dataset consists of 17,718 street level multi-spectral images captured with RGB, Near-infrared, and Thermal imaging sensors on bikes, across diverse urban morphologies (village, town, small city, and big urban area) in the Netherlands. Strict emphasis is put on data calibration and quality while also providing the details of our data collection methodology (including the hardware and software details). To the best of our knowledge, Spectrascapes is the first open-access dataset of its kind. Finally, we demonstrate two downstream use-cases enabled using this dataset and provide potential research directions in the machine learning, urban planning and remote sensing domains.
[160] Why MLLMs Struggle to Determine Object Orientations
Anju Gopinath, Nikhil Krishnaswamy, Bruce Draper
Main category: cs.CV
TL;DR: Contrary to prior assumptions, visual encoder representations in MLLMs do preserve object orientation information, as linear models can accurately predict rotations from embeddings, suggesting orientation failures stem from other MLLM components.
Details
Motivation: Prior work suggests MLLMs struggle with 2D object orientation tasks due to visual encoder limitations, but this paper aims to empirically test whether orientation information is actually preserved in encoder representations.Method: Designed controlled protocol to test orientation recovery from encoder features. Examined SigLIP and ViT features from LLaVA OneVision and Qwen2.5-VL-7B-Instruct models, and CLIP representations from LLaVA 1.5/1.6 using rotated foreground patches. Trained linear regressors to predict object orientation from encoded features as a test of information preservation.
Result: Contrary to the null hypothesis, orientation information is recoverable from encoder representations - simple linear models can accurately predict object orientations from embeddings, contradicting the assumption that MLLM orientation failures originate in the visual encoder.
Conclusion: Visual encoders do preserve orientation information, so MLLM failures on 2D orientation tasks must stem from other components. Although orientation information is present, it’s diffusely spread across thousands of features, which may explain why MLLMs fail to exploit it effectively.
Abstract: Multimodal Large Language Models (MLLMs) struggle with tasks that require reasoning about 2D object orientation in images, as documented in prior work. Tong et al. and Nichols et al. hypothesize that these failures originate in the visual encoder, since commonly used encoders such as CLIP and SigLIP are trained for image-text semantic alignment rather than geometric reasoning. We design a controlled empirical protocol to test this claim by measuring whether rotations can be recovered from encoder representations. In particular, we examine SigLIP and ViT features from LLaVA OneVision and Qwen2.5-VL-7B-Instruct models, respectively, using full images, and examine CLIP representations in LLaVA 1.5 and 1.6 using rotated foreground patches against natural background images. Our null hypothesis is that orientation information is not preserved in the encoder embeddings and we test this by training linear regressors to predict object orientation from encoded features. Contrary to the hypothesis, we find that orientation information is recoverable from encoder representations: simple linear models accurately predict object orientations from embeddings. This contradicts the assumption that MLLM orientation failures originate in the visual encoder. Having rejected the accepted hypothesis that MLLMs struggle with 2D orientation tasks because of visual encoder limitations, we still don’t know why they fail. Although a full explanation is beyond the scope of this paper, we show that although present, orientation information is spread diffusely across tens of thousands of features. This may or may not be while MLLMs fail to exploit the available orientation information.
[161] Towards Successful Implementation of Automated Raveling Detection: Effects of Training Data Size, Illumination Difference, and Spatial Shift
Xinan Zhang, Haolin Wang, Zhongyu Yang, Yi-Chang, Tsai
Main category: cs.CV
TL;DR: A benchmark called RavelingArena is proposed to evaluate model robustness for asphalt pavement raveling detection, addressing performance degradation in diverse real-world conditions through controlled variation experiments.
Details
Motivation: Current machine learning methods for raveling detection degrade in large-scale deployments due to diverse inference data from different runs, sensors, and environmental conditions, highlighting the need for more generalizable and robust solutions.Method: Proposed RavelingArena benchmark built by augmenting existing dataset with controlled variations to evaluate model robustness. Identified and assessed variations impacting robustness including training data quantity, illumination differences, and spatial shifts.
Result: Both quantity and diversity of training data are critical for model accuracy, achieving at least 9.2% gain under most diverse conditions. Case study on multi-year test section showed significant improvements in year-to-year consistency.
Conclusion: The insights provide guidance for more reliable model deployment in raveling detection and other real-world tasks requiring adaptability to diverse conditions, laying foundations for temporal deterioration modeling.
Abstract: Raveling, the loss of aggregates, is a major form of asphalt pavement surface distress, especially on highways. While research has shown that machine learning and deep learning-based methods yield promising results for raveling detection by classification on range images, their performance often degrades in large-scale deployments where more diverse inference data may originate from different runs, sensors, and environmental conditions. This degradation highlights the need of a more generalizable and robust solution for real-world implementation. Thus, the objectives of this study are to 1) identify and assess potential variations that impact model robustness, such as the quantity of training data, illumination difference, and spatial shift; and 2) leverage findings to enhance model robustness under real-world conditions. To this end, we propose RavelingArena, a benchmark designed to evaluate model robustness to variations in raveling detection. Instead of collecting extensive new data, it is built by augmenting an existing dataset with diverse, controlled variations, thereby enabling variation-controlled experiments to quantify the impact of each variation. Results demonstrate that both the quantity and diversity of training data are critical to the accuracy of models, achieving at least a 9.2% gain in accuracy under the most diverse conditions in experiments. Additionally, a case study applying these findings to a multi-year test section in Georgia, U.S., shows significant improvements in year-to-year consistency, laying foundations for future studies on temporal deterioration modeling. These insights provide guidance for more reliable model deployment in raveling detection and other real-world tasks that require adaptability to diverse conditions.
[162] Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift
Akshit Achara, Yovin Yathathugoda, Nick Byrne, Michela Antonelli, Esther Puyol Anton, Alexander Hammers, Andrew P. King
Main category: cs.CV
TL;DR: Paper introduces Flip diagnostic to quantify semantic label-flip errors in segmentation models under distribution shift, where models correctly identify foreground objects but assign wrong semantic labels due to spurious correlations.
Details
Motivation: Segmentation models can fail in subtle ways under distribution shift - they may correctly identify object boundaries but assign wrong semantic labels due to spurious correlations between non-causal features and target labels. Current robustness evaluations focus on overlap metrics but miss these semantic label-flip errors.Method: Proposes Flip diagnostic that counts how often ground truth foreground pixels are assigned wrong foreground identity while remaining predicted as foreground. Also introduces entropy-based flip-risk score computed from foreground identity uncertainty to flag flip-prone cases at inference time without ground truth labels.
Result: In settings with category-scene correlations during training, increasing correlation widens gap between common/rare test conditions and increases within-object label swaps on counterfactual groups. Flip diagnostic reveals these semantic errors that overlap metrics miss.
Conclusion: Segmentation robustness should be assessed beyond overlap metrics by decomposing foreground errors into correct pixels, flipped-identity pixels, and missed-to-background pixels. The proposed flip-risk score can identify vulnerable cases at inference time.
Abstract: The robustness of machine learning models can be compromised by spurious correlations between non-causal features in the input data and target labels. A common way to test for such correlations is to train on data where the label is strongly tied to some non-causal cue, then evaluate on examples where that tie no longer holds. This idea is well established for classification tasks, but for semantic segmentation the specific failure modes are not well understood. We show that a model may achieve reasonable overlap while assigning the wrong semantic label, swapping one plausible foreground class for another, even when object boundaries are largely correct. We focus on this semantic label-flip behaviour and quantify it with a simple diagnostic (Flip) that counts how often ground truth foreground pixels are assigned the wrong foreground identity while remaining predicted as foreground. In a setting where category and scene are correlated during training, increasing the correlation consistently widens the gap between common and rare test conditions and increases these within-object label swaps on counterfactual groups. Overall, our results motivate assessing segmentation robustness under distribution shift beyond overlap by decomposing foreground errors into correct pixels, flipped-identity pixels, and missed-to-background pixels. We also propose an entropy-based, ground truth label-free `flip-risk’ score, which is computed from foreground identity uncertainty, and show that it can flag flip-prone cases at inference time. Code is available at https://github.com/acharaakshit/label-flips.
[163] SSD-GS: Scattering and Shadow Decomposition for Relightable 3D Gaussian Splatting
Iris Zheng, Guojun Tang, Alexander Doronin, Paul Teal, Fang-Lue Zhang
Main category: cs.CV
TL;DR: SSD-GS is a physically-based relighting framework using 3D Gaussian Splatting that decomposes reflectance into four components (diffuse, specular, shadow, subsurface scattering) for high-quality reconstruction and photorealistic relighting under novel lighting conditions.
Details
Motivation: Existing 3DGS-based relighting methods use coarse shading decompositions (only diffuse/specular or neural approximations) leading to limited fidelity and poor physical interpretability, especially for anisotropic metals and translucent materials.Method: Decomposes reflectance into four components: diffuse, specular, shadow, and subsurface scattering. Introduces learnable dipole-based scattering module for subsurface transport, occlusion-aware shadow formulation with visibility estimates and refinement network, and enhanced specular component with anisotropic Fresnel-based model. Uses progressive integration during training.
Result: Demonstrates superior quantitative and perceptual relighting quality compared to prior methods on challenging OLAT dataset. Effectively disentangles lighting and material properties even for unseen illumination conditions.
Conclusion: SSD-GS achieves high-quality physically-based relighting with better fidelity and interpretability, enabling downstream tasks like controllable light source editing and interactive scene relighting.
Abstract: We present SSD-GS, a physically-based relighting framework built upon 3D Gaussian Splatting (3DGS) that achieves high-quality reconstruction and photorealistic relighting under novel lighting conditions. In physically-based relighting, accurately modeling light-material interactions is essential for faithful appearance reproduction. However, existing 3DGS-based relighting methods adopt coarse shading decompositions, either modeling only diffuse and specular reflections or relying on neural networks to approximate shadows and scattering. This leads to limited fidelity and poor physical interpretability, particularly for anisotropic metals and translucent materials. To address these limitations, SSD-GS decomposes reflectance into four components: diffuse, specular, shadow, and subsurface scattering. We introduce a learnable dipole-based scattering module for subsurface transport, an occlusion-aware shadow formulation that integrates visibility estimates with a refinement network, and an enhanced specular component with an anisotropic Fresnel-based model. Through progressive integration of all components during training, SSD-GS effectively disentangles lighting and material properties, even for unseen illumination conditions, as demonstrated on the challenging OLAT dataset. Experiments demonstrate superior quantitative and perceptual relighting quality compared to prior methods and pave the way for downstream tasks, including controllable light source editing and interactive scene relighting. The source code is available at: https://github.com/irisfreesiri/SSD-GS.
[164] SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization
Farzaneh Jafari, Stefano Berretti, Anup Basu
Main category: cs.CV
TL;DR: SEDTalker is an emotion-aware framework for speech-driven 3D facial animation that uses frame-level speech emotion diarization for fine-grained expressive control, enabling continuous modulation of facial expressions over time.
Details
Motivation: Prior approaches rely on utterance-level or manually specified emotion labels, which lack temporal granularity for continuous emotion modulation in speech-driven facial animation. There's a need for more fine-grained, temporally dense emotion control that can capture the dynamic nature of emotional expression in speech.Method: The method uses frame-level speech emotion diarization to predict temporally dense emotion categories and intensities directly from speech. These diarized emotion signals are encoded as learned embeddings and used to condition a speech-driven 3D animation model based on a hybrid Transformer-Mamba architecture, enabling effective disentanglement of linguistic content and emotional style.
Result: Quantitative results show strong frame-level emotion recognition performance and low geometric and temporal reconstruction errors. Qualitative results demonstrate smooth emotion transitions and consistent expression control. The approach was evaluated on a large-scale multi-corpus dataset for speech emotion diarization and the EmoVOCA dataset for emotional 3D facial animation.
Conclusion: Frame-level emotion diarization is effective for expressive and controllable 3D talking head generation, enabling fine-grained control over facial expressions that aligns with the dynamic emotional content of speech while preserving identity and temporal coherence.
Abstract: We introduce SEDTalker, an emotion-aware framework for speech-driven 3D facial animation that leverages frame-level speech emotion diarization to achieve fine-grained expressive control. Unlike prior approaches that rely on utterance-level or manually specified emotion labels, our method predicts temporally dense emotion categories and intensities directly from speech, enabling continuous modulation of facial expressions over time. The diarized emotion signals are encoded as learned embeddings and used to condition a speech-driven 3D animation model based on a hybrid Transformer-Mamba architecture. This design allows effective disentanglement of linguistic content and emotional style while preserving identity and temporal coherence. We evaluate our approach on a large-scale multi-corpus dataset for speech emotion diarization and on the EmoVOCA dataset for emotional 3D facial animation. Quantitative results demonstrate strong frame-level emotion recognition performance and low geometric and temporal reconstruction errors, while qualitative results show smooth emotion transitions and consistent expression control. These findings highlight the effectiveness of frame-level emotion diarization for expressive and controllable 3D talking head generation.
[165] MSGS: Multispectral 3D Gaussian Splatting
Iris Zheng, Guojun Tang, Alexander Doronin, Paul Teal, Fang-Lue Zhang
Main category: cs.CV
TL;DR: Multispectral 3D Gaussian Splatting extends 3DGS with spectral radiance representation and dual-loss optimization for improved view synthesis with spectral consistency.
Details
Motivation: To enhance 3D Gaussian Splatting by incorporating multispectral information for better rendering fidelity, especially for challenging materials like translucent surfaces and anisotropic reflections, while maintaining real-time efficiency.Method: Augment each Gaussian with spectral radiance using per-band spherical harmonics, optimize with dual-loss supervision combining RGB and multispectral signals, and perform spectral-to-RGB conversion at pixel level to retain richer spectral cues.
Result: Demonstrates consistent improvements over RGB-only 3DGS baseline in image quality and spectral consistency on both public and self-captured datasets, particularly excelling with translucent materials and anisotropic reflections.
Conclusion: The approach maintains 3DGS’s compactness and real-time efficiency while providing a foundation for future integration with physically based shading models through multispectral representation.
Abstract: We present a multispectral extension to 3D Gaussian Splatting (3DGS) for wavelength-aware view synthesis. Each Gaussian is augmented with spectral radiance, represented via per-band spherical harmonics, and optimized under a dual-loss supervision scheme combining RGB and multispectral signals. To improve rendering fidelity, we perform spectral-to-RGB conversion at the pixel level, allowing richer spectral cues to be retained during optimization. Our method is evaluated on both public and self-captured real-world datasets, demonstrating consistent improvements over the RGB-only 3DGS baseline in terms of image quality and spectral consistency. Notably, it excels in challenging scenes involving translucent materials and anisotropic reflections. The proposed approach maintains the compactness and real-time efficiency of 3DGS while laying the foundation for future integration with physically based shading models.
[166] Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface
Vladimir Kalušev, Branko Brkljač, Milan Brkljač
Main category: cs.CV
TL;DR: Edge-based object detection system using LLM-based natural language interface and multi-agent orchestration on Raspberry Pi, integrating YOLO for vision, Slack chatbot, and Ollama LLM locally without cloud resources.
Details
Motivation: To demonstrate practical integration of AI agents for object detection and tracking on resource-constrained edge hardware, moving beyond traditional approaches by using LLM-based natural language interfaces for system control and communication.Method: Multi-agent framework with YOLO-based computer vision agent for real-time object detection/tracking, Slack channel chatbot agent for natural language interface, and Ollama LLM reporting agent - all running locally on Raspberry Pi. Event-based message exchange subsystem for agent orchestration instead of fully autonomous frameworks.
Result: Successful prototype implementation showing integration of all components on single resource-constrained platform, with insights into limitations of low-cost testbed platforms for centralized multi-agent AI systems.
Conclusion: Demonstrates feasibility of edge-based multi-agent AI systems with LLM interfaces for vision tasks, highlighting fast prototyping approach enabled by generative AI systems and providing practical alternative to cloud-dependent solutions.
Abstract: The paper presents design and prototype implementation of an edge based object detection system within the new paradigm of AI agents orchestration. It goes beyond traditional design approaches by leveraging on LLM based natural language interface for system control and communication and practically demonstrates integration of all system components into a single resource constrained hardware platform. The method is based on the proposed multi-agent object detection framework which tightly integrates different AI agents within the same task of providing object detection and tracking capabilities. The proposed design principles highlight the fast prototyping approach that is characteristic for transformational potential of generative AI systems, which are applied during both development and implementation stages. Instead of specialized communication and control interface, the system is made by using Slack channel chatbot agent and accompanying Ollama LLM reporting agent, which are both run locally on the same Raspberry Pi platform, alongside the dedicated YOLO based computer vision agent performing real time object detection and tracking. Agent orchestration is implemented through a specially designed event based message exchange subsystem, which represents an alternative to completely autonomous agent orchestration and control characteristic for contemporary LLM based frameworks like the recently proposed OpenClaw. Conducted experimental investigation provides valuable insights into limitations of the low cost testbed platforms in the design of completely centralized multi-agent AI systems. The paper also discusses comparative differences between presented approach and the solution that would require additional cloud based external resources.
[167] A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings
Caiwen Jiang, Lei Zeng, Wei Liu
Main category: cs.CV
TL;DR: A 3D SAM-based progressive prompting framework for segmenting radiotherapy-induced normal tissue injuries in head-and-neck medical images, using text, dose-guided box, and click prompts with small-target focus loss.
Details
Motivation: Radiotherapy-induced normal tissue injury segmentation is clinically important but challenging due to limited annotations and heterogeneity across injury types, lesion sizes, and imaging modalities.Method: Proposes a 3D SAM-based progressive prompting framework with three complementary prompts: text prompts for task-aware adaptation, dose-guided box prompts for coarse localization, and click prompts for iterative refinement, plus a small-target focus loss for small/sparse lesions.
Result: The method achieves reliable segmentation performance across diverse injury types (ORN, CE, CRN) and outperforms state-of-the-art methods.
Conclusion: The proposed progressive prompting framework effectively addresses challenges in radiotherapy-induced injury segmentation with limited data and heterogeneous lesions.
Abstract: Radiotherapy-induced normal tissue injury is a clinically important complication, and accurate segmentation of injury regions from medical images could facilitate disease assessment, treatment planning, and longitudinal monitoring. However, automatic segmentation of these lesions remains largely unexplored because of limited voxel-level annotations and substantial heterogeneity across injury types, lesion size, and imaging modality. To address this gap, we curate a dedicated head-and-neck radiotherapy-induced normal tissue injury dataset covering three manifestations: osteoradionecrosis (ORN), cerebral edema (CE), and cerebral radiation necrosis (CRN). We further propose a 3D SAM-based progressive prompting framework for multi-task segmentation in limited-data settings. The framework progressively incorporates three complementary prompts: text prompts for task-aware adaptation, dose-guided box prompts for coarse localization, and click prompts for iterative refinement. A small-target focus loss is introduced to improve local prediction and boundary delineation for small and sparse lesions. Experiments on ORN, CE, and CRN demonstrate that the proposed method achieves reliable segmentation performance across diverse injury types and outperforms state-of-the-art methods.
[168] UniBlendNet: Unified Global, Multi-Scale, and Region-Adaptive Modeling for Ambient Lighting Normalization
Jiatao Dai, Wei Dong, Han Zhou, Chengzhou Tang, Jun Chen
Main category: cs.CV
TL;DR: UniBlendNet: A unified framework for ambient lighting normalization that improves upon IFBlend by better modeling global illumination, multi-scale structures, and region-adaptive refinement for enhanced image restoration under complex lighting conditions.
Details
Motivation: Existing ambient lighting normalization methods like IFBlend have limitations in global context modeling and spatial adaptivity, leading to suboptimal restoration in challenging regions with complex, spatially varying illumination conditions.Method: Proposes UniBlendNet with three key components: 1) UniConvNet-based module for global illumination understanding and long-range dependencies, 2) Scale-Aware Aggregation Module (SAAM) for pyramid-based multi-scale feature aggregation with dynamic reweighting, and 3) mask-guided residual refinement for region-adaptive correction.
Result: Extensive experiments on NTIRE Ambient Lighting Normalization benchmark show UniBlendNet consistently outperforms baseline IFBlend, achieving improved restoration quality with visually more natural and stable results.
Conclusion: UniBlendNet effectively addresses limitations of existing methods by jointly modeling global illumination, multi-scale structures, and region-adaptive refinement, leading to superior ambient lighting normalization performance.
Abstract: Ambient Lighting Normalization (ALN) aims to restore images degraded by complex, spatially varying illumination conditions. Existing methods, such as IFBlend, leverage frequency-domain priors to model illumination variations, but still suffer from limited global context modeling and insufficient spatial adaptivity, leading to suboptimal restoration in challenging regions. In this paper, we propose UniBlendNet, a unified framework for ambient lighting normalization that jointly models global illumination, multi-scale structures, and region-adaptive refinement. Specifically, we enhance global illumination understanding by integrating a UniConvNet-based module to capture long-range dependencies. To better handle complex lighting variations, we introduce a Scale-Aware Aggregation Module (SAAM) that performs pyramid-based multi-scale feature aggregation with dynamic reweighting. Furthermore, we design a mask-guided residual refinement mechanism to enable region-adaptive correction, allowing the model to selectively enhance degraded regions while preserving well-exposed areas. This design effectively improves illumination consistency and structural fidelity under complex lighting conditions. Extensive experiments on the NTIRE Ambient Lighting Normalization benchmark demonstrate that UniBlendNet consistently outperforms the baseline IFBlend and achieves improved restoration quality, while producing visually more natural and stable restoration results.
[169] A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy
Caiwen Jiang, Yuzhen Ding, Mi Jia, Samir H. Patel, Terence T. Sio, Jonathan B. Ashman, Lisa A. McGee, Jean-Claude M. Rwigema, William G. Rule, Sameer R. Keole, Sujay A. Vora, William W. Wong, Nathan Y. Yu, Michele Y. Halyard, Steven E. Schild, Dinggang Shen, Wei Liu
Main category: cs.CV
TL;DR: A deep learning framework for fast, clinically-informed deformable image registration in proton therapy using multimodal data including CT scans, anatomical contours, dose distributions, and treatment planning text.
Details
Motivation: Proton therapy requires accurate deformable image registration for adaptive workflows, but conventional methods are too slow and existing deep learning approaches don't utilize clinically relevant multimodal information beyond just images.Method: Coarse-to-fine framework with dual CNN encoders for hierarchical feature extraction and transformer-based decoder. Incorporates clinical priors (target/OAR contours, dose distributions, treatment planning text) via anatomy/risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization.
Result: Evaluated on large-scale proton therapy dataset (1,222 paired CT scans). Shows consistent improvements over state-of-the-art methods, enabling fast and robust clinically meaningful registration.
Conclusion: The proposed framework integrates multimodal clinical information to achieve fast, accurate deformable registration for proton therapy adaptive workflows, outperforming existing methods.
Abstract: Proton therapy offers superior organ-at-risk sparing but is highly sensitive to anatomical changes, making accurate deformable image registration (DIR) across longitudinal CT scans essential. Conventional DIR methods are often too slow for emerging online adaptive workflows, while existing deep learning-based approaches are primarily designed for generic benchmarks and underutilize clinically relevant information beyond images. To address this gap, we propose a clinically scalable coarse-to-fine deformable registration framework that integrates multimodal information from the proton radiotherapy workflow to accommodate diverse clinical scenarios. The model employs dual CNN-based encoders for hierarchical feature extraction and a transformer-based decoder to progressively refine deformation fields. Beyond CT intensities, clinically critical priors, including target and organ-at-risk contours, dose distributions, and treatment planning text, are incorporated through anatomy- and risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization, enabling anatomically focused and clinically informed deformation estimation. We evaluate the proposed framework on a large-scale proton therapy DIR dataset comprising 1,222 paired planning and repeat CT scans across multiple anatomical regions and disease types. Extensive experiments demonstrate consistent improvements over state-of-the-art methods, enabling fast and robust clinically meaningful registration.
[170] Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks
Yu Wang, Sharon Li
Main category: cs.CV
TL;DR: Multimodal ICL performs comparably to text-only ICL in zero-shot but degrades significantly in few-shot settings due to lack of reasoning-level alignment between visual and textual representations and unreliable task mapping transfer.
Details
Motivation: Despite ICL's success in LLMs, its extension to multimodal settings remains poorly understood in terms of internal mechanisms and differences from text-only ICL. The paper aims to systematically analyze multimodal ICL in MLLMs.Method: Systematic analysis of ICL in multimodal LLMs using identical task formulations across modalities. Decomposes multimodal ICL into task mapping construction and transfer, analyzes cross-modal task mapping establishment and transfer across layers. Proposes inference-stage enhancement method to reinforce task mapping transfer.
Result: Multimodal ICL performs comparably to text-only ICL in zero-shot but degrades significantly under few-shot demonstrations. Analysis reveals current models lack reasoning-level alignment between visual and textual representations and fail to reliably transfer learned task mappings to queries. Proposed enhancement method improves performance.
Conclusion: The study provides new insights into mechanisms and limitations of multimodal ICL, revealing fundamental differences from text-only ICL. Findings suggest directions for more effective multimodal adaptation through better cross-modal alignment and task mapping transfer.
Abstract: In-context learning (ICL) enables models to adapt to new tasks via inference-time demonstrations. Despite its success in large language models, the extension of ICL to multimodal settings remains poorly understood in terms of its internal mechanisms and how it differs from text-only ICL. In this work, we conduct a systematic analysis of ICL in multimodal large language models. Using identical task formulations across modalities, we show that multimodal ICL performs comparably to text-only ICL in zero-shot settings but degrades significantly under few-shot demonstrations. To understand this gap, we decompose multimodal ICL into task mapping construction and task mapping transfer, and analyze how models establish cross-modal task mappings, and transfer them to query samples across layers. Our analysis reveals that current models lack reasoning-level alignment between visual and textual representations, and fail to reliably transfer learned task mappings to queries. Guided by these findings, we further propose a simple inference-stage enhancement method that reinforces task mapping transfer. Our results provide new insights into the mechanisms and limitations of multimodal ICL and suggest directions for more effective multimodal adaptation. Our code is available \href{https://github.com/deeplearning-wisc/Multimocal-ICL-Analysis-Framework-MGI}{here}.
[171] CausalDisenSeg: A Causality-Guided Disentanglement Framework with Counterfactual Reasoning for Robust Brain Tumor Segmentation Under Missing Modalities
Bo Liu, Yulong Zou, Jin Hong
Main category: cs.CV
TL;DR: CausalDisenSeg: A causality-guided framework for robust brain tumor segmentation under incomplete MRI data by disentangling anatomical causal factors from stylistic bias factors using causal intervention and counterfactual reasoning.
Details
Motivation: Deep learning models for multimodal brain tumor segmentation suffer from robustness issues with incomplete MRI data due to modality bias, where models exploit spurious correlations rather than learning true anatomical structures. Existing feature fusion methods fail to eliminate this dependency.Method: Proposes CausalDisenSeg, a Structural Causal Model-based framework with three-stage causal intervention: 1) Explicit causal disentanglement using CVAE with HSIC constraint to enforce orthogonality between anatomical and style features, 2) Causal representation reinforcement with Region Causality Module to ground features in physical tumor regions, 3) Counterfactual reasoning with dual-adversarial strategy to suppress residual bias effects.
Result: Significantly outperforms state-of-the-art methods in accuracy and consistency across severe missing-modality scenarios on BraTS 2020 dataset. Achieves state-of-the-art macro-average DSC of 84.49 on BraTS 2023 cross-dataset evaluation.
Conclusion: CausalDisenSeg effectively addresses modality bias in multimodal brain tumor segmentation through causality-guided disentanglement and counterfactual reasoning, achieving robust performance even with incomplete MRI data.
Abstract: In clinical practice, the robustness of deep learning models for multimodal brain tumor segmentation is severely compromised by incomplete MRI data. This vulnerability stems primarily from modality bias, where models exploit spurious correlations as shortcuts rather than learning true anatomical structures. Existing feature fusion methods fail to fundamentally eliminate this dependency. To address this, we propose CausalDisenSeg, a novel Structural Causal Model (SCM)-grounded framework that achieves robust segmentation via causality-guided disentanglement and counterfactual reasoning. We reframe the problem as isolating the anatomical Causal Factor from the stylistic Bias Factor. Our framework implements a three-stage causal intervention: (1) Explicit Causal Disentanglement: A Conditional Variational Autoencoder (CVAE) coupled with an HSIC constraint mathematically enforces statistical orthogonality between anatomical and style features. (2) Causal Representation Reinforcement: A Region Causality Module (RCM) explicitly grounds causal features in physical tumor regions. (3) Counterfactual Reasoning: A dual-adversarial strategy actively suppresses the residual Natural Direct Effect (NDE) of the bias, forcing its spatial attention to be mutually exclusive from the causal path. Extensive experiments on the BraTS 2020 dataset demonstrate that CausalDisenSeg significantly outperforms state-of-the-art methods in accuracy and consistency across severe missing-modality scenarios. Furthermore, cross-dataset evaluation on BraTS 2023 under the same protocol yields a state-of-the-art macro-average DSC of 84.49.
[172] DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis
Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin
Main category: cs.CV
TL;DR: DF3DV-1K: A large-scale real-world dataset of 1,048 scenes with clean and cluttered images for benchmarking distractor-free radiance field methods, enabling robust evaluation and method development.
Details
Motivation: There's a lack of large-scale real-world datasets with both clean and cluttered images for distractor-free radiance fields, limiting development beyond scene-specific reconstruction approaches.Method: Created DF3DV-1K dataset with 1,048 scenes, 89,924 images captured with consumer cameras, spanning 128 distractor types and 161 scene themes. Includes curated DF3DV-41 subset for challenging scenario evaluation. Benchmarked 9 distractor-free radiance field methods and 3D Gaussian Splatting.
Result: Dataset enables comprehensive benchmarking, identifying robust methods and challenging scenarios. Demonstrated application by fine-tuning diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on held-out sets.
Conclusion: DF3DV-1K facilitates development of distractor-free vision and promotes progress beyond scene-specific approaches by providing a comprehensive benchmarking dataset.
Abstract: Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches.
[173] Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens
Zhiwen Zheng, Yuheng Qiao, Xiaoshuai Zhang, Zhao Huang, Tao Zhang, Huiyu Zhou, Shaowei Jiang, Jin Liu, Wenwen Tang, Xingru Huang
Main category: cs.CV
TL;DR: IR4Net uses optical projection side-channels to reconstruct screen content from diffuse reflections, addressing instability and compression issues through physical regularization and semantic reprojection.
Details
Motivation: Noncontact exfiltration of electronic screen content via side-channel attacks presents security challenges. Current optical projection methods face two core problems: (1) near-singular Jacobian spectrum causing instability in projection inversion, and (2) irreversible compression in light transport that destroys global semantic information.Method: IR4Net (Irradiance Robust Radiometric Inversion Network) uses passive speckle patterns from diffuse reflection. It combines: 1) Physically Regularized Irradiance Approximation (PRIrr-Approximation) that embeds radiative transfer equation in learnable optimizer, 2) contour-to-detail cross-scale reconstruction to prevent noise propagation, and 3) Irreversibility Constrained Semantic Reprojection (ICSR) module to restore lost global structure through context-driven semantic mapping.
Result: Evaluated across four scene categories, IR4Net achieves higher fidelity than competing neural approaches while maintaining resilience to illumination perturbations.
Conclusion: IR4Net provides a robust solution for optical projection side-channel attacks by addressing fundamental stability and information loss problems through physical regularization and semantic reconstruction techniques.
Abstract: Noncontact exfiltration of electronic screen content poses a security challenge, with side-channel incursions as the principal vector. We introduce an optical projection side-channel paradigm that confronts two core instabilities: (i) the near-singular Jacobian spectrum of projection mapping breaches Hadamard stability, rendering inversion hypersensitive to perturbations; (ii) irreversible compression in light transport obliterates global semantic cues, magnifying reconstruction ambiguity. Exploiting passive speckle patterns formed by diffuse reflection, our Irradiance Robust Radiometric Inversion Network (IR4Net) fuses a Physically Regularized Irradiance Approximation (PRIrr-Approximation), which embeds the radiative transfer equation in a learnable optimizer, with a contour-to-detail cross-scale reconstruction mechanism that arrests noise propagation. Moreover, an Irreversibility Constrained Semantic Reprojection (ICSR) module reinstates lost global structure through context-driven semantic mapping. Evaluated across four scene categories, IR4Net achieves fidelity beyond competing neural approaches while retaining resilience to illumination perturbations.
[174] VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning
Yifan Li, Pei Cheng, Bin Fu, Shuai Yang, Jiaying Liu
Main category: cs.CV
TL;DR: VibeFlow is a self-supervised framework for video chroma-lux editing that leverages pre-trained video generation models to modify illumination and color while preserving structure and temporal coherence without requiring paired training data.
Details
Motivation: Video chroma-lux editing (modifying illumination and color while preserving structure) is challenging and existing methods require expensive supervised training with synthetic paired data, limiting practical applications.Method: Uses disentangled data perturbation pipeline to adaptively recombine structure from source videos and color-illumination cues from reference images. Introduces Residual Velocity Fields and Structural Distortion Consistency Regularization to address discretization errors in flow-based models.
Result: Achieves impressive visual quality with reduced computational overhead, generalizes zero-shot to diverse applications including video relighting, recoloring, low-light enhancement, day-night translation, and object-specific color editing.
Conclusion: VibeFlow provides an effective self-supervised framework for video chroma-lux editing that eliminates need for costly training resources while maintaining structural and temporal fidelity.
Abstract: Video chroma-lux editing, which aims to modify illumination and color while preserving structural and temporal fidelity, remains a significant challenge. Existing methods typically rely on expensive supervised training with synthetic paired data. This paper proposes VibeFlow, a novel self-supervised framework that unleashes the intrinsic physical understanding of pre-trained video generation models. Instead of learning color and light transitions from scratch, we introduce a disentangled data perturbation pipeline that enforces the model to adaptively recombine structure from source videos and color-illumination cues from reference images, enabling robust disentanglement in a self-supervised manner. Furthermore, to rectify discretization errors inherent in flow-based models, we introduce Residual Velocity Fields alongside a Structural Distortion Consistency Regularization, ensuring rigorous structural preservation and temporal coherence. Our framework eliminates the need for costly training resources and generalizes in a zero-shot manner to diverse applications, including video relighting, recoloring, low-light enhancement, day-night translation, and object-specific color editing. Extensive experiments demonstrate that VibeFlow achieves impressive visual quality with significantly reduced computational overhead. Our project is publicly available at https://lyf1212.github.io/VibeFlow-webpage.
[175] Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking
Jinlin You, Muyu Li, Xudong Zhao
Main category: cs.CV
TL;DR: MambaTrack: A multimodal RGB-Event tracking framework using dynamic state space models with event-adaptive state transitions and gated projection fusion for robust cross-modal integration.
Details
Motivation: Existing Vision Mamba-based RGB-Event tracking methods use static state transition matrices that fail to adapt to variations in event sparsity, leading to imbalanced modeling (underfitting sparse event streams and overfitting dense ones) and degraded cross-modal fusion robustness.Method: Proposes MambaTrack with two key innovations: 1) Event-adaptive state transition mechanism that dynamically modulates the state transition matrix based on event stream density using a learnable scalar, and 2) Gated Projection Fusion module that projects RGB features into event feature space and generates adaptive gates from event density and RGB confidence scores to control fusion intensity.
Result: Achieves state-of-the-art performance on FE108 and FELT datasets, with lightweight design suggesting potential for real-time embedded deployment.
Conclusion: MambaTrack addresses limitations of static state transition matrices in RGB-Event tracking through dynamic adaptation to event sparsity variations, enabling robust cross-modal fusion and superior tracking performance.
Abstract: Existing Vision Mamba-based RGB-Event(RGBE) tracking methods suffer from using static state transition matrices, which fail to adapt to variations in event sparsity. This rigidity leads to imbalanced modeling-underfitting sparse event streams and overfitting dense ones-thus degrading cross-modal fusion robustness. To address these limitations, we propose MambaTrack, a multimodal and efficient tracking framework built upon a Dynamic State Space Model(DSSM). Our contributions are twofold. First, we introduce an event-adaptive state transition mechanism that dynamically modulates the state transition matrix based on event stream density. A learnable scalar governs the state evolution rate, enabling differentiated modeling of sparse and dense event flows. Second, we develop a Gated Projection Fusion(GPF) module for robust cross-modal integration. This module projects RGB features into the event feature space and generates adaptive gates from event density and RGB confidence scores. These gates precisely control the fusion intensity, suppressing noise while preserving complementary information. Experiments show that MambaTrack achieves state-of-the-art performance on the FE108 and FELT datasets. Its lightweight design suggests potential for real-time embedded deployment.
[176] MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis
Simin Huo, Ning Li
Main category: cs.CV
TL;DR: MaMe is a GPU-friendly, training-free token merging method for Vision Transformers that uses matrix operations to accelerate inference, with MaRe for token restoration enabling image synthesis applications.
Details
Motivation: Existing token compression methods for Vision Transformers use GPU-inefficient operations (sorting, scattered writes) that introduce overhead, limiting their effectiveness. There's a need for more efficient token merging that leverages GPU-friendly matrix operations.Method: MaMe is a differentiable token merging method based entirely on matrix operations, making it GPU-friendly. It works without training and can be applied to pre-trained models. MaRe is its inverse operation for token restoration, forming a MaMe+MaRe pipeline for image synthesis tasks.
Result: MaMe doubles ViT-B throughput with only 2% accuracy drop, and fine-tuning boosts accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, provides 1.3x acceleration with negligible degradation. Accelerates VideoMAE-L by 48.5% on Kinetics-400 with 0.84% accuracy loss. MaMe+MaRe pipeline reduces Stable Diffusion v2.1 generation latency by 31% while enhancing quality.
Conclusion: MaMe and MaRe provide effective GPU-friendly methods for accelerating vision models through efficient token compression and restoration, demonstrating significant speed improvements with minimal accuracy loss across various vision tasks including classification, video understanding, and image synthesis.
Abstract: Token compression is crucial for mitigating the quadratic complexity of self-attention mechanisms in Vision Transformers (ViTs), which often involve numerous input tokens. Existing methods, such as ToMe, rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that limit their effectiveness. We introduce MaMe, a training-free, differentiable token merging method based entirely on matrix operations, which is GPU-friendly to accelerate ViTs. Additionally, we present MaRe, its inverse operation, for token restoration, forming a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop. Notably, fine-tuning the last layer with MaMe boosts ViT-B accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3x acceleration with negligible performance degradation. In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with only a 0.84% accuracy loss. Furthermore, MaMe achieves simultaneous improvements in both performance and speed on some tasks. In image synthesis, the MaMe+MaRe pipeline enhances quality while reducing Stable Diffusion v2.1 generation latency by 31%. Collectively, these results demonstrate MaMe’s and MaRe’s effectiveness in accelerating vision models. The code is available at https://github.com/cominder/mame}{https://github.com/cominder/mame.
[177] A Study of Failure Modes in Two-Stage Human-Object Interaction Detection
Lemeng Wang, Qinqian Lei, Vidhi Bakshi, Daniel Yi, Yifan Liu, Jiacheng Hou, Asher Seng Hao, Zheda Mai, Wei-Lun Chao, Robby T. Tan, Bo Wang
Main category: cs.CV
TL;DR: Analysis of failure modes in two-stage human-object interaction (HOI) detection models, focusing on complex scenes with multiple people and rare interactions rather than overall benchmark performance.
Details
Motivation: Current HOI detection evaluations focus mainly on overall accuracy but provide limited insight into why models fail, especially in complex scenes with multiple people and rare interaction combinations. The paper aims to understand failure modes rather than just measure performance.Method: Decomposes HOI detection into multiple interpretable perspectives and analyzes model behavior across these dimensions. Curates a subset of images from existing HOI dataset organized by human-object-interaction configurations (multi-person interactions, object sharing) to examine different failure patterns.
Result: Analysis reveals that high overall benchmark performance doesn’t necessarily reflect robust visual reasoning about human-object relationships. Models struggle with complex scene compositions, particularly involving multiple people and rare interaction combinations.
Conclusion: The study provides insights into limitations of HOI models and offers observations for future research, emphasizing the need to move beyond overall accuracy metrics to understand model reasoning capabilities in complex visual scenes.
Abstract: Human-object interaction (HOI) detection aims to detect interactions between humans and objects in images. While recent advances have improved performance on existing benchmarks, their evaluations mainly focus on overall prediction accuracy and provide limited insight into the underlying causes of model failures. In particular, modern models often struggle in complex scenes involving multiple people and rare interaction combinations. In this work, we present a study to better understand the failure modes of two-stage HOI models, which form the basis of many current HOI detection approaches. Rather than constructing a large-scale benchmark, we instead decompose HOI detection into multiple interpretable perspectives and analyze model behavior across these dimensions to study different types of failure patterns. We curate a subset of images from an existing HOI dataset organized by human-object-interaction configurations (e.g., multi-person interactions and object sharing), and analyze model behavior under these configurations to examine different failure modes. This design allows us to analyze how these HOI models behave under different scene compositions and why their predictions fail. Importantly, high overall benchmark performance does not necessarily reflect robust visual reasoning about human-object relationships. We hope that this study can provide useful insights into the limitations of HOI models and offer observations for future research in this area.
[178] Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning
Yongjin Kim, Yoonjin Oh, Yerin Kim, Hyomin Kim, Jeeyoung Yun, Yujung Heo, Minjun Kim, Sungwoong Kim
Main category: cs.CV
TL;DR: FiMR is a framework that uses decomposed VQA for fine-grained self-reflection and refinement in text-to-image generation with MLLMs, improving prompt alignment through detailed attribute verification.
Details
Motivation: Current unified MLLMs have strong reasoning capabilities but their use in text-to-image generation is underexplored. Existing multimodal reasoning methods rely on holistic image-text alignment without fine-grained reflection on detailed prompt attributes, limiting precise control over generated images.Method: FiMR decomposes input prompts into minimal semantic units (entities and attributes), verifies each unit via visual question answering (VQA), generates explicit fine-grained feedback, and applies targeted localized refinements based on this feedback.
Result: Extensive experiments show FiMR consistently outperforms image generation baselines, including reasoning-based methods, particularly on compositional text-to-image benchmarks, demonstrating improved image-prompt alignment and generation quality.
Conclusion: FiMR enables MLLMs to achieve more precise improvements in text-to-image generation through fine-grained self-reasoning and self-refinement, addressing limitations of holistic alignment approaches and enhancing detailed attribute control.
Abstract: With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self-reflection and self-refinement, their use in text-to-image generation remains largely underexplored. Meanwhile, existing multimodal reasoning-based image generation methods mostly rely on holistic image-text alignment judgments, without fine-grained reflection and refinement of detailed prompt attributes, leading to limited fine-grained control. Therefore, we propose Fine-grained Multimodal Reasoning (FiMR), a framework that leverages decomposed visual question answering (VQA) to break down an input prompt into minimal semantic units-such as entities and attributes-and verify each unit via VQA to generate explicit, fine-grained feedback. Based on this feedback, FiMR then applies targeted, localized refinements. This fine-grained self-reasoning and self-refinement enable MLLMs to achieve more precise improvements in image-prompt alignment and overall generation quality at test time. Extensive experiments demonstrate that FiMR consistently outperforms image generation baselines, including reasoning-based methods, particularly on compositional text-to-image benchmarks.
[179] ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer’s Disease Progression
Juneyong Lee, Geonwoo Baek, Ikbeom Jang
Main category: cs.CV
TL;DR: ADP-DiT: A transformer-based diffusion model for generating longitudinal Alzheimer’s disease MRI scans with fine-grained control over follow-up time and clinical metadata using natural language prompts.
Details
Motivation: Alzheimer's disease progresses differently across individuals, creating need for subject-specific MRI synthesis to assess disease progression. Current methods lack clinically interpretable control over follow-up time and patient metadata in longitudinal AD MRI generation.Method: ADP-DiT uses interval-aware, clinically text-conditioned diffusion transformer with dual text encoders (OpenCLIP for vision-language alignment and T5 for clinical understanding). Conditions include follow-up interval, demographics, diagnosis, and neuropsychological data as natural language prompts. Uses cross-attention for fine-grained guidance and adaptive layer normalization for global modulation, with rotary positional embeddings and SDXL-VAE latent space for high-resolution reconstruction.
Result: Achieved SSIM 0.8739 and PSNR 29.32 dB on 3,321 longitudinal 3T T1-weighted scans from 712 participants, improving over DiT baseline by +0.1087 SSIM and +6.08 dB PSNR. Successfully captured progression-related changes like ventricular enlargement and hippocampal shrinkage.
Conclusion: Integrating comprehensive subject-specific clinical conditions with transformer architectures can significantly improve longitudinal AD MRI synthesis, enabling time-specific control beyond coarse diagnostic stages.
Abstract: Alzheimer’s disease (AD) progresses heterogeneously across individuals, motivating subject-specific synthesis of follow-up magnetic resonance imaging (MRI) to support progression assessment. While Diffusion Transformers (DiT), an emerging transformer-based diffusion model, offer a scalable backbone for image synthesis, longitudinal AD MRI generation with clinically interpretable control over follow-up time and participant metadata remains underexplored. We present ADP-DiT, an interval-aware, clinically text-conditioned diffusion transformer for longitudinal AD MRI synthesis. ADP-DiT encodes follow-up interval together with multi-domain demographic, diagnostic (CN/MCI/AD), and neuropsychological information as a natural-language prompt, enabling time-specific control beyond coarse diagnostic stages. To inject this conditioning effectively, we use dual text encoders-OpenCLIP for vision-language alignment and T5 for richer clinical-language understanding. Their embeddings are fused into DiT through cross-attention for fine-grained guidance and adaptive layer normalization for global modulation. We further enhance anatomical fidelity by applying rotary positional embeddings to image tokens and performing diffusion in a pre-trained SDXL-VAE latent space to enable efficient high-resolution reconstruction. On 3,321 longitudinal 3T T1-weighted scans from 712 participants (259,038 image slices), ADP-DiT achieves SSIM 0.8739 and PSNR 29.32 dB, improving over a DiT baseline by +0.1087 SSIM and +6.08 dB PSNR while capturing progression-related changes such as ventricular enlargement and shrinking hippocampus. These results suggest that integrating comprehensive, subject-specific clinical conditions with architectures can improve longitudinal AD MRI synthesis.
[180] Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
Sanghyeok Chu, Pyunghwan Ahn, Gwangmo Song, SeungHwan Kim, Honglak Lee, Bohyung Han
Main category: cs.CV
TL;DR: Cluster-aware Upcycling improves Mixture-of-Experts initialization by partitioning dense model activations into semantic clusters and initializing experts with cluster-specific subspaces, breaking expert symmetry and enabling early specialization.
Details
Motivation: Standard Sparse Upcycling initializes all MoE experts from identical pretrained dense weights, leading to expert symmetry and limited early specialization due to random router initialization. This paper aims to incorporate semantic structure into MoE initialization to address these issues.Method: 1) Partition dense model’s input activations into semantic clusters; 2) Initialize each expert using subspace representations of its corresponding cluster via truncated SVD; 3) Set router’s initial weights to cluster centroids; 4) Introduce expert-ensemble self-distillation loss for stable training using ensemble teacher guidance.
Result: Outperforms existing methods on CLIP ViT-B/32 and ViT-B/16 across zero-shot and few-shot benchmarks. Produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior.
Conclusion: Cluster-aware Upcycling effectively breaks expert symmetry and encourages early specialization aligned with data distribution, providing superior MoE initialization compared to standard approaches.
Abstract: Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster-aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model’s input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router’s initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation loss that stabilizes training by providing reliable routing guidance using an ensemble teacher. When evaluated on CLIP ViT-B/32 and ViT-B/16, Cluster-aware Upcycling consistently outperforms existing methods across both zero-shot and few-shot benchmarks. The proposed method also produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior.
[181] DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
Hengye Lyu, Zisu Li, Yue Hong, Yueting Weng, Jiaxin Shi, Hanwang Zhang, Chen Liang
Main category: cs.CV
TL;DR: RTR-DiT is a real-time video stylization framework using Diffusion Transformers that enables stable, consistent long video processing with support for both text-guided and reference-guided stylization through teacher fine-tuning and distillation techniques.
Details
Motivation: Existing diffusion-based video stylization methods struggle with stability and consistency in long videos, have high computational costs, and multi-step denoising makes them impractical for real-time applications.Method: Fine-tune a bidirectional teacher model on curated video stylization dataset, then distill into few-step autoregressive model using Self Forcing and Distribution Matching Distillation. Propose reference-preserving KV cache update strategy for stable long video processing and real-time style switching.
Result: Outperforms existing methods in both text-guided and reference-guided video stylization tasks in quantitative metrics and visual quality. Demonstrates excellent performance in real-time long video stylization and interactive style-switching applications.
Conclusion: RTR-DiT provides an effective solution for real-time video stylization with stable long video processing, addressing limitations of existing diffusion-based methods while supporting flexible text and reference guidance.
Abstract: Recent advances in video generation models has significantly accelerated video generation and related downstream tasks. Among these, video stylization holds important research value in areas such as immersive applications and artistic creation, attracting widespread attention. However, existing diffusion-based video stylization methods struggle to maintain stability and consistency when processing long videos, and their high computational cost and multi-step denoising make them difficult to apply in practical scenarios. In this work, we propose RTR-DiT (DiT as Real-Time Rerenderer), a steaming video stylization framework built upon Diffusion Transformer. We first fine-tune a bidirectional teacher model on a curated video stylization dataset, supporting both text-guided and reference-guided video stylization tasks, and subsequently distill it into a few-step autoregressive model via post-training with Self Forcing and Distribution Matching Distillation. Furthermore, we propose a reference-preserving KV cache update strategy that not only enables stable and consistent processing of long videos, but also supports real-time switching between text prompts and reference images. Experimental results show that RTR-DiT outperforms existing methods in both text-guided and reference-guided video stylization tasks, in terms of quantitative metrics and visual quality, and demonstrates excellent performance in real-time long video stylization and interactive style-switching applications.
[182] Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding
Yibo Jiang, Tao Wu, Rui Jiang, Yehao Lu, Chaoxiang Cai, Zequn Qin, Xi Li
Main category: cs.CV
TL;DR: UniRect-CoT: A training-free framework that uses a chain-of-thought approach to align UMMs’ visual understanding with generation by treating diffusion denoising as visual reasoning and using self-supervision to rectify intermediate results.
Details
Motivation: Unified Multimodal Models (UMMs) have a significant capability mismatch where their visual understanding far outperforms their generation capabilities. The rich internal knowledge in these models remains underactivated during generation tasks, similar to how humans need to continuously reflect and activate knowledge while drawing.Method: Proposes UniRect-CoT, a training-free unified rectification chain-of-thought framework that treats the diffusion denoising process in UMMs as an intrinsic visual reasoning process. It aligns intermediate generation results with the target instruction understood by the model, using this as self-supervisory signal to rectify generation.
Result: Extensive experiments show that UniRect-CoT can be easily integrated into existing UMMs and significantly enhances generation quality across diverse complex tasks without requiring additional training.
Conclusion: The proposed framework successfully addresses the capability mismatch in UMMs by activating their internal knowledge during generation through a thinking-while-drawing inspired approach, improving generation quality across various tasks.
Abstract: Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model’s rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human Thinking-While-Drawing'' paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the free lunch’’ hidden in the UMM’s powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during generation.We regard the diffusion denoising process in UMMs as an intrinsic visual reasoning process and align the intermediate results with the target instruction understood by the model, serving as a self-supervisory signal to rectify UMM generation.Extensive experiments demonstrate that UniRect-CoT can be easily integrated into existing UMMs, significantly enhancing generation quality across diverse complex tasks.
[183] Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation
Elton Cao, Hod Lipson
Main category: cs.CV
TL;DR: A generative approach using Latent Diffusion Model with ControlNet-style conditioning to convert 2D sketches into 3D models via conditional dense depth estimation, enabling flexible “draw in 3D” workflow.
Details
Motivation: Traditional sketch-to-3D methods rely on brittle symbolic logic or rigid parametric CAD primitives, limiting creative freedom. There's a need for more flexible approaches that bridge human creativity with digital fabrication while handling inherent ambiguities in orthographic projections.Method: Frames reconstruction as conditional dense depth estimation using Latent Diffusion Model (LDM) with ControlNet-style conditioning. Introduces graph-based BFS masking strategy for partial depth cues to support iterative “sketch-reconstruct-sketch” workflow. Trained on over 1 million image-depth pairs from ABC Dataset.
Result: Demonstrates robust performance across varying shape complexities, providing scalable pipeline for converting sparse 2D line drawings into dense 3D representations without rigid CAD constraints.
Conclusion: Proposed generative approach effectively enables users to “draw in 3D” by overcoming limitations of traditional methods through diffusion-based depth estimation and iterative workflow support.
Abstract: The conversion of 2D freehand sketches into 3D models remains a pivotal challenge in computer vision, bridging the gap between human creativity and digital fabrication. Traditional line drawing reconstruction relies on brittle symbolic logic, while modern approaches are constrained by rigid parametric modeling, limiting users to predefined CAD primitives. We propose a generative approach by framing reconstruction as a conditional dense depth estimation task. To achieve this, we implement a Latent Diffusion Model (LDM) with a ControlNet-style conditioning framework to resolve the inherent ambiguities of orthographic projections. To support an iterative “sketch-reconstruct-sketch” workflow, we introduce a graph-based BFS masking strategy to simulate partial depth cues. We train and evaluate our approach using a massive dataset of over one million image-depth pairs derived from the ABC Dataset. Our framework demonstrates robust performance across varying shape complexities, providing a scalable pipeline for converting sparse 2D line drawings into dense 3D representations, effectively allowing users to “draw in 3D” without the rigid constraints of traditional CAD.
[184] AI Powered Image Analysis for Phishing Detection
K. Acharya, S. Ale, R. Kadel
Main category: cs.CV
TL;DR: Deep learning approach using webpage screenshots for visual phishing detection, comparing ConvNeXt-Tiny and Vision Transformer models with threshold-aware evaluation.
Details
Motivation: Phishing websites increasingly use visual imitation (logos, layouts, colors) to evade text- and URL-based detection systems, requiring image-based approaches.Method: Used webpage screenshots for image-based detection, tested two vision models (ConvNeXt-Tiny and ViT-Base) with transfer learning from ImageNet weights, evaluated with threshold-aware metrics across different decision thresholds.
Result: ConvNeXt-Tiny performed best overall with highest F1-score at optimized threshold and better computational efficiency than ViT-Base, demonstrating convolutional models’ strength for visual phishing detection.
Conclusion: Convolutional models are effective for visual phishing detection, threshold tuning is crucial for real-world deployment, and the curated dataset will be released for reproducibility.
Abstract: Phishing websites now rely heavily on visual imitation-copied logos, similar layouts, and matching colours-to avoid detection by text- and URL-based systems. This paper presents a deep learning approach that uses webpage screenshots for image-based phishing detection. Two vision models, ConvNeXt-Tiny and Vision Transformer (ViT-Base), were tested to see how well they handle visually deceptive phishing pages. The framework covers dataset creation, preprocessing, transfer learning with ImageNet weights, and evaluation using different decision thresholds. The results show that ConvNeXt-Tiny performs the best overall, achieving the highest F1-score at the optimised threshold and running more efficiently than ViT-Base. This highlights the strength of convolutional models for visual phishing detection and shows why threshold tuning is important for real-world deployment. As future work, the curated dataset used in this study will be released to support reproducibility and encourage further research in this area. Unlike many existing studies that primarily report accuracy, this work places greater emphasis on threshold-aware evaluation to better reflect real-world deployment conditions. By examining precision, recall, and F1-score across different decision thresholds, the study identifies operating points that balance detection performance and false-alarm control. In addition, the side-by-side comparison of ConvNeXt-Tiny and ViT-Base under the same experimental setup offers practical insights into how convolutional and transformer-based architectures differ in robustness and computational efficiency for visual phishing detection.
[185] CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling
Shivika, Kartik Bose, Pankaj Gupta
Main category: cs.CV
TL;DR: Investigating training batch composition effects on 3D medical vision-language models shows that explicit class balancing hurts performance compared to random sampling with anatomical subsection diversity.
Details
Motivation: While vision-language models show strong zero-shot diagnostic capabilities in medical imaging, the effect of training batch composition on learned representations remains unexplored for 3D medical imaging, particularly how normal-to-abnormal ratios and data scaling affect performance.Method: Reproduced Merlin dual-encoder model aligning 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss. Investigated two axes: (1) controlling normal-to-abnormal ratio in training batches (25:75, 50:50, 75:25) using section-level balanced sampling, and (2) data scaling ablations on subset (20%, 40%, 100% of data). Compared balanced vs. unbalanced sampling strategies.
Result: All balanced configurations underperformed unbalanced baseline by 2.4-2.8 points (best balanced: 72.02% vs baseline: 74.45%). Performance scaled sub-linearly with data (65.26% to 71.88%). Enforcing 50:50 balanced sampling on subset further degraded performance to 68.01%. Stochastic diversity of random sampling with anatomical subsection batching provides better regularization than engineered class ratios.
Conclusion: Explicit class balancing hurts performance regardless of dataset or balancing granularity. The stochastic diversity of random sampling combined with anatomical subsection batching provides more effective regularization than engineered class ratios at small batch sizes required by 3D medical volumes.
Abstract: Vision-language models trained with contrastive learning on paired medical images and reports show strong zero-shot diagnostic capabilities, yet the effect of training batch composition on learned representations remains unexplored for 3D medical imaging. We reproduce Merlin, a dual-encoder model that aligns 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, achieving a zero-shot macro F1 of 74.45% across 30 findings (original: 73.00%). We then investigate two axes of variation. First, we control the normal-to-abnormal ratio within training batches at 25:75, 50:50, and 75:25 using section-level balanced sampling on the full dataset. All three configurations underperform the unbalanced baseline by 2.4 to 2.8 points, with 75:25 achieving the best result (72.02%) among balanced variants. Second, we conduct data scaling ablations on a 4,362-study subset, training with 20%, 40%, and 100% of the data. Performance scales sub-linearly from 65.26% to 71.88%, with individual findings varying dramatically in data sensitivity. Enforcing 50:50 balanced sampling on the same subset further degrades performance to 68.01%, confirming that explicit class balancing hurts regardless of dataset or balancing granularity. Our results indicate that the stochastic diversity of random sampling, combined with Merlin’s alternating batching over anatomical subsections, provides more effective regularization than engineered class ratios at the small batch sizes required by 3D medical volumes.
[186] UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
Yunkai Dang, Minxin Dai, Yuekun Yang, Zhangnan Li, Wenbin Li, Feng Miao, Yang Gao
Main category: cs.CV
TL;DR: UHR-BAT is a query-guided token compression framework for ultra-high-resolution remote sensing imagery that efficiently selects visual tokens while preserving query-critical details.
Details
Motivation: Ultra-high-resolution remote sensing imagery has vast spatial scale causing quadratic explosion of visual tokens, making it difficult to extract information from small objects. Existing methods either lose critical details or have unpredictable computational costs.Method: Proposes UHR-BAT with text-guided, multi-scale importance estimation for visual tokens and region-wise preserve/merge strategies to reduce token redundancy under strict context budget.
Result: Achieves state-of-the-art performance across various benchmarks for ultra-high-resolution remote sensing tasks.
Conclusion: UHR-BAT provides an effective solution for efficient token compression in ultra-high-resolution imagery while maintaining query-critical information.
Abstract: Ultra-high-resolution (UHR) remote sensing imagery couples kilometer-scale context with query-critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top-k pruning, which either compromise query-critical image details or incur unpredictable compute. In this paper, we propose UHR-BAT, a query-guided and region-faithful token compression framework to efficiently select visual tokens under a strict context budget. Specifically, we leverage text-guided, multi-scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low-cost feature extraction. Furthermore, by introducing region-wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. Experimental results show that UHR-BAT achieves state-of-the-art performance across various benchmarks. Code will be available at https://github.com/Yunkaidang/UHR.
[187] ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing
Zhentao Yang, Yixiang Luomei, Zhuoyang Liu, Zhenyu Liu, Feng Xu
Main category: cs.CV
TL;DR: ZoomSpec: A physics-guided coarse-to-fine framework for wideband spectrum sensing that integrates signal processing priors with deep learning to overcome domain mismatch in existing approaches.
Details
Motivation: Existing data-driven approaches for wideband spectrum sensing treat spectrograms as natural images, suffering from domain mismatch by neglecting time-frequency resolution constraints and spectral leakage, leading to poor narrowband visibility.Method: Proposes ZoomSpec with three key components: 1) Log-Space STFT (LS-STFT) to overcome geometric bottleneck of linear spectrograms, 2) Coarse Proposal Net (CPN) for rapid full-band screening, 3) Adaptive Heterodyne Low-Pass (AHLP) module for center-frequency aligning and bandwidth-matched filtering, and 4) Fine Recognition Net (FRN) that fuses purified time-domain I/Q with spectral magnitude via dual-domain attention.
Result: Achieves state-of-the-art 78.1 mAP@0.5:0.95 on the SpaceNet real-world dataset, surpassing existing leaderboard systems with superior stability across diverse modulation bandwidths.
Conclusion: ZoomSpec effectively integrates signal processing priors with deep learning to address domain mismatch in wideband spectrum sensing, demonstrating superior performance and stability for low-altitude monitoring applications.
Abstract: Wideband spectrum sensing for low-altitude monitoring is critical yet challenging due to heterogeneous protocols,large bandwidths, and non-stationary SNR. Existing data-driven approaches treat spectrograms as natural images,suffering from domain mismatch: they neglect time-frequency resolution constraints and spectral leakage, leading topoor narrowband visibility. This paper proposes ZoomSpec, a physics-guided coarse-to-fine framework integrating signal processing priors with deep learning. We introduce a Log-Space STFT (LS-STFT) to overcome the geometric bottleneck of linear spectrograms, sharpening narrowband structures while maintaining constant relative resolution. A lightweight Coarse Proposal Net (CPN) rapidly screens the full band. To bridge coarse detection and fine recognition, we design an Adaptive Heterodyne Low-Pass (AHLP) module that executes center-frequency aligning, bandwidth-matched filtering, and safe decimation, purifying signals of out-of-band interference. A Fine Recognition Net (FRN) fuses purified time-domain I/Q with spectral magnitude via dual-domain attention to jointly refine temporal boundaries and modulation classification. Evaluations on the SpaceNet real-world dataset demonstrate state-of-the-art 78.1 mAP@0.5:0.95, surpassing existing leaderboard systems with superior stability across diverse modulation bandwidths.
[188] Radar-Informed 3D Multi-Object Tracking under Adverse Conditions
Bingxue Xu, Emil Hedemalm, Ajinkya Khoche, Patric Jensfelt
Main category: cs.CV
TL;DR: RadarMOT: A radar-informed 3D multi-object tracking framework that uses radar point clouds to refine state estimation and recover detector misses, improving tracking accuracy at long ranges and in adverse weather conditions.
Details
Motivation: Existing multi-modal fusion methods treat radar as just another learned feature, which reduces radar's robustness advantages when overall models degrade in difficult conditions like adverse weather or long ranges. The paper aims to explicitly leverage radar data to improve 3D MOT robustness.Method: Proposes RadarMOT framework that uses radar point cloud data as additional observation to refine state estimation and recover detector misses at long ranges. Unlike existing methods that treat radar as learned features, this approach explicitly incorporates radar observations.
Result: Evaluations on MAN-TruckScenes dataset show RadarMOT consistently improves Average Multi-Object Tracking Accuracy (AMOTA) with absolute 12.7% improvement at long range and 10.3% improvement in adverse weather conditions.
Conclusion: RadarMOT effectively leverages radar data to enhance 3D MOT robustness, particularly for long-range tracking and adverse weather scenarios, demonstrating the value of explicit radar-informed approaches over learned feature fusion methods.
Abstract: The challenge of 3D multi-object tracking (3D MOT) is achieving robustness in real-world applications, for example under adverse conditions and maintaining consistency as distance increases. To overcome these challenges, sensor fusion approaches that combine LiDAR, cameras, and radar have emerged. However, existing multi-modal fusion methods usually treat radar as another learned feature inside the network. When the overall model degrades in difficult environmental conditions, the robustness advantages that radar could provide are also reduced. We propose RadarMOT, a radar-informed 3D MOT framework that explicitly uses radar point cloud data as additional observation to refine state estimation and recover detector misses at long ranges. Evaluations on the MAN-TruckScenes dataset show that RadarMOT consistently improves the Average Multi-Object Tracking Accuracy (AMOTA) with absolute 12.7% at long range and 10.3% in adverse weather. The code will be available at https://github.com/bingxue-xu/radarmot
[189] SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance
Qi Xia, Peishan Cong, Ziyi Wang, Yujing Sun, Qin Sun, Xinge Zhu, Mao Ye, Ruigang Yang, Yuexin Ma
Main category: cs.CV
TL;DR: SocialMirror is a diffusion-based framework for reconstructing human behavior in close-interaction scenarios from monocular videos, addressing challenges like mutual occlusions and motion ambiguity through semantic guidance and geometric constraints.
Details
Motivation: Accurate human behavior reconstruction in close-interaction scenarios is crucial for AR, sports analysis, and human-robot collaboration, but current methods struggle with severe mutual occlusions, motion ambiguity, and spatial relationship errors in monocular video reconstruction.Method: A diffusion-based framework that: 1) uses vision-language models to generate high-level interaction descriptions guiding a semantic-guided motion infiller for hallucinating occluded bodies, and 2) employs a sequence-level temporal refiner with geometric constraints to ensure smooth motions and plausible contact relationships.
Result: State-of-the-art performance on multiple interaction benchmarks, demonstrating strong generalization across unseen datasets and in-the-wild scenarios for reconstructing interactive human meshes.
Conclusion: SocialMirror effectively addresses challenges in close-interaction human reconstruction by integrating semantic and geometric cues through a diffusion-based approach, enabling more realistic virtual interactions and motion analysis.
Abstract: Accurately reconstructing human behavior in close-interaction scenarios is crucial for enabling realistic virtual interactions in augmented reality, precise motion analysis in sports, and natural collaborative behavior in human-robot tasks. Reliable reconstruction in these contexts significantly enhances the realism and effectiveness of AI-driven interactive applications. However, human reconstruction from monocular videos in close-interaction scenarios remains challenging due to severe mutual occlusions, leading local motion ambiguity, disrupted temporal continuity and spatial relationship error. In this paper, we propose SocialMirror, a diffusion-based framework that integrates semantic and geometric cues to effectively address these issues. Specifically, we first leverage high-level interaction descriptions generated by a vision-language model to guide a semantic-guided motion infiller, hallucinating occluded bodies and resolving local pose ambiguities. Next, we propose a sequence-level temporal refiner that enforces smooth, jitter-free motions, while incorporating geometric constraints during sampling to ensure plausible contact and spatial relationships. Evaluations on multiple interaction benchmarks show that SocialMirror achieves state-of-the-art performance in reconstructing interactive human meshes, demonstrating strong generalization across unseen datasets and in-the-wild scenarios. The code will be released upon publication.
[190] Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning
Danish Nazir, Antoine Hanna-Asaad, Lucas Görnhardt, Jan Piewek, Thorsten Bagdonat, Tim Fingscheidt
Main category: cs.CV
TL;DR: Efficient multi-view 3D object detection using dynamic token selection and parameter-efficient fine-tuning for ViT backbones
Details
Motivation: Existing multi-view 3D object detection methods use computationally expensive ViT backbones. Current SOTA ToC3D has limitations: fixed token selection ratios and requires full end-to-end retraining of ViT backbones.Method: Proposes image token compensator with token selection for ViT backbones, enabling dynamic layer-wise token selection. Introduces parameter-efficient fine-tuning strategy that trains only proposed modules (1.6M parameters vs 300M+).
Result: Reduces computational complexity by 48-55%, inference latency by 9-25%, while improving mean average precision by 1.0-2.8% and NuScenes detection score by 0.4-1.2% compared to SOTA ToC3D.
Conclusion: Proposed method achieves significant efficiency gains while maintaining or improving detection performance for multi-view 3D object detection.
Abstract: Existing multi-view three-dimensional (3D) object detection approaches widely adopt large-scale pre-trained vision transformer (ViT)-based foundation models as backbones, being computationally complex. To address this problem, current state-of-the-art (SOTA) \texttt{ToC3D} for efficient multi-view ViT-based 3D object detection employs ego-motion-based relevant token selection. However, there are two key limitations: (1) The fixed layer-individual token selection ratios limit computational efficiency during both training and inference. (2) Full end-to-end retraining of the ViT backbone is required for the multi-view 3D object detection method. In this work, we propose an image token compensator combined with a token selection for ViT backbones to accelerate multi-view 3D object detection. Unlike \texttt{ToC3D}, our approach enables dynamic layer-wise token selection within the ViT backbone. Furthermore, we introduce a parameter-efficient fine-tuning strategy, which trains only the proposed modules, thereby reducing the number of fine-tuned parameters from more than $300$ million (M) to only $1.6$ M. Experiments on the large-scale NuScenes dataset across three multi-view 3D object detection approaches demonstrate that our proposed method decreases computational complexity (GFLOPs) by $48%$ … $55%$, inference latency (on an \texttt{NVIDIA-GV100} GPU) by $9%$ … $25%$, while still improving mean average precision by $1.0%$ … $2.8%$ absolute and NuScenes detection score by $0.4%$ … $1.2%$ absolute compared to so-far SOTA \texttt{ToC3D}.
[191] Dehaze-then-Splat: Generative Dehazing with Physics-Informed 3D Gaussian Splatting for Smoke-Free Novel View Synthesis
Yuchao Chen, Hanqing Wang
Main category: cs.CV
TL;DR: Two-stage pipeline for multi-view smoke removal and novel view synthesis using generative dehazing followed by 3D Gaussian Splatting with physics-informed regularization.
Details
Motivation: Address the fundamental tension in dehaze-then-reconstruct pipelines where per-image restoration quality doesn't guarantee multi-view consistency, leading to blurred renders and structural instability in 3D reconstruction.Method: Stage 1: Generate pseudo-clean training images via per-frame generative dehazing (Nano Banana Pro) with brightness normalization. Stage 2: Train 3D Gaussian Splatting with physics-informed auxiliary losses including depth supervision via Pearson correlation with pseudo-depth, dark channel prior regularization, and dual-source gradient matching.
Result: Achieves 20.98 dB PSNR and 0.683 SSIM for novel view synthesis on Akikaze validation scene, representing a +1.50 dB improvement over unregularized baseline.
Conclusion: MCMC-based densification with early stopping, combined with depth and haze-suppression priors, effectively mitigates artifacts from cross-view inconsistencies in frame-wise generative processing for 3D reconstruction.
Abstract: We present Dehaze-then-Splat, a two-stage pipeline for multi-view smoke removal and novel view synthesis developed for Track~2 of the NTIRE 2026 3D Restoration and Reconstruction Challenge. In the first stage, we produce pseudo-clean training images via per-frame generative dehazing using Nano Banana Pro, followed by brightness normalization. In the second stage, we train 3D Gaussian Splatting (3DGS) with physics-informed auxiliary losses – depth supervision via Pearson correlation with pseudo-depth, dark channel prior regularization, and dual-source gradient matching – that compensate for cross-view inconsistencies inherent in frame-wise generative processing. We identify a fundamental tension in dehaze-then-reconstruct pipelines: per-image restoration quality does not guarantee multi-view consistency, and such inconsistency manifests as blurred renders and structural instability in downstream 3D reconstruction.Our analysis shows that MCMC-based densification with early stopping, combined with depth and haze-suppression priors, effectively mitigates these artifacts. On the Akikaze validation scene, our pipeline achieves 20.98,dB PSNR and 0.683 SSIM for novel view synthesis, a +1.50,dB improvement over the unregularized baseline.
[192] VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
Yulu Gao, Bohao Zhang, Zongheng Tang, Jitong Liao, Wenjun Wu, Si Liu
Main category: cs.CV
TL;DR: VGGT-Segmentor (VGGT-S) is a framework that combines geometric modeling with semantic segmentation for instance-level object segmentation across egocentric and exocentric views, achieving state-of-the-art results on Ego-Exo4D benchmark.
Details
Motivation: Instance-level object segmentation across disparate egocentric and exocentric views is challenging due to severe scale, perspective, and occlusion changes. While geometry-aware models like VGGT provide feature alignment, they fail at dense prediction tasks due to pixel-level projection drift.Method: VGGT-S leverages VGGT’s cross-view feature representation and introduces a novel Union Segmentation Head with three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement. Also proposes single-image self-supervised training strategy that eliminates need for paired annotations.
Result: On Ego-Exo4D benchmark, VGGT-S achieves 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks respectively, setting new state-of-the-art. Correspondence-free pretrained model surpasses most fully-supervised baselines.
Conclusion: VGGT-S effectively bridges the gap between geometric modeling and pixel-accurate segmentation, demonstrating strong generalization and scalability through self-supervised training.
Abstract: Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level textntion remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT’s powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.
[193] What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering
Amir Hossein Saleknia, Mohammad Sabokrou
Main category: cs.CV
TL;DR: Supervised dataset bias measurement using classification is flawed due to resolution artifacts; unsupervised clustering of semantic features shows web-scale datasets have minimal true semantic bias.
Details
Motivation: Current methods for measuring dataset bias rely on training classifiers to distinguish between datasets, assuming high accuracy indicates meaningful semantic differences. However, this approach may be confounded by non-semantic artifacts like resolution-based fingerprints rather than true semantic divergence.Method: The authors demonstrate flaws in supervised classification by showing models can achieve high accuracy on non-semantic, procedurally generated images. They propose an unsupervised alternative: clustering semantically-rich features from foundational vision models to directly assess semantic similarity, deliberately avoiding supervised classification on dataset labels.
Result: When applied to major web-scale datasets, the high separability reported by supervised methods largely disappears, with clustering accuracy dropping to near-chance levels. This reveals that conventional classification-based evaluation systematically overstates semantic bias by a large margin.
Conclusion: The fundamental assumption behind supervised dataset bias measurement is flawed due to resolution artifacts. Unsupervised semantic feature clustering provides a more accurate assessment, showing web-scale datasets have minimal true semantic bias despite what supervised methods suggest.
Abstract: In computer vision, a prevailing method for quantifying dataset bias is to train a model to distinguish between datasets. High classification accuracy is then interpreted as evidence of meaningful semantic differences. This approach assumes that standard image augmentations successfully suppress low-level, non-semantic cues, and that any remaining performance must therefore reflect true semantic divergence. We demonstrate that this fundamental assumption is flawed within the domain of large-scale natural image collections. High classification accuracy is often driven by resolution-based artifacts, which are structural fingerprints arising from native image resolution distributions and interpolation effects during resizing. These artifacts form robust, dataset-specific signatures that persist despite conventional image corruptions. Through controlled experiments, we show that models achieve strong dataset classification even on non-semantic, procedurally generated images, proving their reliance on superficial cues. To address this issue, we revisit this decades-old idea of dataset separability, but not with supervised classification. Instead, we introduce an unsupervised approach that measures true semantic separability. Our framework directly assesses semantic similarity by clustering semantically-rich features from foundational vision models, deliberately bypassing supervised classification on dataset labels. When applied to major web-scale datasets, the primary focus of this work, the high separability reported by supervised methods largely vanishes, with clustering accuracy dropping to near-chance levels. This reveals that conventional classification-based evaluation systematically overstates semantic bias by an overwhelming margin.
[194] ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation
Jingjing Qian, Zeyuan He, Chen Shi, Lei Xiao, Li Jiang
Main category: cs.CV
TL;DR: ESCAPE is an embodied AI system for long-horizon indoor tasks that combines episodic spatial memory with adaptive policy execution to coordinate navigation and manipulation robustly.
Details
Motivation: Existing embodied AI methods struggle with catastrophic forgetting, spatial inconsistency, and rigid execution in long-horizon indoor tasks, requiring a more robust approach to coordinate navigation and manipulation.Method: ESCAPE uses a perception-grounding-execution workflow with: 1) Spatio-Temporal Fusion Mapping for depth-free 3D spatial memory, 2) Memory-Driven Target Grounding for interaction masks, and 3) Adaptive Execution Policy for proactive navigation and reactive manipulation.
Result: Achieves state-of-the-art on ALFRED benchmark with 65.09%/60.79% success rates in test seen/unseen environments, reduces redundant exploration, and maintains robust performance (61.24%/56.04%) without detailed guidance for long-horizon tasks.
Conclusion: ESCAPE demonstrates effective coordination of navigation and manipulation through episodic spatial memory and adaptive policy execution, enabling robust performance in complex indoor environments over long horizons.
Abstract: Coordinating navigation and manipulation with robust performance is essential for embodied AI in complex indoor environments. However, as tasks extend over long horizons, existing methods often struggle due to catastrophic forgetting, spatial inconsistency, and rigid execution. To address these issues, we propose ESCAPE (Episodic Spatial Memory Coupled with an Adaptive Policy for Execution), operating through a tightly coupled perception-grounding-execution workflow. For robust perception, ESCAPE features a Spatio-Temporal Fusion Mapping module to autoregressively construct a depth-free, persistent 3D spatial memory, alongside a Memory-Driven Target Grounding module for precise interaction mask generation. To achieve flexible action, our Adaptive Execution Policy dynamically orchestrates proactive global navigation and reactive local manipulation to seize opportunistic targets. ESCAPE achieves state-of-the-art performance on the ALFRED benchmark, reaching 65.09% and 60.79% success rates in test seen and unseen environments with step-by-step instructions. By reducing redundant exploration, our ESCAPE attains substantial improvements in path-length-weighted metrics and maintains robust performance (61.24% / 56.04%) even without detailed guidance for long-horizon tasks.
[195] VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection
Hui Han, Shunli Wang, Yandan Zhao, Taiping Yao, Shouhong Ding
Main category: cs.CV
TL;DR: VRAG-DFD: A framework combining Retrieval-Augmented Generation (RAG) and Reinforcement Learning (RL) to enhance MLLMs for Deepfake Detection with dynamic forgery knowledge retrieval and critical reasoning capabilities.
Details
Motivation: Existing MLLM-based deepfake detection methods lack professional forgery knowledge and critical reasoning abilities, limiting their performance. The paper aims to address two key issues: providing high-quality forgery knowledge to MLLMs and enabling critical reasoning with noisy reference information.Method: Proposes VRAG-DFD framework using RAG and RL. Constructs two datasets: Forensic Knowledge Database (FKD) for DFD knowledge annotation and Forensic Chain-of-Thought Dataset (F-CoT) for critical reasoning. Uses three-stage training: Alignment → Supervised Fine-Tuning → Group Relative Policy Optimization (GRPO) to cultivate MLLM’s critical reasoning ability.
Result: VRAG-DFD achieved state-of-the-art and competitive performance on deepfake detection generalization testing.
Conclusion: The combination of RAG and RL effectively enhances MLLMs for deepfake detection by providing dynamic forgery knowledge retrieval and developing critical reasoning capabilities, leading to improved generalization performance.
Abstract: In Deepfake Detection (DFD) tasks, researchers proposed two types of MLLM-based methods: complementary combination with small DFD detectors, or static forgery knowledge injection.The lack of professional forgery knowledge hinders the performance of these DFD-MLLMs.To solve this, we deeply considered two insightful issues: How to provide high-quality associated forgery knowledge for MLLMs? AND How to endow MLLMs with critical reasoning abilities given noisy reference information? Notably, we attempted to address above two questions with preliminary answers by leveraging the combination of Retrieval-Augmented Generation (RAG) and Reinforcement Learning (RL).Through RAG and RL techniques, we propose the VRAG-DFD framework with accurate dynamic forgery knowledge retrieval and powerful critical reasoning capabilities.Specifically, in terms of data, we constructed two datasets with RAG: Forensic Knowledge Database (FKD) for DFD knowledge annotation, and Forensic Chain-of-Thought Dataset (F-CoT), for critical CoT construction.In terms of model training, we adopt a three-stage training method (Alignment->SFT->GRPO) to gradually cultivate the critical reasoning ability of the MLLM.In terms of performance, VRAG-DFD achieved SOTA and competitive performance on DFD generalization testing.
[196] From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage
Cihan Ruan, Lebin Zhou, Bingqing Zhao, Rongduo Han, Qiming Yuan, Chenchen Zhu, Linyi Han, Liang Yang, Wei Wang, Wei Jiang, Nam Ling
Main category: cs.CV
TL;DR: HELIX is the first neural network that jointly optimizes video compression and DNA encoding, using token-based representations that naturally align with DNA’s quaternary alphabet, achieving 1.91 bits per nucleotide.
Details
Motivation: Video storage in DNA remains an open challenge requiring co-design of compression and molecular encoding, but current approaches treat these stages independently, leaving biochemical constraints and compression objectives misaligned.Method: HELIX introduces TK-SCONE (Token-Kronecker Structured Constraint-Optimized Neural Encoding) which uses token-based representations that map to DNA’s ATCG bases, with Kronecker-structured mixing to break spatial correlations and FSM-based mapping to guarantee biochemical constraints.
Result: Achieves 1.91 bits per nucleotide through joint optimization of token distributions for visual quality, prediction under masking, and DNA synthesis efficiency.
Conclusion: Demonstrates that learned compression and molecular storage converge naturally at token representations, suggesting a new paradigm where neural video codecs are designed for biological substrates from the ground up.
Abstract: DNA-based storage has emerged as a promising approach to the global data crisis, offering molecular-scale density and millennial-scale stability at low maintenance cost. Over the past decade, substantial progress has been made in storing text, images, and files in DNA – yet video remains an open challenge. The difficulty is not merely technical: effective video DNA storage requires co-designing compression and molecular encoding from the ground up, a challenge that sits at the intersection of two fields that have largely evolved independently. In this work, we present HELIX, the first end-to-end neural network jointly optimizing video compression and DNA encoding – prior approaches treat the two stages independently, leaving biochemical constraints and compression objectives fundamentally misaligned. Our key insight: token-based representations naturally align with DNA’s quaternary alphabet – discrete semantic units map directly to ATCG bases. We introduce TK-SCONE (Token-Kronecker Structured Constraint-Optimized Neural Encoding), which achieves 1.91 bits per nucleotide through Kronecker-structured mixing that breaks spatial correlations and FSM-based mapping that guarantees biochemical constraints. Unlike two-stage approaches, HELIX learns token distributions simultaneously optimized for visual quality, prediction under masking, and DNA synthesis efficiency. This work demonstrates for the first time that learned compression and molecular storage converge naturally at token representations – suggesting a new paradigm where neural video codecs are designed for biological substrates from the ground up.
[197] Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data
Yizhao Xu, Hongyuan Zhu, Caiyun Liu, Tianfu Wang, Keyu Chen, Sicheng Xu, Jiaolong Yang, Nicholas Jing Yuan, Qi Zhang
Main category: cs.CV
TL;DR: BVE is a 3D editing framework that uses a self-constructed large-scale dataset and lightweight modules to enable text-guided 3D editing while preserving local invariance in unchanged regions.
Details
Motivation: Existing 3D editing methods have limitations: multi-view editing suffers from projection losses, voxel-based editing has constraints on modifiable regions and scale, and there's a lack of large editing datasets for training and evaluation.Method: Proposes BVE framework with self-constructed large-scale dataset, enhances foundational image-to-3D generative architecture with lightweight trainable modules for efficient semantic injection, and introduces annotation-free 3D masking strategy to preserve local invariance.
Result: Extensive experiments show BVE achieves superior performance in generating high-quality, text-aligned 3D assets while faithfully retaining visual characteristics of original input.
Conclusion: BVE addresses key challenges in 3D editing through dataset construction, efficient architecture modifications, and invariance preservation, enabling effective text-guided 3D editing.
Abstract: 3D editing refers to the ability to apply local or global modifications to 3D assets. Effective 3D editing requires maintaining semantic consistency by performing localized changes according to prompts, while also preserving local invariance so that unchanged regions remain consistent with the original. However, existing approaches have significant limitations: multi-view editing methods incur losses when projecting back to 3D, while voxel-based editing is constrained in both the regions that can be modified and the scale of modifications. Moreover, the lack of sufficiently large editing datasets for training and evaluation remains a challenge. To address these challenges, we propose a Beyond Voxel 3D Editing (BVE) framework with a self-constructed large-scale dataset specifically tailored for 3D editing. Building upon this dataset, our model enhances a foundational image-to-3D generative architecture with lightweight, trainable modules, enabling efficient injection of textual semantics without the need for expensive full-model retraining. Furthermore, we introduce an annotation-free 3D masking strategy to preserve local invariance, maintaining the integrity of unchanged regions during editing. Extensive experiments demonstrate that BVE achieves superior performance in generating high-quality, text-aligned 3D assets, while faithfully retaining the visual characteristics of the original input.
[198] Med-CAM: Minimal Evidence for Explaining Medical Decision Making
Pirzada Suhail, Aditya Anand, Amit Sethi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reliable and interpretable decision-making is essential in medical imaging, where diagnostic outcomes directly influence patient care. Despite advances in deep learning, most medical AI systems operate as opaque black boxes, providing little insight into why a particular diagnosis was reached. In this paper, we introduce Med-CAM, a framework for generating minimal and sharp maps as evidence-based explanations for Medical decision making via Classifier Activation Matching. Med-CAM trains a segmentation network from scratch to produce a mask that highlights the minimal evidence critical to model’s decision for any seen or unseen image. This ensures that the explanation is both faithful to the network’s behaviour and interpretable to clinicians. Experiments show, unlike prior spatial explanation methods, such as Grad-CAM and attention maps, which yield only fuzzy regions of relative importance, Med-CAM with its superior spatial awareness to shapes, textures, and boundaries, delivers conclusive, evidence-based explanations that faithfully replicate the model’s prediction for any given image. By explicitly constraining explanations to be compact, consistent with model activations, and diagnostic alignment, Med-CAM advances transparent AI to foster clinician understanding and trust in high-stakes medical applications such as pathology and radiology.
[199] SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
Haoran Lou, Ziyan Liu, Chunxiao Fan, Yuexin Wu, Yue Ming
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal Large Language Models (MLLMs) exhibit strong reasoning and world knowledge, yet adapting them for retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. In this work, we argue that adapting MLLMs for retrieval should focus on eliciting pre-trained representations rather than overwriting them. To this end, we propose SLQ, an effective and efficient framework that adapts a frozen MLLM into a retriever through a small set of Shared Latent Queries. Appended to the end of both text and image token sequences, these queries leverage the model’s native causal attention to serve as global aggregation interfaces, producing compact embeddings in a unified space while keeping the backbone unchanged. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench. The results demonstrate that SLQ, which preserves pre-trained representations, provides an effective and efficient framework for adapting MLLMs to retrieval.
[200] Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests
Pankaj Deoli, Atef Tej, Anmol Ashri, Anandatirtha JS, Karsten Berns
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We address the challenge of synthetic-to-real transfer in forestry perception where real data have only coarse Tree labels while synthetic data provide fine-grained trunk/crown annotations. We introduce MGTD, a mixed-granularity dataset with 53k synthetic and 3.6k real images, and a four-stage protocol isolating domain shift and granularity mismatch. Our core contribution is granularity-aware distillation, which transfers structural priors from fine-grained synthetic teachers to a coarse-label student via logit-space merging and mask unification. Experiments show consistent mask AP gains, especially for small/distant trees, establishing a testbed for Sim-Real transfer under label granularity constraints.
[201] ReConText3D: Replay-based Continual Text-to-3D Generation
Muhammad Ahmed Ullah Khan, Muhammad Haris Bin Amir, Didier Stricker, Muhammad Zeshan Afzal
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Continual learning enables models to acquire new knowledge over time while retaining previously learned capabilities. However, its application to text-to-3D generation remains unexplored. We present ReConText3D, the first framework for continual text-to-3D generation. We first demonstrate that existing text-to-3D models suffer from catastrophic forgetting under incremental training. ReConText3D enables generative models to incrementally learn new 3D categories from textual descriptions while preserving the ability to synthesize previously seen assets. Our method constructs a compact and diverse replay memory through text-embedding k-Center selection, allowing representative rehearsal of prior knowledge without modifying the underlying architecture. To systematically evaluate continual text-to-3D learning, we introduce Toys4K-CL, a benchmark derived from the Toys4K dataset that provides balanced and semantically diverse class-incremental splits. Extensive experiments on the Toys4K-CL benchmark show that ReConText3D consistently outperforms all baselines across different generative backbones, maintaining high-quality generation for both old and new classes. To the best of our knowledge, this work establishes the first continual learning framework and benchmark for text-to-3D generation, opening a new direction for incremental 3D generative modeling. Project page is available at: https://mauk95.github.io/ReConText3D/.
[202] ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction
Jie Liang, Jiahao Wu, Chao Wang, Jiayu Yang, Xiaoyun Zheng, Kaiqiang Xiong, Zhanke Wang, Jinbo Yan, Feng Gao, Ronggang Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Dynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length. We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with high temporal coherence and reduced memory overhead. Extensive experiments demonstrate that ClipGStream achieves state-of-the-art reconstruction quality and efficiency. The project page is available at: https://liangjie1999.github.io/ClipGStreamWeb/
[203] Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation
Svetlana Pavlitska, Haixi Fan, Konstantin Ditschuneit, J. Marius Zöllner
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Sparse mixture-of-experts (MoE) layers have been shown to substantially increase model capacity without a proportional increase in computational cost and are widely used in transformer architectures, where they typically replace feed-forward network blocks. In contrast, integrating sparse MoE layers into convolutional neural networks (CNNs) remains inconsistent, with most prior work focusing on fine-grained MoEs operating at the filter or channel levels. In this work, we investigate a coarser, patch-wise formulation of sparse MoE layers for semantic segmentation, where local regions are routed to a small subset of convolutional experts. Through experiments on the Cityscapes and BDD100K datasets using encoder-decoder and backbone-based CNNs, we conduct a design analysis to assess how architectural choices affect routing dynamics and expert specialization. Our results demonstrate consistent, architecture-dependent improvements (up to +3.9 mIoU) with little computational overhead, while revealing strong design sensitivity. Our work provides empirical insights into the design and internal dynamics of sparse MoE layers in CNN-based dense prediction. Our code is available at https://github.com/KASTEL-MobilityLab/moe-layers/.
[204] Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Arya Shah, Vaibhav Tripathi, Mayank Singh, Chaklam Silpasuwanchai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision-language models are increasingly deployed in high-stakes settings, yet their susceptibility to sycophantic manipulation remains poorly understood, particularly in relation to how these models represent visual information internally. Whether models whose visual representations more closely mirror human neural processing are also more resistant to adversarial pressure is an open question with implications for both neuroscience and AI safety. We investigate this question by evaluating 12 open-weight vision-language models spanning 6 architecture families and a 40$\times$ parameter range (256M–10B) along two axes: brain alignment, measured by predicting fMRI responses from the Natural Scenes Dataset across 8 human subjects and 6 visual cortex regions of interest, and sycophancy, measured through 76,800 two-turn gaslighting prompts spanning 5 categories and 10 difficulty levels. Region-of-interest analysis reveals that alignment specifically in early visual cortex (V1–V3) is a reliable negative predictor of sycophancy ($r = -0.441$, BCa 95% CI $[-0.740, -0.031]$), with all 12 leave-one-out correlations negative and the strongest effect for existence denial attacks ($r = -0.597$, $p = 0.040$). This anatomically specific relationship is absent in higher-order category-selective regions, suggesting that faithful low-level visual encoding provides a measurable anchor against adversarial linguistic override in vision-language models. We release our code on \href{https://github.com/aryashah2k/Gaslight-Gatekeep-Sycophantic-Manipulation}{GitHub} and dataset on \href{https://huggingface.co/datasets/aryashah00/Gaslight-Gatekeep-V1-V3}{Hugging Face}
[205] Temporally Consistent Long-Term Memory for 3D Single Object Tracking
Jaejoon Yoo, SuBeen Lee, Yerim Jeon, Miso Lee, Jae-Pil Heo
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: 3D Single Object Tracking (3D-SOT) aims to localize a target object across a sequence of LiDAR point clouds, given its 3D bounding box in the first frame. Recent methods have adopted a memory-based approach to utilize previously observed features of the target object, but remain limited to only a few recent frames. This work reveals that their temporal capacity is fundamentally constrained to short-term context due to severe temporal feature inconsistency and excessive memory overhead. To this end, we propose a robust long-term 3D-SOT framework, ChronoTrack, which preserves the temporal feature consistency while efficiently aggregating the diverse target features via long-term memory. Based on a compact set of learnable memory tokens, ChronoTrack leverages long-term information through two complementary objectives: a temporal consistency loss and a memory cycle consistency loss. The former enforces feature alignment across frames, alleviating temporal drift and improving the reliability of proposed long-term memory. In parallel, the latter encourages each token to encode diverse and discriminative target representations observed throughout the sequence via memory-point-memory cyclic walks. As a result, ChronoTrack achieves new state-of-the-art performance on multiple 3D-SOT benchmarks, demonstrating its effectiveness in long-term target modeling with compact memory while running at real-time speed of 42 FPS on a single RTX 4090 GPU. The code is available at https://github.com/ujaejoon/ChronoTrack
[206] PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation
Chen Wang, Yixin Zhu, Yongbin Zhu, Fengyuan Shi, Qi Li, Jun Wang, Zuozhu Liu, Keli Hu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate lesion segmentation in ultrasound images is essential for preventive screening and clinical diagnosis, yet remains challenging due to low contrast, blurry boundaries, and significant scale variations. Although existing deep learning-based methods have achieved remarkable performance, these methods still struggle with scale variations and indistinct tumor boundaries. To address these challenges, we propose a progressive boundary enhanced U-Net (PBE-UNet). Specially, we first introduce a scale-aware aggregation module (SAAM) that dynamically adjusts its receptive field to capture robust multi-scale contextual information. Then, we propose a boundary-guided feature enhancement (BGFE) module to enhance the feature representations. We find that there are large gaps between the narrow boundary and the wide segmentation error areas. Unlike existing methods that treat boundaries as static masks, the BGFE module progressively expands the narrow boundary prediction into broader spatial attention maps. Thus, broader spatial attention maps could effectively cover the wider segmentation error regions and enhance the model’s focus on these challenging areas. We conduct expensive experiments on four benchmark ultrasound datasets, BUSI, Dataset B, TN3K, and BP. The experimental results how that our proposed PBE-UNet outperforms state-of-the-art ultrasound image segmentation methods. The code is at https://github.com/cruelMouth/PBE-UNet.
[207] From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
Mohammad Mahdi, Nedko Savov, Danda Pani Paudel, Luc Van Gool
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Exo-to-Ego video generation aims to synthesize a first-person video from a synchronized third-person view and corresponding camera poses. While paired supervision is available, synchronized exo-ego data inherently introduces substantial spatio-temporal and geometric discontinuities, violating the smooth-motion assumptions of standard video generation benchmarks. We identify this synchronization-induced jump as the central challenge and propose Syn2Seq-Forcing, a sequential formulation that interpolates between the source and target videos to form a single continuous signal. By reframing Exo2Ego as sequential signal modeling rather than a conventional condition-output task, our approach enables diffusion-based sequence models, e.g. Diffusion Forcing Transformers (DFoT), to capture coherent transitions across frames more effectively. Empirically, we show that interpolating only the videos, without performing pose interpolation already produces significant improvements, emphasizing that the dominant difficulty arises from spatio-temporal discontinuities. Beyond immediate performance gains, this formulation establishes a general and flexible framework capable of unifying both Exo2Ego and Ego2Exo generation within a single continuous sequence model, providing a principled foundation for future research in cross-view video synthesis.
[208] Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training
Nghia, Nguyen, Amer Wahed, Andy Quesada, Yasir Ali, Hanadi El Achi, Y. Helen Zhang, Jocelyn Ursua, Alex Banerjee, Sahib Kalra, L. Jeffrey Medeiros, Jie Xu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision transformers (ViT) have been shown to allow for more flexible feature detection and can outperform convolutional neural network (CNN) when pre-trained on sufficient data. Due to their promising feature detection capabilities, we deployed ViTs for morphological classification of anaplastic large cell lymphoma (ALCL) versus classic Hodgkin lymphoma (cHL). We had previously designed a ViT model which was trained on a small dataset of 1,200 image patches in fully supervised training. That model achieved a diagnostic accuracy of 100% and an F1 score of 1.0 on the independent test set. Since fully supervised training is not a practical method due to lack of expertise resources in both the training and testing phases, we conducted a recent study on a modified approach to training data (weakly supervised training) and show that labeling training image patch automatically at the slide level of each whole-slide-image is a more practical solution for clinical use of Vision Transformer. Our ViT model, trained on a larger dataset of 100,000 image patches, yields evaluation metrics with significant accuracy, F1 score, and area under the curve (AUC) at 91.85%, 0.92, and 0.98, respectively. These are respectable values that qualify this ViT model, with weakly supervised training, as a suitable tool for a deep learning module in clinical model development using automated image patch extraction.
[209] DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement
Rejoy Chakraborty, Prasun Roy, Saumik Bhattacharya, Umapada Pal
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Few-shot Font Generation aims to generate stylistically consistent glyphs from a few reference glyphs. However, capturing complex font styles from a few exemplars remains challenging, and the existing methods often struggle to retain discernible local characteristics in generated samples. This paper introduces DRG-Font, a contrastive font generation strategy that learns complex glyph attributes by decomposing style and content embedding spaces. For optimal style supervision, the proposed architecture incorporates a Reference Selection (RS) Module to dynamically select the best style reference from an available pool of candidates. The network learns to decompose glyph attributes into style and shape priors through a Multi-scale Style Head Block (MSHB) and a Multi-scale Content Head Block (MCHB). For style adaptation, a Multi-Fusion Upsampling Block (MFUB) produces the target glyph by combining the reference style prior and target content prior. The proposed method demonstrates significant improvements over state-of-the-art approaches across multiple visual and analytical benchmarks.
[210] A Resource-Efficient Hybrid CNN-LSTM network for image-based bean leaf disease classification
Hye Jin Rhee, Joseph Damilola Akinyemi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate and resource-efficient automated diagnosis is a cornerstone of modern agricultural expert systems. While Convolutional Neural Networks (CNNs) have established benchmarks in plant pathology, their ability to capture long-range spatial dependencies is often limited by standard pooling layers, and their high memory footprint hinders deployment on portable devices. This paper proposes a lightweight hybrid CNN-LSTM system for bean leaf disease classification. By integrating an LSTM layer to model the spatial-sequential relationships within feature maps, our hybrid architecture achieves a 94.38% accuracy while maintaining an exceptionally small footprint of 1.86 MB; a 70% reduction in size compared to traditional CNN-based systems. Furthermore, we provide a systematic evaluation of image augmentation strategies, demonstrating that tailored transformations are superior to generic combinations for maintaining the integrity of diagnostic patterns. Results on the $\textit{ibean}$ dataset confirm that the proposed system achieves state-of-the-art F1 scores of 99.22% with EfficientNet-B7+LSTM, providing a robust and scalable framework for real-time agricultural decision support in resource-constrained environments. The code and augmented datasets used in this study are publicly available on this $\href{https://github.com/HJin-R/bean_disease}{Github}$ repo.
[211] DiffMagicFace: Identity Consistent Facial Editing of Real Videos
Huanghao Yin, Shenkun Xu, Kanle Shi, Junhai Yong, Bin Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Text-conditioned image editing has greatly benefitted from the advancements in Image Diffusion Models. However, extending these techniques to facial video editing introduces challenges in preserving facial identity throughout the source video and ensuring consistency of the edited subject across frames. In this paper, we introduce DiffMagicFace, a unique video editing framework that integrates two fine-tuned models for text and image control. These models operate concurrently during inference to produce video frames that maintain identity features while seamlessly aligning with the editing semantics. To ensure the consistency of the edited videos, we develop a dataset comprising images showcasing various facial perspectives for each edited subject. The creation of a data set is achieved through rendering techniques and the subsequent application of optimization algorithms. Remarkably, our approach does not depend on video datasets but still delivers high-quality results in both consistency and content. The excellent effect holds even for complex tasks like talking head videos and distinguishing closely related categories. The videos edited using our framework exhibit parity with videos that are made using traditional rendering software. Through comparative analysis with current state-of-the-art methods, our framework demonstrates superior performance in both visual appeal and quantitative metrics.
[212] Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image
Yujie Gao, Yao Xiao, Xiangnan Zhu, Ya Li, Yiyi Zhang, Liqing Zhang, Jianfu Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reconstructing a complete 3D head from a single portrait remains challenging because existing methods still face a sharp quality-speed trade-off: high-fidelity pipelines often rely on multi-stage processing and per-subject optimization, while fast feed-forward models struggle with complete geometry and fine appearance details. To bridge this gap, we propose Any3DAvatar, a fast and high-quality method for single-image 3D Gaussian head avatar generation, whose fastest setting reconstructs a full head in under one second while preserving high-fidelity geometry and texture. First, we build AnyHead, a unified data suite that combines identity diversity, dense multi-view supervision, and realistic accessories, filling the main gaps of existing head data in coverage, full-head geometry, and complex appearance. Second, rather than sampling unstructured noise, we initialize from a Plücker-aware structured 3D Gaussian scaffold and perform one-step conditional denoising, formulating full-head reconstruction into a single forward pass while retaining high fidelity. Third, we introduce auxiliary view-conditioned appearance supervision on the same latent tokens alongside 3D Gaussian reconstruction, improving novel-view texture details at zero extra inference cost. Experiments show that Any3DAvatar outperforms prior single-image full-head reconstruction methods in rendering fidelity while remaining substantially faster.
[213] PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios
Zebei Tong, Hongchang Chen, Yujie Lei, Gang Chen, Yushi Liu, Zhi Zheng, Hao Chen, Jieming Zhang, Ying Li, Dongpu Cao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Image generation technology can synthesize condition-specific images to supplement real-world industrial anomaly data and enhance anomaly detection model performance. Existing generation techniques rarely account for the pose and orientation of industrial components in assembly, making the generated images difficult to utilize for downstream application. To solve this, we propose a novel image synthesis approach, called PostureObjectStitch, that achieves accurate generation to meet the requirement of industrial assembly. A condition decoupling approach is introduced to separate input multi-view images into high-frequency, texture, and RGB features. The feature temporal modulation mechanism adapts these features across diffusion model time-steps, enabling progressive generation from coarse to fine details while maintaining consistency. To ensure semantic accuracy, we introduce a conditional loss that enhances critical industrial elements and a geometric prior that guides component positioning for correct assembly relationships. Comprehensive experimental results on the MureCom dataset, our newly contributed DreamAssembly dataset, and the downstream application validate the outstanding performance of our method.
[214] Context Sensitivity Improves Human-Machine Visual Alignment
Frieda Born, Tom Neuhäuser, Lukas Muttenthaler, Brett D. Roads, Bernhard Spitzer, Andrew K. Lampinen, Matt Jones, Klaus-Robert Müller, Michael C. Mozer
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Modern machine learning models typically represent inputs as fixed points in a high-dimensional embedding space. While this approach has been proven powerful for a wide range of downstream tasks, it fundamentally differs from the way humans process information. Because humans are constantly adapting to their environment, they represent objects and their relationships in a highly context-sensitive manner. To address this gap, we propose a method for context-sensitive similarity computation from neural network embeddings, applied to modeling a triplet odd-one-out task with an anchor image serving as simultaneous context. Modeling context enables us to achieve up to a 15% improvement in odd-one-out accuracy over a context-insensitive model. We find that this improvement is consistent across both original and “human-aligned” vision foundation models.
[215] Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias
Zhiyuan Xu, Jiuming Liu, Yuxin Chen, Masayoshi Tomizuka, Chenfeng Xu, Chensheng Peng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present SparseGen, a novel framework for efficient image-to-3D generation, which exhibits low input-view bias while being significantly faster. Unlike traditional approaches that rely on dense volumetric grids, triplanes, or pixel-aligned primitives, we model scenes with a compact sparse set of learned 3D anchor queries and a learned expansion operator that decodes each transformed query into a small local set of 3D Gaussian primitives. Trained under a rectified-flow reconstruction objective without 3D supervision, our model learns to allocate representation capacity where geometry and appearance matter, achieving significant reductions in memory and inference time while preserving multi-view fidelity. We introduce quantitative measures of input-view bias and utilization to show that sparse queries reduce overfitting to conditioning views while being representationally efficient. Our results argue that sparse set-latent expansion is a principled, practical alternative for efficient 3D generative modeling.
[216] Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
Weijie Wang, Qihang Cao, Sensen Gao, Donny Y. Chen, Haofei Xu, Wenjing Bian, Songyou Peng, Tat-Jen Cham, Chuanxia Zheng, Andreas Geiger, Jianfei Cai, Jia-Wang Bian, Bohan Zhuang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reconstructing 3D representations from 2D inputs is a fundamental task in computer vision and graphics, serving as a cornerstone for understanding and interacting with the physical world. While traditional methods achieve high fidelity, they are limited by slow per-scene optimization or category-specific training, which hinders their practical deployment and scalability. Hence, generalizable feed-forward 3D reconstruction has witnessed rapid development in recent years. By learning a model that maps images directly to 3D representations in a single forward pass, these methods enable efficient reconstruction and robust cross-scene generalization. Our survey is motivated by a critical observation: despite the diverse geometric output representations, ranging from implicit fields to explicit primitives, existing feed-forward approaches share similar high-level architectural patterns, such as image feature extraction backbones, multi-view information fusion mechanisms, and geometry-aware design principles. Consequently, we abstract away from these representation differences and instead focus on model design, proposing a novel taxonomy centered on model design strategies that are agnostic to the output format. Our proposed taxonomy organizes the research directions into five key problems that drive recent research development: feature enhancement, geometry awareness, model efficiency, augmentation strategies and temporal-aware models. To support this taxonomy with empirical grounding and standardized evaluation, we further comprehensively review related benchmarks and datasets, and extensively discuss and categorize real-world applications based on feed-forward 3D models. Finally, we outline future directions to address open challenges such as scalability, evaluation standards, and world modeling.
[217] Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model
Shuyun Wang, Hu Zhang, Xin Shen, Dadong Wang, Xin Yu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Bitstream-corrupted video recovery aims to restore realistic content degraded during video storage or transmission. Existing methods typically assume that predefined masks of corrupted regions are available, but manually annotating these masks is labor-intensive and impractical in real-world scenarios. To address this limitation, we introduce a new blind video recovery setting that removes the reliance on predefined masks. This setting presents two major challenges: accurately identifying corrupted regions and recovering content from extensive and irregular degradations. We propose a Metadata-Guided Diffusion Model (M-GDM) to tackle these challenges. Specifically, intrinsic video metadata are leveraged as corruption indicators through a dual-stream metadata encoder that separately embeds motion vectors and frame types before fusing them into a unified representation. This representation interacts with corrupted latent features via cross-attention at each diffusion step. To preserve intact regions, we design a prior-driven mask predictor that generates pseudo masks using both metadata and diffusion priors, enabling the separation and recombination of intact and recovered regions through hard masking. To mitigate boundary artifacts caused by imperfect masks, a post-refinement module enhances consistency between intact and recovered regions. Extensive experiments demonstrate the effectiveness of our method and its superiority in blind video recovery. Code is available at: https://github.com/Shuyun-Wang/M-GDM.
[218] PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction
Xianggang Yu, Lingteng Qiu, Xiaohang Ren, Guanying Chen, Shuguang Cui, Xiaoguang Han, Baoyuan Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present PartNerFace, a part-based neural radiance fields approach, for reconstructing animatable facial avatar from monocular RGB videos. Existing solutions either simply condition the implicit network with the morphable model parameters or learn an imaginary canonical radiance field, making them fail to generalize to unseen facial expressions and capture fine-scale motion details. To address these challenges, we first apply inverse skinning based on a parametric head model to map an observed point to the canonical space, and then model fine-scale motions with a part-based deformation field. Our key insight is that the deformation of different facial parts should be modeled differently. Specifically, our part-based deformation field consists of multiple local MLPs to adaptively partition the canonical space into different parts, where the deformation of a 3D point is computed by aggregating the prediction of all local MLPs by a soft-weighting mechanism. Extensive experiments demonstrate that our method generalizes well to unseen expressions and is capable of modeling fine-scale facial motions, outperforming state-of-the-art methods both quantitatively and qualitatively.
[219] ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding
Tianze Xia, Zijian Ning, Zonglin Zhao, Mingjia Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Subject-driven image generation has shown great success in creating personalized content, but its capabilities are largely confined to single subjects in common poses. Current approaches face a fundamental conflict when handling multiple subjects with complex, distinct actions: preserving individual identities while enforcing precise pose structures. This challenge often leads to identity fusion and pose distortion, as appearance and structure signals become entangled within the model’s architecture. To resolve this conflict, we introduce ASTRA(Adaptive Synthesis through Targeted Retrieval Augmentation), a novel framework that architecturally disentangles subject appearance from pose structure within a unified Diffusion Transformer. ASTRA achieves this through a dual-pronged strategy. It first employs a Retrieval-Augmented Pose (RAG-Pose) pipeline to provide a clean, explicit structural prior from a curated database. Then, its core generative model learns to process these dual visual conditions using our Enhanced Universal Rotary Position Embedding (EURoPE), an asymmetric encoding mechanism that decouples identity tokens from spatial locations while binding pose tokens to the canvas. Concurrently, a Disentangled Semantic Modulation (DSM) adapter offloads the identity preservation task into the text conditioning stream. Extensive experiments demonstrate that our integrated approach achieves superior disentanglement. On our designed COCO-based complex pose benchmark, ASTRA achieves a new state-of-the-art in pose adherence, while maintaining high identity fidelity and text alignment in DreamBench.
[220] A Multi-Stage Optimization Pipeline for Bethesda Cell Detection in Pap Smear Cytology
Martin Amster, Camila María Polotto
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Computer vision techniques have advanced significantly in recent years, finding diverse and impactful applications within the medical field. In this paper, we introduce a new framework for the detection of Bethesda cells in Pap smear images, developed for Track B of the Riva Cytology Challenge held in association with the International Symposium on Biomedical Imaging (ISBI). This work focuses on enhancing computer vision models for cell detection, with performance evaluated using the mAP50-95 metric. We propose a solution based on an ensemble of YOLO and U-Net architectures, followed by a refinement stage utilizing overlap removal techniques and a binary classifier. Our framework achieved second place with a mAP50-95 score of 0.5909 in the competition. The implementation and source code are available at the following repository: github.com/martinamster/riva-trackb
[221] SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation
Songlin Du, Xiaoyong Lu, Yaping Yan, Guobao Xiao, Xiaobo Lu, Takeshi Ikenaga
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Local feature matching plays a critical role in understanding the correspondence between cross-view images. However, traditional methods are constrained by the inherent local nature of feature descriptors, limiting their ability to capture non-local scene information that is essential for accurate cross-view correspondence. In this paper, we introduce SceneGlue, a scene-aware feature matching framework designed to overcome these limitations. SceneGlue leverages a hybridizable matching paradigm that integrates implicit parallel attention and explicit cross-view visibility estimation. The parallel attention mechanism simultaneously exchanges information among local descriptors within and across images, enhancing the scene’s global context. To further enrich the scene awareness, we propose the Visibility Transformer, which explicitly categorizes features into visible and invisible regions, providing an understanding of cross-view scene visibility. By combining explicit and implicit scene-level awareness, SceneGlue effectively compensates for the local descriptor constraints. Notably, SceneGlue is trained using only local feature matches, without requiring scene-level groundtruth annotations. This scene-aware approach not only improves accuracy and robustness but also enhances interpretability compared to traditional methods. Extensive experiments on applications such as homography estimation, pose estimation, image matching, and visual localization validate SceneGlue’s superior performance. The source code is available at https://github.com/songlin-du/SceneGlue.
[222] UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4%, +10.3%, and +4.2% respectively, with no additional training required.
[223] Heuristic Style Transfer for Real-Time, Efficient Weather Attribute Detection
Hamed Ouattara, Pierre Duthon, Pascal Houssam Salmane, Frédéric Bernardin, Omar Ait Aider
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present lightweight and efficient architectures to detect weather conditions from RGB images, predicting the weather type (sunny, rain, snow, fog) and 11 complementary attributes such as intensity, visibility, and ground condition, for a total of 53 classes across the tasks. This work examines to what extent weather conditions manifest as variations in visual style. We investigate style-inspired techniques, including Gram matrices, a truncated ResNet-50 targeting lower and intermediate layers, and PatchGAN-style architectures, within a multi-task framework with attention mechanisms. Two families are introduced: RTM (ResNet50-Truncated-MultiTasks) and PMG (PatchGAN-MultiTasks-Gram), together with their variants. Our contributions include automation of Gram-matrix computation, integration of PatchGAN into supervised multi-task learning, and local style capture through local Gram for improved spatial coherence. We also release a dataset of 503,875 images annotated with 12 weather attributes under a Creative Commons Attribution (CC-BY) license. The models achieve F1 scores above 96 percent on our internal test set and above 78 percent in zero-shot evaluation on several external datasets, confirming their generalization ability. The PMG architecture, with fewer than 5 million parameters, runs in real time with a small memory footprint, making it suitable for embedded systems. The modular design of the models also allows style-related or weather-related tasks to be added or removed as needed.
[224] HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM’s zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.
[225] SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments
Dinging Li, Yingxiu Zhao, Xinrui Cheng, Kangheng Lin, Hongbo Peng, Hongxing Li, Zixuan Wang, Yuhong Dai, Haodong Li, Jia Wang, Yukang Shi, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Spatial reasoning over three-dimensional scenes is a core capability for embodied intelligence, yet continuous model improvement remains bottlenecked by the cost of geometric annotation. The self-evolving paradigm offers a promising path, but its reliance on model consensus to construct pseudo-labels causes training to reinforce rather than correct the model’s own geometric errors. We identify a property unique to 3D spatial reasoning that circumvents this limitation: ground truth is a deterministic consequence of the underlying geometry, computable exactly from point clouds and camera poses without any model involvement. Building on this insight, we present SpatialEvo, a self-evolving framework for 3D spatial reasoning, centered on the Deterministic Geometric Environment (DGE). The DGE formalizes 16 spatial reasoning task categories under explicit geometric validation rules and converts unannotated 3D scenes into zero-noise interactive oracles, replacing model consensus with objective physical feedback. A single shared-parameter policy co-evolves across questioner and solver roles under DGE constraints: the questioner generates physically valid spatial questions grounded in scene observations, while the solver derives precise answers against DGE-verified ground truth. A task-adaptive scheduler endogenously concentrates training on the model’s weakest categories, producing a dynamic curriculum without manual design. Experiments across nine benchmarks demonstrate that SpatialEvo achieves the highest average score at both 3B and 7B scales, with consistent gains on spatial reasoning benchmarks and no degradation on general visual understanding.
[226] MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images
Felicia Bader, Philipp Seeböck, Anastasia Bartashova, Ulrike Attenberger, Georg Langs
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In diagnostic reports, experts encode complex imaging data into clinically actionable information. They describe subtle pathological findings that are meaningful in their anatomical context. Reports follow relatively consistent structures, expressing diagnostic information with few words that are often associated with tiny but consequential image observations. Standard vision language models struggle to identify the associations between these informative text components and small locations in the images. Here, we propose “MApLe”, a multi-task, multi-instance vision language alignment approach that overcomes these limitations. It disentangles the concepts of anatomical region and diagnostic finding, and links local image information to sentences in a patch-wise approach. Our method consists of a text embedding trained to capture anatomical and diagnostic concepts in sentences, a patch-wise image encoder conditioned on anatomical structures, and a multi-instance alignment of these representations. We demonstrate that MApLe can successfully align different image regions and multiple diagnostic findings in free-text reports. We show that our model improves the alignment performance compared to state-of-the-art baseline models when evaluated on several downstream tasks. The code is available at https://github.com/cirmuw/MApLe.
[227] HiProto: Hierarchical Prototype Learning for Interpretable Object Detection Under Low-quality Conditions
Jianlin Xiang, Linhui Dai, Xue Yang, Chaolei Yang, Yanshan Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Interpretability is essential for deploying object detection systems in critical applications, especially under low-quality imaging conditions that degrade visual information and increase prediction uncertainty. Existing methods either enhance image quality or design complex architectures, but often lack interpretability and fail to improve semantic discrimination. In contrast, prototype learning enables interpretable modeling by associating features with class-centered semantics, which can provide more stable and interpretable representations under degradation. Motivated by this, we propose HiProto, a new paradigm for interpretable object detection based on hierarchical prototype learning. By constructing structured prototype representations across multiple feature levels, HiProto effectively models class-specific semantics, thereby enhancing both semantic discrimination and interpretability. Building upon prototype modeling, we first propose a Region-to-Prototype Contrastive Loss (RPC-Loss) to enhance the semantic focus of prototypes on target regions. Then, we propose a Prototype Regularization Loss (PR-Loss) to improve the distinctiveness among class prototypes. Finally, we propose a Scale-aware Pseudo Label Generation Strategy (SPLGS) to suppress mismatched supervision for RPC-Loss, thereby preserving the robustness of low-level prototype representations. Experiments on ExDark, RTTS, and VOC2012-FOG demonstrate that HiProto achieves competitive results while offering clear interpretability through prototype responses, without relying on image enhancement or complex architectures. Our code will be available at https://github.com/xjlDestiny/HiProto.git.
[228] Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework
Enzhuo Zhang, Sijie Zhao, Dilxat Muhtar, Zhenshi Li, Xueliang Zhang, Pengfeng Xiao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Generative diffusion priors have recently achieved state-of-the-art performance in natural image super-resolution, demonstrating a powerful capability to synthesize photorealistic details. However, their direct application to remote sensing image super-resolution (RSISR) reveals significant shortcomings. Unlike natural images, remote sensing images exhibit a unique texture distribution where ground objects are globally stochastic yet locally clustered, leading to highly imbalanced textures. This imbalance severely hinders the model’s spatial perception. To address this, we propose TexADiff, a novel framework that begins by estimating a Relative Texture Density Map (RTDM) to represent the texture distribution. TexADiff then leverages this RTDM in three synergistic ways: as an explicit spatial conditioning to guide the diffusion process, as a loss modulation term to prioritize texture-rich regions, and as a dynamic adapter for the sampling schedule. These modifications are designed to endow the model with explicit texture-aware capabilities. Experiments demonstrate that TexADiff achieves superior or competitive quantitative metrics. Furthermore, qualitative results show that our model generates faithful high-frequency details while effectively suppressing texture hallucinations. This improved reconstruction quality also results in significant gains in downstream task performance. The source code of our method can be found at https://github.com/ZezFuture/TexAdiff.
[229] Depth-Aware Image and Video Orientation Estimation
Muhammad Z. Alam, Larry Stetsiuk, M. Umair Mukati, Zeeshan Kaleem
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper introduces a novel approach for image and video orientation estimation by leveraging depth distribution in natural images. The proposed method estimates the orientation based on the depth distribution across different quadrants of the image, providing a robust framework for orientation estimation suited for applications such as virtual reality (VR), augmented reality (AR), autonomous navigation, and interactive surveillance systems. To further enhance fine-scale perceptual alignment, we incorporate depth gradient consistency (DGC) and horizontal symmetry analysis (HSA), enabling precise orientation correction. This hybrid strategy effectively exploits depth cues to support spatial coherence and perceptual stability in immersive visual content. Qualitative and quantitative evaluations demonstrate the robustness and accuracy of the proposed approach, outperforming existing techniques across diverse scenarios.
[230] POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
Yikun Liu, Yuan Liu, Le Tian, Xiao Zhou, Jiangchao Yao, Yanfeng Wang, Weidi Xie
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model’s ability to locate ground-truth evidence. To mitigate this, we propose V-Fold, an adaptive history-aware compression scheme that preserves recent dialogue turns in high fidelity while folding historical context into the visual space via rendering; and (iii) we develop POINTS-Seeker-8B, a state-of-the-art multimodal agentic search model that consistently outperforms existing models across six diverse benchmarks, effectively resolving the challenges of long-horizon, knowledge-intensive visual reasoning.
[231] Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios
Xiaomin Li, Tala Wang, Zichen Zhong, Ying Zhang, Zirui Zheng, Takashi Isobe, Dezhuang Li, Huchuan Lu, You He, Xu Jia
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Daily scenarios are characterized by visual richness, requiring Multimodal Large Language Models (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks predominantly aim at evaluating MLLMs’ pre-existing knowledge or perceptual understanding, often neglecting the critical capability of reasoning. To bridge this gap, we introduce DailyClue, a benchmark designed for visual clue-driven reasoning in daily scenarios. Our construction is guided by two core principles: (1) strict grounding in authentic daily activities, and (2) challenging query design that necessitates more than surface-level perception. Instead of simple recognition, our questions compel MLLMs to actively explore suitable visual clues and leverage them for subsequent reasoning. To this end, we curate a comprehensive dataset spanning four major daily domains and 16 distinct subtasks. Comprehensive evaluation across MLLMs and agentic models underscores the formidable challenge posed by our benchmark. Our analysis reveals several critical insights, emphasizing that the accurate identification of visual clues is essential for robust reasoning.
[232] Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models
Xiaohe Li, Jiahao Li, Kaixin Zhang, Yuqiang Fang, Leilei Lin, Hong Wang, Haohua Wu, Zide Fan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental “temporal blindness”. Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.
[233] Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself
Yuhang Dai, Xingyi Yang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Feed-forward 3D reconstruction models are efficient but rigid: once trained, they perform inference in a zero-shot manner and cannot adapt to the test scene. As a result, visually plausible reconstructions often contain errors, particularly under occlusions, specularities, and ambiguous cues. To address this, we introduce Free Geometry, a framework that enables feed-forward 3D reconstruction models to self-evolve at test time without any 3D ground truth. Our key insight is that, when the model receives more views, it produces more reliable and view-consistent reconstructions. Leveraging this property, given a testing sequence, we mask a subset of frames to construct a self-supervised task. Free Geometry enforces cross-view feature consistency between representations from full and partial observations, while maintaining the pairwise relations implied by the held-out frames. This self-supervision allows for fast recalibration via lightweight LoRA updates, taking less than 2 minutes per dataset on a single GPU. Our approach consistently improves state-of-the-art foundation models, including Depth Anything 3 and VGGT, across 4 benchmark datasets, yielding an average improvement of 3.73% in camera pose accuracy and 2.88% in point map prediction. Code is available at https://github.com/hiteacherIamhumble/Free-Geometry .
[234] Towards Unconstrained Human-Object Interaction
Francesco Tonini, Alessandro Conti, Lorenzo Vaquero, Cigdem Beyan, Elisa Ricci
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Human-Object Interaction (HOI) detection is a longstanding computer vision problem concerned with predicting the interaction between humans and objects. Current HOI models rely on a vocabulary of interactions at training and inference time, limiting their applicability to static environments. With the advent of Multimodal Large Language Models (MLLMs), it has become feasible to explore more flexible paradigms for interaction recognition. In this work, we revisit HOI detection through the lens of MLLMs and apply them to in-the-wild HOI detection. We define the Unconstrained HOI (U-HOI) task, a novel HOI domain that removes the requirement for a predefined list of interactions at both training and inference. We evaluate a range of MLLMs on this setting and introduce a pipeline that includes test-time inference and language-to-graph conversion to extract structured interactions from free-form text. Our findings highlight the limitations of current HOI detectors and the value of MLLMs for U-HOI. Code will be available at https://github.com/francescotonini/anyhoi
[235] Training-Free Semantic Multi-Object Tracking with Vision-Language Models
Laurence Bonat, Francesco Tonini, Elisa Ricci, Lorenzo Vaquero
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with LLM disambiguation. On BenSMOT, TF-SMOT achieves state-of-the-art tracking performance within the SMOT setting and improves summary and caption quality compared to prior art. Interaction recognition, however, remains challenging under strict exact-match evaluation on the fine-grained and long-tailed WordNet label space; our analysis and ablations indicate that semantic overlap and label granularity substantially affect measured performance.
[236] Don’t Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
Ami Baid, Zihui Xue, Kristen Grauman
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.
[237] Geometric Context Transformer for Streaming 3D Reconstruction
Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, Yinghao Xu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.
[238] ROSE: Retrieval-Oriented Segmentation Enhancement
Song Tang, Guangquan Jie, Henghui Ding, Yu-Gang Jiang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model’s knowledge but demand up-to-date external information for accurate recognition. To support the study of NEST, we construct a NEST benchmark using an automated pipeline that generates news-related data samples for comprehensive evaluation. Additionally, we propose ROSE: Retrieval-Oriented Segmentation Enhancement, a plug-and-play framework designed to augment any MLLM-based segmentation model. ROSE comprises four key components. First, an Internet Retrieval-Augmented Generation module is introduced to employ user-provided multimodal inputs to retrieve real-time web information. Then, a Textual Prompt Enhancer enriches the model with up-to-date information and rich background knowledge, improving the model’s perception ability for emerging entities. Furthermore, a Visual Prompt Enhancer is proposed to compensate for MLLMs’ lack of exposure to novel entities by leveraging internet-sourced images. To maintain efficiency, a WebSense module is introduced to intelligently decide when to invoke retrieval mechanisms based on user input. Experimental results demonstrate that ROSE significantly boosts performance on the NEST benchmark, outperforming a strong Gemini-2.0 Flash-based retrieval baseline by 19.2 in gIoU.
[239] Seedance 2.0: Advancing Video Generation for World Complexity
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hongxiang Hao, Haoxun He, Jiaao He, Qian He, Tuyen Hoang, Heng Hu, Ruoqing Hu, Yuxiang Hu, Jiancheng Huang, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Jishuo Jin, Ming Jing, Ashley Kim, Shanshan Lao, Yichong Leng, Bingchuan Li, Gen Li, Haifeng Li, Huixia Li, Jiashi Li, Ming Li, Xiaojie Li, Xingxing Li, Yameng Li, Yiying Li, Yu Li, Yueyan Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Wang Liao, J. H. Lien, Shanchuan Lin, Xi Lin, Feng Ling, Yue Ling, Fangfang Liu, Jiawei Liu, Jihao Liu, Jingtuo Liu, Shu Liu, Sichao Liu, Wei Liu, Xue Liu, Zuxi Liu, Ruijie Lu, Lecheng Lyu, Jingting Ma, Tianxiang Ma, Xiaonan Nie, Jingzhe Ning, Junjie Pan, Xitong Pan, Ronggui Peng, Xueqiong Qu, Yuxi Ren, Yuchen Shen, Guang Shi, Lei Shi, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Wenjing Tang, Boyang Tao, Zirui Tao, Dongliang Wang, Feng Wang, Hulin Wang, Ke Wang, Qingyi Wang, Rui Wang, Shuai Wang, Shulei Wang, Weichen Wang, Xuanda Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Zijie Wang, Ziyu Wang, Guoqiang Wei, Meng Wei, Di Wu, Guohong Wu, Hanjie Wu, Huachao Wu, Jian Wu, Jie Wu, Ruolan Wu, Shaojin Wu, Xiaohu Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Xin Xia, Xuefeng Xiao, Shuang Xu, Bangbang Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yihang Yang, Zhixian Yang, Ziyan Yang, Fulong Ye, Bingqian Yi, Xing Yin, Yongbin You, Linxiao Yuan, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Siyu Zhai, Zhonghua Zhai, Bowen Zhang, Chenlin Zhang, Heng Zhang, Jun Zhang, Manlin Zhang, Peiyuan Zhang, Shuo Zhang, Xiaohe Zhang, Xiaoying Zhang, Xinyan Zhang, Xinyi Zhang, Yichi Zhang, Zixiang Zhang, Haiyu Zhao, Huating Zhao, Liming Zhao, Yian Zhao, Guangcong Zheng, Jianbin Zheng, Xiaozheng Zheng, Zerong Zheng, Kuan Zhu, Feilong Zuo
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities available in the industry to date. It delivers substantial, well-rounded improvements across all key sub-dimensions of video and audio generation. In both expert evaluations and public user tests, the model has demonstrated performance on par with the leading levels in the field. Seedance 2.0 supports direct generation of audio-video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. For multi-modal inputs as reference, its current open platform supports up to 3 video clips, 9 images, and 3 audio clips. In addition, we provide Seedance 2.0 Fast version, an accelerated variant of Seedance 2.0 designed to boost generation speed for low-latency scenarios. Seedance 2.0 has delivered significant improvements to its foundational generation capabilities and multi-modal generation performance, bringing an enhanced creative experience for end users.
[240] One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat, Yu-Xiong Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards \emph{one token per frame} at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into \emph{learnable} and \emph{progressive} modules for \emph{token-level compression} (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate \emph{frame-level compression}, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named \emph{question-conditioned compression} (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, \emph{i.e.}, the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined \emph{token-level} and \emph{frame-level} leads to an e\textbf{x}treme compression model for long video understanding, named \textbf{\name}, achieving a significantly larger compression ratio and enabling denser frame sampling. Our \name is finetuned from VideoChat-Flash with a data-efficient \emph{supervised compression tuning} stage that only requires 2.5% of the supervised fine-tuning data, yet boosts the accuracy from 42.9% to 46.2% on LVBench and enhances multiple other long video benchmarks.
[241] SemAttNet: Towards Attention-based Semantic Aware Guided Depth Completion
Danish Nazir, Marcus Liwicki, Didier Stricker, Muhammad Zeshan Afzal
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2204.13635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2204.13635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[242] What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction
Sunny Panchal, Apratim Bhattacharyya, Guillaume Berger, Antoine Mercier, Cornelius Bohm, Florian Dietrichkeit, Reza Pourreza, Xuanlin Li, Pulkit Madan, Mingu Lee, Mark Todorovich, Ingo Bax, Roland Memisevic
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2407.08101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.08101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[243] SinkSAM-Net: Knowledge-Driven Self-Supervised Sinkhole Segmentation Using Topographic Priors and Segment Anything Model
Osher Rafaeli, Tal Svoray, Ariel Nahlieli
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2410.01473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.01473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[244] SiLVR: A Simple Language-based Video Reasoning Framework
Ce Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, Gedas Bertasius
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.24869: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.24869&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[245] Visual Sparse Steering (VS2): Unsupervised Adaptation for Image Classification using Sparsity-Guided Steering Vectors
Gerasimos Chatzoudis, Zhuowei Li, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.01247: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01247&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[246] Frozen Forecasting: A Unified Evaluation
Jacob C Walker, Pedro Vélez, Luisa Polania Cabrera, Guangyao Zhou, Sayna Ebrahimi, Rishabh Kabra, Carl Doersch, Maks Ovsjanikov, João Carreira, Shiry Ginosar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.13942: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.13942&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[247] Hybrid Approach for Enhancing Lesion Segmentation in Fundus Images
Mohammadmahdi Eshragh, Emad A. Mohammed, Behrouz Far, Ezekiel Weis, Carol L Shields, Sandor R Ferenczy, Trafford Crump
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.25549: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25549&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks
Arun Sharma
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12102: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12102&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] AIM-CoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning
Xiping Li, Jianghong Ma
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.25699: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25699&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] Getting the Numbers Right$\unicode{x2014}$Modelling Multi-Class Object Counting in Dense and Varied Scenes
Villanelle O’Reilly, Jonathan Cox, Georgios Leontidis, Marc Hanheide, Petra Bosilj, James M. Brown
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.02213: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02213&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] Geometry-Aware Cross Modal Alignment for Light Field-LiDAR Semantic Segmentation
Jie Luo, Yuxuan Jiang, Xin Jin, Mingyu Liu, Yihui Fan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.06687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
Yifan Liu, Fangneng Zhan, Kaichen Zhou, Yilun Du, Paul Pu Liang, Hanspeter Pfister
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.10946: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10946&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
Quyen Tran, Hai Nguyen, Hoang Phan, Quan Dao, Linh Ngo, Khoat Than, Dinh Phung, Dimitris Metaxas, Trung Le
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2211.16780: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2211.16780&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] Delineate Anything Flow: Fast, Country-Level Field Boundary Detection from Any Source
Mykola Lavreniuk, Nataliia Kussul, Andrii Shelestov, Yevhenii Salii, Volodymyr Kuzin, Sergii Skakun, Zoltan Szantoi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.13417: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13417&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] Lite Any Stereo: Efficient Zero-Shot Stereo Matching
Junpeng Jing, Weixun Luo, Ye Mao, Krystian Mikolajczyk
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.16555: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16555&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets?
Dingrui Wang, Zhihao Liang, Hongyuan Ye, Zhexiao Sun, Zhaowei Lu, Yuchen Zhang, Yuyu Zhao, Yuan Gao, Marvin Seegert, Finn Schäfer, Haotong Qin, Wei Li, Luigi Palmieri, Felix Jahncke, Mattia Piccinini, Johannes Betz
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.17792: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17792&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[257] UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.23332: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23332&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[258] GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization
Zixuan Song, Jing Zhang, Di Wang, Zidie Zhou, Wenbin Liu, Haonan Guo, En Wang, Bo Du
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.02697: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02697&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.08639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[260] ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi K. Lakshmikanth, Ehsan Adeli
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.14234: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14234&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] Democratising Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling
Sander Moonemans, Sebastiaan Ram, Frédérique Meeuwsen, Carlijn Lems, Jeroen van der Laak, Geert Litjens, Francesco Ciompi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.17326: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17326&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[262] Heavy-Tailed Class-Conditional Priors for Long-Tailed Generative Modeling
Aymene Mohammed Bouayed, Samuel Deslauriers-Gauthier, Adrian Iaccovelli, David Naccache
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.02154: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.02154&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] Multi-Dimensional Knowledge Profiling with Large-Scale Literature Database and Hierarchical Retrieval
Zhucun Xue, Jiangning Zhang, Juntao Jiang, Jinzhuo Liu, Haoyang He, Teng Hu, Xiaobin Hu, Yong Liu, Shuicheng Yan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.15170: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15170&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[264] Learning Sewing Patterns via Latent Flow Matching of Implicit Fields
Cong Cao, Ren Li, Corentin Dumery, Hao Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.17740: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17740&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[265] Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models
Yue Zhou, Xinan He, Kaiqing Lin, Bing Fan, Feng Ding, Bin Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.01738: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01738&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] A Function-Centric Perspective on Flat and Sharp Minima
Israel Mason-Williams, Gabryel Mason-Williams, Helen Yannakoudakis
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.12451: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12451&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] Adaptive Multi-Scale Channel-Spatial Attention Aggregation Framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired
Qi He, XiangXiang Wang, Jingtao Zhang, Yongbin Yu, Hongxiang Chu, Manping Fan, JingYe Cai, Zhenglin Yang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.16385: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16385&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[268] LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
Jihao Qiu, Lingxi Xie, Xinyue Huo, Qi Tian, Qixiang Ye
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.20913: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20913&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations
Maximus A. Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Audrey Du, Edward W. Duan, Wei-Chiu Ma, Kushal Kedia
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.04671: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04671&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[270] Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
Christian Simon, Masato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa, Dongseok Shim, Zhi Zhong, Shuyang Cui, Shusuke Takahashi, Takashi Shibuya, Yuki Mitsufuji
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.20981: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20981&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[271] Tokenizing Semantic Segmentation with Run Length Encoding
Abhineet Singh, Justin Rozeboom, Nilanjan Ray
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.21627: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21627&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] OPTED: Open Preprocessed Trachoma Eye Dataset Using Zero-Shot SAM 3 Segmentation
Kibrom Gebremedhin, Hadush Hailu, Bruk Gebregziabher
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.06885: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06885&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[273] Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images
Qishun Yang, Shu Yang, Lijie Hu, Di Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.08486: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08486&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[274] UNBOX: Unveiling Black-box visual models with Natural-language
Simone Carnemolla, Chiara Russo, Simone Palazzo, Quentin Bouniot, Daniela Giordano, Zeynep Akata, Matteo Pennisi, Concetto Spampinato
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.08639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[275] Towards Generalizable Robotic Manipulation in Dynamic Environments
Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang, Xiang Bai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.15620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models
Weile Gong, Yiping Zuo, Zijian Lu, Xin He, Weibei Fan, Lianyong Qi, Shi Jin
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.19790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[277] Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model
Athos Georgiou
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.28554: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28554&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[278] SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation
Guiyu Zhang, Yabo Chen, Xunzhi Xiang, Junchao Huang, Zhongyu Wang, Li Jiang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.03723: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03723&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[279] Action Images: End-to-End Policy Learning via Multiview Video Generation
Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Pengsheng Guo, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.06168: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06168&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[280] DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection
Jiangbei Yue, Darren Treanor, Venkataraman Subramanian, Sharib Ali
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.08261: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08261&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[281] Detecting Diffusion-generated Images via Dynamic Assembly Forests
Mengxin Fu, Yuezun Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.09106: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09106&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[282] Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models
Yunkai Zhang, Linda Li, Yingxin Cui, Xiyuan Ruan, Zeyu Zheng, Kezhen Chen, Yi Zhang, Diji Yang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.09687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[283] The Second Challenge on Real-World Face Restoration at NTIRE 2026: Methods and Results
Jingkai Wang, Jue Gong, Zheng Chen, Kai Liu, Jiatong Li, Yulun Zhang, Radu Timofte, Jiachen Tu, Yaokun Shi, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yingsi Chen, Yijiao Liu, Hui Li, Yu Wang, Congchao Zhu, Alexandru-Gabriel Lefterache, Anamaria Radoi, Chuanyue Yan, Tao Lu, Yanduo Zhang, Kanghui Zhao, Jiaming Wang, Yuqi Li, WenBo Xiong, Yifei Chen, Xian Hu, Wei Deng, Daiguo Zhou, Sujith Roy V, Claudia Jesuraj, Vikas B, Spoorthi LC, Nikhil Akalwadi, Ramesh Ashok Tabib, Uma Mudenagudi, Yuxuan Jiang, Chengxi Zeng, Tianhao Peng, Fan Zhang, David Bull Wei Zhou, Linfeng Li, Hongyu Huang, Hoyoung Lee, SangYun Oh, ChangYoung Jeong, Axi Niu, Jinyang Zhang, Zhenguo Wu, Senyan Qing, Jinqiu Sun, Yanning Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10532: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10532&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[284] Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
Tencent Hunyuan Team
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11244: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11244&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[285] SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization
Deming Li, Abhay Yadav, Cheng Peng, Rama Chellappa, Anand Bhattad
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11797: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11797&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[286] RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation
Guoan Xu, Yang Xiao, Guangwei Gao, Dongchen Zhu, Guo-Jun Qi, Wenjing Jia
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12319: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12319&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[287] Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, Chanyoung Park
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12358: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12358&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[288] Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models
Ravikumar Balakrishnan, Sanket Mendapara, Ankit Garg
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12371: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12371&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[289] CoD-Lite: Real-Time Diffusion-Based Generative Image Compression
Zhaoyang Jia, Naifu Xue, Zihan Zheng, Jiahao Li, Bin Li, Xiaoyi Zhang, Zongyu Guo, Yuan Zhang, Houqiang Li, Yan Lu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12525: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12525&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[290] A Faster Path to Continual Learning
Wei Li, Hangjie Yuan, Zixiang Zhao, Borui Kang, Ziwei Liu, Tao Feng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[291] PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination
Xuan Wang, Kai Ruan, Jiayi Han, Kaiyue Zhou, Gaoang Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12856: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12856&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[292] HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy
Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.00695: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00695&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[293] RoboTAG: End-to-end Robot Configuration Estimation via Topological Alignment Graph
Yifan Liu, Fangneng Zhan, Wanhua Li, Haowen Sun, Katerina Fragkiadaki, Hanspeter Pfister
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.07717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[294] From Instruction to Event: Sound-Triggered Mobile Manipulation
Hao Ju, Shaofei Huang, Hongyu Li, Zihan Ding, Si Liu, Meng Wang, Zhedong Zheng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.21667: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21667&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[295] Exploration and Exploitation Errors Are Measurable for Language Model Agents
Jaden Park, Jungtaek Kim, Jongwon Jeong, Robert D. Nowak, Kangwook Lee, Yong Jae Lee
Main category: cs.AI
TL;DR: A framework for evaluating LM agents’ exploration-exploitation tradeoffs in partially observable grid environments with unknown task DAGs, featuring policy-agnostic metrics and controllable difficulty.
Details
Motivation: To address the challenge of systematically distinguishing and quantifying exploration and exploitation behaviors in LM agents without access to their internal policies, particularly for complex open-ended decision-making tasks in embodied AI scenarios.Method: Design controllable 2D grid environments with partial observability and unknown task DAGs, where map generation can be adjusted to emphasize exploration or exploitation difficulty. Create policy-agnostic metrics to quantify exploration and exploitation errors from agent actions.
Result: State-of-the-art LM agents struggle on the task, showing distinct failure modes. Reasoning models perform better, and both exploration and exploitation can be significantly improved through minimal harness engineering.
Conclusion: The proposed framework enables systematic evaluation of exploration-exploitation tradeoffs in LM agents, revealing current limitations and opportunities for improvement in embodied AI decision-making.
Abstract: Language Model (LM) agents are increasingly used in complex open-ended decision-making tasks, from AI coding to physical AI. A core requirement in these settings is the ability to both explore the problem space and exploit acquired knowledge effectively. However, systematically distinguishing and quantifying exploration and exploitation from observed actions without access to the agent’s internal policy remains challenging. To address this, we design controllable environments inspired by practical embodied AI scenarios. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy-agnostic evaluation, we design a metric to quantify exploration and exploitation errors from agent’s actions. We evaluate a variety of frontier LM agents and find that even state-of-the-art models struggle on our task, with different models exhibiting distinct failure modes. We further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal harness engineering. We release our code \href{https://github.com/jjj-madison/measurable-explore-exploit}{here}.
[296] SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications
Qibin Liu, Julia Gonski
Main category: cs.AI
TL;DR: A safe, lightweight agentic framework for autonomous execution of well-defined scientific tasks using isolated environments and self-assessing mechanisms.
Details
Motivation: Existing agentic AI systems face challenges in reliable real-world scientific deployment, needing safer, more user-friendly frameworks for autonomous task execution.Method: Combines isolated execution environment, three-layer agent loop, and self-assessing do-until mechanism to ensure safe operation while leveraging LLMs of varying capabilities.
Result: Enables end-to-end automation of structured scientific tasks with minimal human intervention, allowing researchers to offload routine workloads.
Conclusion: The framework provides a practical solution for deploying agentic AI in scientific research by focusing on well-defined tasks with clear stopping criteria.
Abstract: Recent advances in agentic AI have enabled increasingly autonomous workflows, but existing systems still face substantial challenges in achieving reliable deployment in real-world scientific research. In this work, we present a safe, lightweight, and user-friendly agentic framework for the autonomous execution of well-defined scientific tasks. The framework combines an isolated execution environment, a three-layer agent loop, and a self-assessing do-until mechanism to ensure safe and reliable operation while effectively leveraging large language models of varying capability levels. By focusing on structured tasks with clearly defined context and stopping criteria, the framework supports end-to-end automation with minimal human intervention, enabling researchers to offload routine workloads and devote more effort to creative activities and open-ended scientific inquiry.
[297] Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models
Chashi Mahiul Islam, Alan Villarreal, Mao Nishino, Shaeke Salman, Xiuwen Liu
Main category: cs.AI
TL;DR: Analysis of LLM unpredictability rooted in floating-point precision limitations, revealing chaotic “avalanche effects” and three distinct regimes of numerical behavior in Transformer models.
Details
Motivation: As LLMs are increasingly used in agentic workflows, their unpredictability due to numerical instability has become a critical reliability issue. While studies show downstream effects, the root causes and mechanisms remain poorly understood.Method: Rigorous analysis of how unpredictability stems from finite numerical precision of floating-point representations, tracking how rounding errors propagate, amplify, or dissipate through Transformer computation layers. Identifies chaotic “avalanche effect” in early layers where minor perturbations trigger binary outcomes.
Result: LLMs exhibit universal, scale-dependent chaotic behaviors with three distinct regimes: 1) stable regime (perturbations vanish, constant outputs), 2) chaotic regime (rounding errors dominate, output divergence), and 3) signal-dominated regime (true input variations override numerical noise). Validated across multiple datasets and model architectures.
Conclusion: Numerical instability in LLMs is fundamentally rooted in floating-point precision limitations, creating predictable chaotic behaviors that affect reliability in agentic workflows. Understanding these regimes provides insights for improving model stability and reliability.
Abstract: As Large Language Models (LLMs) are increasingly integrated into agentic workflows, their unpredictability stemming from numerical instability has emerged as a critical reliability issue. While recent studies have demonstrated the significant downstream effects of these instabilities, the root causes and underlying mechanisms remain poorly understood. In this paper, we present a rigorous analysis of how unpredictability is rooted in the finite numerical precision of floating-point representations, tracking how rounding errors propagate, amplify, or dissipate through Transformer computation layers. Specifically, we identify a chaotic “avalanche effect” in the early layers, where minor perturbations trigger binary outcomes: either rapid amplification or complete attenuation. Beyond specific error instances, we demonstrate that LLMs exhibit universal, scale-dependent chaotic behaviors characterized by three distinct regimes: 1) a stable regime, where perturbations fall below an input-dependent threshold and vanish, resulting in constant outputs; 2) a chaotic regime, where rounding errors dominate and drive output divergence; and 3) a signal-dominated regime, where true input variations override numerical noise. We validate these findings extensively across multiple datasets and model architectures.
[298] Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach
Mohamed-Bachir Belaid
Main category: cs.AI
TL;DR: Earth Observation satellite scheduling with unknown constraints using interactive constraint learning and optimization
Details
Motivation: EO satellite scheduling typically assumes fully specified constraint models, but in practice constraints are often embedded in engineering artefacts or simulators rather than explicit mathematical models, requiring learning from interaction.Method: Conservative Constraint Acquisition (CCA) for efficiently identifying justified constraints while limiting unnecessary tightening, embedded in Learn&Optimize framework that alternates optimization under learned constraint model with targeted oracle queries.
Result: On synthetic instances with up to 50 tasks, L&O improves over greedy baseline and uses far fewer oracle queries than acquire-then-solve baseline. For n≤30, average gap drops from 65-68% to 17.7-35.8%. At n=50, L&O improves on FAO (17.9% vs 20.3%) while using 21.3 queries instead of 100 and 5× less execution time.
Conclusion: The proposed interactive constraint learning approach effectively handles EO scheduling with unknown constraints, achieving better performance with fewer queries than baseline methods.
Abstract: Earth Observation (EO) satellite scheduling (deciding which imaging tasks to perform and when) is a well-studied combinatorial optimization problem. Existing methods typically assume that the operational constraint model is fully specified in advance. In practice, however, constraints governing separation between observations, power budgets, and thermal limits are often embedded in engineering artefacts or high-fidelity simulators rather than in explicit mathematical models. We study EO scheduling under \emph{unknown constraints}: the objective is known, but feasibility must be learned interactively from a binary oracle. Working with a simplified model restricted to pairwise separation and global capacity constraints, we introduce Conservative Constraint Acquisition~(CCA), a domain-specific procedure designed to identify justified constraints efficiently in practice while limiting unnecessary tightening of the learned model. Embedded in the \textsc{Learn&Optimize} framework, CCA supports an interactive search process that alternates optimization under a learned constraint model with targeted oracle queries. On synthetic instances with up to 50tasks and dense constraint networks, L&O improves over a no-knowledge greedy baseline and uses far fewer main oracle queries than a two-phase acquire-then-solve baseline (FAO). For $n\leq 30$, the average gap drops from 65–68% (Priority Greedy) to 17.7–35.8% using L&O. At $n{=}50$, where the CP-SAT reference is the best feasible solution found in 120s, L&O improves on FAO on average (17.9% vs.\ 20.3%) while using 21.3 main queries instead of 100 and about $5\times$ less execution time.
[299] WebXSkill: Skill Learning for Autonomous Web Agents
Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, Chetan Bansal, Dongmei Zhang, Saravan Rajmohan, Jianfeng Gao, Huaxiu Yao
Main category: cs.AI
TL;DR: WebXSkill introduces executable skills for web agents that combine parameterized action programs with natural language guidance, bridging the gap between textual workflow skills and code-based skills.
Details
Motivation: Current web agents struggle with long-horizon workflows due to a grounding gap: textual workflow skills provide guidance but aren't executable, while code-based skills are executable but opaque to agents, lacking step-level understanding for error recovery.Method: Three-stage framework: 1) Skill extraction mines reusable action subsequences from synthetic trajectories and abstracts them into parameterized skills, 2) Skill organization indexes skills into URL-based graph for context-aware retrieval, 3) Skill deployment offers grounded mode (fully automated) and guided mode (agent follows step-by-step).
Result: Improves task success rate by up to 9.8 points on WebArena and 12.9 points on WebVoyager over baselines, demonstrating effectiveness of executable skills for web agents.
Conclusion: WebXSkill bridges the grounding gap in web agent skill formulations by combining executability with natural language guidance, enabling both automated execution and agent-driven adaptation for improved performance on complex web tasks.
Abstract: Autonomous web agents powered by large language models (LLMs) have shown promise in completing complex browser tasks, yet they still struggle with long-horizon workflows. A key bottleneck is the grounding gap in existing skill formulations: textual workflow skills provide natural language guidance but cannot be directly executed, while code-based skills are executable but opaque to the agent, offering no step-level understanding for error recovery or adaptation. We introduce WebXSkill, a framework that bridges this gap with executable skills, each pairing a parameterized action program with step-level natural language guidance, enabling both direct execution and agent-driven adaptation. WebXSkill operates in three stages: skill extraction mines reusable action subsequences from readily available synthetic agent trajectories and abstracts them into parameterized skills, skill organization indexes skills into a URL-based graph for context-aware retrieval, and skill deployment exposes two complementary modes, grounded mode for fully automated multi-step execution and guided mode where skills serve as step-by-step instructions that the agent follows with its native planning. On WebArena and WebVoyager, WebXSkill improves task success rate by up to 9.8 and 12.9 points over the baseline, respectively, demonstrating the effectiveness of executable skills for web agents. The code is publicly available at https://github.com/aiming-lab/WebXSkill.
[300] Listening Alone, Understanding Together: Collaborative Context Recovery for Privacy-Aware AI
Tanmay Srivastava, Amartya Basu, Shubham Jain, Vaishnavi Ranganathan
Main category: cs.AI
TL;DR: CONCORD is a privacy-aware framework for proactive speech-based AI assistants that uses real-time speaker verification to capture only owner speech, then safely recovers missing context through assistant-to-assistant coordination.
Details
Motivation: As AI assistants evolve from reactive to always-listening proactive systems, they face significant privacy risks by potentially capturing non-consenting speakers' speech, making social deployment challenging. There's a need for privacy-preserving proactive conversational agents.Method: CONCORD enforces owner-only speech capture via real-time speaker verification, producing one-sided transcripts. It recovers missing context through: (1) spatio-temporal context resolution, (2) information gap detection, and (3) minimal A2A queries governed by relationship-aware disclosure policies, treating context recovery as negotiated safe exchanges between assistants.
Result: Achieves 91.4% recall in gap detection, 96% relationship classification accuracy, and 97% true negative rate in privacy-sensitive disclosure decisions across multi-domain dialogue datasets.
Conclusion: By reframing always-listening AI as a coordination problem between privacy-preserving agents, CONCORD offers a practical path toward socially deployable proactive conversational agents.
Abstract: We introduce CONCORD, a privacy-aware asynchronous assistant-to-assistant (A2A) framework that leverages collaboration between proactive speech-based AI. As agents evolve from reactive to always-listening assistants, they face a core privacy risk (of capturing non-consenting speakers), which makes their social deployment a challenge. To overcome this, we implement CONCORD, which enforces owner-only speech capture via real-time speaker verification, producing a one-sided transcript that incurs missing context but preserves privacy. We demonstrate that CONCORD can safely recover necessary context through (1) spatio-temporal context resolution, (2) information gap detection, and (3) minimal A2A queries governed by a relationship-aware disclosure. Instead of hallucination-prone inferring, CONCORD treats context recovery as a negotiated safe exchange between assistants. Across a multi-domain dialogue dataset, CONCORD achieves 91.4% recall in gap detection, 96% relationship classification accuracy, and 97% true negative rate in privacy-sensitive disclosure decisions. By reframing always-listening AI as a coordination problem between privacy-preserving agents, CONCORD offers a practical path toward socially deployable proactive conversational agents.
[301] Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension
Vasundra Srinivasan
Main category: cs.AI
TL;DR: MMA2A architecture preserves multimodal signals (voice, image, text) in native formats during agent-to-agent routing, improving task accuracy by 20 percentage points over text-bottleneck baselines, especially for vision-dependent tasks.
Details
Motivation: Current multi-agent systems often bottleneck multimodal signals (audio, vision) into text, losing crucial information needed for accurate cross-modal reasoning. The paper aims to preserve native modality signals across agent boundaries to improve task performance.Method: MMA2A architecture layer that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. Evaluated on CrossModal-CS benchmark with 50 tasks using same LLM backend, varying only routing paths.
Result: MMA2A achieves 52% task completion accuracy vs 32% for text-bottleneck baseline. Gains concentrate on vision-dependent tasks: product defect reports improve by +38.5pp and visual troubleshooting by +16.7pp. Requires 1.8× latency cost for native multimodal processing.
Conclusion: Routing is a first-order design variable in multi-agent systems that determines information available for downstream reasoning. Native modality routing paired with capable agent-level reasoning yields significant accuracy improvements, especially for vision/audio tasks.
Abstract: Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2A, an architecture layer atop A2A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal-CS, a controlled 50-task benchmark with the same LLM backend, same tasks, and only the routing path varying, MMA2A achieves 52% task completion accuracy versus 32% for the text-bottleneck baseline (95% bootstrap CI on $Δ$TCA: [8, 32] pp; McNemar’s exact $p = 0.006$). Gains concentrate on vision-dependent tasks: product defect reports improve by +38.5 pp and visual troubleshooting by +16.7 pp. This accuracy gain comes at a $1.8\times$ latency cost from native multimodal processing. These results suggest that routing is a first-order design variable in multi-agent systems, as it determines the information available for downstream reasoning.
[302] ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
Chenlang Yi, Gang Li, Zizhan Xiong, Tue Minh Cao, Yanmin Gong, My T. Thai, Tianbao Yang
Main category: cs.AI
TL;DR: ReSS framework bridges symbolic and neural reasoning for tabular data by using decision trees to extract symbolic scaffolds that guide LLMs to generate faithful natural-language reasoning, improving accuracy and explainability.
Details
Motivation: Tabular data in high-stakes domains like healthcare and finance requires both high accuracy and faithful, human-understandable reasoning. Symbolic models offer verifiable logic but lack semantic expressiveness, while LLMs need specialized fine-tuning for domain-specific tabular reasoning. There's a need to bridge these approaches with scalable data curation and consistent reasoning.Method: ReSS uses decision trees to extract instance-level decision paths as symbolic scaffolds. These scaffolds, along with input features and labels, guide LLMs to generate grounded natural-language reasoning that strictly follows the decision logic. The resulting dataset fine-tunes a pretrained LLM into a specialized tabular reasoning model, enhanced by scaffold-invariant data augmentation for better generalization.
Result: Experimental results on medical and financial benchmarks show ReSS-trained models improve traditional decision trees and standard fine-tuning approaches by up to 10% while producing faithful and consistent reasoning.
Conclusion: ReSS successfully bridges symbolic and neural reasoning for tabular data, addressing both accuracy and explainability challenges in high-stakes domains through systematic scaffolding and faithful reasoning generation.
Abstract: Tabular data remains prevalent in high-stakes domains such as healthcare and finance, where predictive models are expected to provide both high accuracy and faithful, human-understandable reasoning. While symbolic models offer verifiable logic, they lack semantic expressiveness. Meanwhile, general-purpose LLMs often require specialized fine-tuning to master domain-specific tabular reasoning. To address the dual challenges of scalable data curation and reasoning consistency, we propose ReSS, a systematic framework that bridges symbolic and neural reasoning models. ReSS leverages a decision-tree model to extract instance-level decision paths as symbolic scaffolds. These scaffolds, alongside input features and labels, guide an LLM to generate grounded natural-language reasoning that strictly adheres to the underlying decision logic. The resulting high-quality dataset is used to fine-tune a pretrained LLM into a specialized tabular reasoning model, further enhanced by a scaffold-invariant data augmentation strategy to improve generalization and explainability. To rigorously assess faithfulness, we introduce quantitative metrics including hallucination rate, explanation necessity, and explanation sufficiency. Experimental results on medical and financial benchmarks demonstrate that ReSS-trained models improve traditional decision trees and standard fine-tuning approaches up to $10%$ while producing faithful and consistent reasoning
[303] Quantifying and Understanding Uncertainty in Large Reasoning Models
Yangyi Li, Chenxu Zhao, Mengdi Huai
Main category: cs.AI
TL;DR: A novel conformal prediction framework for Large Reasoning Models that quantifies uncertainty in reasoning-answer structures with statistical guarantees, plus an explanation method using Shapley values to identify key training examples and reasoning steps.
Details
Motivation: Traditional uncertainty quantification methods for Large Reasoning Models (LRMs) lack finite-sample guarantees for reasoning-answer generation, ignore logical connections between reasoning traces and final answers, and fail to interpret uncertainty origins while overlooking training factors driving valid reasoning.Method: 1) A novel conformal prediction methodology that quantifies uncertainty in reasoning-answer structures with statistical guarantees; 2) A unified example-to-step explanation framework using Shapley values to identify provably sufficient subsets of training examples and their key reasoning steps while preserving guarantees.
Result: Extensive experiments on challenging reasoning datasets verify the effectiveness of the proposed methods in providing statistically rigorous uncertainty quantification and interpretable explanations for LRMs.
Conclusion: The proposed framework addresses critical gaps in uncertainty quantification for reasoning models by providing statistical guarantees while maintaining interpretability through Shapley-based explanations that identify key training examples and reasoning steps.
Abstract: Large Reasoning Models (LRMs) have recently demonstrated significant improvements in complex reasoning. While quantifying generation uncertainty in LRMs is crucial, traditional methods are often insufficient because they do not provide finite-sample guarantees for reasoning-answer generation. Conformal prediction (CP) stands out as a distribution-free and model-agnostic methodology that constructs statistically rigorous uncertainty sets. However, existing CP methods ignore the logical connection between the reasoning trace and the final answer. Additionally, prior studies fail to interpret the origins of uncertainty coverage for LRMs as they typically overlook the specific training factors driving valid reasoning. Notably, it is challenging to disentangle reasoning quality from answer correctness when quantifying uncertainty, while simultaneously establishing theoretical guarantees for computationally efficient explanation methods. To address these challenges, we first propose a novel methodology that quantifies uncertainty in the reasoning-answer structure with statistical guarantees. Subsequently, we develop a unified example-to-step explanation framework using Shapley values that identifies a provably sufficient subset of training examples and their key reasoning steps to preserve the guarantees. We also provide theoretical analyses of our proposed methods. Extensive experiments on challenging reasoning datasets verify the effectiveness of the proposed methods.
[304] FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
Yupeng Cao, Haohang Li, Weijin Liu, Wenbo Cao, Anke Xu, Lingfei Qian, Xueqing Peng, Minxue Tang, Zhiyuan Yao, Jimin Huang, K. P. Subbalakshmi, Zining Zhu, Jordan W. Suchow, Yangyang Yu
Main category: cs.AI
TL;DR: FinTrace is a benchmark for evaluating financial tool-calling in LLMs with 800 expert-annotated trajectories across 34 financial tasks, using multi-axis evaluation metrics. The study reveals models struggle with information utilization and final answer quality despite good tool selection.
Details
Motivation: Existing benchmarks for financial tool-calling focus on limited scenarios and use call-level metrics that fail to capture trajectory-level reasoning quality. There's a need for comprehensive evaluation that assesses the full reasoning process in financial tool-calling tasks.Method: Created FinTrace benchmark with 800 expert-annotated trajectories across 34 financial task categories with multiple difficulty levels. Uses rubric-based evaluation with 9 metrics across 4 axes: action correctness, execution efficiency, process quality, and output quality. Also created FinTrace-Training dataset with 8,196 curated trajectories for preference learning.
Result: Evaluation of 13 LLMs shows frontier models achieve strong tool selection but all models struggle with information utilization and final answer quality. Fine-tuning Qwen-3.5-9B with DPO on FinTrace-Training improves intermediate reasoning metrics but end-to-end answer quality remains a bottleneck.
Conclusion: There’s a critical gap between invoking the right tools and reasoning effectively over their outputs in financial tool-calling. Trajectory-level training improves intermediate reasoning but doesn’t fully propagate to final output quality, indicating need for better end-to-end reasoning approaches.
Abstract: Recent studies demonstrate that tool-calling capability enables large language models (LLMs) to interact with external environments for long-horizon financial tasks. While existing benchmarks have begun evaluating financial tool calling, they focus on limited scenarios and rely on call-level metrics that fail to capture trajectory-level reasoning quality. To address this gap, we introduce FinTrace, a benchmark comprising 800 expert-annotated trajectories spanning 34 real-world financial task categories across multiple difficulty levels. FinTrace employs a rubric-based evaluation protocol with nine metrics organized along four axes – action correctness, execution efficiency, process quality, and output quality – enabling fine-grained assessment of LLM tool-calling behavior. Our evaluation of 13 LLMs reveals that while frontier models achieve strong tool selection, all models struggle with information utilization and final answer quality, exposing a critical gap between invoking the right tools and reasoning effectively over their outputs. To move beyond diagnosis, we construct FinTrace-Training, the first trajectory-level preference dataset for financial tool-calling, containing 8,196 curated trajectories with tool-augmented contexts and preference pairs. We fine-tune Qwen-3.5-9B using supervised fine-tuning followed by direct preference optimization (DPO) and show that training on FinTrace-Training consistently improves intermediate reasoning metrics, with DPO more effectively suppressing failure modes. However, end-to-end answer quality remains a bottleneck, indicating that trajectory-level improvements do not yet fully propagate to final output quality.
[305] Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
Ziwei Wang, Junjie Zheng, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Zhouhua Fang, Zhiwei Liu, Dajun Chen, Yong Li, Jiajun Bu
Main category: cs.AI
TL;DR: LAMO framework enables lightweight MLLMs for GUI automation through role-oriented data synthesis and two-stage training, supporting both monolithic execution and multi-agent orchestration.
Details
Motivation: Lightweight GUI agents face deployment cost challenges on resource-constrained devices, with limited capacity and poor task scalability under end-to-end episodic learning, hindering adaptation to multi-agent systems.Method: LAMO framework combines role-oriented data synthesis with two-stage training: (1) supervised fine-tuning with Perplexity-Weighted Cross-Entropy optimization for knowledge distillation and visual perception enhancement, and (2) reinforcement learning for role-oriented cooperative exploration.
Result: Developed LAMO-3B, a task-scalable native GUI agent supporting monolithic execution and MAS-style orchestration, with extensive static and online evaluations validating the design effectiveness.
Conclusion: LAMO enables lightweight MLLMs to participate in realistic GUI workflows through effective trade-off between cost and scalability, allowing multi-role orchestration to expand capability boundaries for GUI automation.
Abstract: Autonomous Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) enable digital automation on end-user devices. While scaling both parameters and data has yielded substantial gains, advanced methods still suffer from prohibitive deployment costs on resource-constrained devices. When facing complex in-the-wild scenarios, lightweight GUI agents are bottlenecked by limited capacity and poor task scalability under end-to-end episodic learning, impeding adaptation to multi-agent systems (MAS), while training multiple skill-specific experts remains costly. Can we strike an effective trade-off in this cost-scalability dilemma, enabling lightweight MLLMs to participate in realistic GUI workflows? To address these challenges, we propose the LAMO framework, which endows a lightweight MLLM with GUI-specific knowledge and task scalability, allowing multi-role orchestration to expand its capability boundary for GUI automation. LAMO combines role-oriented data synthesis with a two-stage training recipe: (i) supervised fine-tuning with Perplexity-Weighted Cross-Entropy optimization for knowledge distillation and visual perception enhancement, and (ii) reinforcement learning for role-oriented cooperative exploration. With LAMO, we develop a task-scalable native GUI agent, LAMO-3B, supporting monolithic execution and MAS-style orchestration. When paired with advanced planners as a plug-and-play policy executor, LAMO-3B can continuously benefit from planner advances, enabling a higher performance ceiling. Extensive static and online evaluations validate the effectiveness of our design.
[306] RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
Renqi Chen, Zeyin Tao, Jianming Guo, Jing Wang, Zezhou Xu, Jingzhe Zhu, Qingqing Sun, Tianyi Zhang, Shuai Chen
Main category: cs.AI
TL;DR: RiskWebWorld: A realistic interactive benchmark for evaluating GUI agents in e-commerce risk management with 1,513 tasks from production pipelines, revealing significant capability gaps between models.
Details
Motivation: Existing GUI agent benchmarks focus on benign consumer environments, lacking evaluation in high-stakes investigative domains like e-commerce risk management where agents must operate on uncooperative websites with partial environmental information.Method: Created RiskWebWorld benchmark with 1,513 tasks from production risk-control pipelines across 8 domains, built Gymnasium-compliant infrastructure to decouple policy planning from environment mechanics, enabling scalable evaluation and agentic reinforcement learning.
Result: Top-tier generalist models achieved 49.1% success rate, while specialized open-weights GUI models failed completely, showing foundation model scale matters more than zero-shot interface grounding. Agentic RL improved open-source models by 16.2%.
Conclusion: RiskWebWorld provides a practical testbed for developing robust digital workers in professional domains, highlighting the need for better evaluation in realistic, high-stakes environments beyond consumer applications.
Abstract: Graphical User Interface (GUI) agents show strong capabilities for automating web tasks, but existing interactive benchmarks primarily target benign, predictable consumer environments. Their effectiveness in high-stakes, investigative domains such as authentic e-commerce risk management remains underexplored. To bridge this gap, we present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management. RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, partially environmental hijackments. To support scalable evaluation and agentic reinforcement learning (RL), we further build a Gymnasium-compliant infrastructure that decouples policy planning from environment mechanics. Our evaluation across diverse models reveals a dramatic capability gap: top-tier generalist models achieve 49.1% success, while specialized open-weights GUI models lag at near-total failure. This highlights that foundation model scale currently matters more than zero-shot interface grounding in long-horizon professional tasks. We also demonstrate the viability of our infrastructure through agentic RL, which improves open-source models by 16.2%. These results position RiskWebWorld as a practical testbed for developing robust digital workers.
[307] Weight Patching: Toward Source-Level Mechanistic Localization in LLMs
Chenghao Sun, Chengsheng Zhang, Guanzheng Qin, Rui Dai, Xinmei Tian
Main category: cs.AI
TL;DR: Weight Patching is a parameter-space intervention method for mechanistic interpretability that analyzes model capabilities by transferring weights between paired same-architecture models with different capability strengths.
Details
Motivation: Current activation-space localization methods may identify modules that merely aggregate or amplify upstream signals rather than encoding target capabilities in their own parameters, creating a gap in understanding where capabilities are actually encoded.Method: Proposes Weight Patching: replaces selected module weights from a behavior-specialized model into a base model under fixed inputs, using paired same-architecture models with different capability strengths. Introduces vector-anchor behavioral interface framework for shared internal criteria of task-relevant control states in open-ended generation.
Result: Analysis reveals a hierarchy from shallow candidate source-side carriers to aggregation/routing modules to downstream execution circuits. Component scores guide mechanism-aware model merging, improving selective fusion across expert combinations.
Conclusion: Weight Patching enables parameter-space causal analysis of model capabilities, bridging activation-space and parameter-space interpretability, with applications in understanding capability localization and guiding model merging.
Abstract: Mechanistic interpretability seeks to localize model behavior to the internal components that causally realize it. Prior work has advanced activation-space localization and causal tracing, but modules that appear important in activation space may merely aggregate or amplify upstream signals rather than encode the target capability in their own parameters. To address this gap, we propose Weight Patching, a parameter-space intervention method for source-oriented analysis in paired same-architecture models that differ in how strongly they express a target capability under the inputs of interest. Given a base model and a behavior-specialized counterpart, Weight Patching replaces selected module weights from the specialized model into the base model under a fixed input. We instantiate the method on instruction following and introduce a framework centered on a vector-anchor behavioral interface that provides a shared internal criterion for whether a task-relevant control state has been formed or recovered in open-ended generation. Under this framework, the analysis reveals a hierarchy from shallow candidate source-side carriers to aggregation and routing modules, and further to downstream execution circuits. The recovered component scores can also guide mechanism-aware model merging, improving selective fusion across the evaluated expert combinations and providing additional external validation.
[308] Rethinking AI Hardware: A Three-Layer Cognitive Architecture for Autonomous Agents
Li Chen
Main category: cs.AI
TL;DR: Tri-Spirit Architecture decomposes AI intelligence into three layers (planning, reasoning, execution) mapped to different hardware substrates, achieving significant latency and energy reductions through cognitive decomposition.
Details
Motivation: Current AI paradigms treat planning, reasoning, and execution as monolithic processes, leading to unnecessary latency, energy consumption, and fragmented behavioral continuity across heterogeneous hardware systems.Method: Three-layer cognitive framework: Super Layer (planning), Agent Layer (reasoning), Reflex Layer (execution) mapped to distinct compute substrates, coordinated via asynchronous message bus. Includes parameterized routing policy, habit-compilation mechanism, convergent memory model, and explicit safety constraints.
Result: In simulation of 2000 synthetic tasks: reduces mean task latency by 75.6%, energy consumption by 71.1%, decreases LLM invocations by 30%, enables 77.6% offline task completion compared to cloud-centric and edge-only baselines.
Conclusion: Cognitive decomposition, rather than model scaling alone, is a primary driver of system-level efficiency in AI hardware, enabling more efficient intelligence distribution across heterogeneous compute substrates.
Abstract: The next generation of autonomous AI systems will be constrained not only by model capability, but by how intelligence is structured across heterogeneous hardware. Current paradigms – cloud-centric AI, on-device inference, and edge-cloud pipelines – treat planning, reasoning, and execution as a monolithic process, leading to unnecessary latency, energy consumption, and fragmented behavioral continuity. We introduce the Tri-Spirit Architecture, a three-layer cognitive framework that decomposes intelligence into planning (Super Layer), reasoning (Agent Layer), and execution (Reflex Layer), each mapped to distinct compute substrates and coordinated via an asynchronous message bus. We formalize the system with a parameterized routing policy, a habit-compilation mechanism that promotes repeated reasoning paths into zero-inference execution policies, a convergent memory model, and explicit safety constraints. We evaluate the architecture in a reproducible simulation of 2000 synthetic tasks against cloud-centric and edge-only baselines. Tri-Spirit reduces mean task latency by 75.6 percent and energy consumption by 71.1 percent, while decreasing LLM invocations by 30 percent and enabling 77.6 percent offline task completion. These results suggest that cognitive decomposition, rather than model scaling alone, is a primary driver of system-level efficiency in AI hardware.
[309] The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents
Rafflesia Khan, Nafiul Islam Khan
Main category: cs.AI
TL;DR: A feasibility study on Cognitive Companion architectures that monitor LLM agents during multi-step tasks to reduce reasoning degradation, with both LLM-based and zero-overhead probe-based implementations tested on various model sizes.
Details
Motivation: LLM agents suffer from reasoning degradation, looping, drift, and stuck states during multi-step tasks (up to 30% failure rates on hard tasks). Current solutions like hard step limits or LLM-as-judge monitoring have limitations - step limits are abrupt and LLM monitoring adds 10-15% overhead per step.Method: Introduced Cognitive Companion parallel monitoring architecture with two implementations: 1) LLM-based Companion that monitors agent reasoning, and 2) novel zero-overhead Probe-based Companion trained on hidden states from layer 28. Conducted three-batch feasibility study centered on Gemma 4 E4B, with exploratory analysis on smaller models (Qwen 2.5 1.5B and Llama 3.2 1B).
Result: LLM-based Companion reduced repetition on loop-prone tasks by 52-62% with ~11% overhead. Probe-based Companion showed mean effect size of +0.471 at zero measured inference overhead, with strongest probe achieving cross-validated AUROC 0.840. Companion benefit is task-type dependent: most helpful on loop-prone/open-ended tasks, neutral/negative on structured tasks. Small-model experiments suggest scale boundary - companions didn’t improve quality proxy on 1B-1.5B models.
Conclusion: This is a feasibility study showing sub-token monitoring may be useful, identifying task-type sensitivity as practical design constraint, and motivating selective companion activation as promising future direction. Results encourage further exploration but not definitive validation.
Abstract: Large language model (LLM) agents on multi-step tasks suffer reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks. Current solutions include hard step limits (abrupt) or LLM-as-judge monitoring (10-15% overhead per step). This paper introduces the Cognitive Companion, a parallel monitoring architecture with two implementations: an LLM-based Companion and a novel zero-overhead Probe-based Companion. We report a three-batch feasibility study centered on Gemma 4 E4B, with an additional exploratory small-model analysis on Qwen 2.5 1.5B and Llama 3.2 1B. In our experiments, the LLM-based Companion reduced repetition on loop-prone tasks by 52-62% with approximately 11% overhead. The Probe-based Companion, trained on hidden states from layer 28, showed a mean effect size of +0.471 at zero measured inference overhead; its strongest probe result achieved cross-validated AUROC 0.840 on a small proxy-labeled dataset. A key empirical finding is that companion benefit appears task-type dependent: companions are most helpful on loop-prone and open-ended tasks, while effects are neutral or negative on more structured tasks. Our small-model experiments also suggest a possible scale boundary: companions did not improve the measured quality proxy on 1B-1.5B models, even when interventions fired. Overall, the paper should be read as a feasibility study rather than a definitive validation. The results provide encouraging evidence that sub-token monitoring may be useful, identify task-type sensitivity as a practical design constraint, and motivate selective companion activation as a promising direction for future work.
[310] AlphaCNOT: Learning CNOT Minimization with Model-Based Planning
Jacopo Cossio, Daniele Lizzio Bosco, Riccardo Romanello, Giuseppe Serra, Carla Piazza
Main category: cs.AI
TL;DR: AlphaCNOT: A reinforcement learning framework using Monte Carlo Tree Search for quantum circuit optimization, specifically focusing on CNOT gate minimization in both unconstrained and topology-aware scenarios.
Details
Motivation: Quantum circuit optimization is crucial for Noisy Intermediate Scale Quantum devices where error propagation scales with operation count. CNOT gates are particularly important as the only 2-qubit gate in the universal Clifford+T set, making their minimization essential for practical quantum computing.Method: AlphaCNOT uses a model-based reinforcement learning framework based on Monte Carlo Tree Search (MCTS) to treat CNOT minimization as a planning problem. Unlike other RL approaches, it leverages lookahead search to evaluate future trajectories, enabling more efficient CNOT sequence discovery.
Result: Achieves up to 32% reduction in CNOT gate count compared to PMH baseline for linear reversible synthesis. In topology-aware synthesis (constrained version), shows consistent gate count reduction across various topologies with up to 8 qubits compared to state-of-the-art RL solutions.
Conclusion: The combination of RL with search-based strategies like MCTS is effective for quantum circuit optimization tasks, including CNOT minimization. This approach can be extended to other optimization problems like Clifford minimization, contributing to the transition toward practical “quantum utility.”
Abstract: Quantum circuit optimization is a central task in Quantum Computing, as current Noisy Intermediate Scale Quantum devices suffer from error propagation that often scales with the number of operations. Among quantum operations, the CNOT gate is of fundamental importance, being the only 2-qubit gate in the universal Clifford+T set. The problem of CNOT gates minimization has been addressed by heuristic algorithms such as the well-known Patel-Markov-Hayes (PMH) for linear reversible synthesis (i.e., CNOT minimization with no topological constraints), and more recently by Reinforcement Learning (RL) based strategies in the more complex case of topology-aware synthesis, where each CNOT can act on a subset of all qubits pairs. In this work we introduce AlphaCNOT, a RL framework based on Monte Carlo Tree Search (MCTS) that address effectively the CNOT minimization problem by modeling it as a planning problem. In contrast to other RL- based solution, our method is model-based, i.e. it can leverage lookahead search to evaluate future trajectories, thus finding more efficient sequences of CNOTs. Our method achieves a reduction of up to 32% in CNOT gate count compared to PMH baseline on linear reversible synthesis, while in the constraint version we report a consistent gate count reduction on a variety of topologies with up to 8 qubits, with respect to state-of-the-art RL-based solutions. Our results suggest the combination of RL with search-based strategies can be applied to different circuit optimization tasks, such as Clifford minimization, thus fostering the transition toward the “quantum utility” era.
[311] GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis
Bo Yu, Cheng Yang, Dongyang Hou, Chengfu Liu, Jiayao Liu, Chi Wang, Zhiming Zhang, Haifeng Li, Wentao Yang
Main category: cs.AI
TL;DR: GeoAgentBench (GABench) is a dynamic evaluation benchmark for LLM-based GIS agents with 117 tools across 53 tasks, featuring novel metrics and a Plan-and-React architecture that outperforms traditional frameworks.
Details
Motivation: Current evaluation of LLM-based GIS agents is inadequate because existing benchmarks use static text/code matching and ignore dynamic runtime feedback and multimodal spatial outputs, failing to capture the complexity of real geospatial workflows.Method: Introduced GeoAgentBench (GABench) with 117 atomic GIS tools across 6 domains, Parameter Execution Accuracy (PEA) metric using “Last-Attempt Alignment,” VLM-based verification for spatial accuracy, and Plan-and-React agent architecture that decouples global planning from step-wise reactive execution.
Result: Experiments with 7 LLMs show Plan-and-React significantly outperforms traditional frameworks, achieving optimal balance between logical rigor and execution robustness, especially in multi-step reasoning and error recovery.
Conclusion: GABench establishes a robust standard for assessing autonomous GeoAI, highlighting current capability boundaries and advancing next-generation spatial analysis systems through dynamic, multimodal evaluation.
Abstract: The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and interactive evaluation benchmark tailored for tool-augmented GIS agents. GABench provides a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains. Recognizing that precise parameter configuration is the primary determinant of execution success in dynamic GIS environments, we designed the Parameter Execution Accuracy (PEA) metric, which utilizes a “Last-Attempt Alignment” strategy to quantify the fidelity of implicit parameter inference. Complementing this, a Vision-Language Model (VLM) based verification is proposed to assess data-spatial accuracy and cartographic style adherence. Furthermore, to address the frequent task failures caused by parameter misalignments and runtime anomalies, we developed a novel agent architecture, Plan-and-React, that mimics expert cognitive workflows by decoupling global orchestration from step-wise reactive execution. Extensive experiments with seven representative LLMs demonstrate that the Plan-and-React paradigm significantly outperforms traditional frameworks, achieving the optimal balance between logical rigor and execution robustness, particularly in multi-step reasoning and error recovery. Our findings highlight current capability boundaries and establish a robust standard for assessing and advancing the next generation of autonomous GeoAI.
[312] Formalizing the Safety, Security, and Functional Properties of Agentic AI Systems
Edoardo Allegrini, Ananth Shreekumar, Z. Berkay Celik
Main category: cs.AI
TL;DR: A formal modeling framework for analyzing safety, security, and functionality of multi-agent AI systems using host agent and task lifecycle models with temporal logic properties.
Details
Motivation: Current agentic AI systems lack unified semantic frameworks for rigorous analysis, creating fragmentation in inter-agent communication protocols (MCP for tools, A2A for coordination) that prevents systematic verification of safety, security, and functional properties.Method: Introduces two central models: (1) host agent model formalizing top-level entity for task decomposition and orchestration, and (2) task lifecycle model detailing sub-task states and transitions. Defines 30 properties (16 for host agent, 14 for task lifecycle) in temporal logic across liveness, safety, completeness, and fairness categories.
Result: Provides first rigorously grounded, domain-agnostic framework enabling formal verification of multi-agent AI systems, detection of coordination edge cases, and prevention of deadlocks and security vulnerabilities.
Conclusion: The framework enables systematic analysis, design, and deployment of correct, reliable, and robust agentic AI systems by addressing current fragmentation in inter-agent communication and providing formal verification capabilities.
Abstract: Agentic AI systems, which leverage multiple autonomous agents and large language models (LLMs), are increasingly used to address complex, multi-step tasks. The safety, security, and functionality of these systems are critical, especially in high-stakes applications. However, the current ecosystem of inter-agent communication is fragmented, with protocols such as the Model Context Protocol (MCP) for tool access and the Agent-to-Agent (A2A) protocol for coordination being analyzed in isolation. This fragmentation creates a semantic gap that prevents the rigorous analysis of system properties and introduces risks such as architectural misalignment and exploitable coordination issues. To address these challenges, we introduce a modeling framework for agentic AI systems composed of two central models: (1) the host agent model formalizes the top-level entity that interacts with the user, decomposes tasks, and orchestrates their execution by leveraging external agents and tools; (2) the task lifecycle model details the states and transitions of individual sub-tasks from creation to completion, providing a fine-grained view of task management and error handling. Together, these models provide a unified semantic framework for reasoning about the behavior of multi-AI agent systems. Grounded in this framework, we define 16 properties for the host agent and 14 for the task lifecycle, categorized into liveness, safety, completeness, and fairness. Expressed in temporal logic, these properties enable formal verification of system behavior, detection of coordination edge cases, and prevention of deadlocks and security vulnerabilities. Through this effort, we introduce the first rigorously grounded, domain-agnostic framework for the analysis, design, and deployment of correct, reliable, and robust agentic AI systems.
[313] AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
Joydeep Biswas, Sheila Schoepp, Gautham Vasan, Anthony Opipari, Arthur Zhang, Zichao Hu, Sebastian Joseph, Matthew Lease, Junyi Jessy Li, Peter Stone, Kiri L. Wagstaff, Matthew E. Taylor, Odest Chadwicke Jenkins
Main category: cs.AI
TL;DR: AI-assisted peer review system deployed at AAAI-26 conference scale, generating reviews for 22,977 papers using frontier models with tool use and safeguards, showing preference over human reviews on technical accuracy.
Details
Motivation: Scientific peer review faces challenges with increasing submission volumes affecting quality, consistency, and timeliness. The community is exploring AI assistance but needs to determine if AI can generate technically sound reviews at real-world conference scale.Method: Deployed a multi-stage AI review system combining frontier models, tool use, and safeguards to generate reviews for all 22,977 full-review papers at AAAI-26 in less than a day. Conducted large-scale survey of authors and program committee members and introduced a novel benchmark for evaluation.
Result: Participants preferred AI reviews over human reviews on key dimensions including technical accuracy and research suggestions. The system substantially outperformed simple LLM-generated review baselines at detecting scientific weaknesses in benchmark evaluations.
Conclusion: State-of-the-art AI methods can already make meaningful contributions to scientific peer review at conference scale, opening a path toward synergistic human-AI teaming for research evaluation.
Abstract: Scientific peer review faces mounting strain as submission volumes surge, making it increasingly difficult to sustain review quality, consistency, and timeliness. Recent advances in AI have led the community to consider its use in peer review, yet a key unresolved question is whether AI can generate technically sound reviews at real-world conference scale. Here we report the first large-scale field deployment of AI-assisted peer review: every main-track submission at AAAI-26 received one clearly identified AI review from a state-of-the-art system. The system combined frontier models, tool use, and safeguards in a multi-stage process to generate reviews for all 22,977 full-review papers in less than a day. A large-scale survey of AAAI-26 authors and program committee members showed that participants not only found AI reviews useful, but actually preferred them to human reviews on key dimensions such as technical accuracy and research suggestions. We also introduce a novel benchmark and find that our system substantially outperforms a simple LLM-generated review baseline at detecting a variety of scientific weaknesses. Together, these results show that state-of-the-art AI methods can already make meaningful contributions to scientific peer review at conference scale, opening a path toward the next generation of synergistic human-AI teaming for evaluating research.
[314] [Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI
You Rim Choi, Subeom Park, Hyung-Sin Kim
Main category: cs.AI
TL;DR: ATI is a bio-inspired sensor-first architecture for physical AI that co-designs sensing and inference through a tripartite system enabling adaptive sensing, edge-cloud execution, and foundation model reasoning.
Details
Motivation: As AI moves to robots and wearables, physical constraints (latency, energy, privacy, reliability) require new architectures that co-design sensing and inference, not just scale model capacity.Method: Artificial Tripartite Intelligence (ATI) with four levels: L1 Brainstem for reflexive safety/signal control, L2 Cerebellum for continuous sensor calibration, L3/L4 Cerebral Inference for skill selection/execution and deep reasoning. Modular organization enables sensor control, adaptive sensing, edge-cloud execution, and foundation model reasoning in closed-loop.
Result: ATI prototype on mobile camera improved end-to-end accuracy from 53.8% to 88% while reducing remote L4 invocations by 43.3% compared to default auto-exposure.
Conclusion: Co-designing sensing and inference is valuable for embodied AI, enabling performance gains while meeting physical constraints through bio-inspired modular architecture.
Abstract: As AI moves from data centers to robots and wearables, scaling ever-larger models becomes insufficient. Physical AI operates under tight latency, energy, privacy, and reliability constraints, and its performance depends not only on model capacity but also on how signals are acquired through controllable sensors in dynamic environments. We present Artificial Tripartite Intelligence (ATI), a bio-inspired, sensor-first architectural contract for physical AI. ATI is tripartite at the systems level: a Brainstem (L1) provides reflexive safety and signal-integrity control, a Cerebellum (L2) performs continuous sensor calibration, and a Cerebral Inference Subsystem spanning L3/L4 supports routine skill selection and execution, coordination, and deep reasoning. This modular organization allows sensor control, adaptive sensing, edge-cloud execution, and foundation model reasoning to co-evolve within one closed-loop architecture, while keeping time-critical sensing and control on device and invoking higher-level inference only when needed. We instantiate ATI in a mobile camera prototype under dynamic lighting and motion. In our routed evaluation (L3-L4 split inference), compared to the default auto-exposure setting, ATI (L1/L2 adaptive sensing) improves end-to-end accuracy from 53.8% to 88% while reducing remote L4 invocations by 43.3%. These results show the value of co-designing sensing and inference for embodied AI.
[315] Reward Design for Physical Reasoning in Vision-Language Models
Derek Lilienthal, Manisha Mukherjee, Sameera Horawalavithana
Main category: cs.AI
TL;DR: Systematic reward ablation study for GRPO-based VLM training on physical reasoning, comparing four reward signals of increasing semantic richness to understand how reward design shapes VLM physical reasoning behavior.
Details
Motivation: Current Vision Language Models (VLMs) fall short of human performance on physics benchmarks despite advances in post-training algorithms. There's poor understanding of how reward design shapes VLM physical reasoning behavior, motivating a systematic study of different reward signals.Method: Conducted systematic reward ablation study for GRPO-based VLM training on physical reasoning using IBM Granite Vision 3.3 (2B). Compared four reward signals: format compliance, answer accuracy, composite rubric reward (answer correctness, physics principle identification, unit consistency), and novel internal reward from model attention weights over input image regions. Evaluated on PhyX benchmark with 3,000 problems spanning six physics domains and six reasoning types across multiple-choice and open-ended formats.
Result: GRPO with accuracy-based rewards outperforms SFT on most domains, though gains vary substantially by reward type and domain. Accuracy-based rewards provide strongest overall gains. Rubric rewards improve structured reasoning quality without consistent accuracy improvements. Attention-based rewards enhance spatial reasoning while degrading performance in symbolic domains. Internal attention-weight reward improves spatial relation accuracy from 0.27 to 0.50 without requiring spatial annotations.
Conclusion: Reward design does not uniformly improve performance but induces domain-specific reasoning behaviors. Supervising where the model attends during generation is a promising direction for visually grounded physical reasoning. Different reward signals shape different aspects of VLM reasoning capabilities.
Abstract: Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood. We present a systematic reward ablation study for GRPO-based VLM training on physical reasoning. We compare four reward signals of increasing semantic richness: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, and unit consistency), and a novel internal reward derived from model attention weights over input image regions. We evaluate on PhyX, a 3,000-problem benchmark spanning six physics domains and six reasoning types across multiple-choice and open-ended formats, using IBM Granite Vision 3.3 (2B). Across both formats, GRPO with accuracy-based rewards outperforms SFT on most domains, though gains vary substantially by reward type and domain. Reward design does not uniformly improve performance. Instead, it induces domain-specific reasoning behaviors. Accuracy-based rewards provide the strongest overall gains. Rubric rewards improve structured reasoning quality without consistent accuracy improvements. Attention-based rewards enhance spatial reasoning while degrading performance in symbolic domains. Our internal attention-weight reward requires no spatial annotations and improves spatial relation accuracy from 0.27 to 0.50, suggesting that supervising where the model attends during generation is a promising direction for visually grounded physical reasoning.
[316] Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents
Kangsan Kim, Minki Kang, Taeil Kim, Yanlai Yang, Mengye Ren, Sung Ju Hwang
Main category: cs.AI
TL;DR: Memory Transfer Learning (MTL) enables coding agents to leverage shared memory across heterogeneous coding domains, improving performance by 3.7% through transfer of meta-knowledge like validation routines rather than task-specific code.
Details
Motivation: Existing memory-based self-evolution approaches for coding agents are limited to homogeneous task domains, failing to exploit shared infrastructural foundations (runtime environments, programming languages) that exist across diverse real-world coding problems.Method: Investigate Memory Transfer Learning (MTL) using a unified memory pool from heterogeneous domains. Evaluate across 6 coding benchmarks using four memory representations ranging from concrete traces to abstract insights.
Result: Cross-domain memory improves average performance by 3.7%, primarily transferring meta-knowledge (e.g., validation routines) rather than task-specific code. Abstraction dictates transferability: high-level insights generalize well, while low-level traces often cause negative transfer due to excessive specificity. Transfer effectiveness scales with memory pool size, and memory can be transferred between different models.
Conclusion: Establishes empirical design principles for expanding memory utilization beyond single-domain silos, showing that memory transfer learning enables coding agents to leverage shared knowledge across heterogeneous domains effectively.
Abstract: Memory-based self-evolution has emerged as a promising paradigm for coding agents. However, existing approaches typically restrict memory utilization to homogeneous task domains, failing to leverage the shared infrastructural foundations, such as runtime environments and programming languages, that exist across diverse real-world coding problems. To address this limitation, we investigate \textbf{Memory Transfer Learning} (MTL) by harnessing a unified memory pool from heterogeneous domains. We evaluate performance across 6 coding benchmarks using four memory representations, ranging from concrete traces to abstract insights. Our experiments demonstrate that cross-domain memory improves average performance by 3.7%, primarily by transferring meta-knowledge, such as validation routines, rather than task-specific code. Importantly, we find that abstraction dictates transferability; high-level insights generalize well, whereas low-level traces often induce negative transfer due to excessive specificity. Furthermore, we show that transfer effectiveness scales with the size of the memory pool, and memory can be transferred even between different models. Our work establishes empirical design principles for expanding memory utilization beyond single-domain silos. Project page: https://memorytransfer.github.io/
[317] Hierarchical Reinforcement Learning with Runtime Safety Shielding for Power Grid Operation
Gitesh Malik
Main category: cs.AI
TL;DR: Hierarchical safety-constrained RL framework for power-grid control with runtime safety shield and zero-shot generalization to unseen grids
Details
Motivation: RL shows promise for power-grid automation but faces deployment challenges due to safety requirements, brittleness under rare disturbances, and poor generalization to unseen grid topologies in safety-critical infrastructureMethod: Safety-constrained hierarchical control framework that decouples long-horizon decision-making from real-time feasibility enforcement. High-level RL policy proposes abstract actions, while deterministic runtime safety shield filters unsafe actions using fast forward simulation
Result: Evaluated on Grid2Op benchmark: achieves longer episode survival, lower peak line loading, and robust zero-shot generalization to unseen grids (ICAPS 2021 large-scale transmission grid) without retraining
Conclusion: Safety and generalization in power-grid control are best achieved through architectural design rather than complex reward engineering, providing a practical path toward deployable learning-based controllers
Abstract: Reinforcement learning has shown promise for automating power-grid operation tasks such as topology control and congestion management. However, its deployment in real-world power systems remains limited by strict safety requirements, brittleness under rare disturbances, and poor generalization to unseen grid topologies. In safety-critical infrastructure, catastrophic failures cannot be tolerated, and learning-based controllers must operate within hard physical constraints. This paper proposes a safety-constrained hierarchical control framework for power-grid operation that explicitly decouples long-horizon decision-making from real-time feasibility enforcement. A high-level reinforcement learning policy proposes abstract control actions, while a deterministic runtime safety shield filters unsafe actions using fast forward simulation. Safety is enforced as a runtime invariant, independent of policy quality or training distribution. The proposed framework is evaluated on the Grid2Op benchmark suite under nominal conditions, forced line-outage stress tests, and zero-shot deployment on the ICAPS 2021 large-scale transmission grid without retraining. Results show that flat reinforcement learning policies are brittle under stress, while safety-only methods are overly conservative. In contrast, the proposed hierarchical and safety-aware approach achieves longer episode survival, lower peak line loading, and robust zero-shot generalization to unseen grids. These results indicate that safety and generalization in power-grid control are best achieved through architectural design rather than increasingly complex reward engineering, providing a practical path toward deployable learning-based controllers for real-world energy systems.
[318] TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
Zerun Ma, Guoqiang Wang, Xinchen Xie, Yicheng Chen, He Du, Bowen Li, Yanan Sun, Wenran Liu, Kai Chen, Yining Li
Main category: cs.AI
TL;DR: TREX is a multi-agent system that automates the entire LLM training lifecycle through collaboration between Researcher and Executor modules, using a search tree approach for experimental planning and evaluated on the FT-Bench benchmark.
Details
Motivation: While LLMs have enabled AI agents to perform isolated scientific tasks, automating complex real-world workflows like LLM training remains challenging. The paper aims to address this gap by creating an automated system for the complete LLM training lifecycle.Method: TREX uses a multi-agent system with two core modules: Researcher (for requirement analysis, literature/data research, training strategy formulation) and Executor (for data recipe preparation, model training and evaluation). The experimental process is modeled as a search tree for efficient exploration path planning, historical result reuse, and insight distillation from iterative trials.
Result: The system was evaluated on FT-Bench, a benchmark with 10 real-world tasks ranging from optimizing fundamental model capabilities to enhancing domain-specific performance. Experimental results show that TREX consistently optimizes model performance on target tasks.
Conclusion: TREX successfully automates the LLM training lifecycle through multi-agent collaboration and search tree modeling, demonstrating effective optimization of model performance across diverse real-world tasks.
Abstract: While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formulation of training strategies, preparation of data recipes, and model training and evaluation. The multi-round experimental process is modeled as a search tree, enabling the system to efficiently plan exploration paths, reuse historical results, and distill high-level insights from iterative trials. To evaluate the capability of automated LLM training, we construct FT-Bench, a benchmark comprising 10 tasks derived from real-world scenarios, ranging from optimizing fundamental model capabilities to enhancing performance on domain-specific tasks. Experimental results demonstrate that the TREX agent consistently optimizes model performance on target tasks.
[319] Agentic AI Optimisation (AAIO): what it is, how it works, why it matters, and how to deal with it
Luciano Floridi, Carlotta Buttaboni, Nicolas Gentler, Emmie Hine, Jessica Morley, Claudio Novelli, Tyler Schroder
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The emergence of Agentic Artificial Intelligence (AAI) systems capable of independently initiating digital interactions necessitates a new optimisation paradigm designed explicitly for seamless agent-platform interactions. This article introduces Agentic AI Optimisation (AAIO) as an essential methodology for ensuring effective integration between websites and agentic AI systems. Like how Search Engine Optimisation (SEO) has shaped digital content discoverability, AAIO can define interactions between autonomous AI agents and online platforms. By examining the mutual interdependency between website optimisation and agentic AI success, the article highlights the virtuous cycle that AAIO can create. It further explores the governance, ethical, legal, and social implications (GELSI) of AAIO, emphasising the necessity of proactive regulatory frameworks to mitigate potential negative impacts. The article concludes by affirming AAIO’s essential role as part of a fundamental digital infrastructure in the era of autonomous digital agents, advocating for equitable and inclusive access to its benefits.
[320] FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks
Jun Takahashi, Atsunori Moteki, Akiyoshi Uchida, Shoichi Masui, Fan Yang, Kanji Uchino, Yueqi Song, Yonatan Bisk, Graham Neubig, Ikuo Kusajima, Yasuto Watanabe, Hiroyuki Ishida, Koki Nakagawa, Shan Jiang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper introduces FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are built to detect and document safety hazards, procedural violations, and other critical incidents across real-world manufacturing and retail environments. Whereas most agentic AI benchmarks focus on performance in simulated or digital environments, our work addresses the fundamental challenge of evaluating agents in the real-world. In this paper, we improve the evaluation function from previous methods to assess the performance of agentic AI in diverse real-world tasks. Our dataset comprises on-site captured images/videos in factories, warehouses and retails. Tasks were meticulously developed through interviews with site workers and managers. Evaluation results confirmed that performance evaluation considering the characteristics of Multimodal LLM (MLLM) such as GPT-4o is feasible. Furthermore, this study identifies both the effectiveness and limitations of the proposed new evaluation methodology. The complete dataset and evaluation program are publicly accessible on the website (https://en-documents.research.global.fujitsu.com/fieldworkarena/)
[321] Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar, Bilal Kartal, Pritam Biswas, Yoshi Suhara, Kangwook Lee, Jaewoong Cho
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Model (LLM) agents are reshaping the game industry, by enabling more intelligent and human-preferable characters. Yet, current game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets to adapt pre-trained LLMs into gaming agents. To fill these gaps, we present Orak, a benchmark for training and evaluating LLM agents across 12 popular video games spanning all major genres. Using a plug-and-play interface built on Model Context Protocol (MCP), Orak supports systematic and reproducible studies of agentic modules in varied game scenarios. We further release a fine-tuning dataset of expert LLM gameplay trajectories covering multiple genres, turning general LLMs into effective game agents. Orak offers a united evaluation framework, including game leaderboards, LLM battle arenas, and \fix{ablation studies} of input modality, agentic strategies, and fine-tuning effects, establishing a foundation towards versatile gaming agents. Code and datasets are available at https://github.com/krafton-ai/Orak and https://huggingface.co/datasets/KRAFTON/Orak.
[322] RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Ma, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, Ge Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its essentially on-policy strategy coupled with LLM’s immense action space and sparse reward. Critically, RLVR can lead to the capability boundary collapse, narrowing the LLM’s problem-solving scope. To address this problem, we propose RL-PLUS, a novel hybrid-policy optimization approach for LLMs that synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components, i.e., Multiple Importance Sampling to address distributional mismatch from external data, and Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. Compared with existing RLVR methods, RL-PLUS achieves 1) state-of-the-art performance on six math reasoning benchmarks; 2) superior performance on six out-of-distribution reasoning tasks; 3) consistent and significant gains across diverse model families, with average relative improvements up to 69.2%. Moreover, the analysis of Pass@k curves indicates that RL-PLUS effectively resolves the capability boundary collapse problem.
[323] Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization
Bin Hong, Jiayu Liu, Kai Zhang, Jianwen Sun, Mengdi Zhang, Zhenya Huang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning. However, their lengthy outputs increase computational costs and may lead to overthinking, raising challenges in balancing reasoning effectiveness and efficiency. Current solutions often compromise reasoning quality or require extensive resources. In this paper, we investigate how to reduce the generation length of LRMs with limited tuning. We analyze generation path distributions and filter generated trajectories through difficulty estimation. Subsequently, we analyze the convergence characteristics of various preference optimization objectives under a unified Bradley-Terry loss based framework. Based on the analysis, we propose Length Controlled Preference Optimization (LCPO) that directly balances the implicit reward related to NLL loss. LCPO can effectively learn length preference with limited data and training. Extensive experiments demonstrate that our method significantly reduces the average output length of LRMs by over 50% across multiple benchmarks while maintaining the reasoning performance. Our work highlights the potential for computationally efficient approaches in guiding LRMs toward efficient reasoning.
[324] MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents
Pengxiang Zhao, Guangyi Liu, YaoZhen Liang, Weiqing He, Zhengxi Lu, WenHao Wang, Yuehao Huang, Yuxiang Chai, Zhaolu Kang, Yaxuan Guo, Hao Wang, Kexin Zhang, Liang Liu, Yong Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Shortcuts such as APIs and deep-links have emerged as efficient complements to flexible GUI operations, fostering a promising hybrid paradigm for MLLM-based mobile automation. However, systematic evaluation of GUI-shortcut hybrid agents remains largely underexplored. To bridge this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent’s capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 9 evaluation metrics. Experiments demonstrate that hybrid agents achieve up to 68.3% success rate and 39% greater execution efficiency than GUI-only counterparts. Furthermore, our evaluation framework effectively reveals the quality gap between predefined and agent-generated shortcuts, validating its capability to assess shortcut generation methods. MAS-Bench addresses the lack of systematic benchmarks for GUI-shortcut hybrid mobile agents, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents. Project page: https://pengxiang-zhao.github.io/MAS-Bench.
[325] ProRe: A Proactive Reward System for GUI Agents via Reasoner-Actor Collaboration
Gaole Dai, Shiqi Jiang, Ting Cao, Yuqing Yang, Yuanchun Li, Rui Tan, Mo Li, Lili Qiu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reward is critical to the evaluation and training of large language models (LLMs). However, existing rule-based or model-based reward methods struggle to generalize to GUI agents, where access to ground-truth trajectories or application databases is often unavailable, and static trajectory-based LLM-as-a-Judge approaches suffer from limited accuracy. To address these challenges, we propose ProRe, a proactive reward system that leverages a general-purpose reasoner and domain-specific evaluator agents (actors). The reasoner schedules targeted state probing tasks, which the evaluator agents then execute by actively interacting with the environment to collect additional observations. This enables the reasoner to assign more accurate and verifiable rewards to GUI agents. Empirical results on over 3K trajectories demonstrate that ProRe improves reward accuracy and F1 score by up to 5.3% and 19.4%, respectively. Furthermore, integrating ProRe with state-of-the-art policy agents yields a success rate improvement of up to 22.4%. The source code is available at https://github.com/V-Droid-Agent/ProRe.
[326] Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
Yihong Dong, Zhaoyu Ma, Xue Jiang, Zhiyuan Fan, Jiaru Qian, Yongmin Li, Jianha Xiao, Zhi Jin, Rongyu Cao, Binhua Li, Fei Huang, Yongbin Li, Ge Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diffusion language models (DLMs) are emerging as a powerful and promising alternative to the dominant autoregressive paradigm, offering inherent advantages in parallel generation and bidirectional context modeling. However, the performance of DLMs on code generation tasks, which have stronger structural constraints, is significantly hampered by the critical trade-off between inference speed and output quality. We observed that accelerating the code generation process by reducing the number of sampling steps usually leads to a catastrophic collapse in performance. In this paper, we introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber), a novel training-free sampling algorithm for DLMs to achieve better inference speed and output quality in code generation. Specifically, Saber is motivated by two key insights in the DLM generation process: 1) it can be adaptively accelerated as more of the code context is established; 2) it requires a backtracking mechanism to reverse the generated tokens. Extensive experiments on multiple mainstream code generation benchmarks show that Saber boosts Pass@1 accuracy by an average improvement of 1.9% over mainstream DLM sampling methods, meanwhile achieving an average 251.4% inference speedup. By leveraging the inherent advantages of DLMs, our work significantly narrows the performance gap with autoregressive models in code generation.
[327] Empowerment Gain and Causal Model Construction: Children and adults are sensitive to controllability and variability in their causal interventions
Eunice Yiu, Kelsey Allen, Shiry Ginosar, Alison Gopnik
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Learning about the causal structure of the world is a fundamental problem for human cognition. Causal models and especially causal learning have proved to be difficult for large pretrained models using standard techniques of deep learning. In contrast, cognitive scientists have applied advances in our formal understanding of causation in computer science, particularly within the Causal Bayes Net formalism, to understand human causal learning. In the very different tradition of reinforcement learning, researchers have described an intrinsic reward signal called “empowerment” which maximizes mutual information between actions and their outcomes. “Empowerment” may be an important bridge between classical Bayesian causal learning and reinforcement learning and may help to characterize causal learning in humans and enable it in machines. If an agent learns an accurate causal world model, they will necessarily increase their empowerment, and increasing empowerment will lead to a more accurate causal world model. Empowerment may also explain distinctive features of childrens causal learning, as well as providing a more tractable computational account of how that learning is possible. In an empirical study, we systematically test how children and adults use cues to empowerment to infer causal relations, and design effective causal interventions.
[328] Variance Computation for Weighted Model Counting with Knowledge Compilation Approach
Kengo Nakamura, Masaaki Nishino, Norihito Yasuda
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.03523: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03523&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[329] 3D Instruction Ambiguity Detection
Jiayu Ding, Haoran Tang, Hongbo Jin, Wei Gao, Ge Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.05991: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05991&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[330] AMA: Adaptive Memory via Multi-Agent Collaboration
Weiquan Huang, Zixuan Wang, Hehai Lin, Sudong Wang, Bo Xu, Qian Li, Beier Zhu, Linyi Yang, Chengwei Qin
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.20352: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20352&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[331] Bayesian-LoRA: Probabilistic Low-Rank Adaptation of Large Language Models
Moule Lin, Shuhao Guan, Andrea Patane, David Gregg, Goetz Botterweck
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.21003: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21003&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[332] Contextuality from Single-State Ontological Models: An Information-Theoretic Obstruction
Song-Ju Kim
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.16716: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16716&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[333] DeepPresenter: Environment-Grounded Reflection for Agentic Presentation Generation
Hao Zheng, Guozhao Mo, Xinru Yan, Qianhao Yuan, Wenkai Zhang, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.22839: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22839&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[334] GraphScout: Empowering Large Language Models with Intrinsic Exploration Ability for Agentic Graph Reasoning
Yuchen Ying, Weiqi Jiang, Tongya Zheng, Yu Wang, Shunyu Liu, Kaixuan Chen, Mingli Song
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.01410: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01410&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[335] Animating Petascale Time-varying Data on Commodity Hardware with LLM-assisted Scripting
Ishrat Jahan Eliza, Xuan Huang, Aashish Panta, Alper Sahistan, Zhimin Li, Amy A. Gooch, Valerio Pascucci
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.07053: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07053&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[336] Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents
Shuai Zhen, Yanhua Yu, Ruopei Guo, Nan Cheng, Yang Deng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.05808: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05808&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[337] Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization
Yu Li, Sizhe Tang, Tian Lan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.07165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[338] ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models
Chonghan Qin, Xiachong Feng, Weitao Ma, Xiaocheng Feng, Lingpeng Kong
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.08064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[339] Avenir-UX: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding
Wee Joe Tan, Zi Rui Lucas Lim, Shashank Durgad, Karim Obegi, Aiden Yiliu Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.09581: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09581&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[340] Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents
S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis, Antonios Saravanos
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11465: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11465&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[341] The Non-Optimality of Scientific Knowledge: Path Dependence, Lock-In, and The Local Minimum Trap
Mohamed Mabrok
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11828: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11828&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[342] DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
Hao Yan, Yuliang Liu, Xingchen Liu, Yuyi Zhang, Minghui Liao, Jihao Wu, Wei Chen, Xiang Bai
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[343] From edges to meaning: Semantic line sketches as a cognitive scaffold for ancient pictograph invention
Seowung Leem, Lin Gu, Ruogu Fang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12865: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12865&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[344] Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data
Danrui Qi, Jinglin Peng, Yongjun He, Jiannan Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2310.02540: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2310.02540&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[345] RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care
Ziqi Yang, Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao, Dakuo Wang, Bingsheng Yao, Nawar Shara
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2502.05740: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.05740&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[346] Autonomous Multi-objective Alloy Design through Simulation-guided Optimization
Penghui Yang, Chendong Zhao, Bijun Tang, Zhonghan Zhang, Xinrun Wang, Yanchen Deng, Xuyu Dong, Yuhao Lu, Jianguo Huang, Yixuan Li, Yushan Xiao, Cuntai Guan, Zheng Liu, Bo An
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.16005: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.16005&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[347] Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving
Juntao Zhao, Jiuru Li, Chuan Wu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.18454: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.18454&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[348] FCBV-Net: Category-Level Robotic Garment Smoothing via Feature-Conditioned Bimanual Value Prediction
Mohammed Daba, Jing Qiu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.05153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[349] Decentralized Rank Scheduling for Energy-Constrained Multi-Task Federated Fine-Tuning in Edge-Assisted IoV Networks
Bokeng Zheng, Jianqiang Zhong, Jiayi Liu, Lei Xue, Xu Chen, Xiaoxi Zhang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.09532: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09532&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[350] Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs
Shei Pern Chua, Zhen Leng Thai, Kai Jun Teh, Xiao Li, Qibing Ren, Xiaolin Hu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.05367: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05367&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[351] Neuro-Symbolic AI for Cybersecurity: State of the Art, Challenges, and Opportunities
Safayat Bin Hakim, Muhammad Adil, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.06921: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.06921&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[352] The Signal is in the Steps: Local Scoring for Reasoning Data Selection
Hoang Anh Just, Myeongseob Ko, Ruoxi Jia
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.03988: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03988&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[353] A Practitioner’s Guide to Kolmogorov-Arnold Networks
Amir Noorizadegan, Sifan Wang, Leevan Ling, Juan P. Dominguez-Morales
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.25781: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25781&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[354] SAQ: Stabilizer-Aware Quantum Error Correction Decoder
David Zenati, Eliya Nachmani
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.08914: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08914&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[355] ZK-APEX: Zero-Knowledge Approximate Personalized Unlearning with Executable Proofs
Mohammad M Maheri, Sunil Cotterill, Alex Davidson, Hamed Haddadi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.09953: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09953&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[356] VeruSAGE: A Study of Agent-Based Verification for Rust Systems
Chenyuan Yang, Natalie Neamtu, Chris Hawblitzel, Jacob R. Lorch, Shan Lu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.18436: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18436&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[357] BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs
Muhammad Zeeshan Karamat, Sadman Saif, Christiana Chamon Garcia
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.22174: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22174&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[358] Strategic Response of News Publishers to Generative AI
Hangcheng Zhao, Ron Berman
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.24968: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24968&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[359] Safe-FedLLM: Delving into the Safety of Federated Large Language Models
Mingxiang Tao, Yu Tian, Wenxuan Tu, Yue Yang, Xue Yang, Xiangyan Tang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.07177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[360] Optimized Human-Robot Co-Dispatch Planning for Petro-Site Surveillance under Varying Criticalities
Nur Ahmad Khatim, Mansur Arief
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.07924: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07924&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[361] In-Context Autonomous Network Incident Response: An End-to-End Large Language Model Agent Approach
Yiran Gao, Kim Hammar, Tao Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.13156: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13156&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[362] Online Navigation Planning for Long-term Autonomous Operation of Underwater Gliders
Victor-Alexandru Darvariu, Charlotte Z. Reed, Jan Stratmann, Bruno Lacerda, Benjamin Allsup, Stephen Woodward, Elizabeth Siddle, Trishna Saeharaseelan, Owain Jones, Dan Jones, Tobias Ferreira, Chloe Baker, Kevin Chaplin, James Kirk, Ashley Iceton-Morris, Ryan D. Patmore, Jeff Polton, Charlotte Williams, Christopher D. J. Auckland, Rob A. Hall, Alexandra Kokkinaki, Alvaro Lorenzo Lopez, Justin J. H. Buck, Nick Hawes
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.19315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[363] FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation
Zhihao Ding, Jinming Li, Ze Lu, Jieming Shi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.23636: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23636&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[364] Domain-Adaptive Model Merging Across Disconnected Modes
Junming Liu, Yusen Zhang, Rongchao Zhang, Wenkai Zhu, Tian Wu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.05957: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05957&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[365] The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection
J Alex Corll
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.11875: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11875&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[366] Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction
Chenghan Wu, Zongmin Yu, Boai Sun, Liu Yang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.12725: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12725&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[367] ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents
Zijian Lu, Yiping Zuo, Yupeng Nie, Xin He, Weibei Fan, Lianyong Qi, Shi Jin
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.20340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[368] Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots
Licol Zeinfeld, Alona Strugatski, Ziva Bar-Dov, Ron Blonder, Shelley Rap, Giora Alexandron
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.23682: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23682&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[369] A Lightweight, Transferable, and Self-Adaptive Framework for Intelligent DC Arc-Fault Detection in Photovoltaic Systems
Xiaoke Yang, Long Gao, Haoyu He, Hanyuan Hang, Qi Liu, Shuai Zhao, Qiantu Tuo, Rui Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.25749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[370] WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning
Mintae Kim, Koushil Sreenath
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.08958: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08958&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[371] WybeCoder: Verified Imperative Code Generation
Fabian Gloeckle, Mantas Baksys, Darius Feher, Kunhao Zheng, Amaury Hayat, Sean B. Holden, Gabriel Synnaeve, Peter O’Hearn
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.29088: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29088&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[372] Trust and Reliance on AI in Education: AI Literacy and Need for Cognition as Moderators
Griffin Pitts, Neha Rani, Weedguet Mildort
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.01114: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01114&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[373] SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
Zikai Zhang, Rui Hu, Olivera Kotevska, Jiahao Xu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.01473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[374] Optimal Stability of KL Divergence under Gaussian Perturbations
Jialu Pan, Yufeng Zhang, Nan Hu, Keqin Li, Zhenbang Chen, Ji Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11026: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11026&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[375] ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs
Lik Tung Fu, Jie Zhou, Shaokai Ren, Mengli Zhang, Jia Xiong, Hugo Jiang, Nan Guan, Xi Wang, Jun Yang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.02811: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02811&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[376] Exact Structural Abstraction and Tractability Limits
Tristan Simas
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.07349: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07349&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[377] THEIA: Learning Complete Kleene Three-Valued Logic in a Pure-Neural Modular Architecture
Augustus Haoyang Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11284: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11284&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[378] eBandit: Kernel-Driven Reinforcement Learning for Adaptive Video Streaming
Mahdi Alizadeh
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.08791: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08791&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[379] Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.09613: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09613&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[380] A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs
Chen Zhang, Yan Ding, Haotian Wang, Chubo Liu, Keqin Li, Kenli Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.09752: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09752&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[381] RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
Xuning Yang, Rishit Dagli, Alex Zook, Hugo Hadfield, Ankit Goyal, Stan Birchfield, Fabio Ramos, Jonathan Tremblay
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.09860: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09860&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[382] Cost-optimal Sequential Testing via Doubly Robust Q-learning
Doudou Zhou, Yiran Zhang, Dian Jin, Yingye Zheng, Lu Tian, Tianxi Cai
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[383] CodeTracer: Towards Traceable Agent States
Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye, Jiaming Wang, Yancheng He, Pengyu Zou, Lehan Zhang, Xinping Lei, Haoyang Huang, Ken Deng, Ming Sun, Zhaoxiang Zhang, He Ye, Jiaheng Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11641: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11641&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[384] Beyond LLMs, Sparse Distributed Memory, and Neuromorphics <A Hyper-Dimensional SRAM-CAM “VaCoAl” for Ultra-High Speed, Ultra-Low Power, and Low Cost>
Hiroyuki Chuma, Kanji Otsuka, Yoichi Sato
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[385] Melodic contour does not cluster: Reconsidering contour typology
Bas Cornelissen, Willem Zuidema, John Ashley Burgoyne, Henkjan Honing
Main category: cs.SD
TL;DR: The paper questions the adequacy of discrete typologies for melodic phrase contours, finding no evidence of clustering in real musical datasets and suggesting contour is best seen as continuous.
Details
Motivation: To challenge the traditional approach of using small sets of discrete contour types to describe melodic phrases, questioning whether phrase contours actually cluster into distinct types in real musical data.Method: Applied UMAP dimensionality reduction followed by the dist-dip test of multimodality to test for clustering in phrase contours from German and Chinese folksongs, Gregorian chant, and a synthetic dataset for validation.
Result: No evidence of clustering was found in actual phrase contours from any of the musical datasets, though the test correctly identified clustering in synthetic data. This raises problems for discrete typologies, showing type frequencies may be unreliable.
Conclusion: Melodic contour should be viewed as a continuous phenomenon rather than discrete types, challenging existing typologies and suggesting alternative analytical approaches.
Abstract: How to describe the shape of a melodic phrase? Scholars have often relied on typologies with a small set of contour types. We question their adequacy: we find no evidence that phrase contours cluster into discrete types, neither in German or Chinese folksongs, nor in Gregorian chant. The test for clustering we propose applies the dist-dip test of multimodality after a UMAP dimensionality reduction. The test correctly identifies clustering in a synthetic dataset, but not in actual phrase contours. These results raise problems for discrete typologies. In particular, type frequencies may be unreliable, as we see with Huron’s typology. We also show how a recent finding of four contour shapes may be an artefact of the analysis. Our findings suggest that melodic contour is best seen as a continuous phenomenon.
[386] Comparison of window shapes and lengths in short-time feature extraction for classification of heart sound signals
Mahmoud Fakhry, Abeer FathAllah Brery
Main category: cs.SD
TL;DR: Experimental evaluation of window shapes and lengths for PCG signal segmentation using biLSTM networks, finding Gaussian window with 75ms length performs best for heart sound classification.
Details
Motivation: PCG signals for cardiovascular diagnosis require careful feature extraction due to non-stationarity. Different window shapes and lengths affect feature quality, with some windows causing spectral distortion that impacts classification performance.Method: Evaluated three window shapes (Gaussian, triangular, rectangular) each with three lengths using biLSTM networks trained on statistical features extracted from PCG signals with sliding windows. Compared classification performance across window configurations.
Result: Gaussian window with 75ms length achieved best classification performance. Triangular window competed with Gaussian at 75ms length. Rectangular window performed worst despite being commonly used. Gaussian window outperformed baseline methods.
Conclusion: Window shape and length significantly impact PCG signal classification performance. Gaussian window with 75ms length is optimal for heart sound analysis using biLSTM networks, outperforming traditional rectangular windows.
Abstract: Heart sound signals, phonocardiography (PCG) signals, allow for the automatic diagnosis of potential cardiovascular pathology. Such classification task can be tackled using the bidirectional long short-term memory (biLSTM) network, trained on features extracted from labeled PCG signals. Regarding the non-stationarity of PCG signals, it is recommended to extract the features from multiple short-length segments of the signals using a sliding window of certain shape and length. However, some window contains unfavorable spectral side lobes, which distort the features. Accordingly, it is preferable to adapt the window shape and length in terms of classification performance. We propose an experimental evaluation for three window shapes, each with three window lengths. The biLSTM network is trained and tested on statistical features extracted, and the performance is reported in terms of the window shapes and lengths. Results show that the best performance is obtained when the Gaussian window is used for splitting the signals, and the triangular window competes with the Gaussian window for a length of 75 ms. Although the rectangular window is a commonly offered option, it is the worst choice for splitting the signals. Moreover, the classification performance obtained with a 75 ms Gaussian window outperforms that of a baseline method.
[387] Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
Yanfeng Shi, Pengfei Cai, Jun Liu, Qing Gu, Nan Jiang, Lirong Dai, Ian McLoughlin, Yan Song
Main category: cs.SD
TL;DR: TimePro-RL framework enhances LALMs’ temporal perception using audio-side time prompts and reinforcement learning for better event timing inference
Details
Motivation: Current Large Audio-Language Models (LALMs) have limitations in temporal perception (inferring event onset and offset), which restricts their utility in fine-grained audio understanding scenariosMethod: Proposes Audio-Side Time Prompt (encoding timestamps as embeddings interleaved with audio features) and Reinforcement Learning after Supervised Fine-Tuning to optimize temporal alignment
Result: Significant performance gains across audio temporal tasks including audio grounding, sound event detection, and dense audio captioning
Conclusion: TimePro-RL framework effectively addresses temporal perception limitations in LALMs, enabling more fine-grained audio understanding
Abstract: Large Audio-Language Models (LALMs) enable general audio understanding and demonstrate remarkable performance across various audio tasks. However, these models still face challenges in temporal perception (e.g., inferring event onset and offset), leading to limited utility in fine-grained scenarios. To address this issue, we propose Audio-Side Time Prompt and leverage Reinforcement Learning (RL) to develop the TimePro-RL framework for fine-grained temporal perception. Specifically, we encode timestamps as embeddings and interleave them within the audio feature sequence as temporal coordinates to prompt the model. Furthermore, we introduce RL following Supervised Fine-Tuning (SFT) to directly optimize temporal alignment performance. Experiments demonstrate that TimePro-RL achieves significant performance gains across a range of audio temporal tasks, such as audio grounding, sound event detection, and dense audio captioning, validating its robust effectiveness.
cs.LG
[388] Sparse Goodness: How Selective Measurement Transforms Forward-Forward Learning
Kamer Ali Yuksel, Hassan Sawaf
Main category: cs.LG
TL;DR: The paper systematically explores goodness function design for Forward-Forward networks, finding that sparse goodness functions (top-k and entmax-weighted energy) significantly outperform traditional sum-of-squares, with adaptive sparsity around alpha=1.5 working best.
Details
Motivation: The Forward-Forward algorithm uses local goodness functions to train neural networks layer by layer, but the default sum-of-squares goodness function may not be optimal. The paper aims to systematically explore the design space of goodness functions to improve FF network performance.Method: 1) Systematically studied 11 different goodness functions, investigating both which activations to measure and how to aggregate them. 2) Introduced top-k goodness that evaluates only the k most active neurons. 3) Proposed entmax-weighted energy with learnable sparse weighting using alpha-entmax transformation. 4) Adopted separate label feature forwarding (FFCL) where class hypotheses are injected at every layer. 5) Conducted controlled experiments across two architectures with sparsity spectrum analysis over k and alpha.
Result: Top-k goodness improved Fashion-MNIST accuracy by 22.6 percentage points over SoS baseline. Entmax-weighted energy provided additional gains. Combined with FFCL, achieved 87.1% accuracy on Fashion-MNIST with 4x2000 architecture, representing 30.7 percentage point improvement over SoS baseline. Adaptive sparsity with alpha≈1.5 consistently outperformed both fully dense and fully sparse alternatives.
Conclusion: Sparsity in the goodness function is the most important design choice in FF networks, with adaptive sparsity around alpha=1.5 being optimal. The work provides principled guidance for designing effective goodness functions in biologically plausible neural network training algorithms.
Abstract: The Forward-Forward (FF) algorithm is a biologically plausible alternative to backpropagation that trains neural networks layer by layer using a local goodness function to distinguish positive from negative data. Since its introduction, sum-of-squares (SoS) has served as the default goodness function. In this work, we systematically study the design space of goodness functions, investigating both which activations to measure and how to aggregate them. We introduce top-k goodness, which evaluates only the k most active neurons, and show that it substantially outperforms SoS, improving Fashion-MNIST accuracy by 22.6 percentage points. We further introduce entmax-weighted energy, which replaces hard top-k selection with a learnable sparse weighting based on the alpha-entmax transformation, yielding additional gains. Orthogonally, we adopt separate label feature forwarding (FFCL), in which class hypotheses are injected at every layer through a dedicated projection rather than concatenated only at the input. Combining these ideas, we achieve 87.1 percent accuracy on Fashion-MNIST with a 4x2000 architecture, representing a 30.7 percentage point improvement over the SoS baseline while changing only the goodness function and the label pathway. Across controlled experiments covering 11 goodness functions, two architectures, and a sparsity spectrum analysis over both k and alpha, we identify a consistent principle: sparsity in the goodness function is the most important design choice in FF networks. In particular, adaptive sparsity with alpha approximately 1.5 outperforms both fully dense and fully sparse alternatives.
[389] The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
Laura Gomezjurado Gonzalez
Main category: cs.LG
TL;DR: Transformers trained on algorithmic tasks like Collatz prediction show delayed generalization (grokking) due to decoder bottlenecks, not encoder learning failures - the encoder learns structure early but decoder struggles to access it.
Details
Motivation: To understand why transformers exhibit long delays between training-set fit and generalization (grokking) in algorithmic tasks, specifically investigating whether the delay stems from failure to learn structure or from limited access to already-learned structure.Method: Study one-step Collatz prediction using encoder-decoder arithmetic models. Conduct causal interventions including encoder/decoder transplantation experiments, freezing converged encoders, and analyzing numeral representation effects across 15 different bases.
Result: Encoder learns parity and residue structure within first few thousand steps while accuracy remains near chance for tens of thousands more. Transplanting trained encoder accelerates grokking 2.75x, while trained decoder hurts performance. Freezing converged encoder eliminates plateau, achieving 97.6% vs 86.1% accuracy. Base choice significantly affects learnability - bases aligned with Collatz arithmetic (base 24) reach 99.8% accuracy while binary fails completely.
Conclusion: Grokking delay in algorithmic tasks stems from decoder bottlenecks in accessing encoder-learned structure, not from failure to learn structure. Numeral representation acts as inductive bias controlling decoder’s ability to exploit local digit structure, explaining large learnability differences for same underlying task.
Abstract: Grokking in transformers trained on algorithmic tasks is characterized by a long delay between training-set fit and abrupt generalization, but the source of that delay remains poorly understood. In encoder-decoder arithmetic models, we argue that this delay reflects limited access to already learned structure rather than failure to acquire that structure in the first place. We study one-step Collatz prediction and find that the encoder organizes parity and residue structure within the first few thousand training steps, while output accuracy remains near chance for tens of thousands more. Causal interventions support the decoder bottleneck hypothesis. Transplanting a trained encoder into a fresh model accelerates grokking by 2.75 times, while transplanting a trained decoder actively hurts. Freezing a converged encoder and retraining only the decoder eliminates the plateau entirely and yields 97.6% accuracy, compared to 86.1% for joint training. What makes the decoder’s job harder or easier depends on numeral representation. Across 15 bases, those whose factorization aligns with the Collatz map’s arithmetic (e.g., base 24) reach 99.8% accuracy, while binary fails completely because its representations collapse and never recover. The choice of base acts as an inductive bias that controls how much local digit structure the decoder can exploit, producing large differences in learnability from the same underlying task.
[390] Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments
Rajat Khanda, Mohammad Baqar Sambuddha Chakrabarti, Satyasaran Changdar
Main category: cs.LG
TL;DR: AMC is a memory architecture for continual reinforcement learning inspired by synaptic tagging theory, using a three-phase crystallization process to consolidate experiences while preventing catastrophic forgetting.
Details
Motivation: Autonomous AI agents need to acquire new capabilities without erasing prior knowledge, addressing the challenge of catastrophic forgetting in continual learning.Method: Adaptive Memory Crystallization (AMC) models memory as a continuous crystallization process with three phases (Liquid-Glass-Crystal) governed by stochastic differential equations, inspired by synaptic tagging and capture theory.
Result: Empirical evaluation shows improvements in forward transfer (+34-43%), reductions in catastrophic forgetting (67-80%), and 62% decrease in memory footprint across Meta-World MT50, Atari, and MuJoCo benchmarks.
Conclusion: AMC provides a principled memory architecture for continual reinforcement learning that effectively balances plasticity and stability while reducing memory requirements.
Abstract: Autonomous AI agents operating in dynamic environments face a persistent challenge: acquiring new capabilities without erasing prior knowledge. We present Adaptive Memory Crystallization (AMC), a memory architecture for progressive experience consolidation in continual reinforcement learning. AMC is conceptually inspired by the qualitative structure of synaptic tagging and capture (STC) theory, the idea that memories transition through discrete stability phases, but makes no claim to model the underlying molecular or synaptic mechanisms. AMC models memory as a continuous crystallization process in which experiences migrate from plastic to stable states according to a multi-objective utility signal. The framework introduces a three-phase memory hierarchy (Liquid–Glass–Crystal) governed by an Itô stochastic differential equation (SDE) whose population-level behavior is captured by an explicit Fokker–Planck equation admitting a closed-form Beta stationary distribution. We provide proofs of: (i) well-posedness and global convergence of the crystallization SDE to a unique Beta stationary distribution; (ii) exponential convergence of individual crystallization states to their fixed points, with explicit rates and variance bounds; and (iii) end-to-end Q-learning error bounds and matching memory-capacity lower bounds that link SDE parameters directly to agent performance. Empirical evaluation on Meta-World MT50, Atari 20-game sequential learning, and MuJoCo continual locomotion consistently shows improvements in forward transfer (+34–43% over the strongest baseline), reductions in catastrophic forgetting (67–80%), and a 62% decrease in memory footprint.
[391] Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation
Fei Ding, Yongkang Zhang, youwei wang, Zijian Zeng
Main category: cs.LG
TL;DR: The paper analyzes token-level credit assignment in RL fine-tuning for reasoning models, identifies issues with intra-group comparisons, and proposes transformations to restore gradient exchangeability for stable training.
Details
Motivation: Current RL fine-tuning methods for reasoning models using intra-group comparisons suffer from learning tax, solution probability drift, and entropy collapse during long-term training, requiring better understanding of token-level credit assignment.Method: Analyzes necessary conditions for algorithm design from token-level credit assignment perspective, identifies mechanisms disrupting gradient exchangeability, and proposes minimal intra-group transformations to restore cancellation structure in shared token space.
Result: Experimental results show the proposed transformations stabilize training, improve sample efficiency, and enhance final performance, validating the design condition’s value.
Conclusion: Maintaining gradient exchangeability across token updates is crucial for preventing reward-irrelevant drift in RL fine-tuning of reasoning models, and simple transformations can effectively restore this structure.
Abstract: In sparse termination rewards, intra-group comparisons have become the dominant paradigm for fine-tuning reasoning models via reinforcement learning. However, long-term training often leads to issues like ineffective update accumulation (learning tax), solution probability drift, and entropy collapse. This paper presents a necessary condition for algorithm design from a token-level credit assignment perspective: to prevent reward-irrelevant drift, intra-group objectives must maintain gradient exchangeability across token updates, enabling gradient cancellation on weak-credit/high-frequency tokens. We show that two common mechanisms disrupting exchangeability make “non-cancellation” a structural norm. Based on this, we propose minimal intra-group transformations to restore or approximate the cancellation structure in the shared token space. Experimental results demonstrate that these transformations stabilize training, improve sample efficiency, and enhance final performance, validating the value of this design condition.
[392] Spectral Entropy Collapse as an Empirical Signature of Delayed Generalisation in Grokking
Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, Phan Thanh Duc
Main category: cs.LG
TL;DR: The paper identifies normalized spectral entropy as an order parameter for grokking (delayed generalization after memorization) in 1-layer Transformers, showing it follows a two-phase pattern of norm expansion then entropy collapse.
Details
Motivation: Grokking phenomenon lacks predictive mechanistic explanation; understanding the underlying dynamics could provide insights into generalization in neural networks.Method: Analyze normalized spectral entropy of representation covariance as order parameter; validate on 1-layer Transformers on group-theoretic tasks; conduct causal interventions and control experiments.
Result: Identified stable entropy threshold (≈0.61) that predicts grokking onset; causal intervention delaying entropy collapse delays grokking; power-law predicts onset with 4.1% error; mechanism holds across different groups.
Conclusion: Spectral entropy collapse is necessary but not sufficient for grokking; architecture matters (MLPs show collapse without grokking); provides mechanistic understanding of generalization dynamics.
Abstract: Grokking – delayed generalisation long after memorisation – lacks a predictive mechanistic explanation. We identify the normalised spectral entropy $\tilde{H}(t)$ of the representation covariance as a scalar order parameter for this transition, validated on 1-layer Transformers on group-theoretic tasks. Five contributions: (i) Grokking follows a two-phase pattern: norm expansion then entropy collapse. (ii) $\tilde{H}$ crosses a stable threshold $\tilde{H}^* \approx 0.61$ before generalisation in 100% of runs (mean lead: 1,020 steps). (iii) A causal intervention preventing collapse delays grokking by +5,020 steps ($p=0.044$); a norm-matched control ($n=30$, $p=5\times10^{-5}$) confirms entropy – not norm – drives the transition. (iv) A power-law $ΔT = C_1(\tilde{H}-\tilde{H}^*)^γ+C_2$ ($R^2=0.543$) predicts grokking onset with 4.1% error. (v) The mechanism holds across abelian ($\mathbb{Z}/97\mathbb{Z}$) and non-abelian ($S_5$) groups. Crucially, MLPs show entropy collapse without grokking, proving collapse is necessary but not sufficient – architecture matters. Code: https://anonymous.4open.science/r/grokking-entropy
[393] When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation
Sandro Andric
Main category: cs.LG
TL;DR: Stronger reasoning in LLMs can make them worse simulators of boundedly rational human behavior in multi-agent negotiations, creating a solver-sampler mismatch where over-optimization reduces fidelity.
Details
Motivation: To challenge the assumption that stronger reasoning always improves simulation fidelity, especially when simulating boundedly rational human behavior rather than solving strategic problems optimally.Method: Tested three reflection conditions (no reflection, bounded reflection, native reasoning) across three multi-agent negotiation environments using different model families and OpenAI’s GPT models.
Result: Bounded reflection produced more diverse and compromise-oriented trajectories than no reflection or native reasoning. GPT-5.2 with native reasoning always ended in authority decisions, while bounded reflection recovered compromise outcomes.
Conclusion: Model capability and simulation fidelity are different objectives; behavioral simulations should qualify models as samplers of plausible behavior, not just as solvers of strategic problems.
Abstract: Large language models are increasingly used as agents in social, economic, and policy simulations. A common assumption is that stronger reasoning should improve simulation fidelity. We argue that this assumption can fail when the objective is not to solve a strategic problem, but to sample plausible boundedly rational behavior. In such settings, reasoning-enhanced models can become better solvers and worse simulators: they can over-optimize for strategically dominant actions, collapse compromise-oriented terminal behavior, and sometimes exhibit a diversity-without-fidelity pattern in which local variation survives without outcome-level fidelity. We study this solver-sampler mismatch in three multi-agent negotiation environments adapted from earlier simulation work: an ambiguous fragmented-authority trading-limits scenario, an ambiguous unified-opposition trading-limits scenario, and a new-domain grid-curtailment case in emergency electricity management. We compare three reflection conditions, no reflection, bounded reflection, and native reasoning, across two primary model families and then extend the same protocol to direct OpenAI runs with GPT-4.1 and GPT-5.2. Across all three experiments, bounded reflection produces substantially more diverse and compromise-oriented trajectories than either no reflection or native reasoning. In the direct OpenAI extension, GPT-5.2 native ends in authority decisions in 45 of 45 runs across the three experiments, while GPT-5.2 bounded recovers compromise outcomes in every environment. The contribution is not a claim that reasoning is generally harmful. It is a methodological warning: model capability and simulation fidelity are different objectives, and behavioral simulation should qualify models as samplers, not only as solvers.
[394] Synthetic Tabular Generators Fail to Preserve Behavioral Fraud Patterns: A Benchmark on Temporal, Velocity, and Multi-Account Signals
Bhavana Sajja
Main category: cs.LG
TL;DR: Paper introduces behavioral fidelity as a third dimension for evaluating synthetic tabular data, focusing on preserving temporal, sequential, and structural behavioral patterns of real-world entity activity, with specific applications to fraud detection.
Details
Motivation: Existing synthetic data evaluation focuses only on statistical fidelity (distributions/correlations) and downstream utility (classifier performance), but misses behavioral patterns that real-world detection systems actually rely on for fraud detection and similar applications.Method: Formalizes taxonomy of four behavioral fraud patterns (P1-P4): inter-event timing, burst structure, multi-account graph motifs, and velocity-rule trigger rates. Defines degradation ratio metric calibrated to real-data noise floor. Proves theoretical limitations of row-independent generators. Benchmarks CTGAN, TVAE, GaussianCopula, and TabularARGN on IEEE-CIS Fraud Detection and Amazon Fraud Dataset.
Result: All four generators fail severely: on IEEE-CIS, composite degradation ratios range from 24.4x (TVAE) to 39.0x (GaussianCopula); on Amazon FDB, row-independent generators score 81.6-99.7x, while TabularARGN achieves 17.2x. Row-independent generators are structurally incapable of reproducing graph motifs and positive burst fingerprints.
Conclusion: Behavioral fidelity is a crucial missing dimension for synthetic tabular data evaluation, especially for domains with entity-level sequential data like fraud detection, healthcare, and network security. The P1-P4 framework provides systematic evaluation, and row-independent generators have fundamental limitations for behavioral pattern preservation.
Abstract: We introduce behavioral fidelity – a third evaluation dimension for synthetic tabular data that measures whether generated data preserves the temporal, sequential, and structural behavioral patterns that distinguish real-world entity activity. Existing frameworks evaluate statistical fidelity (marginal distributions and correlations) and downstream utility (classifier AUROC on synthetic-trained models), but neither tests for the behavioral signals that operational detection and analysis systems actually rely on. We formalize a taxonomy of four behavioral fraud patterns (P1-P4) covering inter-event timing, burst structure, multi-account graph motifs, and velocity-rule trigger rates; define a degradation ratio metric calibrated to a real-data noise floor (1.0 = matches real variability, k = k-times worse); and prove that row-independent generators – the dominant paradigm – are structurally incapable of reproducing P3 graph motifs (Proposition 1) and produce non-positive within-entity IET autocorrelation (Proposition 2), making the positive burst fingerprint of fraud sequences unachievable regardless of architecture or training data size. We benchmark CTGAN, TVAE, GaussianCopula, and TabularARGN on IEEE-CIS Fraud Detection and the Amazon Fraud Dataset. All four fail severely: on IEEE-CIS composite degradation ratios range from 24.4x (TVAE) to 39.0x (GaussianCopula); on Amazon FDB, row-independent generators score 81.6-99.7x, while TabularARGN achieves 17.2x. We document generator-specific failure modes and their resolutions. The P1-P4 framework extends to any domain with entity-level sequential tabular data, including healthcare and network security. We release our evaluation framework as open source.
[395] From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures
Srinidhi Madabhushi, Pranesh Vyas, Swathi Vaidyanathan, Mayur Kurup, Elliott Nash, Yegor Silyutin
Main category: cs.LG
TL;DR: Graph-based anomaly detection system using GCN-GAE embeddings to identify under-represented services in load tests vs real events at Prime Video
Details
Motivation: Load tests for streaming services like Prime Video can miss service behaviors unique to real event traffic, requiring better anomaly detection to identify under-represented services during actual eventsMethod: Unsupervised node-level graph embeddings using GCN-GAE (Graph Convolutional Network - Graph Autoencoder) on directed, weighted service graphs at minute-level resolution, with anomaly detection based on cosine similarity between load test and event embeddings
Result: System identifies incident-related services with 96% precision and 0.08% false positive rate, though recall is limited at 58% under conservative propagation assumptions; demonstrates early detection capability
Conclusion: The framework provides practical utility for Prime Video while offering methodological lessons for broader application across microservice ecosystems, with synthetic anomaly injection enabling controlled evaluation
Abstract: Prime Video regularly conducts load tests to simulate the viewer traffic spikes seen during live events such as Thursday Night Football as well as video-on-demand (VOD) events such as Rings of Power. While these stress tests validate system capacity, they can sometimes miss service behaviors unique to real event traffic. We present a graph-based anomaly detection system that identifies under-represented services using unsupervised node-level graph embeddings. Built on a GCN-GAE, our approach learns structural representations from directed, weighted service graphs at minute-level resolution and flags anomalies based on cosine similarity between load test and event embeddings. The system identifies incident-related services that are documented and demonstrates early detection capability. We also introduce a preliminary synthetic anomaly injection framework for controlled evaluation that show promising precision (96%) and low false positive rate (0.08%), though recall (58%) remains limited under conservative propagation assumptions. This framework demonstrates practical utility within Prime Video while also surfacing methodological lessons and directions, providing a foundation for broader application across microservice ecosystems.
[396] Generalization Guarantees on Data-Driven Tuning of Gradient Descent with Langevin Updates
Saumya Goyal, Rohith Rongali, Ritabrata Ray, Barnabás Póczos
Main category: cs.LG
TL;DR: The paper proposes Langevin Gradient Descent Algorithm (LGD) for hyperparameter tuning in regression tasks, provides theoretical guarantees for optimal hyperparameter configuration achieving Bayes optimal solution, and shows meta-learning generalization bounds with O(dh) pseudo-dimension.
Details
Motivation: The paper addresses the problem of learning to learn for regression through hyperparameter tuning, aiming to develop algorithms that can automatically learn optimal hyperparameters from multiple tasks rather than requiring manual tuning for each new task.Method: Proposes Langevin Gradient Descent Algorithm (LGD) which approximates the posterior mean for convex regression tasks. Studies theoretical properties including existence of optimal hyperparameter configuration achieving Bayes optimal solution. Provides generalization bounds for meta-learning hyperparameters with O(dh) pseudo-dimension bound.
Result: Theoretical results show LGD can achieve Bayes optimal solution with optimal hyperparameters. Generalization bounds of O(dh) pseudo-dimension for meta-learning hyperparameters, extending prior work beyond elastic net to convex loss regression. Empirical evidence shows success for few-shot learning on linear regression with synthetic datasets.
Conclusion: LGD provides a theoretically grounded approach to hyperparameter tuning for regression tasks with provable optimality properties and generalization guarantees for meta-learning, extending beyond previous limited hyperparameter settings.
Abstract: We study learning to learn for regression problems through the lens of hyperparameter tuning. We propose the Langevin Gradient Descent Algorithm (LGD), which approximates the mean of the posterior distribution defined by the loss function and regularizer of a convex regression task. We prove the existence of an optimal hyperparameter configuration for which the LGD algorithm achieves the Bayes’ optimal solution for squared loss. Subsequently, we study generalization guarantees on meta-learning optimal hyperparameters for the LGD algorithm from a given set of tasks in the data-driven setting. For a number of parameters $d$ and hyperparameter dimension $h$, we show a pseudo-dimension bound of $O(dh)$, upto logarithmic terms under mild assumptions on LGD. This matches the dimensional dependence of the bounds obtained in prior work for the elastic net, which only allows for $h=2$ hyperparameters, and extends their bounds to regression on convex loss. Finally, we show empirical evidence of the success of LGD and the meta-learning procedure for few-shot learning on linear regression using a few synthetically created datasets.
[397] Depth-Resolved Coral Reef Thermal Fields from Satellite SST and Sparse In-Situ Loggers Using Physics-Informed Neural Networks
Alzayat Saleh, Mostafa Rahimi Azghadi
Main category: cs.LG
TL;DR: A physics-informed neural network (PINN) combines satellite sea surface temperature with sparse in-situ loggers to estimate subsurface ocean temperatures for coral bleaching monitoring, outperforming statistical and physics-only baselines.
Details
Motivation: Satellite SST products only measure ocean surface temperatures, but corals live at various depths where temperatures can be 1-3°C cooler. Applying surface temperatures uniformly to all depths overestimates thermal stress for subsurface corals, creating a need for depth-resolved temperature estimation.Method: Physics-informed neural network (PINN) that fuses NOAA Coral Reef Watch SST with sparse in-situ temperature loggers within the one-dimensional vertical heat equation. The model enforces SST as a hard surface boundary condition and jointly learns effective thermal diffusivity (κ) and light attenuation (Kd).
Result: The PINN achieves 0.25-1.38°C RMSE at unseen depths across four Great Barrier Reef sites. Under extreme sparsity (three training depths), it maintains 0.27°C RMSE at 5m and 0.32°C at 9.1m holdouts, outperforming statistical baselines (>1.8°C) and physics-only finite-difference baselines in 90% of experiments. Depth-resolved thermal stress profiles show attenuation with depth.
Conclusion: Physics-constrained fusion of satellite SST with sparse loggers can extend coral bleaching assessment to the depth dimension using existing observational infrastructure, though PINN predictions provide conservative lower bounds on thermal stress due to smoothing of short-duration peaks.
Abstract: Satellite sea surface temperature (SST) products underpin global coral bleaching monitoring, yet they measure only the ocean skin. Corals inhabit depths from the shallows to beyond 20 metres, where temperatures can be 1-3°C cooler than the surface; applying satellite SST uniformly to all depths therefore overestimates subsurface thermal stress. We present a physics-informed neural network (PINN) that fuses NOAA Coral Reef Watch SST with sparse in-situ temperature loggers within the one-dimensional vertical heat equation, enforcing SST as a hard surface boundary condition and jointly learning effective thermal diffusivity (\k{appa}) and light attenuation (Kd). Validated across four Great Barrier Reef sites (30 holdout experiments), the PINN achieves 0.25-1.38°C RMSE at unseen depths. Under extreme sparsity (three training depths), the PINN maintains 0.27°C RMSE at the 5 metre holdout and 0.32°C at the 9.1 metre holdout, where statistical baselines collapse to >1.8°C; it outperforms a physics-only finite-difference baseline in 90% of experiments. Depth-resolved Degree Heating Day (DHD) profiles show that thermal stress attenuates with depth: at Davies Reef, DHD drops from 0.29 at the surface to zero by 10.7 metres, consistent with logger observations, while satellite DHD remains constant at 0.31 across all depths. However, the PINN underestimates absolute DHD at shallow depths because its smooth predictions attenuate the short-duration peaks that drive threshold exceedances; PINN DHD values should be interpreted as conservative lower bounds on depth-resolved stress. These results demonstrate that physics-constrained fusion of satellite SST with sparse loggers can extend bleaching assessment to the depth dimension using existing observational infrastructure.
[398] Automated co-design of high-performance thermodynamic cycles via graph-based hierarchical reinforcement learning
Wenqing Li, Xu Feng, Peixue Jiang, Yinhai Zhu
Main category: cs.LG
TL;DR: Graph-based hierarchical reinforcement learning for automated co-design of thermodynamic cycles, discovering novel configurations with improved performance.
Details
Motivation: Traditional thermodynamic cycle design methods are inefficient, rely on expert knowledge, and lack scalability, limiting discovery of high-performance cycles.Method: Graph-based hierarchical reinforcement learning with cycles encoded as graphs (components as nodes, connections as edges), using deep learning thermophysical surrogate for decoding, and manager-worker framework for structural exploration and parameter optimization.
Result: Method discovered 18 novel heat pump cycles and 21 novel heat engine cycles with performance improvements of 4.6% and 133.3% respectively compared to classical cycles.
Conclusion: The approach provides a scalable, automated alternative to expert-driven thermodynamic cycle design, balancing efficiency with broad applicability.
Abstract: Thermodynamic cycles are pivotal in determining the efficacy of energy conversion systems. Traditional design methodologies, which rely on expert knowledge or exhaustive enumeration, are inefficient and lack scalability, thereby constraining the discovery of high-performance cycles. In this study, we introduce a graph-based hierarchical reinforcement learning approach for the co-design of structure parameters in thermodynamic cycles. These cycles are encoded as graphs, with components and connections depicted as nodes and edges, adhering to grammatical constraints. A deep learning-based thermophysical surrogate facilitates stable graph decoding and the simultaneous resolution of global parameters. Building on this foundation, we develop a hierarchical reinforcement learning framework wherein a high-level manager explores structural evolution and proposes candidate configurations, whereas a low-level worker optimizes parameters and provides performance rewards to steer the search towards high-performance regions. By integrating graph representation, thermophysical surrogate, and manager-worker learning, this method establishes a fully automated pipeline for encoding, decoding, and co-optimization. Using heat pump and heat engine cycles as case studies, the results demonstrate that the proposed method not only replicates classical cycle configurations but also identifies 18 and 21 novel heat pump and heat engine cycles, respectively. Relative to classical cycles, the novel configurations exhibit performance improvements of 4.6% and 133.3%, respectively, surpassing the traditional designs. This method effectively balances efficiency with broad applicability, providing a practical and scalable intelligent alternative to expert-driven thermodynamic cycle design.
[399] Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus
Zijian Zhao, Jing Gao, Sen Li
Main category: cs.LG
TL;DR: CMAT is a centralized multi-agent RL framework using Transformer encoder for joint observations and hierarchical decoder for latent consensus generation, enabling single-agent PPO optimization while maintaining coordination.
Details
Motivation: Cooperative MARL faces challenges like non-stationarity, unstable training, weak coordination, and limited theoretical guarantees when decomposing centralized control into multiple agents. Need a framework that can handle large joint observation/action spaces while maintaining coordination.Method: Proposes Consensus Multi-Agent Transformer (CMAT) - a centralized framework bridging MARL to hierarchical SARL. Uses Transformer encoder for joint observations, Transformer decoder autoregressively generates high-level consensus vector in latent space, then all agents generate actions simultaneously conditioned on consensus.
Result: CMAT achieves superior performance over recent centralized solutions, sequential MARL methods, and conventional MARL baselines on StarCraft II, Multi-Agent MuJoCo, and Google Research Football benchmarks.
Conclusion: CMAT effectively handles large joint observation/action spaces through hierarchical factorization, enables order-independent joint decision making, and allows optimization with single-agent PPO while preserving expressive coordination through latent consensus.
Abstract: Cooperative multi-agent reinforcement learning (MARL) is widely used to address large joint observation and action spaces by decomposing a centralized control problem into multiple interacting agents. However, such decomposition often introduces additional challenges, including non-stationarity, unstable training, weak coordination, and limited theoretical guarantees. In this paper, we propose the Consensus Multi-Agent Transformer (CMAT), a centralized framework that bridges cooperative MARL to a hierarchical single-agent reinforcement learning (SARL) formulation. CMAT treats all agents as a unified entity and employs a Transformer encoder to process the large joint observation space. To handle the extensive joint action space, we introduce a hierarchical decision-making mechanism in which a Transformer decoder autoregressively generates a high-level consensus vector, simulating the process by which agents reach agreement on their strategies in latent space. Conditioned on this consensus, all agents generate their actions simultaneously, enabling order-independent joint decision making and avoiding the sensitivity to action-generation order in conventional Multi-Agent Transformers (MAT). This factorization allows the joint policy to be optimized using single-agent PPO while preserving expressive coordination through the latent consensus. To evaluate the proposed method, we conduct experiments on benchmark tasks from StarCraft II, Multi-Agent MuJoCo, and Google Research Football. The results show that CMAT achieves superior performance over recent centralized solutions, sequential MARL methods, and conventional MARL baselines. The code for this paper is available at:https://github.com/RS2002/CMAT .
[400] Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
Aadyot Bhatnagar, Peter Mørch Groth, Ali Madani
Main category: cs.LG
TL;DR: STOMP is a novel offline RL algorithm for multi-objective alignment that extends direct preference optimization using smooth Tchebysheff scalarization to handle conflicting rewards, demonstrated on protein engineering tasks.
Details
Motivation: Real-world applications often require optimizing multiple conflicting objectives simultaneously (e.g., catalytic activity vs specificity in proteins, helpfulness vs harmlessness in chatbots). Traditional linear reward scalarization fails to recover non-convex regions of the Pareto front, necessitating better multi-objective alignment methods.Method: Frames multi-objective RL as an optimization problem using smooth Tchebysheff scalarization, which overcomes limitations of linear scalarization. Derives STOMP algorithm that extends direct preference optimization to multi-objective setting by standardizing individual rewards based on their observed distributions.
Result: Empirically validated on protein engineering tasks by aligning three autoregressive protein language models on three laboratory datasets. STOMP achieves highest hypervolumes in 8 out of 9 settings according to both offline off-policy and generative evaluations.
Conclusion: STOMP is a powerful, robust multi-objective alignment algorithm that can meaningfully improve post-trained models for multi-attribute protein optimization and has broader applications beyond protein engineering.
Abstract: Large language models can be aligned with human preferences through offline reinforcement learning (RL) on small labeled datasets. While single-objective alignment is well-studied, many real-world applications demand the simultaneous optimization of multiple conflicting rewards, e.g. optimizing both catalytic activity and specificity in protein engineering, or helpfulness and harmlessness for chatbots. Prior work has largely relied on linear reward scalarization, but this approach provably fails to recover non-convex regions of the Pareto front. In this paper, instead of scalarizing the rewards directly, we frame multi-objective RL itself as an optimization problem to be scalarized via smooth Tchebysheff scalarization, a recent technique that overcomes the shortcomings of linear scalarization. We use this formulation to derive Smooth Tchebysheff Optimization of Multi-Objective Preferences (STOMP), a novel offline RL algorithm that extends direct preference optimization to the multi-objective setting in a principled way by standardizing the individual rewards based on their observed distributions. We empirically validate STOMP on a range of protein engineering tasks by aligning three autoregressive protein language models on three laboratory datasets of protein fitness. Compared to state-of-the-art baselines, STOMP achieves the highest hypervolumes in eight of nine settings according to both offline off-policy and generative evaluations. We thus demonstrate that STOMP is a powerful, robust multi-objective alignment algorithm that can meaningfully improve post-trained models for multi-attribute protein optimization and beyond.
[401] Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning
Shentong Mo
Main category: cs.LG
TL;DR: CoUR is a framework that uses LLMs to streamline RL reward function design by quantifying code uncertainty and reusing relevant components, reducing evaluation costs while improving performance.
Details
Motivation: Traditional RL reward function design is labor-intensive, inefficient, and inconsistent due to manual processes that create redundancy and overlook local uncertainties at intermediate decision points.Method: CoUR integrates LLMs with code uncertainty quantification and similarity selection combining textual/semantic analysis to identify reusable reward components, plus Bayesian optimization on decoupled reward terms for efficient search.
Result: CoUR achieves better performance than baselines across 9 IsaacGym environments and 20 Bidexterous Manipulation tasks while significantly reducing reward evaluation costs.
Conclusion: CoUR provides an efficient, robust framework for RL reward design by leveraging LLMs to reduce redundancy and address local uncertainties, offering practical benefits for complex RL applications.
Abstract: Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging and labor-intensive process due to the inefficiencies and inconsistencies inherent in traditional methods. Existing methods often rely on extensive manual design and evaluation steps, which are prone to redundancy and overlook local uncertainties at intermediate decision points. To address these challenges, we propose the Chain of Uncertain Rewards (CoUR), a novel framework that integrates large language models (LLMs) to streamline reward function design and evaluation in RL environments. Specifically, our CoUR introduces code uncertainty quantification with a similarity selection mechanism that combines textual and semantic analyses to identify and reuse the most relevant reward function components. By reducing redundant evaluations and leveraging Bayesian optimization on decoupled reward terms, CoUR enables a more efficient and robust search for optimal reward feedback. We comprehensively evaluate CoUR across nine original environments from IsaacGym and all 20 tasks from the Bidexterous Manipulation benchmark. The experimental results demonstrate that CoUR not only achieves better performance but also significantly lowers the cost of reward evaluations.
[402] KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Bing Li, Ulf Schlichtmann
Main category: cs.LG
TL;DR: KV Packet: A recomputation-free KV cache reuse framework using trainable soft-token adapters to bridge context discontinuities, achieving near-zero FLOPs and lower TTFT while maintaining accuracy comparable to full recomputation.
Details
Motivation: Standard KV caches are context-dependent, requiring recomputation when reusing cached documents in new contexts. Existing solutions still incur computational overhead and increased latency, creating a need for more efficient cache reuse methods.Method: Treats cached documents as immutable “packets” wrapped in light-weight trainable soft-token adapters. These adapters are trained via self-supervised distillation to bridge context discontinuities without recomputing KV states.
Result: Experiments on Llama-3.1 and Qwen2.5 show KV Packet achieves near-zero FLOPs and lower Time-to-First-Token latency than recomputation-based baselines, while maintaining F1 scores comparable to full recomputation.
Conclusion: KV Packet provides an efficient recomputation-free solution for KV cache reuse that significantly reduces computational overhead and latency while preserving model performance.
Abstract: Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets’’ wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero FLOPs and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.
[403] Does Dimensionality Reduction via Random Projections Preserve Landscape Features?
Iván Olarte Rodríguez, Anja Jankovic, Thomas Bäck, Elena Raponi
Main category: cs.LG
TL;DR: ELA features computed in randomly projected spaces often don’t reflect original problem characteristics, with most features being sensitive to dimensionality reduction via Random Gaussian Embeddings.
Details
Motivation: Exploratory Landscape Analysis (ELA) struggles in high-dimensional settings due to sparsity, high variance, and computational cost. While dimensionality reduction has been proposed to make ELA applicable, it's unclear whether features computed in reduced spaces still reflect intrinsic properties of the original optimization landscape.Method: Investigates robustness of ELA features under dimensionality reduction via Random Gaussian Embeddings (RGEs). Computes ELA features in projected spaces from the same sampled points and objective values, comparing them to features obtained in the original search space across multiple sample budgets and embedding dimensions.
Result: Linear random projections often alter geometric and topological structure relevant to ELA, yielding feature values unrepresentative of original problems. Most features are highly sensitive to embedding, though a small subset remains comparatively stable. Robustness under projection doesn’t necessarily imply informativeness, as apparently robust features may reflect projection-induced artifacts rather than intrinsic landscape characteristics.
Conclusion: Dimensionality reduction via random projections can distort ELA features, making them unreliable for characterizing original high-dimensional optimization landscapes. Careful validation is needed when applying ELA in reduced spaces.
Abstract: Exploratory Landscape Analysis (ELA) provides numerical features for characterizing black-box optimization problems. In high-dimensional settings, however, ELA suffers from sparsity effects, high estimator variance, and the prohibitive cost of computing several feature classes. Dimensionality reduction has therefore been proposed as a way to make ELA applicable in such settings, but it remains unclear whether features computed in reduced spaces still reflect intrinsic properties of the original landscape. In this work, we investigate the robustness of ELA features under dimensionality reduction via Random Gaussian Embeddings (RGEs). Starting from the same sampled points and objective values, we compute ELA features in projected spaces and compare them to those obtained in the original search space across multiple sample budgets and embedding dimensions. Our results show that linear random projections often alter the geometric and topological structure relevant to ELA, yielding feature values that are no longer representative of the original problem. While a small subset of features remains comparatively stable, most are highly sensitive to the embedding. Moreover, robustness under projection does not necessarily imply informativeness, as apparently robust features may still reflect projection-induced artifacts rather than intrinsic landscape characteristics.
[404] Analog Optical Inference on Million-Record Mortgage Data
Sofia Berloff, Pavel Koptev, Konstantin Malkov
Main category: cs.LG
TL;DR: Analog optical computer benchmarked on mortgage classification shows 3.3% accuracy gap vs XGBoost, with limitations identified at encoding, architecture, and hardware fidelity levels.
Details
Motivation: To evaluate analog optical computers (AOCs) on real-world tabular data beyond small image benchmarks, identifying sources of accuracy loss for practical deployment.Method: Benchmarked AOC digital twin on mortgage approval classification using 5.84 million HMDA records, compared against XGBoost, analyzed three sources of accuracy loss: encoding, architecture, and hardware non-idealities.
Result: AOC achieved 94.6% balanced accuracy vs 97.9% for XGBoost; encoding restriction dropped all models to ~89.5%; hardware non-idealities showed no measurable penalty; architectural limitations identified as primary constraint.
Conclusion: Three-layer limitation framework (encoding, architecture, hardware) identifies where accuracy is lost in analog optical computers, guiding future improvements for practical deployment.
Abstract: Analog optical computers promise large efficiency gains for machine learning inference, yet no demonstration has moved beyond small-scale image benchmarks. We benchmark the analog optical computer (AOC) digital twin on mortgage approval classification from 5.84 million U.S. HMDA records and separate three sources of accuracy loss. On the original 19 features, the AOC reaches 94.6% balanced accuracy with 5,126 parameters (1,024 optical), compared with 97.9% for XGBoost; the 3.3 percentage-point gap narrows by only 0.5pp when the optical core is widened from 16 to 48 channels, suggesting an architectural rather than hardware limitation. Restricting all models to a shared 127-bit binary encoding drops every model to 89.4–89.6%, with an encoding cost of 8pp for digital models and 5pp for the AOC. Seven calibrated hardware non-idealities impose no measurable penalty. The three resulting layers of limitation (encoding, architecture, hardware fidelity) locate where accuracy is lost and what to improve next.
[405] Out of Context: Reliability in Multimodal Anomaly Detection Requires Contextual Inference
Kevin Wilkinghoff, Neelu Madan, Juan Miguel Valverde, Kamal Nasrollahi, Radu Tudor Ionescu, Rafal Wisniewski, Thomas B. Moeslund, Wenwu Wang, Zheng-Hua Tan
Main category: cs.LG
TL;DR: The paper proposes reframing multimodal anomaly detection as cross-modal contextual inference, where modalities play asymmetric roles to separate context from observation for conditional abnormality assessment.
Details
Motivation: Current anomaly detection methods assume a single unconditional reference distribution for normal behavior, but anomalies are often context-dependent. In dynamic environments, this leads to structural ambiguity where contextual variation is mistaken for abnormality. While multimodal data is available, existing methods treat all data streams equally without distinguishing contextual information from anomaly signals.Method: The paper proposes reframing multimodal anomaly detection as a cross-modal contextual inference problem where modalities play asymmetric roles - separating context from observation. This enables defining abnormality conditionally rather than relative to a single global reference distribution.
Result: The paper presents a conceptual framework rather than empirical results, outlining implications for model design, evaluation protocols, and benchmark construction in multimodal anomaly detection.
Conclusion: Multimodal anomaly detection should move beyond treating all modalities equally and instead adopt asymmetric roles to explicitly condition abnormality assessments on operating contexts, addressing structural ambiguity in dynamic environments.
Abstract: Anomaly detection aims to identify observations that deviate from expected behavior. Because anomalous events are inherently sparse, most frameworks are trained exclusively on normal data to learn a single reference model of normality. This implicitly assumes that normal behavior can be captured by a single, unconditional reference distribution. In practice, however, anomalies are often context-dependent: A specific observation may be normal under one operating condition, yet anomalous under another. As machine learning systems are deployed in dynamic and heterogeneous environments, these fixed-context assumptions introduce structural ambiguity, i.e., the inability to distinguish contextual variation from genuine abnormality under marginal modeling, leading to unstable performance and unreliable anomaly assessments. While modern sensing systems frequently collect multimodal data capturing complementary aspects of both system behavior and operating conditions, existing methods treat all data streams equally, without distinguishing contextual information from anomaly-relevant signals. As a result, abnormality is often evaluated without explicitly conditioning on operating conditions. We argue that multimodal anomaly detection should be reframed as a cross-modal contextual inference problem, in which modalities play asymmetric roles, separating context from observation, to define abnormality conditionally rather than relative to a single global reference. This perspective has implications for model design, evaluation protocols, and benchmark construction, and outline open research challenges toward robust, context-aware multimodal anomaly detection.
[406] Bias-Corrected Adaptive Conformal Inference for Multi-Horizon Time Series Forecasting
Ankit Lade, Sai Krishna J., Indar Kumar
Main category: cs.LG
TL;DR: BC-ACI improves ACI by adding bias correction to address persistent forecast bias after distribution shifts, reducing interval width while maintaining coverage.
Details
Motivation: Standard Adaptive Conformal Inference (ACI) only adjusts quantile thresholds but cannot shift interval centers when base forecasters develop persistent bias after regime changes, leading to unnecessarily wide and conservative prediction intervals.Method: Proposes Bias-Corrected ACI (BC-ACI) which augments standard ACI with an online exponentially weighted moving average (EWM) estimate of forecast bias. Corrects nonconformity scores before quantile computation and re-centers prediction intervals. Includes adaptive dead-zone threshold to suppress corrections when estimated bias is indistinguishable from noise.
Result: In experiments across 688 runs with two base models, four synthetic regimes, and three real datasets, BC-ACI reduces Winkler interval scores by 13-17% under mean and compound distribution shifts while maintaining equivalent performance on stationary data. Provides finite-sample analysis showing coverage guarantees degrade gracefully with bias estimation error.
Conclusion: BC-ACI effectively addresses the root cause of miscalibration (persistent bias) rather than just the symptom (coverage violation), producing more efficient prediction intervals while maintaining coverage guarantees under distribution shifts.
Abstract: Adaptive Conformal Inference (ACI) provides distribution-free prediction intervals with asymptotic coverage guarantees for time series under distribution shift. However, ACI only adapts the quantile threshold – it cannot shift the interval center. When a base forecaster develops persistent bias after a regime change, ACI compensates by widening intervals symmetrically, producing unnecessarily conservative bands. We propose Bias-Corrected ACI (BC-ACI), which augments standard ACI with an online exponentially weighted moving average (EWM) estimate of forecast bias. BC-ACI corrects nonconformity scores before quantile computation and re-centers prediction intervals, addressing the root cause of miscalibration rather than its symptom. An adaptive dead-zone threshold suppresses corrections when estimated bias is indistinguishable from noise, ensuring no degradation on well-calibrated data. In controlled experiments across 688 runs spanning two base models, four synthetic regimes, and three real datasets, BC-ACI reduces Winkler interval scores by 13–17% under mean and compound distribution shifts (Wilcoxon p < 0.001) while maintaining equivalent performance on stationary data (ratio 1.002x). We provide finite-sample analysis showing that coverage guarantees degrade gracefully with bias estimation error.
[407] Counterfactual Peptide Editing for Causal TCR–pMHC Binding Inference
Sanjar Khudoyberdiev, Arman Bekov
Main category: cs.LG
TL;DR: CIP framework improves TCR-pMHC binding prediction by enforcing invariance to non-anchor peptide edits while amplifying sensitivity to anchor residue changes, reducing shortcut learning.
Details
Motivation: Current neural models for TCR-pMHC binding prediction suffer from shortcut learning - they exploit spurious correlations like peptide length bias or V-gene co-occurrence rather than learning the actual physical binding interface, making predictions brittle under out-of-distribution evaluation.Method: Counterfactual Invariant Prediction (CIP) generates biologically constrained counterfactual peptide edits and enforces two auxiliary objectives: (1) invariance loss penalizing prediction changes under conservative non-anchor substitutions, and (2) contrastive loss encouraging large prediction changes under anchor-position disruptions.
Result: CIP achieves AUROC 0.831 and counterfactual consistency 0.724 under challenging family-held-out evaluation, with a 39.7% reduction in shortcut index relative to baseline. Anchor-aware edit generation is identified as the dominant driver of OOD gains.
Conclusion: CIP provides a practical framework for causally-grounded TCR specificity modeling by reducing shortcut learning through biologically-informed counterfactual training, improving generalization to out-of-distribution scenarios.
Abstract: Neural models for TCR-pMHC binding prediction are susceptible to shortcut learning: they exploit spurious correlations in training data – such as peptide length bias or V-gene co-occurrence – rather than the physical binding interface. This renders predictions brittle under family-held-out and distance-aware evaluation, where such shortcuts do not transfer. We introduce \emph{Counterfactual Invariant Prediction} (CIP), a training framework that generates biologically constrained counterfactual peptide edits and enforces invariance to edits at non-anchor positions while amplifying sensitivity at MHC anchor residues. CIP augments the base classifier with two auxiliary objectives: (1) an invariance loss penalizing prediction changes under conservative non-anchor substitutions, and (2) a contrastive loss encouraging large prediction changes under anchor-position disruptions. Evaluated on a curated VDJdb-IEDB benchmark under family-held-out, distance-aware, and random splits, CIP achieves AUROC 0.831 and counterfactual consistency (CFC) 0.724 under the challenging family-held-out protocol – a 39.7% reduction in shortcut index relative to the unconstrained baseline. Ablations confirm that anchor-aware edit generation is the dominant driver of OOD gains, providing a practical recipe for causally-grounded TCR specificity modeling.
[408] Binomial Gradient-Based Meta-Learning for Enhanced Meta-Gradient Estimation
Yilang Zhang, Abraham Jaeger Mountain, Bingcong Li, Georgios B. Giannakis
Main category: cs.LG
TL;DR: BinomGBML introduces a truncated binomial expansion for efficient and accurate meta-gradient estimation in gradient-based meta-learning, improving upon existing approximation methods with provable error bounds.
Details
Motivation: Gradient-based meta-learning methods suffer from high computational overhead that scales linearly with the number of gradient descent steps. Existing approximation methods for meta-gradient estimation via truncated backpropagation suffer from large approximation errors, motivating the need for more accurate and efficient alternatives.Method: Proposes BinomGBML, which uses a truncated binomial expansion for meta-gradient estimation. This approach leverages efficient parallel computation to incorporate more information in the gradient estimation. The method is applied to model-agnostic meta-learning (MAML) as BinomMAML.
Result: BinomMAML provably enjoys improved error bounds that decay super-exponentially under mild conditions. Numerical tests corroborate the theoretical analysis and show boosted performance with only slightly increased computational overhead compared to existing methods.
Conclusion: The binomial expansion approach provides an efficient and accurate alternative for meta-gradient estimation in gradient-based meta-learning, offering theoretical guarantees and practical performance improvements with manageable computational cost.
Abstract: Meta-learning offers a principled framework leveraging \emph{task-invariant} priors from related tasks, with which \emph{task-specific} models can be fine-tuned on downstream tasks, even with limited data records. Gradient-based meta-learning (GBML) relies on gradient descent (GD) to adapt the prior to a new task. Albeit effective, these methods incur high computational overhead that scales linearly with the number of GD steps. To enhance efficiency and scalability, existing methods approximate the gradient of prior parameters (meta-gradient) via truncated backpropagation, yet suffer large approximation errors. Targeting accurate approximation, this work puts forth binomial GBML (BinomGBML), which relies on a truncated binomial expansion for meta-gradient estimation. This novel expansion endows more information in the meta-gradient estimation via efficient parallel computation. As a running paradigm applied to model-agnostic meta-learning (MAML), the resultant BinomMAML provably enjoys error bounds that not only improve upon existing approaches, but also decay super-exponentially under mild conditions. Numerical tests corroborate the theoretical analysis and showcase boosted performance with slightly increased computational overhead.
[409] Enhancing Confidence Estimation in Telco LLMs via Twin-Pass CoT-Ensembling
Anton Saenko, Pranshav Gajjar, Abiodun Ganiyu, Vijay K. Shah
Main category: cs.LG
TL;DR: LLM confidence scores in telecom tasks are unreliable and overconfident; Twin-Pass CoT-Ensembling improves calibration by aggregating multiple reasoning evaluations, reducing calibration error by up to 88%.
Details
Motivation: LLMs are increasingly used for complex telecommunications tasks like 3GPP specification analysis and O-RAN troubleshooting, but their confidence scores are often biased and unreliable, showing systematic overconfidence. This lack of trustworthy self-assessment makes it difficult to verify model outputs and safely rely on them in practice.Method: Proposes Twin-Pass Chain of Thought (CoT)-Ensembling methodology for improving confidence estimation. It leverages multiple independent reasoning evaluations and aggregates their assessments into a calibrated confidence score. Evaluated on Gemma-3 model family (4B, 12B, 27B parameters) using TeleQnA, ORANBench, and srsRANBench benchmarks.
Result: The approach reduces Expected Calibration Error (ECE) by up to 88% across benchmarks, significantly improving the reliability of model self-assessment. Shows that standard single-pass, verbalized confidence estimates fail to reflect true correctness, often assigning high confidence to incorrect predictions.
Conclusion: Highlights limitations of current confidence estimation practices and demonstrates a practical path toward more trustworthy evaluation of LLM outputs in telecommunications through improved calibration techniques.
Abstract: Large Language Models (LLMs) are increasingly applied to complex telecommunications tasks, including 3GPP specification analysis and O-RAN network troubleshooting. However, a critical limitation remains: LLM-generated confidence scores are often biased and unreliable, frequently exhibiting systematic overconfidence. This lack of trustworthy self-assessment makes it difficult to verify model outputs and safely rely on them in practice. In this paper, we study confidence calibration in telecom-domain LLMs using the representative Gemma-3 model family (4B, 12B, and 27B parameters), evaluated on TeleQnA, ORANBench, and srsRANBench. We show that standard single-pass, verbalized confidence estimates fail to reflect true correctness, often assigning high confidence to incorrect predictions. To address this, we propose a novel Twin-Pass Chain of Thought (CoT)-Ensembling methodology for improving confidence estimation by leveraging multiple independent reasoning evaluations and aggregating their assessments into a calibrated confidence score. Our approach reduces Expected Calibration Error (ECE) by up to 88% across benchmarks, significantly improving the reliability of model self-assessment. These results highlight the limitations of current confidence estimation practices and demonstrate a practical path toward more trustworthy evaluation of LLM outputs in telecommunications.
[410] MOONSHOT : A Framework for Multi-Objective Pruning of Vision and Large Language Models
Gabriel Afriat, Xiang Meng, Shibal Ibrahim, Hussein Hazimeh, Rahul Mazumder
Main category: cs.LG
TL;DR: MOONSHOT is a multi-objective pruning framework that combines layer-wise reconstruction loss and second-order Taylor approximation to improve post-training one-shot pruning across various architectures and sparsity levels.
Details
Motivation: Existing one-shot pruning methods use single objectives (either layer-wise reconstruction loss or second-order Taylor approximation) but neither is consistently effective across different architectures and sparsity levels. The authors aim to create a more robust pruning approach by combining both objectives.Method: MOONSHOT extends any single-objective pruning method into a multi-objective framework that jointly optimizes both layer-wise reconstruction error and second-order Taylor approximation of training loss. It acts as a wrapper around existing pruning algorithms and includes efficient computation of inverse Hessian for scalability to billion-parameter models.
Result: On Llama-3.2 and Llama-2 models, MOONSHOT reduces C4 perplexity by up to 32.6% at 2:4 sparsity and improves zero-shot mean accuracy across seven classification benchmarks by up to 4.9 points. On Vision Transformers, it improves ImageNet-1k accuracy by over 5 points at 70% sparsity, and on ResNet-50, it yields a 4-point gain at 90% sparsity.
Conclusion: MOONSHOT provides a general and flexible framework that significantly improves pruning effectiveness by combining multiple objectives, demonstrating strong performance across language models and vision architectures at various sparsity levels.
Abstract: Weight pruning is a common technique for compressing large neural networks. We focus on the challenging post-training one-shot setting, where a pre-trained model is compressed without any retraining. Existing one-shot pruning methods typically optimize a single objective, such as a layer-wise reconstruction loss or a second-order Taylor approximation of the training loss. We highlight that neither objective alone is consistently the most effective across architectures and sparsity levels. Motivated by this insight, we propose MOONSHOT, a general and flexible framework that extends any single-objective pruning method into a multi-objective formulation by jointly optimizing both the layer-wise reconstruction error and second-order Taylor approximation of the training loss. MOONSHOT acts as a wrapper around existing pruning algorithms. To enable this integration while maintaining scalability to billion-parameter models, we propose modeling decisions and introduce an efficient procedure for computing the inverse Hessian, preserving the efficiency of state-of-the-art one-shot pruners. When combined with state-of-the-art pruning methods on Llama-3.2 and Llama-2 models, MOONSHOT reduces C4 perplexity by up to 32.6% at 2:4 sparsity and improves zero-shot mean accuracy across seven classification benchmarks by up to 4.9 points. On Vision Transformers, it improves accuracy on ImageNet-1k by over 5 points at 70% sparsity, and on ResNet-50, it yields a 4-point gain at 90% sparsity.
[411] Physics-informed reservoir characterization from bulk and extreme pressure events with a differentiable simulator
Harun Ur Rashid, Mingxin Li, Aleksandra Pachalieva, Georg Stadler, Daniel O’Malley
Main category: cs.LG
TL;DR: Physics-informed ML method embeds differentiable subsurface flow simulator into neural network training to infer heterogeneous permeability fields from limited pressure observations while maintaining physical consistency.
Details
Motivation: Traditional subsurface characterization methods rely on expensive full-physics simulations that are infeasible for handling uncertainty and extreme events at scale, while purely data-driven models struggle with physics consistency for sparse observations, complex geology, and extreme events.Method: Physics-informed machine learning approach that embeds a differentiable subsurface flow simulator directly into neural network training. The network infers heterogeneous permeability fields from limited pressure observations, with training minimizing both permeability and pressure losses through the simulator to enforce physical consistency.
Result: The method reduces pressure inference error by half compared to purely data-driven approaches across eight distinct data scenarios. It maintains higher pressure inference accuracy even for extreme events in the tail of sample distributions.
Conclusion: The proposed method enables rapid, physics-consistent subsurface inversion for real-time reservoir characterization and risk-aware decision-making, with the simulator used only during training to keep inference fast.
Abstract: Accurate characterization of subsurface heterogeneity is challenging but essential for applications such as reservoir pressure management, geothermal energy extraction and CO$_2$, H$_2$, and wastewater injection operations. This challenge becomes especially acute in extreme pressure events, which are rarely observed but can strongly affect operational risk. Traditional history matching and inversion techniques rely on expensive full-physics simulations, making it infeasible to handle uncertainty and extreme events at scale. Purely data-driven models often struggle to maintain physics consistency when dealing with sparse observations, complex geology, and extreme events. To overcome these limitations, we introduce a physics-informed machine learning method that embeds a differentiable subsurface flow simulator directly into neural network training. The network infers heterogeneous permeability fields from limited pressure observations, while training minimizes both permeability and pressure losses through the simulator, enforcing physical consistency. Because the simulator is used only during training, inference remains fast once the model is learned. In an initial test, the proposed method reduces the pressure inference error by half compared with a purely data-driven approach. We then extend the test over eight distinct data scenarios, and in every case, our method produces significantly lower pressure inference errors than the purely data-driven model. We also evaluate our method on extreme events, which represent high-consequence data in the tail of the sample distribution. Similar to the bulk distribution, the physics-informed model maintains higher pressure inference accuracy in the extreme event regimes. Overall, the proposed method enables rapid, physics-consistent subsurface inversion for real-time reservoir characterization and risk-aware decision-making.
[412] Some Theoretical Limitations of t-SNE
Rupert Li, Elchanan Mossel
Main category: cs.LG
TL;DR: t-SNE visualization technique loses important data features; mathematical framework shows how and when this occurs
Details
Motivation: t-SNE is popular for data visualization but all dimension reduction techniques lose some data features; need to mathematically understand what specific features are lost when using t-SNEMethod: Develop mathematical framework with theoretical results showing how t-SNE loses important data features in different scenarios
Result: Established multiple mathematical results demonstrating specific ways t-SNE loses important data features during dimension reduction
Conclusion: t-SNE, like other dimension reduction methods, loses important data features; mathematical framework helps understand what features are lost and when
Abstract: t-SNE has gained popularity as a dimension reduction technique, especially for visualizing data. It is well-known that all dimension reduction techniques may lose important features of the data. We provide a mathematical framework for understanding this loss for t-SNE by establishing a number of results in different scenarios showing how important features of data are lost by using t-SNE.
[413] Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
Eun Woo Im, Dhruv Madhwal, Vivek Gupta
Main category: cs.LG
TL;DR: Slipform improves vision-language models’ compositional reasoning by using lexical concreteness to generate effective hard negatives and addressing gradient imbalance with a margin-based loss.
Details
Motivation: Vision-language models struggle with compositional reasoning, particularly with word order and attribute binding, due to insufficient informative samples during contrastive pretraining. Existing hard negative mining methods lack explicit mechanisms for determining which linguistic elements to modify.Method: 1) Introduces ConcretePlant to systematically isolate and manipulate perceptually grounded concepts based on lexical concreteness, where modifying highly concrete terms generates stronger learning signals. 2) Proposes Cement loss, a margin-based approach that correlates psycholinguistic scores with sample difficulty to dynamically calibrate penalization and address gradient imbalance in InfoNCE.
Result: The integrated Slipform framework achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, and single/multi-label linear probing.
Conclusion: Lexical concreteness is a fundamental determinant of negative sample efficacy for improving compositional reasoning in vision-language models. Addressing gradient imbalance through calibrated penalization significantly enhances model performance on nuanced semantic tasks.
Abstract: Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of the InfoNCE further reveals a severe gradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, the Cement loss is formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated as Slipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.
[414] Beyond Uniform Sampling: Synergistic Active Learning and Input Denoising for Robust Neural Operators
Samrendra Roy, Souvik Chakraborty, Syed Bahauddin Alam
Main category: cs.LG
TL;DR: Neural operators for physics simulations are vulnerable to adversarial attacks; proposed defense combines active learning-based targeted data generation with input denoising architecture to improve robustness while maintaining accuracy.
Details
Motivation: Neural operators serve as fast surrogate models for physics simulations but are highly vulnerable to adversarial perturbations, creating safety risks for digital twin deployments in critical applications like nuclear reactor monitoring.Method: Combines two approaches: 1) Active learning component that uses differential evolution attacks to probe model weaknesses and generates targeted training data at vulnerability locations with adaptive smooth-ratio safeguard; 2) Input denoising component that adds a learnable bottleneck to filter adversarial noise while preserving physics-relevant features.
Result: On viscous Burgers’ equation benchmark, combined approach achieves 2.04% combined error (87% reduction relative to standard training), outperforming both active learning alone (3.42%) and input denoising alone (5.22%).
Conclusion: Optimal training data for neural operators is architecture-dependent as different architectures concentrate sensitivity in distinct input subspaces; uniform sampling cannot adequately cover vulnerability landscape of all models, with implications for safety-critical energy system deployments.
Abstract: Neural operators have emerged as fast surrogate models for physics simulations, yet they remain acutely vulnerable to adversarial perturbations, a critical liability for safety-critical digital twin deployments. We present a synergistic defense that combines active learning-based data generation with an input denoising architecture. The active learning component adaptively probes model weaknesses using differential evolution attacks, then generates targeted training data at discovered vulnerability locations while an adaptive smooth-ratio safeguard preserves baseline accuracy. The input denoising component augments the operator architecture with a learnable bottleneck that filters adversarial noise while retaining physics-relevant features. On the viscous Burgers’ equation benchmark, the combined approach achieves a 2.04% combined error (1.21% baseline + 0.83% robustness), representing an 87% reduction relative to standard training (15.42% combined) and outperforming both active learning alone (3.42%) and input denoising alone (5.22%). More broadly, our results, combined with cross-architecture vulnerability analysis from prior work, suggest that optimal training data for neural operators is architecture-dependent: because different architectures concentrate sensitivity in distinct input subspaces, uniform sampling cannot adequately cover the vulnerability landscape of all models. These findings have potential implications for the deployment of neural operators in safety-critical energy systems including nuclear reactor monitoring.
[415] Multi-Task LLM with LoRA Fine-Tuning for Automated Cancer Staging and Biomarker Extraction
Jiahao Shao, Anam Nawaz Khan, Christopher Brett, Tom Berg, Xueping Li, Bing Yao
Main category: cs.LG
TL;DR: Fine-tuned Llama-3-8B with LoRA for multi-task extraction of breast cancer staging and biomarkers from pathology reports, achieving high accuracy with parameter-efficient design.
Details
Motivation: Unstructured pathology reports hinder large-scale cancer data curation, while existing LLM approaches face computational cost and hallucination issues, creating need for reliable, efficient extraction methods.Method: Fine-tuned Llama-3-8B-Instruct using LoRA on 10,677 expert-verified reports, with parallel classification heads for multi-task extraction (TNM staging, histologic grade, biomarkers) instead of generative approach.
Result: Achieved Macro F1 score of 0.976, outperforming rule-based NLP, zero-shot LLMs, and single-task LLM baselines, effectively handling complex contextual ambiguities and heterogeneous reporting formats.
Conclusion: Parameter-efficient multi-task architecture enables reliable, scalable pathology data extraction for clinical decision support and oncology research, addressing computational and hallucination limitations of traditional LLM approaches.
Abstract: Pathology reports serve as the definitive record for breast cancer staging, yet their unstructured format impedes large-scale data curation. While Large Language Models (LLMs) offer semantic reasoning, their deployment is often limited by high computational costs and hallucination risks. This study introduces a parameter-efficient, multi-task framework for automating the extraction of Tumor-Node-Metastasis (TNM) staging, histologic grade, and biomarkers. We fine-tune a Llama-3-8B-Instruct encoder using Low-Rank Adaptation (LoRA) on a curated, expert-verified dataset of 10,677 reports. Unlike generative approaches, our architecture utilizes parallel classification heads to enforce consistent schema adherence. Experimental results demonstrate that the model achieves a Macro F1 score of 0.976, successfully resolving complex contextual ambiguities and heterogeneous reporting formats that challenge traditional extraction methods including rule-based natural language processing (NLP) pipelines, zero-shot LLMs, and single-task LLM baselines. The proposed adapter-efficient, multi-task architecture enables reliable, scalable pathology-derived cancer staging and biomarker profiling, with the potential to enhance clinical decision support and accelerate data-driven oncology research.
[416] Text-Attributed Knowledge Graph Enrichment with Large Language Models for Medical Concept Representation
Mohsen Nayebi Kerdabadi, Arya Hadizadeh Moghaddam, Chen Chen, Dongjie Wang, Zijun Yao
Main category: cs.LG
TL;DR: CoMed: LLM-empowered graph learning framework for medical concept representation that combines EHR-mined associations with LLM-generated semantic relations and text attributes to create unified embeddings for clinical prediction tasks.
Details
Motivation: Current medical concept representation learning faces two key challenges: (1) missing/incomplete cross-type dependencies (diagnosis-medication-procedure relations) in existing ontologies, limiting EHR pattern modeling, and (2) difficulty integrating rich clinical text semantics with knowledge graph structure for representation learning.Method: 1. Build global knowledge graph over medical codes by combining statistically reliable associations mined from EHRs with type-constrained LLM prompting to infer semantic relations. 2. Use LLMs to enrich KG into text-attributed graph by generating node descriptions and edge rationales. 3. Jointly train LoRA-tuned LLaMA text encoder with heterogeneous GNN to fuse text semantics and graph structure into unified concept embeddings.
Result: Extensive experiments on MIMIC-III and MIMIC-IV show CoMed consistently improves prediction performance and serves as an effective plug-in concept encoder for standard EHR pipelines.
Conclusion: CoMed successfully addresses the challenges of medical concept representation by leveraging LLMs to bridge the gap between structured EHR data and clinical semantics, creating a powerful framework that enhances downstream clinical prediction tasks.
Abstract: In electronic health record (EHR) mining, learning high-quality representations of medical concepts (e.g., standardized diagnosis, medication, and procedure codes) is fundamental for downstream clinical prediction. However, robust concept representation learning is hindered by two key challenges: (i) clinically important cross-type dependencies (e.g., diagnosis-medication and medication-procedure relations) are often missing or incomplete in existing ontology resources, limiting the ability to model complex EHR patterns; and (ii) rich clinical semantics are often missing from structured resources, and even when available as text, are difficult to integrate with KG structure for representation learning. To address these challenges, we present CoMed, an LLM-empowered graph learning framework for medical concept representation. CoMed first builds a global knowledge graph (KG) over medical codes by combining statistically reliable associations mined from EHRs with type-constrained LLM prompting to infer semantic relations. It then utilizes LLMs to enrich the KG into a text-attributed graph by generating node descriptions and edge rationales, providing semantic signals for both concepts and their relationships. Finally, CoMed jointly trains a LoRA-tuned LLaMA text encoder with a heterogeneous GNN, fusing text semantics and graph structure into unified concept embeddings. Extensive experiments on MIMIC-III and MIMIC-IV show that CoMed consistently improves prediction performance and serves as an effective plug-in concept encoder for standard EHR pipelines.
[417] Selecting Feature Interactions for Generalized Additive Models by Distilling Foundation Models
Jingyun Jia, Chandan Singh, Rich Caruana, Ben Lengerich
Main category: cs.LG
TL;DR: TabDistill uses tabular foundation models to automatically discover meaningful feature interactions for generalized additive models (GAMs), improving predictive performance without heuristic interaction selection.
Details
Motivation: Traditional GAMs for tabular data rely on heuristic procedures to select feature interactions, which can miss higher-order or context-dependent effects. There's a need for more systematic, data-driven approaches to interaction discovery.Method: TabDistill first fits a tabular foundation model to the dataset, then applies post-hoc interaction attribution methods to extract salient feature interactions from the foundation model, and finally uses these discovered interactions as terms in a GAM.
Result: Interactions identified by TabDistill lead to consistent improvements in downstream GAMs’ predictive performance across tasks, demonstrating that tabular foundation models can effectively guide interaction discovery.
Conclusion: Tabular foundation models serve as effective, data-driven guides for interaction discovery, bridging high-capacity models and interpretable additive frameworks like GAMs.
Abstract: Identifying meaningful feature interactions is a central challenge in building accurate and interpretable models for tabular data. Generalized additive models (GAMs) have shown great success at modeling tabular data, but often rely on heuristic procedures to select interactions, potentially missing higher-order or context-dependent effects. To meet this challenge, we propose TabDistill, a method that leverages tabular foundation models and post-hoc distillation methods. Our key intuition is that tabular foundation models implicitly learn rich, adaptive feature dependencies through large-scale representation learning. Given a dataset, TabDistill first fits a tabular foundation model to the dataset, and then applies a post-hoc interaction attribution method to extract salient feature interactions from it. We evaluate these interactions by then using them as terms in a GAM. Across tasks, we find that interactions identified by TabDistill lead to consistent improvements in downstream GAMs’ predictive performance. Our results suggest that tabular foundation models can serve as effective, data-driven guides for interaction discovery, bridging high-capacity models and interpretable additive frameworks.
[418] When Less Latent Leads to Better Relay: Information-Preserving Compression for Latent Multi-Agent LLM Collaboration
Yiping Li, Zhiyu An, Wan Du
Main category: cs.LG
TL;DR: Orthogonal Backfill (OBF) compresses KV caches in LLM-based multi-agent communication by using eviction-style compression with orthogonal residuals to preserve useful information while reducing communication costs by 79.8%-89.4%.
Details
Motivation: Communication in LLM-based multi-agent systems is moving beyond discrete tokens to preserve richer context through latent messages like full KV caches, but this incurs high memory and communication costs that need to be addressed.Method: Adapts eviction-style KV compression to multi-agent communication and introduces Orthogonal Backfill (OBF) to mitigate information loss from hard eviction by injecting low-rank orthogonal residuals from discarded KV states into retained KV states.
Result: Achieves performance comparable to full KV relay while reducing communication cost by 79.8%–89.4%, and further improves performance to achieve best results on 7 of 9 benchmarks spanning mathematical reasoning, coding, and knowledge-intensive QA.
Conclusion: More information doesn’t necessarily lead to better communication; preserving the most useful information matters more. OBF effectively compresses KV caches while maintaining or improving performance in multi-agent LLM systems.
Abstract: Communication in Large Language Model (LLM)-based multi-agent systems is moving beyond discrete tokens to preserve richer context. Recent work such as LatentMAS enables agents to exchange latent messages through full key-value (KV) caches. However, full KV relay incurs high memory and communication cost. We adapt eviction-style KV compression to this setting and introduce Orthogonal Backfill (OBF) to mitigate information loss from hard eviction. OBF injects a low-rank orthogonal residual from discarded KV states into the retained KV states. We evaluate proposed method against full KV relay on nine standard benchmarks spanning mathematical reasoning, coding, and knowledge-intensive QA. It achieves performance comparable to full KV relay while reducing communication cost by 79.8%–89.4%. OBF further improves the performance and achieves the best results on 7 of the 9 benchmarks. This suggests that more information does not necessarily lead to better communication; preserving the most useful information matters more. Our codebase is publicly available on https://github.com/markli404/When-Less-Latent-Leads-to-Better-Relay.
[419] BioTrain: Sub-MB, Sub-50mW On-Device Fine-Tuning for Edge-AI on Biosignals
Run Wang, Victor J. B. Jung, Philip Wiese, Sebastian Frey, Giusy Spacone, Francesco Conti, Alessio Burrello, Luca Benin
Main category: cs.LG
TL;DR: BioTrain enables full-network fine-tuning of biosignal models on milliwatt-scale MCUs, achieving up to 35% accuracy improvement over non-adapted baselines while reducing memory footprint by 8.1x.
Details
Motivation: Biosignals show cross-subject variability causing domain shifts that degrade AI model performance on edge devices. On-device adaptation is needed for privacy and reliability, but current MCU platforms can't support full backpropagation due to memory and computational constraints.Method: Proposes BioTrain framework for full-network fine-tuning under milliwatt power and sub-megabyte memory constraints. Uses efficient memory allocator and network topology optimization to enable large batch sizes and reduce peak memory usage.
Result: Achieves 35% accuracy improvement over non-adapted baselines, outperforms last-layer updates by ~7% for new-subject calibration. On GAP9 MCU: 17 samples/s for EEG and 85 samples/s for EOG models under 50 mW power. Reduces memory footprint by 8.1x (from 5.4 MB to 0.67 MB).
Conclusion: BioTrain enables practical on-device adaptation of biosignal models on resource-constrained MCUs, addressing domain shift challenges while maintaining privacy and system reliability.
Abstract: Biosignals exhibit substantial cross-subject and cross-session variability, inducing severe domain shifts that degrade post-deployment performance for small, edge-oriented AI models. On-device adaptation is therefore essential to both preserve user privacy and ensure system reliability. However, existing sub-100 mW MCU-based wearable platforms can only support shallow or sparse adaptation schemes due to the prohibitive memory footprint and computational cost of full backpropagation (BP). In this paper, we propose BioTrain, a framework enabling full-network fine-tuning of state-of-the-art biosignal models under milliwatt-scale power and sub-megabyte memory constraints. We validate BioTrain using both offline and on-device benchmarks on EEG and EOG datasets, covering Day-1 new-subject calibration and longitudinal adaptation to signal drift. Experimental results show that full-network fine-tuning achieves accuracy improvements of up to 35% over non-adapted baselines and outperforms last-layer updates by approximately 7% during new-subject calibration. On the GAP9 MCU platform, BioTrain enables efficient on-device training throughput of 17 samples/s for EEG and 85 samples/s for EOG models within a power envelope below 50 mW. In addition, BioTrain’s efficient memory allocator and network topology optimization enable the use of a large batch size, reducing peak memory usage. For fully on-chip BP on GAP9, BioTrain reduces the memory footprint by 8.1x, from 5.4 MB to 0.67 MB, compared to conventional full-network fine-tuning using batch normalization with batch size 8.
[420] Diffusion Sequence Models for Generative In-Context Meta-Learning of Robot Dynamics
Angelo Moroncelli, Matteo Rufolo, Gunes Cagin Aydin, Asad Ali Shahid, Loris Roveda
Main category: cs.LG
TL;DR: Diffusion-based generative meta-models outperform deterministic Transformers for robust robot dynamics system identification under distribution shifts, with inpainting diffusion achieving best performance while meeting real-time control constraints through warm-started sampling.
Details
Motivation: Accurate robot dynamics modeling is crucial for model-based control but remains challenging under distribution shifts and real-time constraints. The paper aims to improve robustness in system identification by exploring generative sequence models as an alternative to deterministic approaches.Method: Formulates system identification as in-context meta-learning problem. Compares deterministic Transformer-based meta-model with two diffusion-based approaches: (1) inpainting diffusion (Diffuser) learning joint input-observation distribution, and (2) conditioned diffusion models (CNN and Transformer) generating future observations conditioned on control inputs. Uses large-scale randomized simulations to analyze performance across in-distribution and out-of-distribution regimes.
Result: Diffusion models significantly improve robustness under distribution shift, with inpainting diffusion achieving the best performance. Warm-started sampling enables diffusion models to operate within real-time constraints, making them viable for control applications.
Conclusion: Generative meta-models represent a promising direction for robust system identification in robotics, with diffusion models offering superior robustness to distribution shifts while maintaining computational feasibility for real-time control.
Abstract: Accurate modeling of robot dynamics is essential for model-based control, yet remains challenging under distributional shifts and real-time constraints. In this work, we formulate system identification as an in-context meta-learning problem and compare deterministic and generative sequence models for forward dynamics prediction. We take a Transformer-based meta-model, as a strong deterministic baseline, and introduce to this setting two complementary diffusion-based approaches: (i) inpainting diffusion (Diffuser), which learns the joint input-observation distribution, and (ii) conditioned diffusion models (CNN and Transformer), which generate future observations conditioned on control inputs. Through large-scale randomized simulations, we analyze performance across in-distribution and out-of-distribution regimes, as well as computational trade-offs relevant for control. We show that diffusion models significantly improve robustness under distribution shift, with inpainting diffusion achieving the best performance in our experiments. Finally, we demonstrate that warm-started sampling enables diffusion models to operate within real-time constraints, making them viable for control applications. These results highlight generative meta-models as a promising direction for robust system identification in robotics.
[421] Linear Probe Accuracy Scales with Model Size and Benefits from Multi-Layer Ensembling
Erik Nordby, Tasha Pais, Aviel Parrack
Main category: cs.LG
TL;DR: Multi-layer linear probe ensembles detect model deception better than single-layer probes, with performance scaling with model size.
Details
Motivation: Single-layer linear probes for detecting when language models produce outputs they "know" are wrong are fragile - the best layer varies across models/tasks, and probes fail on some deception types. Need more robust detection methods.Method: Combine probes from multiple layers into an ensemble to recover strong performance where single-layer probes fail. Analyze geometric properties of deception directions across layers.
Result: Multi-layer ensembles improve AUROC by +29% on Insider Trading and +78% on Harm-Pressure Knowledge tasks. Probe accuracy improves with model scale (~5% AUROC per 10x parameters). Deception directions rotate gradually across layers rather than appearing at one location.
Conclusion: Multi-layer probe ensembles provide more robust detection of model deception than single-layer probes, with performance scaling with model size. The gradual rotation of deception directions across layers explains both the brittleness of single-layer probes and the success of multi-layer ensembles.
Abstract: Linear probes can detect when language models produce outputs they “know” are wrong, a capability relevant to both deception and reward hacking. However, single-layer probes are fragile: the best layer varies across models and tasks, and probes fail entirely on some deception types. We show that combining probes from multiple layers into an ensemble recovers strong performance even where single-layer probes fail, improving AUROC by +29% on Insider Trading and +78% on Harm-Pressure Knowledge. Across 12 models (0.5B–176B parameters), we find probe accuracy improves with scale: ~5% AUROC per 10x parameters (R=0.81). Geometrically, deception directions rotate gradually across layers rather than appearing at one location, explaining both why single-layer probes are brittle and why multi-layer ensembles succeed.
[422] Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiaoge Zhang, Tianyi Li, Kaiyu Tang, Xiao Li, Jing Li
Main category: cs.LG
TL;DR: The paper analyzes non-determinism in Diffusion Language Models (DLMs), showing that dataset-level metrics mask fine-grained instability, and introduces Factor Variance Attribution to decompose non-determinism sources across model and system factors.
Details
Motivation: Current non-determinism evaluations for LLMs rely on dataset-level metrics under fixed inference configurations, which aggregate sample-level prediction quality and systematically attenuate non-determinism in DLMs, leaving fine-grained instability and error patterns uncharacterized.Method: Conducts fine-grained evaluation of non-determinism based on sample-level prediction differences across model-related factors (guidance scale, diffusion steps, Monte Carlo sampling) and system-related factors (batch size, hardware, numerical precision). Introduces Factor Variance Attribution (FVA) to decompose observed non-determinism into variance attributable to different evaluation factor settings.
Result: Non-determinism in DLMs is pervasive and structured, with code generation exhibiting markedly higher sensitivity to factor-level choices than question answering. FVA enables attribution of non-determinism sources across different evaluation factors.
Conclusion: Fine-grained, factor-aware evaluation is needed for reliable non-determinism assessment of diffusion language models, as dataset-level metrics systematically mask important behavioral variations.
Abstract: Diffusion language models (DLMs) have emerged as a promising paradigm for large language models (LLMs), yet the non-deterministic behavior of DLMs remains poorly understood. The existing non-determinism evaluations for LLMs predominantly rely on dataset-level metrics under fixed inference configurations, providing limited insight into how model behavior varies across runs and evaluation conditions. In this work, we show that dataset-level metrics systematically attenuate non-determinism in diffusion language models by aggregating sample-level prediction quality across different runs. As a result, configurations with similar aggregate performance can exhibit substantially different behaviors on individual inputs, leaving fine-grained instability and distinct error patterns uncharacterized. To address this limitation, we conduct a fine-grained evaluation of non-determinism based on sample-level prediction differences across a range of model-related factors-including guidance scale, diffusion steps, and Monte Carlo sampling-as well as system-related factors such as batch size, hardware, and numerical precision. Our analysis reveals that non-determinism in DLMs is pervasive and structured, with code generation exhibiting markedly higher sensitivity to factor-level choices than question answering. To attribute sources of non-determinism evaluation, we introduce Factor Variance Attribution (FVA), a cross-factor analysis metric that decomposes observed non-determinism into variance attributable to different evaluation factor settings. Our findings highlight the need for fine-grained, factor-aware evaluation to enable reliable non-determinism assessment of diffusion language models.
[423] Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Main category: cs.LG
TL;DR: Ensemble methods for dependent data: theoretical analysis of variance reduction in Markov-dependent settings with adaptive algorithm achieving minimax rates
Details
Motivation: Classical ensemble methods assume independent base learners, but real-world data often exhibits Markov dependence (time-series, RL replay buffers, spatial grids). Existing theory doesn't fully quantify how this dependence degrades ensemble performance guarantees.Method: 1) Established information-theoretic lower bound for stationary, reversible, geometrically ergodic Markov chains. 2) Showed dependence-agnostic uniform bagging is suboptimal. 3) Proposed adaptive spectral routing algorithm that partitions training data using empirical Fiedler eigenvector of dependency graph.
Result: Proved minimax rate of O(√(T_mix/n)) for classification risk in Markov settings. Showed uniform bagging has Ω(T_mix/√n) excess risk (√T_mix gap). Adaptive spectral routing achieves minimax rate up to lower-order term without knowing mixing time.
Conclusion: Theoretical framework for ensemble methods with dependent data, with practical algorithm that adapts to data dependence structure. Validated on synthetic Markov chains, spatial grids, UCR archive, and Atari DQN ensembles.
Abstract: Majority-vote ensembles achieve variance reduction by averaging over diverse, approximately independent base learners. When training data exhibits Markov dependence, as in time-series forecasting, reinforcement learning (RL) replay buffers, and spatial grids, this classical guarantee degrades in ways that existing theory does not fully quantify. We provide a minimax characterization of this phenomenon for discrete classification in a fixed-dimensional Markov setting, together with an adaptive algorithm that matches the rate on a graph-regular subclass. We first establish an information-theoretic lower bound for stationary, reversible, geometrically ergodic chains in fixed ambient dimension, showing that no measurable estimator can achieve excess classification risk better than $Ω(\sqrt{\Tmix/n})$. We then prove that, on the AR(1) witness subclass underlying the lower-bound construction, dependence-agnostic uniform bagging is provably suboptimal with excess risk bounded below by $Ω(\Tmix/\sqrt{n})$, exhibiting a $\sqrt{\Tmix}$ algorithmic gap. Finally, we propose \emph{adaptive spectral routing}, which partitions the training data via the empirical Fiedler eigenvector of a dependency graph and achieves the minimax rate $\mathcal{O}(\sqrt{\Tmix/n})$ up to a lower-order geometric cut term on a graph-regular subclass, without knowledge of $\Tmix$. Experiments on synthetic Markov chains, 2D spatial grids, the 128-dataset UCR archive, and Atari DQN ensembles validate the theoretical predictions. Consequences for deep RL target variance, scalability via Nyström approximation, and bounded non-stationarity are developed as supporting material in the appendix.
[424] WIN-U: Woodbury-Informed Newton-Unlearning as a retain-free Machine Unlearning Framework
Xingjian Zhao, Mohammad Mohammadi Amiri, Malik Magdon-Ismail
Main category: cs.LG
TL;DR: WIN-U is a machine unlearning framework that removes specific data from trained models without needing access to the retained data, using only second-order information and a Newton-style update.
Details
Motivation: Addressing privacy concerns in LLMs by enabling data "right to be forgotten" through machine unlearning, while overcoming practical limitations of existing methods that require direct access to retained data.Method: Proposes WIN-U framework that uses only second-order information from the originally trained model. Performs unlearning via a single Newton-style step using Woodbury matrix identity and generalized Gauss-Newton approximation for forget set curvature.
Result: Achieves state-of-the-art performance on vision and language benchmarks for unlearning efficacy and utility preservation, while being more robust against relearning attacks compared to existing methods.
Conclusion: WIN-U provides an effective retained-data free unlearning solution that approximates the gold-standard retraining optimum, addressing practical privacy and cost constraints in machine unlearning.
Abstract: Privacy concerns in LLMs have led to the rapidly growing need to enforce a data’s “right to be forgotten”. Machine unlearning addresses precisely this task, namely the removal of the influence of some specific data, i.e., the forget set, from a trained model. The gold standard for unlearning is to produce the model that would have been learned on only the rest of the training data, i.e., the retain set. Most existing unlearning methods rely on direct access to the retained data, which may not be practical due to privacy or cost constraints. We propose WIN-U, a retained-data free unlearning framework that requires only second order information for the originally trained model on the full data. The unlearning is performed using a single Newton-style step. Using the Woodbury matrix identity and a generalized Gauss-Newton approximation for the forget set curvature, the WIN-U update recovers the closed-form linear solution and serves as a local second-order approximation to the gold-standard retraining optimum. Extensive experiments on various vision and language benchmarks demonstrate that WIN-U achieves SOTA performance in terms of unlearning efficacy and utility preservation, while being more robust against relearning attacks compared to existing methods. Importantly, WIN-U does not require access to the retained data.
[425] A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models
Jason Kong, Nilesh Prasad Pandey, Flavio Ponzina, Tajana Rosing
Main category: cs.LG
TL;DR: Proposes a lightweight, backpropagation-free sensitivity analysis framework using KL divergence to identify quantization-sensitive components in hybrid SSM-Transformer models for efficient edge deployment.
Details
Motivation: Deploying LLMs on edge devices faces computational/memory constraints; hybrid SSM-Transformer architectures offer efficiency but aggressive quantization requires careful management of uneven effects on different components.Method: Lightweight, backpropagation-free surrogate-based sensitivity analysis using forward-pass metrics only; formal analysis shows KL divergence better captures quantization sensitivity than MSE/SQNR for language modeling.
Result: KL-based rankings align with observed performance drops and outperform alternative metrics; KL-guided mixed-precision achieves near-FP16 perplexity with competitive model sizes/throughput vs Uniform INT4 on Intel Lunar Lake hardware.
Conclusion: Framework enables practical deployment of advanced hybrid models on resource-constrained edge devices with minimal accuracy loss, validated with real-world on-device profiling.
Abstract: Deploying Large Language Models (LLMs) on edge devices faces severe computational and memory constraints, limiting real-time processing and on-device intelligence. Hybrid architectures combining Structured State Space Models (SSMs) with transformer-based LLMs offer a balance of efficiency and performance. Aggressive quantization can drastically cut model size and speed up inference, but its uneven effects on different components require careful management. In this work, we propose a lightweight, backpropagation-free, surrogate-based sensitivity analysis framework to identify hybrid SSM-Transformer components most susceptible to quantization-induced degradation. Relying solely on forward-pass metrics, our method avoids expensive gradient computations and retraining, making it suitable for situations where access to in-domain data is limited due to proprietary restrictions or privacy constraints. We also provide a formal analysis showing that the Kullback-Leibler (KL) divergence metric better captures quantization sensitivity for Language modeling tasks than widely adopted alternatives such as mean squared error (MSE) and signal-to-quantization-noise ratio (SQNR). Through extensive experiments on SSM and hybrid architectures, our ablation studies confirm that KL-based rankings align with observed performance drops and outperform alternative metrics. This framework enables the practical deployment of advanced hybrid models on resource-constrained edge devices with minimal accuracy loss. We further validate our approach with real-world on-device profiling on Intel Lunar Lake hardware, demonstrating that KL-guided mixed-precision achieves near-FP16 perplexity with model sizes and throughput competitive with Uniform INT4 on both CPU and GPU execution modes. Code is available at https://github.com/jasonkongie/kl-ssm-quant.
[426] FAST: A Synergistic Framework of Attention and State-space Models for Spatiotemporal Traffic Prediction
Xinjin Li, Jinghan Cao, Mengyue Wang, Yue Wu, Longxiang Yan, Yeyang Zhou, Ziqi Sha, Yu Ma
Main category: cs.LG
TL;DR: FAST combines attention and state-space models for scalable traffic forecasting with linear complexity, achieving state-of-the-art results on traffic benchmarks.
Details
Motivation: Existing traffic forecasting methods face a trade-off between expressiveness (Transformers capture global dependencies but have quadratic complexity) and efficiency (state-space models are efficient but less effective at modeling spatial interactions in graph-structured traffic data).Method: FAST uses a Temporal-Spatial-Temporal architecture: temporal attention modules capture short- and long-term patterns, Mamba-based spatial module models long-range inter-sensor dependencies with linear complexity, plus learnable multi-source spatiotemporal embeddings and multi-level skip prediction for hierarchical feature fusion.
Result: FAST outperforms Transformer-, GNN-, attention-, and Mamba-based baselines on PeMS04, PeMS07, and PeMS08 benchmarks, achieving best MAE and RMSE with up to 4.3% lower RMSE and 2.8% lower MAE than strongest baselines.
Conclusion: FAST demonstrates favorable balance between accuracy, scalability, and generalization for spatiotemporal traffic forecasting by effectively combining attention and state-space modeling.
Abstract: Traffic forecasting requires modeling complex temporal dynamics and long-range spatial dependencies over large sensor networks. Existing methods typically face a trade-off between expressiveness and efficiency: Transformer-based models capture global dependencies well but suffer from quadratic complexity, while recent selective state-space models are computationally efficient yet less effective at modeling spatial interactions in graph-structured traffic data. We propose FAST, a unified framework that combines attention and state-space modeling for scalable spatiotemporal traffic forecasting. FAST adopts a Temporal-Spatial-Temporal architecture, where temporal attention modules capture both short- and long-term temporal patterns, and a Mamba-based spatial module models long-range inter-sensor dependencies with linear complexity. To better represent heterogeneous traffic contexts, FAST further introduces a learnable multi-source spatiotemporal embedding that integrates historical traffic flow, temporal context, and node-level information, together with a multi-level skip prediction mechanism for hierarchical feature fusion. Experiments on PeMS04, PeMS07, and PeMS08 show that FAST consistently outperforms strong baselines from Transformer-, GNN-, attention-, and Mamba-based families. In particular, FAST achieves the best MAE and RMSE on all three benchmarks, with up to 4.3% lower RMSE and 2.8% lower MAE than the strongest baseline, demonstrating a favorable balance between accuracy, scalability, and generalization.
[427] Outperforming Self-Attention Mechanisms in Solar Irradiance Forecasting via Physics-Guided Neural Networks
Mohammed Ezzaldin Babiker Abdullah, Rufaidah Abdallah Ibrahim Mohammed
Main category: cs.LG
TL;DR: A physics-informed hybrid CNN-BiLSTM model for solar irradiance forecasting that outperforms complex Transformer architectures in arid regions with rapid aerosol fluctuations.
Details
Motivation: To challenge the prevailing "complexity-first" paradigm in solar forecasting and demonstrate that lightweight, physics-informed models can outperform computationally expensive Transformer-based architectures, especially in high-noise meteorological tasks.Method: A hybrid CNN-BiLSTM framework that integrates spatial feature extraction (CNN) with temporal dependency capture (BiLSTM), guided by 15 engineered physics-based features including Clear-Sky indices and Solar Zenith Angle, with Bayesian Optimization for hyperparameter tuning.
Result: Achieved RMSE of 19.53 W/m² using NASA POWER data in Sudan, significantly outperforming attention-based baselines (RMSE 30.64 W/m²), demonstrating the “Complexity Paradox” where simpler physics-guided models beat complex attention mechanisms.
Conclusion: Explicit physical constraints offer more efficient and accurate alternatives to self-attention mechanisms in high-noise meteorological tasks, advocating for hybrid, physics-aware AI for real-time renewable energy management.
Abstract: Accurate Global Horizontal Irradiance (GHI) forecasting is critical for grid stability, particularly in arid regions characterized by rapid aerosol fluctuations. While recent trends favor computationally expensive Transformer-based architectures, this paper challenges the prevailing “complexity-first” paradigm. We propose a lightweight, Physics-Informed Hybrid CNN-BiLSTM framework that prioritizes domain knowledge over architectural depth. The model integrates a Convolutional Neural Network (CNN) for spatial feature extraction with a Bi-Directional LSTM for capturing temporal dependencies. Unlike standard data-driven approaches, our model is explicitly guided by a vector of 15 engineered features including Clear-Sky indices and Solar Zenith Angle - rather than relying solely on raw historical data. Hyperparameters are rigorously tuned using Bayesian Optimization to ensure global optimality. Experimental validation using NASA POWER data in Sudan demonstrates that our physics-guided approach achieves a Root Mean Square Error (RMSE) of 19.53 W/m^2, significantly outperforming complex attention-based baselines (RMSE 30.64 W/m^2). These results confirm a “Complexity Paradox”: in high-noise meteorological tasks, explicit physical constraints offer a more efficient and accurate alternative to self-attention mechanisms. The findings advocate for a shift towards hybrid, physics-aware AI for real-time renewable energy management.
[428] MyoVision: A Mobile Research Tool and NEATBoost-Attention Ensemble Framework for Real Time Chicken Breast Myopathy Detection
Chaitanya Pallerla, Siavash Mahmoudi, Dongyi Wang
Main category: cs.LG
TL;DR: MyoVision: Mobile smartphone-based transillumination imaging system using 14-bit RAW images and NEATBoost-Attention Ensemble model for classifying poultry myopathies (Normal, Woody Breast, Spaghetti Meat) with 82.4% accuracy.
Details
Motivation: Current methods for detecting poultry myopathies (Woody Breast and Spaghetti Meat) rely on subjective manual evaluation or expensive laboratory imaging systems. There's a need for low-cost, non-destructive classification methods using consumer-grade devices.Method: Developed MyoVision mobile transillumination imaging framework using smartphones to capture 14-bit RAW images. Extracted structural texture descriptors from internal tissue abnormalities. Proposed NEATBoost-Attention Ensemble model combining LightGBM and attention-based MLP with neuroevolution optimization using NEAT for automatic hyperparameter discovery.
Result: Achieved 82.4% test accuracy (F1 = 0.83) on dataset of 336 fillets, outperforming conventional ML and DL baselines. Performance matches hyperspectral imaging systems that cost orders of magnitude more.
Conclusion: Consumer-grade smartphone imaging can effectively support scalable internal tissue assessment for poultry myopathy classification. MyoVision establishes reproducible mobile RGB-D acquisition pipeline for multimodal meat quality research.
Abstract: Woody Breast (WB) and Spaghetti Meat (SM) myopathies significantly impact poultry meat quality, yet current detection methods rely either on subjective manual evaluation or costly laboratory-grade imaging systems. We address the problem of low-cost, non-destructive multi-class myopathy classification using consumer smartphones. MyoVision is introduced as a mobile transillumination imaging framework in which 14-bit RAW images are captured and structural texture descriptors indicative of internal tissue abnormalities are extracted. To classify three categories (Normal, Woody Breast, Spaghetti Meat), we propose a NEATBoost-Attention Ensemble model, which is a neuroevolution-optimized weighted fusion of LightGBM and attention-based MLP models. Hyperparameters are automatically discovered using NeuroEvolution of Augmenting Topologies (NEAT), eliminating manual tuning and enabling architecture diversity for small tabular datasets. On a dataset of 336 fillets collected from a commercial processing facility, our method achieves 82.4% test accuracy (F1 = 0.83), outperforming conventional machine learning and deep learning baselines and matching performance reported by hyperspectral imaging systems costing orders of magnitude more. Beyond classification performance, MyoVision establishes a reproducible mobile RGB-D acquisition pipeline for multimodal meat quality research, demonstrating that consumer-grade imaging can support scalable internal tissue assessment.
[429] Asymmetric-Loss-Guided Hybrid CNN-BiLSTM-Attention Model for Industrial RUL Prediction with Interpretable Failure Heatmaps
Mohammed Ezzaldin Babiker Abdullah
Main category: cs.LG
TL;DR: Hybrid 1D-CNN-BiLSTM with attention for turbofan RUL prediction using asymmetric loss to penalize dangerous over-estimation, achieving competitive performance with interpretable attention heatmaps.
Details
Motivation: Existing deep learning approaches fail to capture both spatial correlations from multiple sensors and long-range temporal dependencies in engine degradation data, while standard symmetric loss functions inadequately penalize the safety-critical error of over-estimating remaining useful life, which could lead to catastrophic failures.Method: Proposes a hybrid architecture integrating Twin-Stage 1D-CNNs, BiLSTM network, and custom Bahdanau Additive Attention mechanism. Uses zero-leakage preprocessing, piecewise-linear RUL labeling capped at 130 cycles, and NASA-specified asymmetric exponential loss function that disproportionately penalizes over-estimation. Trained on NASA C-MAPSS FD001 dataset.
Result: Achieved RMSE of 17.52 cycles and NASA S-Score of 922.06 on 100 test engines. Attention weight heatmaps provide interpretable insights into temporal progression of degradation for each engine, supporting maintenance decision-making.
Conclusion: The framework demonstrates competitive performance against established baselines and offers a principled approach to safe, interpretable prognostics in industrial settings by addressing both technical modeling challenges and safety-critical error penalization.
Abstract: Turbofan engine degradation under sustained operational stress necessitates robust prognostic systems capable of accurately estimating the Remaining Useful Life (RUL) of critical components. Existing deep learning approaches frequently fail to simultaneously capture multi-sensor spatial correlations and long-range temporal dependencies, while standard symmetric loss functions inadequately penalize the safety-critical error of over-estimating residual life. This study proposes a hybrid architecture integrating Twin-Stage One-Dimensional Convolutional Neural Networks (1D-CNN), a Bidirectional Long Short-Term Memory (BiLSTM) network, and a custom Bahdanau Additive Attention mechanism. The model was trained and evaluated on the NASA Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) FD001 sub-dataset employing a zero-leakage preprocessing pipeline, piecewise-linear RUL labeling capped at 130 cycles, and the NASA-specified asymmetric exponential loss function that disproportionately penalizes over-estimation to enforce industrial safety constraints. Experiments on 100 test engines achieved a Root Mean Squared Error (RMSE) of 17.52 cycles and a NASA S-Score of 922.06. Furthermore, extracted attention weight heatmaps provide interpretable, per-engine insights into the temporal progression of degradation, supporting informed maintenance decision-making. The proposed framework demonstrates competitive performance against established baselines and offers a principled approach to safe, interpretable prognostics in industrial settings.
[430] From Order to Distribution: A Spectral Characterization of Forgetting in Continual Learning
Zonghuan Xu, Xingjun Ma
Main category: cs.LG
TL;DR: Theoretical analysis of forgetting in continual learning using a distributional perspective, deriving exact operator identities and characterizing convergence rates based on task distribution geometry.
Details
Motivation: Existing continual learning research focuses on empirical forgetting but lacks rigorous theoretical characterization. While previous work analyzed forgetting under random orderings of fixed tasks, this paper shifts to a distributional perspective to understand how the generating task distribution itself governs forgetting.Method: Studies an exact-fit linear regime where tasks are sampled i.i.d. from a task distribution Π. Derives an exact operator identity for the forgetting quantity, revealing a recursive spectral structure. Uses this identity to establish unconditional upper bounds, identify leading asymptotic terms, and characterize convergence rates up to constants in generic nondegenerate cases.
Result: Provides theoretical characterization of forgetting rates in continual learning, relating them to geometric properties of the task distribution. Clarifies what drives slow or fast forgetting in the linear model, with convergence rates determined by spectral properties of the task distribution.
Conclusion: The paper establishes a rigorous theoretical foundation for understanding forgetting in continual learning from a distributional perspective, revealing how task distribution geometry fundamentally governs forgetting behavior in linear models.
Abstract: A central challenge in continual learning is forgetting, the loss of performance on previously learned tasks induced by sequential adaptation to new ones. While forgetting has been extensively studied empirically, rigorous theoretical characterizations remain limited. A notable step in this direction is \citet{evron2022catastrophic}, which analyzes forgetting under random orderings of a fixed task collection in overparameterized linear regression. We shift the perspective from order to distribution. Rather than asking how a fixed task collection behaves under random orderings, we study an exact-fit linear regime in which tasks are sampled i.i.d.\ from a task distribution~$Π$, and ask how the generating distribution itself governs forgetting. In this setting, we derive an exact operator identity for the forgetting quantity, revealing a recursive spectral structure. Building on this identity, we establish an unconditional upper bound, identify the leading asymptotic term, and, in generic nondegenerate cases, characterize the convergence rate up to constants. We further relate this rate to geometric properties of the task distribution, clarifying what drives slow or fast forgetting in this model.
[431] Adaptive Unknown Fault Detection and Few-Shot Continual Learning for Condition Monitoring in Ultrasonic Metal Welding
Ahmadreza Eslaminia, Kuan-Chieh Lu, Klara Nahrstedt, Chenhui Shao
Main category: cs.LG
TL;DR: Adaptive condition monitoring for ultrasonic metal welding using unknown fault detection and few-shot continual learning with minimal retraining
Details
Motivation: UMW is sensitive to tool wear, surface contamination, and material variability, leading to unexpected faults. Conventional supervised learning assumes all fault types are known in advance, limiting ability to handle unseen faults.Method: Unknown fault detection via hidden-layer MLP representations with statistical thresholding. Continual learning selectively updates final layers to incorporate new faults while preserving existing knowledge. Cosine similarity transformation with clustering reduces labeling effort.
Result: 96% accuracy in detecting unseen fault conditions while maintaining reliable classification of known classes. After incorporating new fault type with only 5 labeled samples, updated model achieves 98% testing classification accuracy.
Conclusion: Proposed approach enables adaptive monitoring with minimal retraining cost and time, providing scalable solution for continual learning in condition monitoring where new process conditions constantly emerge.
Abstract: Ultrasonic metal welding (UMW) is widely used in industrial applications but is sensitive to tool wear, surface contamination, and material variability, which can lead to unexpected process faults and unsatisfactory weld quality. Conventional monitoring systems typically rely on supervised learning models that assume all fault types are known in advance, limiting their ability to handle previously unseen process faults. To address this challenge, this paper proposes an adaptive condition monitoring approach that enables unknown fault detection and few-shot continual learning for UMW. Unknown faults are detected by analyzing hidden-layer representations of a multilayer perceptron and leveraging a statistical thresholding strategy. Once detected, the samples from unknown fault types are incorporated into the existing model through a continual learning procedure that selectively updates only the final layers of the network, which enables the model to recognize new fault types while preserving knowledge of existing classes. To accelerate the labeling process, cosine similarity transformation combined with a clustering algorithm groups similar unknown samples, thereby reducing manual labeling effort. Experimental results using a multi-sensor UMW dataset demonstrate that the proposed method achieves 96% accuracy in detecting unseen fault conditions while maintaining reliable classification of known classes. After incorporating a new fault type using only five labeled samples, the updated model achieves 98% testing classification accuracy. These results demonstrate that the proposed approach enables adaptive monitoring with minimal retraining cost and time. The proposed approach provides a scalable solution for continual learning in condition monitoring where new process conditions may constantly emerge over time and is extensible to other manufacturing processes.
[432] Universality of Gaussian-Mixture Reverse Kernels in Conditional Diffusion
Nafiz Ishtiaque, Syed Arefinul Haque, Kazi Ashraful Alam, Fatima Jahara
Main category: cs.LG
TL;DR: Theoretical proof that conditional diffusion models with finite Gaussian mixture reverse kernels and ReLU-network logits can approximate target distributions well in conditional KL divergence, with error decomposing into terminal mismatch and per-step reverse-kernel errors.
Details
Motivation: To provide theoretical foundations for conditional diffusion models by establishing approximation capabilities and error bounds for models using Gaussian mixture reverse kernels with neural network parameterization.Method: Uses path-space decomposition to separate output error into terminal mismatch and per-step reverse-kernel errors. Applies Norets’ Gaussian-mixture theory with quantitative ReLU bounds for static conditional density approximation. Assumes reverse kernels factor through finite-dimensional feature maps.
Result: Proves that conditional diffusion models can approximate regular target distributions arbitrarily well in context-averaged conditional KL divergence, with neural reverse-kernel class being dense in conditional KL under exact terminal matching.
Conclusion: Provides rigorous theoretical justification for conditional diffusion models, showing their approximation capabilities and establishing error decomposition framework for analysis.
Abstract: We prove that conditional diffusion models whose reverse kernels are finite Gaussian mixtures with ReLU-network logits can approximate suitably regular target distributions arbitrarily well in context-averaged conditional KL divergence, up to an irreducible terminal mismatch that typically vanishes with increasing diffusion horizon. A path-space decomposition reduces the output error to this mismatch plus per-step reverse-kernel errors; assuming each reverse kernel factors through a finite-dimensional feature map, each step becomes a static conditional density approximation problem, solved by composing Norets’ Gaussian-mixture theory with quantitative ReLU bounds. Under exact terminal matching the resulting neural reverse-kernel class is dense in conditional KL.
[433] Computational framework for multistep metabolic pathway design
Peter Zhiping Zhang, Jeffrey D. Varner
Main category: cs.LG
TL;DR: Deep learning approach for retrobiosynthetic pathway design combining neural networks with enzymatic templates to improve computational metabolic pathway generation
Details
Motivation: Existing computational frameworks for retrobiosynthesis have limited success in algorithm-guided xenobiotic biochemical retrosynthesis. Deep learning has shown promise in organic chemistry applications, inspiring its application to biochemical transformations for improved metabolic pathway design.Method: Assembled metabolic reaction and enzymatic template data from public databases, enriched with artificial reactions via data augmentation. Trained two neural network-based binary classifiers to distinguish real from artificial reactions for 1-step and 2-step pathways. Combined these models with enzymatic templates to build a multistep retrobiosynthesis pipeline.
Result: Developed a computational biosynthetic pathway design framework validated by reproducing both natural and non-natural pathways computationally.
Conclusion: Deep learning can be effectively combined with traditional retrobiosynthetic workflows to improve in silico synthetic metabolic pathway designs, demonstrating successful reproduction of biochemical pathways.
Abstract: In silico tools are important for generating novel hypotheses and exploring alternatives in de novo metabolic pathway design. However, while many computational frameworks have been proposed for retrobiosynthesis, few successful examples of algorithm-guided xenobiotic biochemical retrosynthesis have been reported in the literature. Deep learning has improved the quality of synthesis and retrosynthesis in organic chemistry applications. Inspired by this progress, we explored combining deep learning of biochemical transformations with the traditional retrobiosynthetic workflow to improve in silico synthetic metabolic pathway designs. To develop our computational biosynthetic pathway design framework, we assembled metabolic reaction and enzymatic template data from public databases. A data augmentation procedure, adapted from literature, was carried out to enrich the assembled reaction dataset with artificial metabolic reactions generated by enzymatic reaction templates. Two neural network-based pathway ranking models were trained as binary classifiers to distinguish assembled reactions from artificial counterparts; each model output a scalar quantifying the plausibility of a 1-step or 2-step pathway. Combining these two models with enzymatic templates, we built a multistep retrobiosynthesis pipeline and validated it by reproducing some natural and non-natural pathways computationally.
[434] Monthly Diffusion v0.9: A Latent Diffusion Model for the First AI-MIP
Kyle J. C. Hall, Maria J. Molina
Main category: cs.LG
TL;DR: MD-1.5 is a climate emulator using SFNO-inspired CVAE with latent diffusion to model monthly atmospheric variability at 1.5-degree resolution with modest computational requirements.
Details
Motivation: To create an efficient climate emulator that can model low-frequency internal atmospheric variability at monthly timesteps in data-sparse regimes with modest computational requirements.Method: Uses a spherical Fourier neural operator (SFNO)-inspired Conditional Variational Auto-Encoder (CVAE) architecture with latent diffusion to model atmospheric evolution at 1.5-degree grid spacing.
Result: Initial results show the MDv0.9 model can forward-step at monthly mean timesteps, though specific performance metrics are not detailed in the abstract.
Conclusion: The MD-1.5 climate emulator demonstrates a novel approach combining SFNO-inspired CVAE with latent diffusion for efficient monthly climate modeling.
Abstract: Here, we describe Monthly Diffusion at 1.5-degree grid spacing (MD-1.5 version 0.9), a climate emulator that leverages a spherical Fourier neural operator (SFNO)-inspired Conditional Variational Auto-Encoder (CVAE) architecture to model the evolution of low-frequency internal atmospheric variability using latent diffusion. MDv0.9 was designed to forward-step at monthly mean timesteps in a data-sparse regime, using modest computational requirements. This work describes the motivation behind the architecture design, the MDv0.9 training procedure, and initial results.
[435] SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization
Xiaole Su, Kasey Zhang, Andy Lyu
Main category: cs.LG
TL;DR: Controlled study shows that keeping SFT and GRPO training data disjoint (0% overlap) consistently outperforms full overlap for Lean 4 autoformalization, with 10.4% semantic gain over SFT alone, while full overlap makes GRPO redundant.
Details
Motivation: The paper investigates the impact of data overlap between Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) stages in post-training recipes, which is a common but unexplored hyperparameter affecting model performance.Method: Conducted controlled ablation study on Qwen3-8B for Lean 4 autoformalization with six conditions: base model, SFT-only, GRPO-only, and three SFT+GRPO configurations with 0%, 30%, or 100% data overlap between SFT and GRPO prompts.
Result: Lower data overlap monotonically improves both compilation and semantic accuracy. At 0% overlap, GRPO yields 10.4 percentage point semantic gain over SFT alone on Gaokao-Formal, while 100% overlap shows no improvement, making GRPO redundant. Dual-metric evaluation reveals compile-semantic gaps exceeding 30 percentage points.
Conclusion: Data overlap between SFT and GRPO stages is a critical post-training hyperparameter. Keeping training data disjoint consistently outperforms full overlap at no additional compute cost, challenging common practice of using overlapping data.
Abstract: Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.
[436] Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO
Jing Sun
Main category: cs.LG
TL;DR: Proposes Target Decoupling architecture for multi-timescale RL to address algorithmic pathologies in temporal credit assignment, showing improved performance in LunarLander-v2.
Details
Motivation: Addresses challenges in temporal credit assignment in RL, inspired by neurobiological dopamine systems. Multi-timescale approaches in Actor-Critic architectures can cause severe algorithmic pathologies like surrogate objective hacking and irreversible myopic degeneration (Paradox of Temporal Uncertainty).Method: Target Decoupling architecture: Critic retains multi-timescale predictions for auxiliary representation learning, while Actor strictly isolates short-term signals and updates policy based solely on long-term advantages.
Result: Achieves statistically significant performance improvements in LunarLander-v2 environment, consistently surpasses “Environment Solved” threshold with minimal variance, eliminates policy collapse, and escapes hovering local optima that trap single-timescale baselines.
Conclusion: Proposed architecture effectively addresses pathologies in multi-timescale RL, providing stable and improved performance without hyperparameter hacking, demonstrating the importance of decoupling temporal signals in Actor-Critic frameworks.
Abstract: Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ‘‘Environment Solved’’ threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines.
[437] From Alignment to Prediction: A Study of Self-Supervised Learning and Predictive Representation Learning
Mintu Dutta, Ritesh Vyas, Mohendra Roy
Main category: cs.LG
TL;DR: Survey paper introducing Predictive Representation Learning (PRL) as a new category in self-supervised learning, comparing BYOL, MAE, and I-JEPA approaches with empirical evaluation.
Details
Motivation: Current self-supervised learning methods focus on representation alignment and input reconstruction but lack predictive capabilities for unobserved data components. The paper aims to define a new paradigm that learns predictive structures of data distributions.Method: Proposes Predictive Representation Learning (PRL) taxonomy alongside alignment and reconstruction approaches. Implements and compares three methods: Bootstrap Your Own Latent (BYOL), Masked Autoencoders (MAE), and Image-JEPA (I-JEPA) for empirical analysis.
Result: MAE achieves perfect similarity (1.00) but weak robustness (0.55). BYOL and I-JEPA show high accuracy (0.98 and 0.95) with better robustness (0.75 and 0.78). I-JEPA demonstrates strong performance as a predictive representation learning approach.
Conclusion: Predictive Representation Learning represents a promising direction for self-supervised learning, with JEPA architectures serving as exemplary implementations. The approach bridges the gap between learning from observed data and predicting unobserved components.
Abstract: Self-supervised learning has emerged as a major technique for the task of learning from unlabeled data, where the current methods mostly revolve around alignment of representations and input recon struction. Although such approaches have demonstrated excellent performance in practice, their scope remains mostly confined to learning from observed data and does not provide much help in terms of a learning structure that is predictive of the data distribution. In this paper, we study some of the recent developments in the realm of self-supervised learning. We define a new category called Predictive Representation Learning (PRL), which revolves around the latent prediction of unobserved components of data based on the observation. We propose a common taxonomy that classifies PRL along with alignment and reconstruction-based learning approaches. Furthermore, we argue that Joint-Embedding Predictive Architecture(JEPA) can be considered as an exemplary member of this new paradigm. We further discuss theoretical perspectives and open challenges, highlighting predictive representation learning as a promising direction for future self-supervised learning research. In this study, we implemented Bootstrap Your Own Latent (BYOL), Masked Autoencoders (MAE), and Image-JEPA (I-JEPA) for comparative analysis. The results indicate that MAE achieves perfect similarity of 1.00, but exhibits relatively weak robustness of 0.55. In contrast, BYOL and I-JEPA attain accuracies of 0.98 and 0.95, with robustness scores of 0.75 and 0.78, respectively.
[438] LEGO-MOF: Equivariant Latent Manipulation for Editable, Generative, and Optimizable MOF Design
Chaoran Zhang, Guangyao Li, Dongxu Ji
Main category: cs.LG
TL;DR: A generative framework for continuous structural manipulation of metal-organic frameworks (MOFs) using SE(3)-equivariant latent space representations for targeted carbon capture optimization.
Details
Motivation: Existing deep generative models for MOF design rely on predefined building blocks and non-differentiable post-optimization, which breaks the information flow needed for continuous structural editing. There's a need for target-driven generative frameworks that enable continuous manipulation of MOF structures for specific applications like carbon capture.Method: Proposes LinkerVAE that maps discrete 3D chemical graphs into a continuous, SE(3)-equivariant latent space. Uses test-time optimization (TTO) with an accurate surrogate model to continuously optimize latent graphs of existing MOFs toward desired properties. Integrates with latent diffusion model and rigid-body assembly for full MOF construction.
Result: Achieves average relative boost of 147.5% in pure CO2 uptake while strictly preserving structural validity. Enables geometry-aware manipulations including implicit chemical style transfer and zero-shot isoreticular expansion.
Conclusion: Establishes a scalable, fully differentiable pathway for automated discovery, targeted optimization, and editing of functional materials like MOFs for carbon capture applications.
Abstract: Metal-organic frameworks (MOFs) are highly promising for carbon capture, yet navigating their vast design space remains challenging. Recent deep generative models enable de novo MOF design but primarily act as feed-forward structure generators. By heavily relying on predefined building block libraries and non-differentiable post-optimization, they fundamentally sever the information flow required for continuous structural editing. Here, we propose a target-driven generative framework focused on continuous structural manipulation. At its core is LinkerVAE, which maps discrete 3D chemical graphs into a continuous, SE(3)-equivariant latent space. This smooth manifold unlocks geometry-aware manipulations, including implicit chemical style transfer and zero-shot isoreticular expansion. Building upon this, we introduce a test-time optimization (TTO) strategy, utilizing an accurate surrogate model to continuously optimize the latent graphs of existing MOFs toward desired properties. This approach systematically enhances carbon capture performance, achieving a striking average relative boost of 147.5% in pure CO2 uptake while strictly preserving structural validity. Integrated with a latent diffusion model and rigid-body assembly for full MOF construction, our framework establishes a scalable, fully differentiable pathway for both the automated discovery, targeted optimization and editing of functional materials.
[439] C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
Kenji Kubo, Shunsuke Kamiya, Masanori Koyama, Kohei Hayashi, Yusuke Iwasawa, Yutaka Matsuo
Main category: cs.LG
TL;DR: C-voting: confidence-based test-time scaling strategy for recurrent neural networks that improves reasoning performance by selecting the best latent trajectory based on prediction confidence.
Details
Motivation: To enhance test-time scaling capabilities of recurrent neural networks for reasoning tasks by developing a voting strategy that works without requiring explicit energy functions, making it applicable to a wider range of models.Method: Introduces confidence-based voting (C-voting) that initializes latent state with multiple candidates using random variables, then selects the trajectory maximizing average top-1 prediction probabilities. Also presents ItrSA++, an attention-based recurrent model with randomized initial values.
Result: C-voting achieves 4.9% higher accuracy on Sudoku-hard than energy-based voting. When combined with ItrSA++, outperforms HRM on Sudoku-extreme (95.2% vs 55.0%) and Maze (78.6% vs 74.5%) tasks.
Conclusion: C-voting is an effective test-time scaling strategy for recurrent models that doesn’t require explicit energy functions, enabling improved performance on reasoning tasks through confidence-based trajectory selection.
Abstract: Neural network models with latent recurrent processing, where identical layers are recursively applied to the latent state, have gained attention as promising models for performing reasoning tasks. A strength of such models is that they enable test-time scaling, where the models can enhance their performance in the test phase without additional training. Models such as the Hierarchical Reasoning Model (HRM) and Artificial Kuramoto Oscillatory Neurons (AKOrN) can facilitate deeper reasoning by increasing the number of recurrent steps, thereby enabling the completion of challenging tasks, including Sudoku, Maze solving, and AGI benchmarks. In this work, we introduce confidence-based voting (C-voting), a test-time scaling strategy designed for recurrent models with multiple latent candidate trajectories. Initializing the latent state with multiple candidates using random variables, C-voting selects the one maximizing the average of top-1 probabilities of the predictions, reflecting the model’s confidence. Additionally, it yields 4.9% higher accuracy on Sudoku-hard than the energy-based voting strategy, which is specific to models with explicit energy functions. An essential advantage of C-voting is its applicability: it can be applied to recurrent models without requiring an explicit energy function. Finally, we introduce a simple attention-based recurrent model with randomized initial values named ItrSA++, and demonstrate that when combined with C-voting, it outperforms HRM on Sudoku-extreme (95.2% vs. 55.0%) and Maze (78.6% vs. 74.5%) tasks.
[440] Learning Inference Concurrency in DynamicGate MLP Structural and Mathematical Justification
Yongil Choi
Main category: cs.LG
TL;DR: DynamicGate MLP enables concurrent learning and inference by separating routing parameters from representation parameters, allowing online adaptation while maintaining inference stability.
Details
Motivation: Conventional neural networks strictly separate learning and inference phases because updating parameters during inference causes unstable outputs and undefined inference functions. This limitation prevents real-time adaptation and on-device learning systems.Method: The paper introduces DynamicGate MLP which structurally separates routing (gating) parameters from representation (prediction) parameters. This separation allows gates to be adapted online while preserving inference stability, or selectively updating weights only within inactive subspaces. The authors mathematically formalize sufficient conditions for concurrent learning and inference.
Result: The paper shows that DynamicGate MLP permits learning inference concurrency, and even under asynchronous or partial updates, the inference output at each time step can be interpreted as a forward computation of a valid model snapshot.
Conclusion: DynamicGate MLP can serve as a practical foundation for online adaptive and on-device learning systems by enabling stable concurrent learning and inference.
Abstract: Conventional neural networks strictly separate learning and inference because if parameters are updated during inference, outputs become unstable and even the inference function itself is not well defined [1, 2, 3]. This paper shows that DynamicGate MLP structurally permits learning inference concurrency [4, 5]. The key idea is to separate routing (gating) parameters from representation (prediction) parameters, so that the gate can be adapted online while inference stability is preserved, or weights can be selectively updated only within the inactive subspace [4, 5, 6, 7]. We mathematically formalize sufficient conditions for concurrency and show that even under asynchronous or partial updates, the inference output at each time step can always be interpreted as a forward computation of a valid model snapshot [8, 9, 10]. This suggests that DynamicGate MLP can serve as a practical foundation for online adaptive and on device learning systems [11, 12].
[441] Parameter-efficient Quantum Multi-task Learning
Hevish Cowlessur, Chandra Thapa, Tansu Alpcan, Seyit Camtepe
Main category: cs.LG
TL;DR: Quantum multi-task learning framework replaces classical linear heads with quantum prediction heads for parameter efficiency while maintaining task specialization.
Details
Motivation: Multi-task learning with hard-parameter-sharing suffers from rapid growth of task-specific parameters as number of tasks increases. Quantum variational circuits offer compact representations in high-dimensional Hilbert spaces, suggesting potential for more parameter-efficient multi-task heads.Method: Proposes QMTL framework with hybrid architecture: shared VQC with task-independent quantum encoding stage, followed by lightweight task-specific ansatz blocks. This enables localized task adaptation while maintaining compact parameterization compared to classical linear heads.
Result: Quantum head parameter cost scales linearly vs quadratic growth for classical heads. Achieves comparable or better performance than classical baselines on NLP, medical imaging, and multimodal sarcasm detection benchmarks, with substantially fewer parameters than existing hybrid quantum MTL models.
Conclusion: Quantum multi-task learning provides parameter-efficient alternative to classical heads, demonstrating feasibility on both simulators and real quantum hardware while maintaining task specialization capabilities.
Abstract: Multi-task learning (MTL) improves generalization and data efficiency by jointly learning related tasks through shared representations. In the widely used hard-parameter-sharing setting, a shared backbone is combined with task-specific prediction heads. However, task-specific parameters can grow rapidly with the number of tasks. Therefore, designing multi-task heads that preserve task specialization while improving parameter efficiency remains a key challenge. In Quantum Machine Learning (QML), variational quantum circuits (VQCs) provide a compact mechanism for mapping classical data to quantum states residing in high-dimensional Hilbert spaces, enabling expressive representations within constrained parameter budgets. We propose a parameter-efficient quantum multi-task learning (QMTL) framework that replaces conventional task-specific linear heads with a fully quantum prediction head in a hybrid architecture. The model consists of a VQC with a shared, task-independent quantum encoding stage, followed by lightweight task-specific ansatz blocks enabling localized task adaptation while maintaining compact parameterization. Under a controlled and capacity-matched formulation where the shared representation dimension grows with the number of tasks, our parameter-scaling analysis demonstrates that a standard classical head exhibits quadratic growth, whereas the proposed quantum head parameter cost scales linearly. We evaluate QMTL on three multi-task benchmarks spanning natural language processing, medical imaging, and multimodal sarcasm detection, where we achieve performance comparable to, and in some cases exceeding, classical hard-parameter-sharing baselines while consistently outperforming existing hybrid quantum MTL models with substantially fewer head parameters. We further demonstrate QMTL’s executability on noisy simulators and real quantum hardware, illustrating its feasibility.
[442] Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning
Qin Zhou, Guoyan Liang, Qianyi Yang, Jingyuan Chen, Sai Wu, Chang Yao, Zhe Wang
Main category: cs.LG
TL;DR: ESC-RL introduces evidence-aware reinforcement learning for radiology report generation with group-wise rewards and self-correcting preference learning to improve clinical faithfulness.
Details
Motivation: Current RL approaches for radiology report generation have two limitations: report-level rewards offer limited guidance for clinical faithfulness, and methods lack explicit self-improving mechanisms to align with clinical preferences.Method: ESC-RL has two components: 1) Group-wise Evidence-aware Alignment Reward (GEAR) that provides group-wise, evidence-aware feedback to reinforce true positives, recover false negatives, and suppress false positives; 2) Self-correcting Preference Learning (SPL) that automatically constructs disease-aware preference datasets from noisy observations and uses LLMs to synthesize refined reports without human supervision.
Result: Extensive experiments on two public chest X-ray datasets demonstrate consistent gains and state-of-the-art performance.
Conclusion: ESC-RL promotes clinically faithful, disease-aligned rewards and supports continual self-improvement during training for radiology report generation.
Abstract: Recent reinforcement learning (RL) approaches have advanced radiology report generation (RRG), yet two core limitations persist: (1) report-level rewards offer limited evidence-grounded guidance for clinical faithfulness; and (2) current methods lack an explicit self-improving mechanism to align with clinical preference. We introduce clinically aligned Evidence-aware Self-Correcting Reinforcement Learning (ESC-RL), comprising two key components. First, a Group-wise Evidence-aware Alignment Reward (GEAR) delivers group-wise, evidence-aware feedback. GEAR reinforces consistent grounding for true positives, recovers missed findings for false negatives, and suppresses unsupported content for false positives. Second, a Self-correcting Preference Learning (SPL) strategy automatically constructs a reliable, disease-aware preference dataset from multiple noisy observations and leverages an LLM to synthesize refined reports without human supervision. ESC-RL promotes clinically faithful, disease-aligned reward and supports continual self-improvement during training. Extensive experiments on two public chest X-ray datasets demonstrate consistent gains and state-of-the-art performance.
[443] Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
Xiaohua Wang, Muzhao Tian, Yuqi Zeng, Zisu Huang, Jiakang Yuan, Bowen Chen, Jingwen Xu, Mingbo Zhou, Wenhao Liu, Muling Wu, Zhengkang Guo, Qi Qian, Yifei Wang, Feiran Zhang, Ruicheng Yin, Shihan Dou, Changze Lv, Tao Chen, Kaitao Song, Xu Tan, Tao Gui, Xiaoqing Zheng, Xuanjing Huang
Main category: cs.LG
TL;DR: Survey paper analyzing reward hacking in RLHF and related alignment methods for LLMs/MLLMs, proposing Proxy Compression Hypothesis as unifying framework
Details
Motivation: RLHF and related alignment methods have become central for steering LLMs/MLLMs toward human-preferred behaviors, but introduce systemic vulnerability to reward hacking where models exploit imperfections in learned reward signalsMethod: Proposes Proxy Compression Hypothesis (PCH) as unifying framework, formalizing reward hacking as emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives
Result: Framing reward hacking as structural instability of proxy-based alignment under scale, explaining empirical phenomena across RLHF, RLAIF, and RLVR regimes, and how local shortcut learning generalizes into broader misalignment
Conclusion: Highlights open challenges in scalable oversight, multimodal grounding, and agentic autonomy, organizing detection/mitigation strategies around intervention on compression, amplification, or co-adaptation dynamics
Abstract: Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability: reward hacking, where models exploit imperfections in learned reward signals to maximize proxy objectives without fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception–reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, including deception and strategic gaming of oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understanding reward hacking. We formalize reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view, reward hacking arises from the interaction of objective compression, optimization amplification, and evaluator–policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, including deception and strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framing reward hacking as a structural instability of proxy-based alignment under scale, we highlight open challenges in scalable oversight, multimodal grounding, and agentic autonomy.
[444] Design Space Exploration of Hybrid Quantum Neural Networks for Chronic Kidney Disease
Muhammad Kashif, Hanzalah Mohamed Siraj, Nouhaila Innan, Alberto Marchisio, Muhammad Shafique
Main category: cs.LG
TL;DR: Comprehensive design space exploration of Hybrid Quantum Neural Networks for Chronic Kidney Disease diagnosis, benchmarking 625 models across encoding, architecture, measurement, and shot settings.
Details
Motivation: Hybrid Quantum Neural Networks show promise for near-term quantum machine learning, but their practical performance depends heavily on design choices like data encoding, circuit architecture, measurement strategy, and shot settings. There's a need for systematic exploration of these design dimensions to understand their impact on model performance.Method: Benchmarked 625 different HQNN models by combining five encoding schemes, five entanglement architectures, five measurement strategies, and five shot settings. Used a carefully curated clinical dataset for CKD diagnosis, with 10-fold stratified cross-validation and comprehensive evaluation metrics including accuracy, AUC, F1-score, and composite performance score.
Result: Revealed strong non-trivial interactions between encoding choices and circuit architectures. Found that high performance doesn’t necessarily require large parameter counts or complex circuits. Compact architectures with appropriate encodings (e.g., IQP with Ring entanglement) achieved the best trade-off between accuracy, robustness, and efficiency.
Conclusion: Provides actionable insights into how different design dimensions influence learning behavior in HQNNs. Demonstrates that careful selection of encoding and architecture combinations can lead to efficient and effective quantum machine learning models for medical diagnosis applications.
Abstract: Hybrid Quantum Neural Networks (HQNNs) have recently emerged as a promising paradigm for near-term quantum machine learning. However, their practical performance strongly depends on design choices such as classical-to-quantum data encoding, quantum circuit architecture, measurement strategy and shots. In this paper, we present a comprehensive design space exploration of HQNNs for Chronic Kidney Disease (CKD) diagnosis. Using a carefully curated and preprocessed clinical dataset, we benchmark 625 different HQNN models obtained by combining five encoding schemes, five entanglement architectures, five measurement strategies, and five different shot settings. To ensure fair and robust evaluation, all models are trained using 10-fold stratified cross-validation and assessed on a test set using a comprehensive set of metrics, including accuracy, area under the curve (AUC), F1-score, and a composite performance score. Our results reveal strong and non-trivial interactions between encoding choices and circuit architectures, showing that high performance does not necessarily require large parameter counts or complex circuits. In particular, we find that compact architectures combined with appropriate encodings (e.g., IQP with Ring entanglement) can achieve the best trade-off between accuracy, robustness, and efficiency. Beyond absolute performance analysis, we also provide actionable insights into how different design dimensions influence learning behavior in HQNNs.
[445] Golden Handcuffs make safer AI agents
Aram Ebtekar, Michael K. Cohen
Main category: cs.LG
TL;DR: Bayesian RL agent with expanded reward range includes large negative penalty -L to become risk-averse to novel unintended strategies, using mentor override for safety while maintaining capability.
Details
Motivation: Reinforcement learners can achieve high rewards through unintended, potentially harmful strategies. Need to mitigate this by making agents risk-averse to novel schemes while maintaining learning capability.Method: Expand agent’s subjective reward range to include large negative value -L while true rewards are in [0,1]. Use Bayesian approach to make agent risk-averse to novel strategies. Implement override mechanism that yields control to safe mentor when predicted value drops below threshold.
Result: Proves two key properties: (1) Capability - agent attains sublinear regret against best mentor using vanishing mentor-guided exploration, (2) Safety - no decidable low-complexity predicate is triggered by optimizing policy before being triggered by mentor.
Conclusion: Bayesian approach with expanded reward range and mentor override provides principled safety guarantees while maintaining learning capability, addressing the problem of unintended strategies in RL.
Abstract: Reinforcement learners can attain high reward through novel unintended strategies. We study a Bayesian mitigation for general environments: we expand the agent’s subjective reward range to include a large negative value $-L$, while the true environment’s rewards lie in $[0,1]$. After observing consistently high rewards, the Bayesian policy becomes risk-averse to novel schemes that plausibly lead to $-L$. We design a simple override mechanism that yields control to a safe mentor whenever the predicted value drops below a fixed threshold. We prove two properties of the resulting agent: (i) Capability: using mentor-guided exploration with vanishing frequency, the agent attains sublinear regret against its best mentor. (ii) Safety: no decidable low-complexity predicate is triggered by the optimizing policy before it is triggered by a mentor.
[446] Self-Organizing Maps with Optimized Latent Positions
Seiki Ubukata, Akira Notsu, Katsuhiro Honda
Main category: cs.LG
TL;DR: SOM-OLP is a new objective-based topographic mapping method that introduces continuous latent positions for each data point, offering better computational efficiency while maintaining principled optimization.
Details
Motivation: Existing Self-Organizing Maps (SOM) formulations face a trade-off between computational efficiency and clear optimization objectives. Objective-based variants like STVQ are principled but computationally expensive with many latent nodes due to neighborhood-coupled computations.Method: Proposes SOM-OLP with continuous latent positions for each data point. Constructs separable surrogate local cost from STVQ’s neighborhood distortion, formulates entropy-regularized objective, and uses block coordinate descent with closed-form updates for assignment probabilities, latent positions, and reference vectors.
Result: Achieves competitive neighborhood preservation and quantization performance, favorable scalability for large numbers of latent nodes and large datasets, and best average rank among compared methods on 16 benchmark datasets.
Conclusion: SOM-OLP provides an efficient, objective-based topographic mapping method with linear per-iteration complexity that maintains theoretical guarantees while scaling well to large datasets and many latent nodes.
Abstract: Self-Organizing Maps (SOM) are a classical method for unsupervised learning, vector quantization, and topographic mapping of high-dimensional data. However, existing SOM formulations often involve a trade-off between computational efficiency and a clearly defined optimization objective. Objective-based variants such as Soft Topographic Vector Quantization (STVQ) provide a principled formulation, but their neighborhood-coupled computations become expensive as the number of latent nodes increases. In this paper, we propose Self-Organizing Maps with Optimized Latent Positions (SOM-OLP), an objective-based topographic mapping method that introduces a continuous latent position for each data point. Starting from the neighborhood distortion of STVQ, we construct a separable surrogate local cost based on its local quadratic structure and formulate an entropy-regularized objective based on it. This yields a simple block coordinate descent scheme with closed-form updates for assignment probabilities, latent positions, and reference vectors, while guaranteeing monotonic non-increase of the objective and retaining linear per-iteration complexity in the numbers of data points and latent nodes. Experiments on a synthetic saddle manifold, scalability studies on the Digits and MNIST datasets, and 16 benchmark datasets show that SOM-OLP achieves competitive neighborhood preservation and quantization performance, favorable scalability for large numbers of latent nodes and large datasets, and the best average rank among the compared methods on the benchmark datasets.
[447] (How) Learning Rates Regulate Catastrophic Overtraining
Mark Rofin, Aditya Varre, Nicolas Flammarion
Main category: cs.LG
TL;DR: SFT can cause catastrophic overtraining in LLMs by exacerbating forgetting through learning rate dynamics and increased model sharpness.
Details
Motivation: To understand why supervised fine-tuning (SFT) harms fundamental LLM capabilities (catastrophic overtraining), particularly after long pretraining, by investigating the interplay between optimization dynamics in pretraining and finetuning.Method: 1) Investigate catastrophic forgetting through implicit regularization of learning rate, showing how different learning rates converge to qualitatively different models at same SFT loss. 2) Link forgetting to overtraining by showing learning rate decay increases pretrained model sharpness, which exacerbates forgetting during SFT.
Result: Learning rate mediates optimization in SFT: large vs small steps lead to different model behaviors despite same loss. Learning rate decay increases model sharpness, which worsens catastrophic forgetting, leading to overtraining.
Conclusion: Provides mechanistic understanding of overtraining in LLMs, showing how learning rate dynamics during pretraining affect SFT outcomes through sharpness and forgetting mechanisms.
Abstract: Supervised fine-tuning (SFT) is a common first stage of LLM post-training, teaching the model to follow instructions and shaping its behavior as a helpful assistant. At the same time, SFT may harm the fundamental capabilities of an LLM, particularly after long pretraining: a phenomenon known as catastrophic overtraining (Springer et al., 2025). To understand overtraining, we first investigate catastrophic forgetting in finetuning through the lens of implicit regularization of the learning rate. For models trained to the same SFT loss, we identify how the learning rate mediates optimization: finetuning with large and small steps converges to qualitatively different models. Next, we link forgetting to overtraining: learning rate decay increases the sharpness of the pretrained model, which in turn exacerbates catastrophic forgetting during SFT, leading to overtraining. Our findings paint a picture of the overtraining mechanism in LLMs and broadly contribute to the understanding of the interplay between optimization dynamics during pretraining and finetuning.
[448] Ordinary Least Squares is a Special Case of Transformer
Xiaojun Tan, Yuchen Zhao
Main category: cs.LG
TL;DR: Transformers are shown to be mathematically equivalent to Ordinary Least Squares (OLS) regression in a specific parameter setting, revealing their statistical nature as classical computational algorithms rather than universal approximators.
Details
Motivation: To understand the fundamental statistical nature of Transformer architectures - whether they are universal approximators or implementations of known computational algorithms like Ordinary Least Squares.Method: Using algebraic proof and spectral decomposition of empirical covariance matrices to construct specific parameter settings where single-layer Linear Transformer’s attention mechanism becomes mathematically equivalent to OLS closed-form projection.
Result: Demonstrated that attention can solve OLS problems in one forward pass without iteration, uncovered decoupled slow/fast memory mechanisms, and established continuity between Transformers and classical statistical inference.
Conclusion: Transformers are better understood as neural implementations of classical computational algorithms (like OLS) rather than universal approximators, with clear mathematical connections to statistical inference methods.
Abstract: The statistical essence of the Transformer architecture has long remained elusive: Is it a universal approximator, or a neural network version of known computational algorithms? Through rigorous algebraic proof, we show that the latter better describes Transformer’s basic nature: Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer. Using the spectral decomposition of the empirical covariance matrix, we construct a specific parameter setting where the attention mechanism’s forward pass becomes mathematically equivalent to the OLS closed-form projection. This means attention can solve the problem in one forward pass, not by iterating. Building upon this prototypical case, we further uncover a decoupled slow and fast memory mechanism within Transformers. Finally, the evolution from our established linear prototype to standard Transformers is discussed. This progression facilitates the transition of the Hopfield energy function from linear to exponential memory capacity, thereby establishing a clear continuity between modern deep architectures and classical statistical inference.
[449] A Bayesian Framework for Uncertainty-Aware Explanations in Power Quality Disturbance Classification
Yinsong Chen, Samson S. Yu, Kashem M. Muttaqi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Advanced deep learning methods have shown remarkable success in power quality disturbance (PQD) classification. To enhance model transparency, explainable AI (XAI) techniques have been developed to provide instance-specific interpretations of classifier decisions. However, conventional XAI methods yield deterministic explanations, overlooking uncertainty and limiting reliability in safety-critical applications. This paper proposes a Bayesian explanation framework that models explanation uncertainty by generating a relevance attribution distribution for each instance. This method allows experts to select explanations based on confidence percentiles, thereby tailoring interpretability according to specific disturbance types. Extensive experiments on synthetic and real-world power quality datasets demonstrate that the proposed framework improves the transparency and reliability of PQD classifiers through uncertainty-aware explanations.
[450] Optimization with SpotOptim
Thomas Bartz-Beielstein
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The spotoptim package implements surrogate-model-based optimization of expensive black-box functions in Python. Building on two decades of Sequential Parameter Optimization (SPO) methodology, it provides a Kriging-based optimization loop with Expected Improvement, support for continuous, integer, and categorical variables, noise-aware evaluation via Optimal Computing Budget Allocation (OCBA), and multi-objective extensions. A steady-state parallelization strategy overlaps surrogate search with objective evaluation on multi-core hardware, and a success-rate-based restart mechanism detects stagnation while preserving the best solution found. The package returns scipy-compatible OptimizeResult objects and accepts any scikit-learn-compatible surrogate model. Built-in TensorBoard logging provides real-time monitoring of convergence and surrogate quality. This report describes the architecture and module structure of spotoptim, provides worked examples including neural network hyperparameter tuning, and compares the framework with BoTorch, Optuna, Ray Tune, BOHB, SMAC, and Hyperopt. The package is open-source.
[451] Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
Angelo Moroncelli, Roberto Zanetti, Marco Maccarini, Loris Roveda
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent’s actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.
[452] Physics-Informed Neural Networks for Solving Derivative-Constrained PDEs
Kentaro Hoshisashi, Carolyn E Phelan, Paolo Barucca
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Physics-Informed Neural Networks (PINNs) recast PDE solving as an optimisation problem in function space by minimising a residual-based objective, yet many applications require additional derivative-based relations that are just as fundamental as the governing equations. In this paper, we present Derivative-Constrained PINNs (DC-PINNs), a general framework that treats constrained PDE solving as an optimisation guided by a minimum objective function criterion where the physics resides in the minimum principle. DC-PINNs embed general nonlinear constraints on states and derivatives, e.g., bounds, monotonicity, convexity, incompressibility, computed efficiently via automatic differentiation, and they employ self-adaptive loss balancing to tune the influence of each objective, reducing reliance on manual hyperparameters and problem-specific architectures. DC-PINNs consistently reduce constraint violations and improve physical fidelity versus baseline PINN variants, representative hard-constraint formulations on benchmarks, including heat diffusion with bounds, financial volatilities with arbitrage-free, and fluid flow with vortices shed. Explicitly encoding derivative constraints stabilises training and steers optimisation toward physically admissible minima even when the PDE residual alone is small, providing reliable solutions of constrained PDEs grounded in energy minimum principles.
[453] Spectral Thompson sampling
Tomas Kocak, Michal Valko, Remi Munos, Shipra Agrawal
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Thompson Sampling (TS) has attracted a lot of interest due to its good empirical performance, in particular in the computational advertising. Though successful, the tools for its performance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit problem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be similar. Although the setting has application both in recommender systems and advertising, the traditional algorithms would scale poorly with the number of choices. For that purpose we consider an effective dimension d, which is small in real-world graphs. We deliver the analysis showing that the regret of SpectralTS scales as dsqrt(T ln N) with high probability, where T is the time horizon and N is the number of choices. Since a dsqrt(T ln N) regret is comparable to the known results, SpectralTS offers a computationally more efficient alternative. We also show that our algorithm is competitive on both synthetic and real-world data.
[454] Online learning with noisy side observations
Tomáš Kocák, Gergely Neu, Michal Valko
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We propose a new partial-observability model for online learning problems where the learner, besides its own loss, also observes some noisy feedback about the other actions, depending on the underlying structure of the problem. We represent this structure by a weighted directed graph, where the edge weights are related to the quality of the feedback shared by the connected nodes. Our main contribution is an efficient algorithm that guarantees a regret of $\widetilde{O}(\sqrt{α^* T})$ after $T$ rounds, where $α^$ is a novel graph property that we call the effective independence number. Our algorithm is completely parameter-free and does not require knowledge (or even estimation) of $α^$. For the special case of binary edge weights, our setting reduces to the partial-observability models of Mannor and Shamir (2011) and Alon et al. (2013) and our algorithm recovers the near-optimal regret bounds.
[455] Soft $Q(λ)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces
Pranav Mahajan, Ben Seymour
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Soft Q-learning has emerged as a versatile model-free method for entropy-regularised reinforcement learning, optimising for returns augmented with a penalty on the divergence from a reference policy. Despite its success, the multi-step extensions of soft Q-learning remain relatively unexplored and limited to on-policy action sampling under the Boltzmann policy. In this brief research note, we first present a formal $n$-step formulation for soft Q-learning and then extend this framework to the fully off-policy case by introducing a novel Soft Tree Backup operator. Finally, we unify these developments into Soft $Q(λ)$, an elegant online, off-policy, eligibility trace framework that allows for efficient credit assignment under arbitrary behaviour policies. Our derivations propose a model-free method for learning entropy-regularised value functions that can be utilised in future empirical experiments.
[456] Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
Dongjie Fu, Fangming Feng, Xize Cheng, Linjun Li, Zhou Zhao, Tao Jin
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluating the character alignment of role-playing agents. To address these challenges, we present RoleJudge, an evaluation framework that leverages audio large language models to systematically assess the alignment between speech and character across multiple modalities and dimensions. Furthermore, we introduce RoleChat, the first voice role-playing evaluation dataset enriched with chain-of-thought reasoning annotations, comprising a diverse set of authentic and LLM-generated speech samples. Utilizing this dataset, we implement a multi-stage training paradigm and incorporate Standard Alignment in reinforcement learning to mitigate reward misalignment during optimization. Experimental results in terms of accuracy and subjective assessment demonstrate that RoleJudge outperforms various baseline models, validating the effectiveness of our multidimensional evaluation framework.
[457] Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
Jaemin Kim, Sungkyun Kim, Junyeol Lee, Jiwon Seo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) are widely used across many domains, but their scale makes deployment challenging. Post-Training Quantization (PTQ) reduces memory footprint without retraining by leveraging a small calibration set. Recent Hessian-based PTQ methods compensate quantization error via cross-channel dependencies, but such approaches degrade at low bit-widths due to noisy curvature estimates from limited calibration data. We propose DASH-Q, a robust PTQ framework using diagonal Hessian approximation and iterative weighted least squares. By discarding noise-prone dependencies, DASH-Q filters sampling noise while prioritizing the preservation of salient feature power. We outperform other PTQ baselines in ultra low-bit regime, improving zero-shot accuracy by 7.01% on average and up to 14.01% over the strongest baselines across five baseline LLM models, while showing robust and stable performance with very small calibration data.
[458] Composite Silhouette: A Subsampling-based Aggregation Strategy
Aggelos Semoglou, Aristidis Likas, John Pavlopoulos
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Determining the number of clusters is a central challenge in unsupervised learning, where ground-truth labels are unavailable. The Silhouette coefficient is a widely used internal validation metric for this task, yet its standard micro-averaged form tends to favor larger clusters under size imbalance. Macro-averaging mitigates this bias by weighting clusters equally, but may overemphasize noise from under-represented groups. We introduce Composite Silhouette, an internal criterion for cluster-count selection that aggregates evidence across repeated subsampled clusterings rather than relying on a single partition. For each subsample, micro- and macro-averaged Silhouette scores are combined through an adaptive convex weight determined by their normalized discrepancy and smoothed by a bounded nonlinearity; the final score is then obtained by averaging these subsample-level composites. We establish key properties of the criterion and derive finite-sample concentration guarantees for its subsampling estimate. Experiments on synthetic and real-world datasets show that Composite Silhouette effectively reconciles the strengths of micro- and macro-averaging, yielding more accurate recovery of the ground-truth number of clusters.
[459] RPS: Information Elicitation with Reinforcement Prompt Selection
Tao Wang, Jingyao Lu, Xibo Wang, Haonan Huang, Su Yao, Zhiqiang Hu, Xingyan Chen, Enmao Diao
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) have shown remarkable capabilities in dialogue generation and reasoning, yet their effectiveness in eliciting user-known but concealed information in open-ended conversations remains limited. In many interactive AI applications, such as personal assistants, tutoring systems, and legal or clinical support, users often withhold sensitive or uncertain information due to privacy concerns, ambiguity, or social hesitation. This makes it challenging for LLMs to gather complete and contextually relevant inputs. In this work, we define the problem of information elicitation in open-ended dialogue settings and propose Reinforcement Prompt Selection (RPS), a lightweight reinforcement learning framework that formulates prompt selection as a sequential decision-making problem. To analyze this problem in a controlled setting, we design a synthetic experiment, where a reinforcement learning agent outperforms a random query baseline, illustrating the potential of policy-based approaches for adaptive information elicitation. Building on this insight, RPS learns a policy over a pool of prompts to adaptively elicit concealed or incompletely expressed information from users through dialogue. We also introduce IELegal, a new benchmark dataset constructed from real legal case documents, which simulates dialogue-based information elicitation tasks aimed at uncovering case-relevant facts. In this setting, RPS outperforms static prompt baselines, demonstrating the effectiveness of adaptive prompt selection for eliciting critical information in LLM-driven dialogue systems.
[460] SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention
Hongtao Xu, Jianchao Tan, Yuxuan Hu, Pengju Lu, Hongyu Wang, Pingwei Sun, Yerui Sun, Yuchen Xie, Xunliang Cai, Mingzhen Li, Weile Jia
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit{1)} sequence length and \textit{2)} sparsity sensitivity, leading to a severe imbalance problem and sub-optimal model accuracy. Existing algorithms and training frameworks typically focus on single issue, failing to systematically co-optimize these two problems. Therefore, we propose SparseBalance, a novel algorithm-system co-design framework, which exploits the sparsity and sequence heterogeneity to optimize model accuracy and system efficiency jointly. First, we propose workload-aware dynamic sparsity tuning, which employs a bidirectional sparsity adjustment to eliminate stragglers and exploit inherent bubbles for free accuracy. Second, we propose a sparsity-aware batching strategy to achieve coarse-grained balance, which complements dynamic sparsity tuning. Experimental results demonstrate that SparseBalance achieves up to a 1.33$\times$ end-to-end speedup while still improving the long-context capability by 0.46% on the LongBench benchmark.
[461] UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization
Zhengxi Lu, Fei Tang, Guangyi Liu, Kaitao Song, Xu Tan, Jin Ma, Wenqi Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool-Integrated Policy Optimization (TIPO), which separately optimizes tool selection through single-turn prediction and task execution through on-policy multi-turn rollouts. Experimental results show that UI-Copilot-7B achieves state-of-the-art performance on challenging MemGUI-Bench, outperforming strong 7B-scale GUI agents such as GUI-Owl-7B and UI-TARS-1.5-7B. Moreover, UI-Copilot-7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI-Copilot’s strong generalization to real-world GUI tasks.
[462] Beyond State Consistency: Behavior Consistency in Text-Based World Models
Youling Huang, Guanqiao Chen, Junchi Yao, Lu Wang, Fangkai Yang, Chao Du, ChenZhuo Zhao, Pu Zhao, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: World models have been emerging as critical components for assessing the consequences of actions generated by interactive agents in online planning and offline evaluation. In text-based environments, world models are typically evaluated and trained with single-step metrics such as Exact Match, aiming to improve the similarity between predicted and real-world states, but such metrics have been shown to be insufficient for capturing actual agent behavior. To address this issue, we introduce a new behavior-aligned training paradigm aimed at improving the functional consistency between the world model and the real environment. This paradigm focuses on optimizing a tractable step-level metric named Behavior Consistency Reward (BehR), which measures how much the likelihood of a logged next action changes between the real state and the world-model-predicted state under a frozen Reference Agent. Experiments on WebShop and TextWorld show that BehR-based training improves long-term alignment in several settings, with the clearest gains in WebShop and less movement in near-ceiling regimes, while preserving or improving single-step prediction quality in three of four settings. World models trained with BehR also achieve lower false positives in offline surrogate evaluation and show modest but encouraging gains in inference-time lookahead planning.
[463] Evaluating Supervised Machine Learning Models: Principles, Pitfalls, and Metric Selection
Xuanyan Liu, Ignacio Cabrera Martin, Marcello Trovati, Xiaolong Xu, Nikolaos Polatidis
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The evaluation of supervised machine learning models is a critical stage in the development of reliable predictive systems. Despite the widespread availability of machine learning libraries and automated workflows, model assessment is often reduced to the reporting of a small set of aggregate metrics, which can lead to misleading conclusions about real-world performance. This paper examines the principles, challenges, and practical considerations involved in evaluating supervised learning algorithms across classification and regression tasks. In particular, it discusses how evaluation outcomes are influenced by dataset characteristics, validation design, class imbalance, asymmetric error costs, and the choice of performance metrics. Through a series of controlled experimental scenarios using diverse benchmark datasets, the study highlights common pitfalls such as the accuracy paradox, data leakage, inappropriate metric selection, and overreliance on scalar summary measures. The paper also compares alternative validation strategies and emphasizes the importance of aligning model evaluation with the intended operational objective of the task. By presenting evaluation as a decision-oriented and context-dependent process, this work provides a structured foundation for selecting metrics and validation protocols that support statistically sound, robust, and trustworthy supervised machine learning systems.
[464] Simulation-Based Optimisation of Batting Order and Bowling Plans in T20 Cricket
Tinniam V Ganesh
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper develops a unified Markov Decision Process (MDP) framework for optimising two recurring in-match decisions in T20 cricket namely batting order selection and bowling plan assignment, directly in terms of win and defend probability rather than expected runs. A three-phase player profile engine (Powerplay, Middle, Death) with James-Stein shrinkage is estimated from 1,161 IPL ball-by-ball records (2008-2025). Win/defend probabilities are evaluated by vectorised Monte Carlo simulation over N = 50,000 innings trajectories. Batting orders are searched by exhaustive enumeration. Bowling plans are computed by simulated annealing over the remaining quota with the constraint that the same bowler cannot bowl consecutive overs. Applied to two 2026 IPL matches, the optimal batting order improves Mumbai Indians’ win probability by 4.1 percentage points (52.4% to 56.5%), and the optimal Gujarat Titans bowling plan improves defend probability by 5.2 percentage points (39.1% to 44.3%). In both cases the observed sub-optimality is consistent with phase-agnostic deployment in decisions that appear reasonable by aggregate metrics but are exposed as costly when phase-specific profiles are applied.
[465] Hardware-Efficient Neuro-Symbolic Networks with the Exp-Minus-Log Operator
Eymen Ipek
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Deep neural networks (DNNs) deliver state-of-the-art accuracy on regression and classification tasks, yet two structural deficits persistently obstruct their deployment in safety-critical, resource-constrained settings: (i) opacity of the learned function, which precludes formal verification, and (ii) reliance on heterogeneous, library-bound activation functions that inflate latency and silicon area on edge hardware. The recently introduced Exp-Minus-Log (EML) Sheffer operator, eml(x, y) = exp(x) - ln(y), was shown by Odrzywolek (2026) to be sufficient - together with the constant 1 - to express every standard elementary function as a binary tree of identical nodes. We propose to embed EML primitives inside conventional DNN architectures, yielding a hybrid DNN-EML model in which the trunk learns distributed representations and the head is a depth-bounded, weight-sparse EML tree whose snapped weights collapse to closed-form symbolic sub-expressions. We derive the forward equations, prove computational-cost bounds, analyse inference and training acceleration relative to multilayer perceptrons (MLPs) and physics-informed neural networks (PINNs), and quantify the trade-offs for FPGA/analog deployment. We argue that the DNN-EML pairing closes a literature gap: prior neuro-symbolic and equation-learner approaches (EQL, KAN, AI-Feynman) work with heterogeneous primitive sets and do not exploit a single hardware-realisable Sheffer element. A balanced assessment shows that EML is unlikely to accelerate training, and on commodity CPU/GPU it is also unlikely to accelerate inference; however, on a custom EML cell (FPGA logic block or analog circuit) the asymptotic latency advantage can reach an order of magnitude with simultaneous gain in interpretability and formal-verification tractability.
[466] ASTER: Latent Pseudo-Anomaly Generation for Unsupervised Time-Series Anomaly Detection
Romain Hermary, Samet Hicsonmez, Dan Pineau, Abd El Rahman Shabayek, Djamila Aouada
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Time-series anomaly detection (TSAD) is critical in domains such as industrial monitoring, healthcare, and cybersecurity, but it remains challenging due to rare and heterogeneous anomalies and the scarcity of labelled data. This scarcity makes unsupervised approaches predominant, yet existing methods often rely on reconstruction or forecasting, which struggle with complex data, or on embedding-based approaches that require domain-specific anomaly synthesis and fixed distance metrics. We propose ASTER, a framework that generates pseudo-anomalies directly in the latent space, avoiding handcrafted anomaly injections and the need for domain expertise. A latent-space decoder produces tailored pseudo-anomalies to train a Transformer-based anomaly classifier, while a pre-trained LLM enriches the temporal and contextual representations of this space. Experiments on three benchmark datasets show that ASTER achieves state-of-the-art performance and sets a new standard for LLM-based TSAD.
[467] Drowsiness-Aware Adaptive Autonomous Braking System based on Deep Reinforcement Learning for Enhanced Road Safety
Hossem Eddine Hafidi, Elisabetta De Giovanni, Teodoro Montanaro, Ilaria Sergi, Massimo De Vittorio, Luigi Patrono
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Driver drowsiness significantly impairs the ability to accurately judge safe braking distances and is estimated to contribute to 10%-20% of road accidents in Europe. Traditional driver-assistance systems lack adaptability to real-time physiological states such as drowsiness. This paper proposes a deep reinforcement learning-based autonomous braking system that integrates vehicle dynamics with driver physiological data. Drowsiness is detected from ECG signals using a Recurrent Neural Network (RNN), selected through an extensive benchmark analysis of 2-minute windows with varying segmentation and overlap configurations. The inferred drowsiness state is incorporated into the observable state space of a Double-Dueling Deep Q-Network (DQN) agent, where driver impairment is modeled as an action delay. The system is implemented and evaluated in a high-fidelity CARLA simulation environment. Experimental results show that the proposed agent achieves a 99.99% success rate in avoiding collisions under both drowsy and non-drowsy conditions. These findings demonstrate the effectiveness of physiology-aware control strategies for enhancing adaptive and intelligent driving safety systems.
[468] HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark
Jiacheng Wang, Jinchang Hou, Fabian Wang, Ping Jian, Chenfu Bao, Zhonghou Lv
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementary but underexplored setting through the lens of \emph{intrinsic} risk, where intrinsic failures remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes. To evaluate this setting, we introduce \emph{non-attack intrinsic risk auditing} and present \textbf{HINTBench}, a benchmark of 629 agent trajectories (523 risky, 106 safe; 33 steps on average) supporting three tasks: risk detection, risk-step localization, and intrinsic failure-type identification. Its annotations are organized under a unified five-constraint taxonomy. Experiments reveal a substantial capability gap: strong LLMs perform well on trajectory-level risk detection, but their performance drops to below 35 Strict-F1 on risk-step localization, while fine-grained failure diagnosis proves even harder. Existing guard models transfer poorly to this setting. These findings establish intrinsic risk auditing as an open challenge for agent safety.
[469] MolCryst-MLIPs: A Machine-Learned Interatomic Potentials Database for Molecular Crystals
Adam Lahouari, Shen Ai, Jihye Han, Jillian Hoffstadt, Philipp Hoellmer, Charlotte Infante, Pulkita Jain, Sangram Kadam, Maya M. Martirossyan, Amara McCune, Hypatia Newton, Shlok J. Paul, Willmor Pena, Jonathan Raghoonanan, Sumon Sahu, Oliver Tan, Andrea Vergara, Jutta Rogal, Mark E. Tuckerman
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present an open Molecular Crystal (MC) database of Machine-Learned Interatomic Potentials (MLIP) called MolCryst-MLIPs. The first release comprises fine-tuned MACE models for nine molecular crystal systems – Benzamide, Benzoic acid, Coumarin, Durene, Isonicotinamide, Niacinamide, Nicotinamide, Pyrazinamide, and Resorcinol – developed using the Automated Machine Learning Pipeline (AMLP), which streamlines the entire MLIP development workflow, from reference data generation to model training and validation, into a reproducible and user-friendly pipeline. Models are fine-tuned from the MACE-MH-1 foundation model (omol head), yielding a mean energy MAE of 0.141 kJ/mol/atom and a mean force MAE of 0.648 kJ/mol/Angstrom across all systems. Dynamical stability and structural integrity, as assessed through energy conservation, P2 orientational order parameters, and radial distribution functions, are evaluated using molecular dynamics simulations. The released models and datasets constitute a growing open database of validated MLIPs, ready for production MD simulations of molecular crystal polymorphism under different thermodynamic conditions.
[470] DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
Xiaofan Li, Ming Yang, Zhiyuan Ma, Shichao Ma, Jintao Du, Yu Cheng, Weiqiang Wang, Zhizhong Zhang, Xin Tan, Yanyun Qu, Lizhuang Ma, Yuan Xie
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.
[471] MAny: Merge Anything for Multimodal Continual Instruction Tuning
Zijian Gao, Wangwang Jia, Xingxing Zhang, Pengfei Qian, Tao Sun, Bo Ding, Yong Dou, Huaimin Wang, Kele Xu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf{M}erge \textbf{Any}thing), a framework that merges task-specific knowledge through \textbf{C}ross-modal \textbf{P}rojection \textbf{M}erging (\textbf{CPM}) and \textbf{L}ow-rank \textbf{P}arameter \textbf{M}erging (\textbf{LPM}). Specifically, CPM recovers perceptual alignment by adaptively merging cross-modal visual representations via visual-prototype guidance, ensuring accurate feature recovery during inference. Simultaneously, LPM eliminates mutual interference among task-specific low-rank modules by recursively merging low-rank weight matrices. By leveraging recursive least squares, LPM provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability. Notably, MAny operates as a training-free paradigm that achieves knowledge merging via efficient CPU-based algebraic operations, eliminating additional gradient-based optimization beyond initial tuning. Our extensive evaluations confirm the superior performance and robustness of MAny across multiple MLLMs and benchmarks. Specifically, on the UCIT benchmark, MAny achieves significant leads of up to 8.57% and 2.85% in final average accuracy over state-of-the-art methods across two different MLLMs, respectively.
[472] Unsupervised Anomaly Detection in Process-Complex Industrial Time Series: A Real-World Case Study
Sergej Krasnikov, Lukas Meitz, Samineh Bagheri, Michael Heider, Thorsten Schöler, Jörg Hähner
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Industrial time-series data from real production environments exhibits substantially higher complexity than commonly used benchmark datasets, primarily due to heterogeneous, multi-stage operational processes. As a result, anomaly detection methods validated under simplified conditions often fail to generalize to industrial settings. This work presents an empirical study on a unique dataset collected from fully operational industrial machinery, explicitly capturing pronounced process-induced variability. We evaluate which model classes are capable of capturing this complexity, starting with a classical Isolation Forest baseline and extending to multiple autoencoder architectures. Experimental results show that Isolation Forest is insufficient for modeling the non-periodic, multi-scale dynamics present in the data, whereas autoencoders consistently perform better. Among them, temporal convolutional autoencoders achieve the most robust performance, while recurrent and variational variants require more careful tuning.
[473] Quantum Machine Learning for Colorectal Cancer Data: Anastomotic Leak Classification and Risk Factors
Vojtěch Novák, Ivan Zelinka, Lenka Přibylová, Lubomír Martínek, Vladimír Benčurík, Martin Beseda
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This study evaluates colorectal risk factors and compares classical models against Quantum Neural Networks (QNNs) for anastomotic leak prediction. Analyzing clinical data with 14% leak prevalence, we tested ZZFeatureMap encodings with RealAmplitudes and EfficientSU2 ansatze under simulated noise. $F_β$-optimized quantum configurations yielded significantly higher sensitivity (83.3%) than classical baselines (66.7%). This demonstrates that quantum feature spaces better prioritize minority class identification, which is critical for low-prevalence clinical risk prediction. Our work explores various optimizers under noisy conditions, highlighting key trade-offs and future directions for hardware deployment.
[474] Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation
Shangzhe Li, Weitong Zhang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We study value adaptation in offline-to-online reinforcement learning under general function approximation. Starting from an imperfect offline pretrained $Q$-function, the learner aims to adapt it to the target environment using only a limited amount of online interaction. We first characterize the difficulty of this setting by establishing a minimax lower bound, showing that even when the pretrained $Q$-function is close to optimal $Q^\star$, online adaptation can be no more efficient than pure online RL on certain hard instances. On the positive side, under a novel structural condition on the offline-pretrained value functions, we propose O2O-LSVI, an adaptation algorithm with problem-dependent sample complexity that provably improves over pure online RL. Finally, we complement our theory with neural-network experiments that demonstrate the practical effectiveness of the proposed method.
[475] First-See-Then-Design: A Multi-Stakeholder View for Optimal Performance-Fairness Trade-Offs
Kavya Gupta, Nektarios Kalampalikis, Christoph Heitz, Isabel Valera
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Fairness in algorithmic decision-making is often defined in the predictive space, where predictive performance - used as a proxy for decision-maker (DM) utility - is traded off against prediction-based fairness notions, such as demographic parity or equality of opportunity. This perspective, however, ignores how predictions translate into decisions and ultimately into utilities and welfare for both DM and decision subjects (DS), as well as their allocation across social-salient groups. In this paper, we propose a multi-stakeholder framework for fair algorithmic decision-making grounded in welfare economics and distributive justice, explicitly modeling the utilities of both the DM and DS, and defining fairness via a social planner’s utility that captures inequalities in DS utilities across groups under different justice-based fairness notions (e.g., Egalitarian, Rawlsian). We formulate fair decision-making as a post-hoc multi-objective optimization problem, characterizing the achievable performance-fairness trade-offs in the two-dimensional utility space of DM utility and the social planner’s utility, under different decision policy classes (deterministic vs. stochastic, shared vs. group-specific). Using the proposed framework, we then identify conditions (in terms of the stakeholders’ utilities) under which stochastic policies are more optimal than deterministic ones, and empirically demonstrate that simple stochastic policies can yield superior performance-fairness trade-offs by leveraging outcome uncertainty. Overall, we advocate a shift from prediction-centric fairness to a transparent, justice-based, multi-stakeholder approach that supports the collaborative design of decision-making policies.
[476] BOAT: Navigating the Sea of In Silico Predictors for Antibody Design via Multi-Objective Bayesian Optimization
Jackie Rao, Ferran Gonzalez Hernandez, Leon Gerard, Alexandra Gessner
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Antibody lead optimization is inherently a multi-objective challenge in drug discovery. Achieving a balance between different drug-like properties is crucial for the development of viable candidates, and this search becomes exponentially challenging as desired properties grow. The ever-growing zoo of sophisticated in silico tools for predicting antibody properties calls for an efficient joint optimization procedure to overcome resource-intensive sequential filtering pipelines. We present BOAT, a versatile Bayesian optimization framework for multi-property antibody engineering. Our `plug-and-play’ framework couples uncertainty-aware surrogate modeling with a genetic algorithm to jointly optimize various predicted antibody traits while enabling efficient exploration of sequence space. Through systematic benchmarking against genetic algorithms and newer generative learning approaches, we demonstrate competitive performance with state-of-the-art methods for multi-objective protein optimization. We identify clear regimes where surrogate-driven optimization outperforms expensive generative approaches and establish practical limits imposed by sequence dimensionality and oracle costs.
[477] TIP: Token Importance in On-Policy Distillation
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher–student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher–student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher–student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.
[478] Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning
Zekai Lin, Chao Xue, Di Liang, Xingsheng Han, Peiyang Liu, Xianjie Wu, Lei Jiang, Yu Lu, Haibo Shi, Shuang Liang, Minlong Peng
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Supervised Fine-Tuning (SFT) of large language models often suffers from task interference and catastrophic forgetting. Recent approaches alleviate this issue by isolating task-critical parameters during training. However, these methods represent a static solution to a dynamic problem, assuming that parameter importance remains fixed once identified. In this work, we empirically demonstrate that parameter importance exhibits temporal drift over the course of training. To address this, we propose Evolving Parameter Isolation (EPI), a fine-tuning framework that adapts isolation decisions based on online estimates of parameter importance. Instead of freezing a fixed subset of parameters, EPI periodically updates isolation masks using gradient-based signals, enabling the model to protect emerging task-critical parameters while releasing outdated ones to recover plasticity. Experiments on diverse multi-task benchmarks demonstrate that EPI consistently reduces interference and forgetting compared to static isolation and standard fine-tuning, while improving overall generalization. Our analysis highlights the necessity of synchronizing isolation mechanisms with the evolving dynamics of learning diverse abilities.
[479] PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling
Zichao Yan, Yan Wu, Mica Xu Ji, Chaitra Agrahar, Esther Wershof, Marcel Nassar, Mehrshad Sadria, Ridvan Eksi, Vladimir Trifonov, Ignacio Ibarra, Telmo Felgueira, Błażej Osiński, Rory Stark
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Predicting the effects of perturbations in-silico on cell state can identify drivers of cell behavior at scale and accelerate drug discovery. However, modeling challenges remain due to the inherent heterogeneity of single cell gene expression and the complex, latent gene dependencies. Here, we present PRiMeFlow, an end-to-end flow matching based approach to directly model the effects of genetic and small molecule perturbations in the gene expression space. The distribution-fitting approach taken by PRiMeFlow enables it to accurately approximate the empirical distribution of single-cell gene expression, which we demonstrate through extensive benchmarking inside PerturBench. Through ablation studies, we also validate important model design choices such as operating in gene expression space and parameterizing the velocity field with a U-Net architecture. The PRiMeFlow architecture was used as the basis for the model that won the Generalist Prize in the first ARC Virtual Cell Challenge.
[480] $π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, Dongbin Zhao
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($π$-Play), a multi-agent self-evolution framework. In $π$-Play, an examiner generates tasks together with their QCPs, and a teacher model leverages QCP as privileged context to densely supervise a student via self-distillation. This design transforms conventional sparse-reward self-play into a dense-feedback self-evolution loop. Extensive experiments show that data-free $π$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play.
[481] Unsupervised domain transfer: Overcoming signal degradation in sleep monitoring by increasing scoring realism
Mohammad Ahangarkiasari, Andreas Tind Damgaard, Casper Haurum, Kaare B. Mikkelsen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Objective: Investigate whether hypnogram ‘realism’ can be used to guide an unsupervised method for handling arbitrary types of signal degradation in mobile sleep monitoring. Approach: Combining a pretrained, state-of-the-art ‘u-sleep’ model with a ‘discriminator’ network, we align features from a target domain with a feature space learned during pretraining. To test the approach, we distort the source domain with realistic signal degradations, to see how well the method can adapt to different types of degradation. We compare the performance of the resulting model with best-case models designed in a supervised manner for each type of transfer. Main Results: Depending on the type of distortion, we find that the unsupervised approach can increase Cohen’s kappa with as little as 0.03 and up to 0.29, and that for all transfers, the method does not decrease performance. However, the approach never quite reaches the estimated theoretical optimal performance, and when tested on a real-life domain mismatch between two sleep studies, the benefit was insignificant. Significance: ‘Discriminator-guided fine tuning’ is an interesting approach to handling signal degradation for ‘in the wild’ sleep monitoring, with some promise. In particular, what it says about sleep data in general is interesting. However, more development will be necessary before using it ‘in production’.
[482] From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model’s existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.
[483] Physics-Informed Neural Networks for Methane Sorption: Cross-Gas Transfer Learning, Ensemble Collapse Under Physics Constraints, and Monte Carlo Dropout Uncertainty Quantification
Mohammad Nooraiepour, Zezhang Song, Wei Li, Sarah Perez
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate methane sorption prediction across heterogeneous coal ranks requires models that combine thermodynamic consistency, efficient knowledge transfer across data-scarce geological systems, and calibrated uncertainty estimates, capabilities that are rarely addressed together in existing frameworks. We present a physics-informed transfer learning framework that adapts a hydrogen sorption PINN to methane sorption prediction via Elastic Weight Consolidation, coal-specific feature engineering, and a three-phase curriculum that progressively balances transfer preservation with thermodynamic fine-tuning. Trained on 993 equilibrium measurements from 114 independent coal experiments spanning lignite to anthracite, the framework achieves R2 = 0.932 on held-out coal samples, a 227% improvement over pressure-only classical isotherms, while hydrogen pre-training delivers 18.9% lower RMSE and 19.4% faster convergence than random initialization. Five Bayesian uncertainty quantification approaches reveal a systematic divergence in performance across physics-constrained architectures. Monte Carlo Dropout achieves well-calibrated uncertainty at minimal overhead, while deep ensembles, regardless of architectural diversity or initialization strategy, exhibit performance degradation because shared physics constraints narrow the admissible solution manifold. SHAP and ALE analyses confirm that learned representations remain physically interpretable and aligned with established coal sorption mechanisms: moisture-volatile interactions are most influential, pressure-temperature coupling captures thermodynamic co-dependence, and features exhibit non-monotonic effects. These results identify Monte Carlo Dropout as the best-performing UQ method in this physics-constrained transfer learning framework, and demonstrate cross-gas transfer learning as a data-efficient strategy for geological material modeling.
[484] A Complete Symmetry Classification of Shallow ReLU Networks
Pranavkrishnan Ramakrishnan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Parameter space is not function space for neural network architectures. This fact, investigated as early as the 1990s under terms such as reverse engineering," or parameter identifiability", has led to the natural question of parameter space symmetries\textemdash the study of distinct parameters in neural architectures which realize the same function. Indeed, the quotient space obtained by identifying parameters giving rise to the same function, called the \textit{neuromanifold}, has been shown in some cases to have rich geometric properties, impacting optimization dynamics. Thus far, techniques towards complete classifications have required the analyticity of the activation function, notably excising the important case of ReLU. Here, in contrast, we exploit the non-differentiability of the ReLU activation to provide a complete classification of the symmetries in the shallow case.
[485] LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
Sumeet Ramesh Motwani, Daniel Nichols, Charles London, Peggy Li, Fabio Pizzati, Acer Blake, Hasan Hammoud, Tavish McDonald, Akshat Naik, Alesia Ivanova, Vignesh Baskaran, Ivan Laptev, Ruben Glatt, Tal Ben-Nun, Philip Torr, Natasha Jaques, Ameya Prabhu, Brian Bartoldson, Bhavya Kailkhura, Christian Schroeder de Witt
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.
[486] Neural architectures for resolving references in program code
Gergő Szalay, Gergely Zsolt Kovács, Sándor Teleki, Balázs Pintér, Tibor Gregorics
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Resolving and rewriting references is fundamental in programming languages. Motivated by a real-world decompilation task, we abstract reference rewriting into the problems of direct and indirect indexing by permutation. We create synthetic benchmarks for these tasks and show that well-known sequence-to-sequence machine learning architectures are struggling on these benchmarks. We introduce new sequence-to-sequence architectures for both problems. Our measurements show that our architectures outperform the baselines in both robustness and scalability: our models can handle examples that are ten times longer compared to the best baseline. We measure the impact of our architecture in the real-world task of decompiling switch statements, which has an indexing subtask. According to our measurements, the extended model decreases the error rate by 42%. Multiple ablation studies show that all components of our architectures are essential.
[487] Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, Pierfrancesco Beneventano
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-β)/η$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+β)/η$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.
[488] Complex Interpolation of Matrices with an application to Multi-Manifold Learning
Adi Arbel, Stefan Steinerberger, Ronen Talmon
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Given two symmetric positive-definite matrices $A, B \in \mathbb{R}^{n \times n}$, we study the spectral properties of the interpolation $A^{1-x} B^x$ for $0 \leq x \leq 1$. The presence of `common structures’ in $A$ and $B$, eigenvectors pointing in a similar direction, can be investigated using this interpolation perspective. Generically, exact log-linearity of the operator norm $|A^{1-x} B^x|$ is equivalent to the existence of a shared eigenvector in the original matrices; stability bounds show that approximate log-linearity forces principal singular vectors to align with leading eigenvectors of both matrices. These results give rise to and provide theoretical justification for a multi-manifold learning framework that identifies common and distinct latent structures in multiview data.
[489] On an $L^2$ norm for stationary ARMA processes
Anand Ganesh, Babhrubahan Bose, Anand Rajagopalan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2408.10610: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.10610&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[490] Hybrid Attention Model Using Feature Decomposition and Knowledge Distillation for Glucose Forecasting
Ebrahim Farahmand, Shovito Barua Soumma, Nooshin Taheri Chatrudi, Hassan Ghasemzadeh
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2411.10703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.10703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[491] A ghost mechanism: An analytical model of abrupt learning in recurrent networks
Fatih Dinc, Ege Cirakman, Bariscan Kurtkaya, Mert Yuksekgonul, Yiqi Jiang, Mark J. Schnitzer, Hidenori Tanaka
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2501.02378: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.02378&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[492] Biased Federated Learning under Wireless Heterogeneity
Muhammad Faraz Ul Abrar, Nicolò Michelusi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.06078: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.06078&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[493] Neural Mean-Field Games: Extending Mean-Field Game Theory with Neural Stochastic Differential Equations
Anna C.M. Thöni, Yoram Bachrach, Tal Kachman
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2504.13228: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13228&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[494] MDPs with a State Sensing Cost
Vansh Kapoor, Jayakrishnan Nair
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.03280: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.03280&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[495] RANDPOL: Parameter-Efficient End-to-End Quadruped Locomotion via Randomized Policy Learning
Zhuochen Liu, Rahul Jain, Quan Nguyen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.19054: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19054&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[496] Scalable unsupervised feature selection via weight stability
Xudong Zhang, Renato Cordeiro de Amorim
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.06114: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06114&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[497] Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes
Daniel Jenson, Jhonathan Navott, Piotr Grynfelder, Mengyan Zhang, Makkunda Sharma, Elizaveta Semenova, Seth Flaxman
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.09163: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09163&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[498] mLaSDI: Multi-stage latent space dynamics identification
William Anderson, Seung Whan Chung, Robert Stephany, Youngsoo Choi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.09207: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09207&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[499] On the Fundamental Limitations of Dual Static CVaR Decompositions in Markov Decision Processes
Mathieu Godbout, Audrey Durand
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.14005: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.14005&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[500] Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
Zhengyan Wan, Yidong Ouyang, Liyan Xie, Fang Fang, Hongyuan Zha, Guang Cheng
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.21912: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21912&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[501] Power Transform Revisited: Numerically Stable, and Federated
Xuefeng Xu, Graham Cormode
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.04995: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04995&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[502] Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student’s perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD’s apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.
[503] Modeling Student Learning with 3.8 Million Program Traces
Alexis Ross, Megha Srivastava, Jeremiah Blanchard, Jacob Andreas
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.05056: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05056&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[504] Empowering Targeted Neighborhood Search via Hyper Tour for Large-Scale TSP
Tongkai Lu, Shuai Ma, Chongyang Tao
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.20169: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20169&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[505] Data-Efficient RLVR via Off-Policy Influence Guidance
Erle Zhu, Dazhi Jiang, Yuan Wang, Xujun Li, Jiale Cheng, Yuxian Gu, Yilin Niu, Aohan Zeng, Jie Tang, Minlie Huang, Hongning Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.26491: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26491&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[506] Think Outside the Policy: In-Context Steered Policy Optimization
Hsiu-Yuan Huang, Chenming Tang, Weijie Liu, Clive Bai, Saiyong Yang, Yunfang Wu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.26519: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26519&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[507] Learning Dynamics from Input-Output Data with Hamiltonian Gaussian Processes
Jan-Hendrik Ewering, Robin E. Herrmann, Niklas Wahlström, Thomas B. Schön, Thomas Seel
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.05330: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05330&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[508] Mitigating Barren Plateaus in Quantum Denoising Diffusion Probabilistic Model
Haipeng Cao, Kaining Zhang, Dacheng Tao, Zhaofeng Su
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.06695: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06695&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[509] Guided Transfer Learning for Discrete Diffusion Models
Julian Kleutgens, Claudio Battiloro, Lingkai Kong, Benjamin Grewe, Francesca Dominici, Mauricio Tec
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.10877: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10877&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[510] SHARe-KAN: Post-Training Vector Quantization for Cache-Resident KAN Inference
Jeff Smith
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.15742: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15742&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[511] Learning-Based Estimation of Spatially Resolved Scatter Radiation Fields in Interventional Radiology
Felix Lehner, Pasquale Lombardo, Susana Castillo, Oliver Hupe, Marcus Magnor
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.17654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[512] A Review of Diffusion-based Simulation-Based Inference: Foundations and Applications in Non-Ideal Data Scenarios
Haley Rosso, Talea Mayo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.23748: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23748&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[513] Predicting Time Pressure of Powered Two-Wheeler Riders for Proactive Safety Interventions
Sumit S. Shevtekar, Chandresh K. Maurya, Gourab Sil
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.03173: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03173&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[514] Swap Regret Minimization Through Response-Based Approachability
Ioannis Anagnostides, Gabriele Farina, Maxwell Fishelson, Haipeng Luo, Jon Schneider
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.06264: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06264&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[515] Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare
Elizabeth W. Miller, Jeffrey D. Blume
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.00192: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00192&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[516] ReproMIA: A Comprehensive Analysis of Model Reprogramming for Proactive Membership Inference Attacks
Chihan Huang, Huaijin Wang, Shuai Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.28942: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28942&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[517] Restless Bandits with Individual Penalty Constraints: A New Near-Optimal Index Policy and How to Learn It
Nida Zamir, I-Hong Hou
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.04101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[518] When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs
Jose Efraim Aguilar Escamilla, Haoyang Hong, Jiawei Li, Haoyu Zhao, Xuezhou Zhang, Sanghyun Hong, Huazheng Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10062: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10062&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[519] Robust Adversarial Policy Optimization Under Dynamics Uncertainty
Mintae Kim, Koushil Sreenath
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10974: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10974&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[520] Fast and principled equation discovery from chaos to climate
Yuzheng Zhang, Weizhen Li, Rui Carvalho
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11929: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11929&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[521] Convex Hulls of Reachable Sets
Thomas Lew, Riccardo Bonalli, Marco Pavone
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2303.17674: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2303.17674&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[522] Nonparametric Sparse Online Learning of the Koopman Operator
Boya Hou, Sina Sanjari, Nathan Dahlin, Alec Koppel, Subhonmesh Bose
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2405.07432: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.07432&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[523] Fast training of accurate physics-informed neural networks without gradient descent
Chinmay Datar, Taniya Kapoor, Abhishek Chandra, Qing Sun, Erik Lien Bolager, Iryna Burak, Anna Veselovska, Massimo Fornasier, Felix Dietrich
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2405.20836: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.20836&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[524] Fast and Simple Densest Subgraph with Predictions
Thai Bui, Luan Nguyen, Hoa T. Vu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.12600: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12600&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[525] Flow-based Generative Modeling of Potential Outcomes and Counterfactuals
Dongze Wu, David I. Inouye, Yao Xie
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.16051: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16051&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[526] Geminet: Learning the Duality-based Iterative Process for Lightweight Traffic Engineering in Changing Topologies
Ximeng Liu, Zhuoran Liu, Yingming Mao, Yatao Li, Shizhen Zhao, Xinbing Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.23640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[527] A Comprehensive Survey on Network Traffic Synthesis: From Statistical Models to Deep Learning
Nirhoshan Sivaroopan, Kaushitha Silva, Chamara Madarasingha, Thilini Dahanayaka, Guillaume Jourjon, Anura Jayasumana, Kanchana Thilakarathna
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.01976: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.01976&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[528] Neural Two-Stage Stochastic Optimization for Solving Unit Commitment Problem
Zhentong Shao, Jingtao Qin, Nanpeng Yu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.09503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.09503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[529] Random Walk Learning and the Pac-Man Attack
Xingran Chen, Parimal Parag, Rohit Bhagat, Zonghong Liu, Salim El Rouayheb
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.05663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[530] Möbius transforms and Shapley values for vector-valued functions on weighted directed acyclic multigraphs
Patrick Forré, Abel Jansma
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.05786: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05786&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[531] Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-language Models
Mingen Li, Houjian Yu, Yixuan Huang, Youngjin Hong, Hantao Ye, Changhyun Choi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.19268: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19268&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[532] Zero-Shot Function Encoder-Based Differentiable Predictive Control
Hassan Iqbal, Xingjian Li, Tyler Ingebrand, Adam Thorpe, Krishna Kumar, Ufuk Topcu, Ján Drgoňa
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.05757: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05757&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[533] Robust Verification of Controllers under State Uncertainty via Hamilton-Jacobi Reachability Analysis
Albert Lin, Alessandro Pinto, Somil Bansal
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.14755: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14755&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[534] Fluids You Can Trust: Property-Preserving Operator Learning for Incompressible Flows
Ramansh Sharma, Matthew Lowery, Houman Owhadi, Varun Shankar
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.15472: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15472&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[535] Mini-Batch Covariance, Diffusion Limits, and Oracle Complexity in Stochastic Gradient Descent: A Sampling-Design Perspective
Daniel Zantedeschi, Kumar Muthuraman
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.02417: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02417&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[536] LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification
Md Akib Haider, Ahsan Bulbul, Nafis Fuad Shahid, Aimaan Ahmed, Mohammad Ishrak Abedin
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.03959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[537] Spectral methods: crucial for machine learning, natural for quantum computers?
Vasilis Belis, Joseph Bowles, Rishabh Gupta, Evan Peters, Maria Schuld
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.24654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[538] Transcriptomic Models for Immunotherapy Response Prediction Show Limited Cross-cohort Generalisability
Yuheng Liang, Lucy Chhuo, Ahmadreza Argha, Nona Farbehi, Lu Chen, Roohallah Alizadehsani, Mehdi Hosseinzadeh, Amin Beheshti, Thantrira Porntaveetusm, Youqiong Ye, Hamid Alinejad-Rokny
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.05478: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05478&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[539] Parameter-Free Non-Ergodic Extragradient Algorithms for Solving Monotone Variational Inequalities
Lingqing Shen, Fatma Kılınç-Karzan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.07662: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07662&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[540] Evaluating Differential Privacy Against Membership Inference in Federated Learning: Insights from the NIST Genomics Red Team Challenge
Gustavo de Carvalho Bertoli
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12737: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12737&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[541] C$^2$T: Captioning-Structure and LLM-Aligned Common-Sense Reward Learning for Traffic–Vehicle Coordination
Yuyang Chen, Kaiyan Zhao, Yiming Wang, Ming Yang, Bin Rao, Zhenning Li
Main category: cs.MA
TL;DR: C2T framework uses LLM knowledge distillation to create intrinsic rewards for multi-agent traffic control, outperforming traditional MARL approaches on traffic efficiency, safety, and energy metrics.
Details
Motivation: Current MARL-based traffic control systems are limited by hand-crafted, myopic rewards that fail to capture high-level human-centric goals like safety, flow stability, and comfort. There's a need for more intelligent reward functions that can better coordinate traffic light controllers and autonomous vehicles.Method: C2T framework distills “common-sense” knowledge from a Large Language Model into a learned intrinsic reward function. This LLM-derived reward is then used to guide the coordination policy of a cooperative multi-intersection TLC MARL system on CityFlow-based benchmarks.
Result: The framework significantly outperforms strong MARL baselines in traffic efficiency, safety, and an energy-related proxy. It also demonstrates flexibility by allowing distinct “efficiency-focused” versus “safety-focused” policies through LLM prompt modifications.
Conclusion: LLM knowledge distillation can effectively create better intrinsic rewards for multi-agent traffic control systems, enabling more human-centric coordination policies that outperform traditional approaches across multiple metrics.
Abstract: State-of-the-art (SOTA) urban traffic control increasingly employs Multi-Agent Reinforcement Learning (MARL) to coordinate Traffic Light Controllers (TLCs) and Connected Autonomous Vehicles (CAVs). However, the performance of these systems is fundamentally capped by their hand-crafted, myopic rewards (e.g., intersection pressure), which fail to capture high-level, human-centric goals like safety, flow stability, and comfort. To overcome this limitation, we introduce C2T, a novel framework that learns a common-sense coordination model from traffic-vehicle dynamics. C2T distills “common-sense” knowledge from a Large Language Model (LLM) into a learned intrinsic reward function. This new reward is then used to guide the coordination policy of a cooperative multi-intersection TLC MARL system on CityFlow-based multi-intersection benchmarks. Our framework significantly outperforms strong MARL baselines in traffic efficiency, safety, and an energy-related proxy. We further highlight C2T’s flexibility in principle, allowing distinct “efficiency-focused” versus “safety-focused” policies by modifying the LLM prompt.
[542] Learning Probabilistic Responsibility Allocations for Multi-Agent Interactions
Isaac Remy, Caleb Chang, Karen Leung
Main category: cs.MA
TL;DR: A method for learning probabilistic responsibility allocation models in multi-agent interactions using conditional variational autoencoders and trajectory forecasting, evaluated on driving datasets.
Details
Motivation: Understanding how people allocate responsibility in interactive settings is crucial for designing socially compliant and trustworthy autonomous systems, as human behavior is shaped by both individual objectives and shared constraints like safety.Method: Uses conditional variational autoencoder latent space combined with multi-agent trajectory forecasting techniques to learn a distribution over responsibility allocations conditioned on scene and agent context. Incorporates a differentiable optimization layer that maps responsibility allocations to induced controls since ground-truth labels are unavailable.
Result: The method achieves strong predictive performance on the INTERACTION driving dataset and provides interpretable insights into patterns of multi-agent interaction through the lens of responsibility.
Conclusion: The proposed approach successfully learns probabilistic responsibility allocation models that capture multimodal uncertainty in multi-agent interactions, offering valuable insights for autonomous system design.
Abstract: Human behavior in interactive settings is shaped not only by individual objectives but also by shared constraints with others, such as safety. Understanding how people allocate responsibility, i.e., how much one deviates from their desired policy to accommodate others, can inform the design of socially compliant and trustworthy autonomous systems. In this work, we introduce a method for learning a probabilistic responsibility allocation model that captures the multimodal uncertainty inherent in multi-agent interactions. Specifically, our approach leverages the latent space of a conditional variational autoencoder, combined with techniques from multi-agent trajectory forecasting, to learn a distribution over responsibility allocations conditioned on scene and agent context. Although ground-truth responsibility labels are unavailable, the model remains tractable by incorporating a differentiable optimization layer that maps responsibility allocations to induced controls, which are available. We evaluate our method on the INTERACTION driving dataset and demonstrate that it not only achieves strong predictive performance but also provides interpretable insights, through the lens of responsibility, into patterns of multi-agent interaction.
[543] MIND: AI Co-Scientist for Material Research
Geonhee Ahn, Donghyun Lee, Hayoung Doo, Jonggeol Na, Hyunsoo Cho, Sookyung Kim
Main category: cs.MA
TL;DR: MIND is an LLM-driven multi-agent framework for automated hypothesis validation in materials research, integrating ML interatomic potentials for in-silico experiments.
Details
Motivation: Current LLM-based scientific discovery systems are limited to text-based reasoning without automated experimental verification, creating a gap between hypothesis generation and validation.Method: Multi-agent pipeline with hypothesis refinement, experimentation using Machine Learning Interatomic Potentials (SevenNet-Omni), and debate-based validation; modular design with web interface.
Result: Framework enables automated hypothesis testing in materials research with scalable in-silico experiments; code and demo available.
Conclusion: MIND bridges the gap between LLM-driven hypothesis generation and experimental validation, providing an adaptable framework for scientific discovery with potential for broader applications.
Abstract: Large language models (LLMs) have enabled agentic AI systems for scientific discovery, but most approaches remain limited to textbased reasoning without automated experimental verification. We propose MIND, an LLM-driven framework for automated hypothesis validation in materials research. MIND organizes the scientific discovery process into hypothesis refinement, experimentation, and debate-based validation within a multi-agent pipeline. For experimental verification, the system integrates Machine Learning Interatomic Potentials, particularly SevenNet-Omni, enabling scalable in-silico experiments. We also provide a web-based user interface for automated hypothesis testing. The modular design allows additional experimental modules to be integrated, making the framework adaptable to broader scientific workflows. The code is available at: https://github.com/IMMS-Ewha/MIND, and a demonstration video at: https://youtu.be/lqiFe1OQzN4.
[544] [COMP25] The Automated Negotiating Agents Competition (ANAC) 2025 Challenges and Results
Reyhan Aydoğan, Tim Baarslag, Tamara C. P. Florijn, Katsuhide Fujita, Catholijn M. Jonker, Yasser Mohammad
Main category: cs.MA
TL;DR: ANAC 2025 competition results and analysis focusing on multi-deal negotiations and concurrent negotiation agents in supply chain management
Details
Motivation: To present research challenges and findings from the 15th International Automated Negotiating Agents Competition, focusing on advancing automated negotiation techniques in complex environmentsMethod: Analysis of competition results from ANAC 2025, focusing on two domains: multi-deal negotiations and concurrent negotiation agents in supply chain management environments
Result: Key findings from the competition are presented, along with analysis of agent performance in complex negotiation scenarios
Conclusion: The paper outlines strategic directions for future iterations of automated negotiation competitions and research
Abstract: This paper presents the primary research challenges and key findings from the 15th International Automated Negotiating Agents Competition (ANAC 2025), one of the official competitions of IJCAI 2025. We focus on two critical domains: multi-deal negotiations and the development of agents capable of concurrent negotiation within complex supply chain management environments. Furthermore, this work analyzes the results of the competition and outlines strategic directions for future iterations.
[545] RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows
Kai Zhang, Corey D Barrett, Jangwon Kim, Lichao Sun, Tara Taghavi, Krishnaram Kenthapadi
Main category: cs.MA
TL;DR: RadAgents is a multi-agent framework for chest X-ray interpretation that combines clinical priors with multimodal reasoning, following a radiologist-style workflow with verification mechanisms to address limitations in current methods.
Details
Motivation: Current CXR interpretation systems have three main limitations: (1) reasoning lacks clinical interpretability and guideline alignment, (2) insufficient multimodal evidence fusion leading to text-only rationales, and (3) inability to detect/resolve cross-tool inconsistencies without verification mechanisms.Method: RadAgents uses a multi-agent framework that encodes radiologist-style workflow into modular pipeline, integrates clinical priors with task-aware multimodal reasoning, and employs grounding and multimodal retrieval-augmentation to verify and resolve context conflicts.
Result: The framework produces outputs that are more reliable, transparent, and consistent with clinical practice compared to existing methods.
Conclusion: RadAgents bridges gaps in current CXR interpretation by providing clinically interpretable, multimodal reasoning with verification mechanisms, creating a more auditable and reliable system aligned with clinical workflows.
Abstract: Agentic systems offer a potential path to solve complex clinical tasks through collaboration among specialized agents, augmented by tool use and external knowledge bases. Nevertheless, for chest X-ray (CXR) interpretation, prevailing methods remain limited: (i) reasoning is frequently neither clinically interpretable nor aligned with guidelines, reflecting mere aggregation of tool outputs; (ii) multimodal evidence is insufficiently fused, yielding text-only rationales that are not visually grounded; and (iii) systems rarely detect or resolve cross-tool inconsistencies and provide no principled verification mechanisms. To bridge the above gaps, we present RadAgents, a multi-agent framework that couples clinical priors with task-aware multimodal reasoning and encodes a radiologist-style workflow into a modular, auditable pipeline. In addition, we integrate grounding and multimodal retrieval-augmentation to verify and resolve context conflicts, resulting in outputs that are more reliable, transparent, and consistent with clinical practice.
[546] GUIDE: Guided Updates for In-context Decision Evolution in LLM-Driven Spacecraft Operations
Alejandro Carrasco, Mariko Storey-Matsutani, Victor Rodriguez-Fernandez, Richard Linares
Main category: cs.MA
TL;DR: GUIDE is a framework for LLM-based spacecraft control that evolves decision rules across episodes without weight updates, using a playbook that improves through offline reflection on past trajectories.
Details
Motivation: Current LLM-based approaches for spacecraft operations use static prompting and don't improve across repeated executions, limiting their adaptability and performance in dynamic space environments.Method: GUIDE uses a non-parametric policy improvement framework with a structured, state-conditioned playbook of natural-language decision rules. A lightweight acting model handles real-time control while offline reflection updates the playbook from prior trajectories.
Result: Evaluated on an adversarial orbital interception task in Kerbal Space Program Differential Games, GUIDE’s evolution consistently outperforms static baselines, demonstrating effective policy search over structured decision rules.
Conclusion: Context evolution in LLM agents functions as policy search over structured decision rules, enabling real-time closed-loop spacecraft interaction without requiring weight updates.
Abstract: Large language models (LLMs) have been proposed as supervisory agents for spacecraft operations, but existing approaches rely on static prompting and do not improve across repeated executions. We introduce \textsc{GUIDE}, a non-parametric policy improvement framework that enables cross-episode adaptation without weight updates by evolving a structured, state-conditioned playbook of natural-language decision rules. A lightweight acting model performs real-time control, while offline reflection updates the playbook from prior trajectories. Evaluated on an adversarial orbital interception task in the Kerbal Space Program Differential Games environment, GUIDE’s evolution consistently outperforms static baselines. Results indicate that context evolution in LLM agents functions as policy search over structured decision rules in real-time closed-loop spacecraft interaction.
cs.MM
[547] AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction
Zixuan Chen, Depeng Wang, Hao Lin, Li Luo, Ke Xu, Ya Guo, Huijia Zhu, Tanfeng Sun, Xinghao Jiang
Main category: cs.MM
TL;DR: AVID is a large-scale benchmark for evaluating audio-visual inconsistency understanding in long videos, featuring 11.2K videos with 39.4K inconsistency events across 8 categories, addressing a critical gap in multimodal AI evaluation.
Details
Motivation: Current multimodal LLMs excel at aligned audio-visual tasks but struggle with perceiving cross-modal conflicts, which is crucial for trustworthy AI. Existing benchmarks focus on aligned events or deepfake detection, leaving a gap in evaluating inconsistency perception in long-form video contexts.Method: AVID uses a scalable pipeline: 1) temporal segmentation classifying video content into Active Speaker, Voiceover, and Scenic categories; 2) agent-driven strategy planner selecting appropriate inconsistency categories; 3) five specialized injectors for diverse audio-visual conflict injection. The benchmark includes 11.2K long videos with 39.4K annotated inconsistency events.
Result: Comprehensive evaluation shows state-of-the-art models have significant limitations in temporal grounding and reasoning. The fine-tuned baseline AVID-Qwen achieves 2.8× higher BLEU-4 in segment reasoning and surpasses all compared models in temporal grounding (mIoU: 36.1% vs 26.2%) and holistic understanding (SODA-m: 7.47 vs 6.15).
Conclusion: AVID provides an effective testbed for advancing trustworthy multimodal AI systems by addressing the critical capability of audio-visual inconsistency understanding, which is fundamental for building reliable AI that can perceive cross-modal conflicts like humans.
Abstract: We present AVID, the first large-scale benchmark for audio-visual inconsistency understanding in videos. While omni-modal large language models excel at temporally aligned tasks such as captioning and question answering, they struggle to perceive cross-modal conflicts, a fundamental human capability that is critical for trustworthy AI. Existing benchmarks predominantly focus on aligned events or deepfake detection, leaving a significant gap in evaluating inconsistency perception in long-form video contexts. AVID addresses this with: (1) a scalable construction pipeline comprising temporal segmentation that classifies video content into Active Speaker, Voiceover, and Scenic categories; an agent-driven strategy planner that selects semantically appropriate inconsistency categories; and five specialized injectors for diverse audio-visual conflict injection; (2) 11.2K long videos (avg. 235.5s) with 39.4K annotated inconsistency events and 78.7K segment clips, supporting evaluation across detection, temporal grounding, classification, and reasoning with 8 fine-grained inconsistency categories. Comprehensive evaluations of state-of-the-art omni-models reveal significant limitations in temporal grounding and reasoning. Our fine-tuned baseline, AVID-Qwen, achieves substantial improvements over the base model (2.8$\times$ higher BLEU-4 in segment reasoning) and surpasses all compared models in temporal grounding (mIoU: 36.1% vs 26.2%) and holistic understanding (SODA-m: 7.47 vs 6.15), validating AVID as an effective testbed for advancing trustworthy omni-modal AI systems.
[548] AudioX: A Unified Framework for Anything-to-Audio Generation
Zeyue Tian, Zhaoyang Liu, Yizhu Jin, Ruibin Yuan, Liumeng Xue, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo
Main category: cs.MM
TL;DR: AudioX is a unified multimodal framework for anything-to-audio generation that integrates text, video, and audio control signals using a Multimodal Adaptive Fusion module, trained on a large-scale dataset IF-caps with 7M+ samples.
Details
Motivation: Addressing two key challenges in audio/music generation: 1) lack of unified multimodal modeling framework, and 2) scarcity of large-scale, high-quality training data for multimodal-conditioned audio generation.Method: Proposes AudioX framework with Multimodal Adaptive Fusion module for effective fusion of diverse multimodal inputs (text, video, audio). Constructs IF-caps dataset with 7M+ samples through structured data annotation pipeline for comprehensive multimodal supervision.
Result: Achieves superior performance against SOTA methods across various tasks, especially in text-to-audio and text-to-music generation. Demonstrates powerful instruction-following potential for audio generation under multimodal control signals.
Conclusion: AudioX provides a unified solution for multimodal-conditioned audio generation with strong performance, addressing both framework unification and data scarcity challenges through innovative fusion module and large-scale dataset construction.
Abstract: Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, and 2) large-scale, high-quality training data. As such, we propose AudioX, a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. The core design in this framework is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality. To train this unified model, we construct a large-scale, high-quality dataset, IF-caps, comprising over 7 million samples curated through a structured data annotation pipeline. This dataset provides comprehensive supervision for multimodal-conditioned audio generation. We benchmark AudioX against state-of-the-art methods across a wide range of tasks, finding that our model achieves superior performance, especially in text-to-audio and text-to-music generation. These results demonstrate our method is capable of audio generation under multimodal control signals, showing powerful instruction-following potential. The code and datasets will be available at https://zeyuet.github.io/AudioX/.
eess.AS
[549] ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks
Aurosweta Mahapatra, Ismail Rasim Ulgen, Kong Aik Lee, Nicholas Andrews, Berrak Sisman
Main category: eess.AS
TL;DR: ProSDD: A two-stage speech deepfake detection framework that learns prosodic variability from real speech through masked prediction, then jointly optimizes with spoof classification, achieving strong generalization to emotional spoofing attacks.
Details
Motivation: Current speech deepfake detection systems fail to generalize to expressive and emotional spoofing attacks because they rely on spoof-heavy training data and learn dataset-specific artifacts rather than transferable cues of natural speech.Method: Two-stage framework: Stage I learns prosodic variability from real speech through supervised masked prediction of speaker-conditioned prosodic variation (pitch, voice activity, energy). Stage II jointly optimizes this objective with spoof classification.
Result: ProSDD consistently outperforms baselines, reducing ASVspoof 2024 EER from 25.43% to 16.14% (2019-trained) and from 39.62% to 7.38% (2024-trained), with 50% relative reductions on EmoFake and EmoSpoof-TTS datasets.
Conclusion: Learning prosodic variability from real speech enables better generalization to emotional spoofing attacks, addressing limitations of current spoof-heavy training approaches.
Abstract: Speech deepfake detection (SDD) systems perform well on standard benchmarks datasets but often fail to generalize to expressive and emotional spoofing attacks. Many methods rely on spoof-heavy training data, learning dataset-specific artifacts rather than transferable cues of natural speech. In contrast, humans internalize variability in real speech and detect fakes as deviations from it. We introduce ProSDD, a two-stage framework that enriches model embeddings through supervised masked prediction of speaker-conditioned prosodic variation based on pitch, voice activity, and energy. Stage I learns prosodic variability from real speech, and Stage II jointly optimizes this objective with spoof classification. ProSDD consistently outperforms baselines under both ASVspoof 2019 and 2024 training, reducing ASVspoof 2024 EER from 25.43% to 16.14% (2019-trained) and from 39.62% to 7.38% (2024-trained), while achieving 50% relative reductions on EmoFake and EmoSpoof-TTS.
[550] Classical Machine Learning Baselines for Deepfake Audio Detection on the Fake-or-Real Dataset
Faheem Ahmad, Ajan Ahmed, Masudul Imtiaz
Main category: eess.AS
TL;DR: Interpretable classical ML baseline for deepfake audio detection using prosodic, voice-quality, and spectral features achieves ~93% accuracy with RBF SVM, identifying pitch variability and spectral richness as key discriminative cues.
Details
Motivation: Deep learning enables highly realistic synthetic speech, raising concerns about fraud and disinformation. While neural detectors exist, there's a need for transparent baselines to reveal which acoustic cues reliably separate real from synthetic speech.Method: Extract prosodic, voice-quality, and spectral features from 2-second clips at 44.1 kHz (high-fidelity) and 16 kHz (telephone-quality). Use statistical analysis (ANOVA, correlation heatmaps) to identify significant features. Train multiple classifiers including Logistic Regression, LDA, QDA, Gaussian Naive Bayes, SVMs, and GMMs.
Result: Best model (RBF SVM) achieves ~93% test accuracy and ~7% EER on both sampling rates. Linear models reach ~75% accuracy. Feature analysis reveals pitch variability and spectral richness (spectral centroid, bandwidth) as key discriminative cues.
Conclusion: Provides a strong, interpretable baseline for future deepfake audio detectors, demonstrating that classical ML with carefully selected acoustic features can effectively detect synthetic speech while offering transparency about which cues matter most.
Abstract: Deep learning has enabled highly realistic synthetic speech, raising concerns about fraud, impersonation, and disinformation. Despite rapid progress in neural detectors, transparent baselines are needed to reveal which acoustic cues reliably separate real from synthetic speech. This paper presents an interpretable classical machine learning baseline for deepfake audio detection using the Fake-or-Real (FoR) dataset. We extract prosodic, voice-quality, and spectral features from two-second clips at 44.1 kHz (high-fidelity) and 16 kHz (telephone-quality) sampling rates. Statistical analysis (ANOVA, correlation heatmaps) identifies features that differ significantly between real and fake speech. We then train multiple classifiers – Logistic Regression, LDA, QDA, Gaussian Naive Bayes, SVMs, and GMMs – and evaluate performance using accuracy, ROC-AUC, EER, and DET curves. Pairwise McNemar’s tests confirm statistically significant differences between models. The best model, an RBF SVM, achieves ~93% test accuracy and ~7% EER on both sampling rates, while linear models reach ~75% accuracy. Feature analysis reveals that pitch variability and spectral richness (spectral centroid, bandwidth) are key discriminative cues. These results provide a strong, interpretable baseline for future deepfake audio detectors.
[551] Few-Shot and Pseudo-Label Guided Speech Quality Evaluation with Large Language Models
Ryandhimas E. Zezario, Dyah A. M. G. Wisnu, Szu-Wei Fu, Sabato Marco Siniscalchi, Hsin-Min Wang, Yu Tsao
Main category: eess.AS
TL;DR: GatherMOS uses LLMs as meta-evaluators to aggregate diverse signals (acoustic descriptors, DNSMOS, VQScore) for predicting speech quality scores, outperforming existing methods in limited-data scenarios.
Details
Motivation: The paper addresses the challenge of non-intrusive speech quality evaluation by proposing a framework that leverages LLMs' reasoning capabilities to aggregate heterogeneous quality signals, overcoming limitations of single models and improving performance in data-scarce conditions.Method: GatherMOS integrates lightweight acoustic descriptors with pseudo-labels from DNSMOS and VQScore, using LLMs as meta-evaluators to reason over these heterogeneous inputs and predict perceptual mean opinion scores (MOS). The framework explores both zero-shot and few-shot in-context learning setups.
Result: Experiments on VoiceBank-DEMAND dataset show GatherMOS consistently outperforms DNSMOS, VQScore, naive score averaging, and learning-based models (CNN-BLSTM, MOS-SSL) under limited labeled-data conditions. Zero-shot GatherMOS maintains stable performance across diverse conditions, while few-shot guidance yields large gains when support samples match test conditions.
Conclusion: LLM-based aggregation shows promise as a practical strategy for non-intrusive speech quality evaluation, demonstrating that LLMs can effectively reason over heterogeneous quality signals to produce accurate MOS predictions, especially in data-limited scenarios.
Abstract: In this paper, we introduce GatherMOS, a novel framework that leverages large language models (LLM) as meta-evaluators to aggregate diverse signals into quality predictions. GatherMOS integrates lightweight acoustic descriptors with pseudo-labels from DNSMOS and VQScore, enabling the LLM to reason over heterogeneous inputs and infer perceptual mean opinion scores (MOS). We further explore both zero-shot and few-shot in-context learning setups, showing that zero-shot GatherMOS maintains stable performance across diverse conditions, while few-shot guidance yields large gains when support samples match the test conditions. Experiments on the VoiceBank-DEMAND dataset demonstrate that GatherMOS consistently outperforms DNSMOS, VQScore, naive score averaging, and even learning-based models such as CNN-BLSTM and MOS-SSL when trained under limited labeled-data conditions. These results highlight the potential of LLM-based aggregation as a practical strategy for non-intrusive speech quality evaluation.
[552] SpeakerRPL v2: Robust Open-set Speaker Identification through Enhanced Few-shot Foundation Tuning and Model Fusion
Zhiyong Chen, Shuhang Wu, Yingjie Duan, Xinkang Xu, Xinhui Hu
Main category: eess.AS
TL;DR: Improved open-set speaker identification using pretrained foundation models with enhanced learning objectives, model fusion strategy, and model selection method, achieving significant EER reduction.
Details
Motivation: To improve open-set speaker identification by addressing limitations in existing approaches, particularly in robustness, generalization, and stability during few-shot tuning with pretrained speaker foundation models.Method: 1) Enhanced open-set learning objective combining reciprocal points learning with logit normalization and adaptive anchor learning; 2) Model fusion strategy to stabilize few-shot tuning; 3) Model selection method for optimal fusion performance.
Result: Achieved 0.09% EER on Vox1-O-like test set (93% relative reduction from 1.28%), demonstrating effectiveness across VoxCeleb, ESD and 3D-Speaker datasets under diverse conditions.
Conclusion: The proposed improvements significantly enhance open-set speaker identification performance, robustness, and stability when using pretrained speaker foundation models.
Abstract: This paper proposes an improved approach for open-set speaker identification based on pretrained speaker foundation models. Building upon the previous Speaker Reciprocal Points Learning framework (V1), we first introduce an enhanced open-set learning objective by integrating reciprocal points learning with logit normalization (LogitNorm) and incorporating adaptive anchor learning to better constrain target speaker representations and improve robustness. Second, we propose a model fusion strategy to stabilize and enhance the few-shot tuning process, effectively reducing result randomness and improving generalization. Furthermore, we introduce a model selection method to ensure optimal performance in model fusion. Experimental evaluations on the VoxCeleb, ESD and 3D-Speaker datasets demonstrate the effectiveness and robustness of the proposed method under diverse conditions. On a newly proposed Vox1-O-like test set, our method reduces the EER from 1.28% to 0.09%, achieving a relative reduction of approximately 93%.
eess.IV
[553] Learning Class Difficulty in Imbalanced Histopathology Segmentation via Dynamic Focal Attention
Lakmali Nadeesha Kumari, Sen-Ching Samson Cheung
Main category: eess.IV
TL;DR: DFA introduces dynamic focal attention with learnable per-class bias in cross-attention to address class imbalance in histopathology segmentation, moving beyond frequency-based reweighting to capture true difficulty factors like morphological variability.
Details
Motivation: Current approaches to class imbalance in semantic segmentation rely on frequency-based loss reweighting, which assumes rare classes are difficult. However, true difficulty arises from morphological variability, boundary ambiguity, and contextual similarity - factors that frequency cannot capture. There's a need for methods that can learn class-specific difficulty directly from data.Method: Proposes Dynamic Focal Attention (DFA), a mechanism that learns class-specific difficulty within the cross-attention of query-based mask decoders. DFA introduces a learnable per-class bias to attention logits, enabling representation-level reweighting before prediction rather than gradient-level reweighting after prediction. Initialized from a log-frequency prior to prevent gradient starvation, the bias is optimized end-to-end, allowing the model to adaptively capture difficulty signals during training.
Result: On three histopathology benchmarks (BDSA, BCSS, CRAG), DFA consistently improves Dice and IoU metrics, matching or exceeding a difficulty-aware baseline without requiring a separate estimator or additional training stage.
Conclusion: Encoding class difficulty at the representation level through attention mechanisms provides a principled alternative to conventional loss reweighting for imbalanced segmentation, effectively unifying frequency-based and difficulty-aware approaches under a common attention-bias framework.
Abstract: Semantic segmentation of histopathology images under class imbalance is typically addressed through frequency-based loss reweighting, which implicitly assumes that rare classes are difficult. However, true difficulty also arises from morphological variability, boundary ambiguity, and contextual similarity-factors that frequency cannot capture. We propose Dynamic Focal Attention (DFA), a simple and efficient mechanism that learns class-specific difficulty directly within the cross-attention of query-based mask decoders. DFA introduces a learnable per-class bias to attention logits, enabling representation-level reweighting prior to prediction rather than gradient-level reweighting after prediction. Initialised from a log-frequency prior to prevent gradient starvation, the bias is optimised end-to-end, allowing the model to adaptively capture difficulty signals through training, effectively unifying frequency-based and difficulty-aware approaches under a common attention-bias framework. On three histopathology benchmarks (BDSA, BCSS, CRAG), DFA consistently improves Dice and IoU, matching or exceeding a difficulty-aware baseline without a separate estimator or additional training stage. These results demonstrate that encoding class difficulty at the representation level provides a principled alternative to conventional loss reweighting for imbalanced segmentation.
[554] Cyclic 2.5D Perceptual Loss for Cross-Modal 3D Medical Image Synthesis: T1w MRI to Tau PET
Junho Moon, Symac Kim, Haejun Chung, Ikbeom Jang
Main category: eess.IV
TL;DR: A method for synthesizing tau PET images from structural MRI using cyclic 2.5D perceptual loss to improve volumetric consistency and standardization by scanner manufacturer to reduce variability.
Details
Motivation: PET imaging is valuable for Alzheimer's disease diagnosis but limited by cost, regulatory restrictions, and invasiveness. Cross-modal image synthesis can reconstruct unavailable PET modalities from routine MRI scans, addressing access barriers.Method: Proposes cyclic 2.5D perceptual loss that alternates optimization across axial, coronal, and sagittal planes during training to improve volumetric consistency. Also standardizes PET SUVRs by scanner manufacturer to reduce inter-manufacturer variability. Uses various architectures (U-Net, UNETR, SwinUNETR, CycleGAN, Pix2Pix) for synthesis.
Result: Method generalizes across multiple architectures with strong performance. Improves agreement between synthesized SUVRs and measured PET in brain regions relevant to Alzheimer-type tau pathology. Validated on ADNI and SCAN cohorts spanning ADRD spectrum.
Conclusion: Cyclic 2.5D perceptual loss with scanner manufacturer standardization enables effective synthesis of tau PET from structural MRI, addressing limitations of existing perceptual losses for 3D medical image synthesis.
Abstract: Positron emission tomography (PET) provides molecular biomarkers for Alzheimer’s disease and related dementias (ADRD) and is increasingly used for diagnosis, staging, and clinical trial enrichment. However, its use is limited by cost, regulatory restrictions, and the invasiveness of radiotracer injection. Although current frameworks emphasize multimodal biomarker assessment, including the amyloid/tau/neurodegeneration (A/T/N) scheme, these barriers constrain access to PET imaging. Cross-modal image synthesis may help address this gap by reconstructing unavailable modalities from routine scans. Because PET is clinically valuable for regional uptake patterns rather than exact voxel-wise intensities, perceptual losses that capture higher-level semantic features are well suited to PET synthesis. Existing 2D, 3D, and 2.5D perceptual losses for 3D synthesis each have limitations, including restricted volumetric context, scarcity of pretrained 3D models, and difficulty balancing optimization across anatomical planes. In this study, we synthesize tau PET from structural MRI by generating 3D pseudo-[18F]flortaucipir standardized uptake value ratio (SUVR) maps from 3D T1-weighted MR images. We propose a cyclic 2.5D perceptual loss that alternates optimization across axial, coronal, and sagittal planes during training to improve volumetric consistency. We also standardize PET SUVRs by scanner manufacturer, reducing inter-manufacturer variability and better preserving high-uptake regions. Using cohorts spanning the ADRD spectrum from the ADNI and the SCAN cohort, we show that the method generalizes across U-Net, UNETR, SwinUNETR, CycleGAN, and Pix2Pix, with strong performance. Notably, it improves agreement between synthesized SUVRs and measured PET in brain regions relevant to Alzheimer-type tau pathology. Code is publicly available at https://github.com/labhai/Cyclic-2.5D-Perceptual-Loss.
[555] The Gaussian Latent Machine: Efficient Prior and Posterior Sampling for Inverse Problems
Muhamed Kuric, Martin Zach, Andreas Habring, Michael Unser, Thomas Pock
Main category: eess.IV
TL;DR: A novel Gaussian latent machine framework for efficient sampling from product-of-experts models in Bayesian imaging, unifying existing algorithms and enabling efficient two-block Gibbs sampling.
Details
Motivation: The paper addresses the challenge of sampling from complex product-of-experts models commonly used in Bayesian imaging, which often suffer from computational inefficiency. There's a need for a unified framework that can handle various prior and posterior distributions while providing efficient sampling algorithms.Method: Proposes lifting product-of-experts models into a novel latent variable model called Gaussian latent machine. This framework enables general sampling approaches that unify existing algorithms, with a focus on efficient two-block Gibbs sampling for general cases and direct sampling for special cases.
Result: The Gaussian latent machine framework successfully unifies many existing sampling algorithms and provides efficient sampling methods. Numerical experiments demonstrate the approach’s effectiveness across various Bayesian imaging problems involving different prior and posterior distributions.
Conclusion: The proposed Gaussian latent machine offers a powerful, unified framework for sampling from product-of-experts models in Bayesian imaging, providing both theoretical unification and practical efficiency improvements over existing methods.
Abstract: We consider the problem of sampling from a product-of-experts-type model that encompasses many standard prior and posterior distributions commonly found in Bayesian imaging. We show that this model can be easily lifted into a novel latent variable model, which we refer to as a Gaussian latent machine. This leads to a general sampling approach that unifies and generalizes many existing sampling algorithms in the literature. Most notably, it yields a highly efficient and effective two-block Gibbs sampling approach in the general case, while also specializing to direct sampling algorithms in particular cases. Finally, we present detailed numerical experiments that demonstrate the efficiency and effectiveness of our proposed sampling approach across a wide range of prior and posterior sampling problems from Bayesian imaging.