Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 220]
cs.CV [Total: 332]
cs.AI [Total: 221]
cs.SD [Total: 20]
cs.LG [Total: 248]
cs.MA [Total: 11]
cs.MM [Total: 1]
eess.AS [Total: 10]
eess.IV [Total: 12]

Abstract: Job Skill Named Entity Recognition (JobSkillNER) aims to automatically extract key skill information from large-scale job posting data, which is important for improving talent-market matching efficiency and supporting personalized employment services. To the best of our knowledge, this work presents the first Chinese JobSkillNER dataset for recruitment texts. We propose annotation guidelines tailored to Chinese job postings and an LLM-empowered Macro-Micro collaborative annotation pipeline. The pipeline leverages the contextual understanding ability of large language models (LLMs) for initial annotation and then refines the results through expert sentence-level adjudication. Using this pipeline, we annotate more than 20,000 instances collected from four major recruitment platforms over the period 2014-2025. Based on these efforts, we release Chinese-SkillSpan, the first Chinese JobSkillNER dataset aligned with the ESCO occupational skill standard across four dimensions: knowledge, skill, transversal competence, and language competence (LSKT). Experimental results show that the dataset supports effective model training and evaluation, indicating that Chinese-SkillSpan helps fill a major gap in Chinese JobSkillNER resources and provides a useful benchmark for intelligent recruitment research. Code and data are available at https://sites.google.com/view/cn-skillspan-resources .

Meizhu Liu, Matthew Rowe, Amit Agarwal, Michael Avendi, Yassi Abbasi, Hitesh Laxmichand Patel, Paul Li, Kyu J. Han, Tao Sheng, Sujith Ravi, Dan Roth

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, $\mathcal{L}_{1}$, and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based pooling. Experiments on benchmark datasets demonstrate improvements over prior methods.

[8] Evaluating Temporal Consistency in Multi-Turn Language Models

Yash Kumar Atri, Steven L. Johnson, Tom Hartvigsen

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Language models are increasingly deployed in interactive settings where users reason about facts over time rather than in isolation. In such scenarios, correct behavior requires models to maintain and update implicit temporal assumptions established earlier in a conversation. We study this challenge through the lens of temporal scope stability: the ability to preserve, override, or transfer time-scoped factual context across dialogue turns. We introduce ChronoScope, a large-scale diagnostic benchmark designed to isolate temporal scope behavior in controlled multi-turn interactions, comprising over one million deterministically generated question chains grounded in Wikidata. ChronoScope evaluates whether models can correctly retain inferred temporal scope when follow-up questions omit explicit time references, spanning implicit carryover, explicit scope switching, cross-entity transfer, and longer temporal trajectories. Through extensive evaluation of state-of-the-art language models, we find that temporal scope stability is frequently violated in controlled multi-turn settings, with models often drifting toward present-day assumptions despite correct underlying knowledge. These failures intensify with interaction length and persist even under oracle context conditions, revealing a gap between single-turn factual accuracy and coherent temporal reasoning under sequential interaction. We make our dataset and evaluation suite publicly available at https://github.com/yashkumaratri/ChronoScope

[9] DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining

Youze Zheng, Jianyou Wang, Yuhan Chen, Matthew Feng, Longtian Bao, Hanyuan Zhang, Maxim Khan, Aditya K. Sehgal, Christopher D. Rosin, Umber Dube, Ramamohan Paturi

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Predicting the outcomes of prospective clinical trials remains a major challenge for large language models. Prior work has shown that both traditional correlational predictors, such as random forests and logistic regression, and strong commercial LLMs achieve limited performance on this task. In this paper, we propose DeepImagine, a framework for teaching LLMs biomedical reasoning through successive counterfactual imagining. The central idea is to approximate hidden causal mechanisms of clinical trials by training models to infer how observed trial results would change under controlled perturbations of experimental conditions, such as dosage, outcome measures, study arms, geography, and other trial attributes. To support this objective, we construct both natural and approximate counterfactual pairs from real clinical trials with reported outcomes. For settings where strict counterfactual supervision is available, such as paired outcome measures or dose-ranging study arms within the same trial, we train models with supervised fine-tuning. For broader settings where only approximate counterfactual pairs can be retrieved, we optimize models with reinforcement learning using verifiable rewards based on downstream benchmark correctness. We further augment training with synthetic reasoning traces that provide causally plausible explanations for local counterfactual transitions. Using this pipeline, we train language models under 10B parameters, including Qwen3.5-9B, and evaluate them on clinical trial outcome prediction. We aim to show that DeepImagine consistently improves over untuned language models and traditional correlational baselines. Finally, we aim to show that the learned reasoning trajectories provide interpretable signals about how models represent trial-level mechanisms, suggesting a practical path toward more mechanistic and scientifically useful biomedical language models.

[10] Implicit Framing in Obstetric Counseling Notes: A Grounded LLM Pipeline on a VBAC-Eligible Cohort

Baris Karacan, Barbara Di Eugenio, Patrick Thornton, Joanna Tess, Subhash Kumar Kolar

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Clinical framing – the linguistic manner in which clinical information is presented – can influence patient understanding and decision-making, with important implications for healthcare outcomes. Obstetrics is a high-stakes domain in which physicians counsel patients on delivery mode choices such as vaginal birth after cesarean (VBAC) and repeat cesarean section (RCS), yet counseling language remains underexplored in large-scale clinical text analysis. In this work, we analyze physician counseling language in 2,024 obstetric history and physical narratives for a rigorously defined cohort of patients for whom both VBAC and RCS were clinically viable options. To control for confounding due to medical contraindications, we first construct a VBAC-eligible cohort using structured clinical data supplemented by a large language model (LLM)-based extraction pipeline constrained to grounded, verbatim evidence from free-text narratives. We then apply a zero-shot LLM framework to categorize counseling segments into predefined framing categories capturing how physicians linguistically present delivery options. Our analysis reveals a significant difference in counseling framing distributions between VBAC and RCS notes; risk-focused language accounts for a substantially larger share of counseling segments in RCS documentation than in VBAC, with category-level differences confirmed by statistical testing, highlighting the value of controlled LLM-based framing analysis in obstetric care.

[11] ContextWeaver: Selective and Dependency-Structured Memory Construction for LLM Agents

Yating Wu, Yuhao Zhang, Sayan Ghosh, Sourya Basu, Anoop Deoras, Jun Huan, Gaurav Gupta

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language model (LLM) agents often struggle in long-context interactions. As the agent accumulates more interaction history, context management approaches such as sliding window and prompt compression may omit earlier structured information that later steps rely on. Recent retrieval-based memory systems surface relevant content but still overlook the causal and logical structure needed for multi-step reasoning. We introduce ContextWeaver, a selective and dependency-structured memory framework that organizes an agent’s interaction trace into a graph of reasoning steps and selects the relevant context for future actions. Unlike prior context management approaches, ContextWeaver supports: (1) dependency-based construction and traversal that link each step to the earlier steps it relies on; (2) compact dependency summarization that condenses root-to-step reasoning paths into reusable units; and (3) a lightweight validation layer that incorporates execution feedback. On the SWE-Bench Verified and Lite benchmarks, ContextWeaver improves performance over a sliding-window baseline in pass@1, while reducing reasoning steps and token usage. Our observations suggest that modeling logical dependencies provides a stable and scalable memory mechanism for LLM agents that use tools.

[12] Mixture of Heterogeneous Grouped Experts for Language Modeling

Zhicheng Ma, Xiang Liu, Zhaoxiang Liu, Ning Wang, Yi Shen, Kai Wang, Shuming Shi, Shiguo Lian

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) based on Mixture-of-Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes,creating a rigidity that fails to align computational costs with varying token-level complexity. While heterogeneous expert architectures attempt to address this by diversifying expert sizes, they often suffer from significant system-level challenges, specifically unbalanced GPU utilization and inefficient parameter utilization, which hinder practical deployment. To bridge the gap between theoretical heterogeneity and robust industrial application, we propose Mixture of Heterogeneous Grouped Experts (MoHGE) which introduces a two-level routing mechanism to enable flexible, resource-aware expert combinations. To optimize inference efficiency, we propose a Group-Wise Auxiliary Loss, which dynamically steers tokens to the most parameter-efficient expert groups based on task difficulty. To address the critical deployment challenge of GPU load balancing, we introduce an All-size Group-decoupling Allocation strategy coupled with an Intra-Group Experts Auxiliary Loss. These mechanisms collectively ensure uniform computation distribution across GPUs. Extensive evaluations demonstrate that MoHGE matches the performance of MoE architectures while reducing the total parameters by approximately 20% and maintaining balanced GPU utilization. Our work establishes a scalable paradigm for resource-efficient MoE design, offering a practical solution for optimizing inference costs in real-world scenarios.

[13] Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings

Nilanjana Das, Manas Gaur

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) can still be jailbroken into producing harmful outputs despite safety alignment. Existing attacks show this vulnerability, but not the internal mechanisms that cause it. This study asks whether jailbreak success is driven by identifiable internal features rather than prompts alone. We propose a three-stage pipeline for Gemma-2-2B using the BeaverTails dataset. First, we extract concept-aligned tokens from adversarial responses via subspace similarity. Second, we apply three feature-grouping strategies (cluster, hierarchical-linkage, and single-token-driven) to identify SAE feature subgroups for the aligned tokens across all 26 model layers. Third, we steer the model by amplifying the top features from each identified subgroup and measure the change in harmfulness score using a standardized LLM-judge scoring protocol. In all three approaches, the features in the layers [16-25] were relatively more vulnerable to steering. All three methods confirmed that mid to later layer feature subgroups are more responsible for unsafe outputs. These results provide evidence that the jailbreak vulnerability in Gemma-2-2B is localized to feature subgroups of mid to later layers, suggesting that targeted feature-level interventions may offer a more principled path to adversarial robustness than current prompt-level defenses.

Qiyuan Jin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Memes convey meaning through the interaction of visual and textual signals, often combining humor, irony, and offense in subtle ways. Detecting harmful or sensitive content in memes requires accurate modeling of these multimodal cues. Existing CLIP-based approaches rely on static fusion, which struggles to capture fine grained dependencies between modalities. We propose DARC-CLIP, a CLIP-based framework for adaptive multimodal fusion with a hierarchical refinement stack. DARC-CLIP introduces Adaptive Cross-Attention Refiners to for bidirectional information alignment and Dynamic Feature Adapters for task-sensitive signal adaptation. We evaluate DARC-CLIP on the PrideMM benchmark, which includes hate, target, stance, and humor classification, and further test generalization on the CrisisHateMM dataset. DARC-CLIP achieves highly competitive classification accuracy across tasks, with significant gains of +4.18 AUROC and +6.84 F1 in hate detection over the strongest baseline. Ablation studies confirm that ACAR and DFA are the main contributors to these gains. These results show that adaptive cross-signal refinement is an effective strategy for multimodal content analysis in socially sensitive classification.

[15] Measuring Temporal Linguistic Emergence in Diffusion Language Models

Harry Lu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diffusion language models expose an explicit denoising trajectory, making it possible to ask when different kinds of information become measurable during generation. We study three independent 32-step runs of LLaDA-8B-Base on masked WikiText-103 text, each with 1{,}000 probe-training sequences and 200 held-out evaluation sequences. From saved trajectories, we derive four temporal measurements: token commitment; linear recoverability of part-of-speech (POS), coarse semantic category, and token identity; confidence and entropy dynamics; and sensitivity under mid-trajectory re-masking. Across seeds, the same ordering recurs: content categories stabilize earlier than function-heavy categories, POS and coarse semantic labels remain substantially more linearly recoverable than exact lexical identity under our probe setup, uncertainty remains higher for tokens that ultimately resolve incorrectly even though late confidence becomes less calibrated, and perturbation sensitivity peaks in the middle of the trajectory. A direct/collateral decomposition shows that this peak is overwhelmingly local to the perturbed positions themselves. In this LLaDA+WikiText setting, denoising time is therefore a useful analysis axis: under our measurements, coarse labels are recovered earlier and more robustly than lexical identity, trajectory-level uncertainty tracks eventual correctness, and mid-trajectory states are the most intervention-sensitive.

[16] Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt

Zhenzhen Huang, Chaoning Zhang, Fachrina Dewi Puspitasari, Jiaquan Zhang, Yitian Zhou, Shuxu Chen, Yang Yang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) are increasingly utilized in various complex reasoning tasks due to their excellent instruction following capability. However, the model’s performance is highly dependent on the open-ended characteristics of the users’ input prompt. Natural prompts often do not follow proper syntactic rules, which creates ambiguous queries that yield multiple interpretations. Such ambiguous prompts confuse the model in choosing the correct reasoning paths to answer questions. Prior works address this challenge by applying query editing during the LLM inference process without explicitly solving the root cause of the ambiguity. To address this limitation, we propose a pre-inference prompt optimization mechanism via explicit prompt disambiguation. Particularly, we identify semantic risks in the prompt, check their multi-perspective consistency, and resolve any semantic conflicts that arise. Finally, we organize the resolved ambiguities in a logically structured manner as a clean input to the LLM. By explicitly resolving semantic ambiguity, our method can produce a more focused attention distribution to the semantically essential tokens. We also leverage small language models (SLMs) as the main executor of prompt disambiguation to benefit from their efficient computation. Through comprehensive experiments on multiple benchmarks, we demonstrate that our method improves reasoning performance by 2.5 points at a cost of only $0.02. Our study promotes explicit prompt disambiguation as an effective prompt optimization method without disturbing the internal mechanism of LLM inference.

[17] Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

Bishwamittra Ghosh, Soumi Das, Till Speicher, Qinyuan Wu, Mohammad Aflah Khan, Deepak Garg, Krishna P. Gummadi, Evimaria Terzi

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) operate in two fundamental learning modes - fine-tuning (FT) and in-context learning (ICL) - raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task - offering precise language boundaries, controlled string sampling, and no data contamination - and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in-language strings than to out-of-language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.

[18] Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

Tzu-Quan Lin, Wei-Ping Huang, Hao Tang, Hung-yi Lee

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ demonstrate that Speech-FT consistently improves performance across a wide range of supervised, unsupervised, and multitask fine-tuning scenarios. Moreover, Speech-FT achieves superior cross-task generalization compared to fine-tuning baselines that explicitly constrain weight changes, such as weight-space regularization and LoRA fine-tuning. Our analysis reveals that Speech-FT maintains higher feature similarity to the pre-trained model compared to alternative strategies, despite allowing larger weight-space updates. Notably, Speech-FT achieves significant improvements on the SUPERB benchmark. For example, when fine-tuning HuBERT on automatic speech recognition, Speech-FT is able to reduce phone error rate from 5.17% to 3.94%, lower word error rate from 6.38% to 5.75%, and increase speaker identification accuracy from 81.86% to 84.11%. Speech-FT provides a simple yet powerful solution for further refining speech representation models after pre-training.

[19] From Similarity to Structure: Training-free LLM Context Compression with Hybrid Graph Priors

Yitian Zhou, Chaoning Zhang, Jiaquan Zhang, Zhenzhen Huang, Jinyu Guo, Sung-Ho Bae, Lik-Hang Lee, Caiyan Qin, Yang Yang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Long-context large language models remain computationally expensive to run and often fail to reliably process very long inputs, which makes context compression an important component of many systems. Existing compression approaches typically rely on trained compressors, dense retrieval-style selection, or heuristic trimming, and they often struggle to jointly preserve task relevance, topic coverage, and cross-sentence coherence under a strict token budget. To address this, we propose a training-free and model-agnostic compression framework that selects a compact set of sentences guided by structural graph priors. Our method constructs a sparse hybrid sentence graph that combines mutual k-NN semantic edges with short-range sequential edges, extracts a topic skeleton via clustering, and ranks sentences using an interpretable score that integrates task relevance, cluster representativeness, bridge centrality, and a cycle coverage cue. A budgeted greedy selection with redundancy suppression then produces a readable compressed context in original order. Experimental results on four datasets show that our approach is competitive with strong extractive and abstractive baselines, demonstrating larger gains on long-document benchmarks.

[20] Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech

Rikuto Kotoge, Yuichi Sasaki

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Aligning text-to-speech (TTS) system outputs with human feedback through preference optimization has been shown to effectively improve the robustness and naturalness of language model-based TTS models. Current approaches primarily require paired desirable and undesirable samples at the utterance level. However, such pairs are often limited in TTS output data, and utterance-level formulation prevents fine-grained token-level optimization needed for accurate pronunciation alignment. In this study, we propose TKTO that eliminates the need for paired data, enabling a more data-efficient training paradigm, and directly targets token-level units, automatically providing fine-grained alignment signals without token-level annotations. TKTO improves the challenging Japanese TTS accuracy by 39% and reduces CER by 54%, automatically assigning 12.8 times stronger reward to targeted tokens.

[21] EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce

Minhyeong Yu, Wonduk Seo

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Product mapping, the task of deciding whether two e-commerce listings refer to the same product, is a core problem for price monitoring and channel visibility. In real marketplaces, however, sellers frequently inject promotional keywords, platform-specific tags, and bundle descriptions into titles, causing the same product to appear under many different names. Recent LLM-based and multi-agent frameworks improve robustness and interpretability on such hard cases, but they often rely on expensive external APIs, repeated retrieval, and complex inference-time orchestration, making large-scale deployment costly and difficult in privacy-sensitive enterprise settings. To address these issues, we present EPM-RL, a reinforcement-learning-based framework for building an accurate and efficient on-premise e-commerce product mapping model. Our central idea is to distill high-cost agentic reasoning into a trainable in-house model. Starting from a curated set of product pairs with LLM-generated rationales and human verification, we first perform parameter-efficient fine-tuning (PEFT) on a small student model using structured reasoning outputs. We then further optimize the model with Reinforcement Learning (RL) using an agent-based reward that jointly evaluates output-format compliance, label correctness, reasoning–preference scores from specially designed judge models. Preliminary results show that EPM-RL consistently improves over PEFT-only training and offers a stronger quality–cost trade-off than commercial API-based baselines, while enabling private deployment and lower operational cost. These findings suggest that reinforcement learning can turn product mapping from a high-latency agentic pipeline into a scalable, inspectable, and production-ready in-house system.

[22] Au-M-ol: A Unified Model for Medical Audio and Language Understanding

Meizhu Liu, Nistha Mitra, Paul Li, Amine Abdaoui, Adam Ledyard, Tao Sheng

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In this work, we present Au-M-ol, a novel multimodal architecture that extends Large Language Models (LLMs) with audio processing. It is designed to improve performance on clinically relevant tasks such as Automatic Speech Recognition (ASR). Au-M-ol has three main components: (1) an audio encoder that extracts rich acoustic features from medical speech, (2) an adaptation layer that maps audio features into the LLM input space, and (3) a pretrained LLM that performs transcription and clinical language understanding. This design allows the model to interpret spoken medical content directly, improving both accuracy and robustness. In experiments, Au-M-ol reduces Word Error Rate (WER) by 56% compared to state-of-the-art baselines on medical transcription tasks. The model also performs well in challenging conditions, including noisy environments, domain-specific terminology, and speaker variability. These results suggest that Au-M-ol is a strong candidate for real-world clinical applications, where reliable and context-aware audio understanding is essential.

[23] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which directly process spoken language and enable speech-to-text translation (ST) and other downstream tasks, bypassing traditional transcription-based pipelines. Whether this integration improves ST quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable solution overall, but most recent SpeechLLMs can match or even outperform cascades in various settings while SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

[24] Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

Zhisong Qiu, Shuofei Qiao, Kewei Xu, Yuqi Zhu, Lun Du, Ningyu Zhang, Huajun Chen

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at https://github.com/zjunlp/DataMind.

[25] Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

Bhaskar Singh, Shobhit Banga, Pranav Sharma

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe – large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.

[26] $\mathcal{S}^2$IT: Stepwise Syntax Integration Tuning for Large Language Models in Aspect Sentiment Quad Prediction

Bingfeng Chen, Chenjie Qiu, Yifeng Xie, Boyan Xu, Ruichu Cai, Zhifeng Hao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Aspect Sentiment Quad Prediction (ASQP) has seen significant advancements, largely driven by the powerful semantic understanding and generative capabilities of large language models (LLMs). However, while syntactic structure information has been proven effective in previous extractive paradigms, it remains underutilized in the generative paradigm of LLMs due to their limited reasoning capabilities. In this paper, we propose S^2IT, a novel Stepwise Syntax Integration Tuning framework that progressively integrates syntactic structure knowledge into LLMs through a multi-step tuning process. The training process is divided into three steps. S^2IT decomposes the quadruple generation task into two stages: 1) Global Syntax-guided Extraction and 2) Local Syntax-guided Classification, integrating both global and local syntactic structure information. Finally, Fine-grained Structural Tuning enhances the model’s understanding of syntactic structures through the prediction of element links and node classification. Experiments demonstrate that S^2IT significantly improves state-of-the-art performance across multiple datasets. Our implementation will be open-sourced at https://github.com/DMIRLAB-Group/S2IT.

[27] Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

Xinzhu Chen, Wei He, Huichuan Fan, Wenzhe Niu, Zhongxiang Sun, Xuanru Wang, Jiuchong Gao, Jinghua Hao, Renqing He, Weijie Yu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Group Relative Policy Optimization (GRPO) performs coarse-grained credit assignment in reinforcement learning with verifiable rewards (RLVR) by assigning the same advantage to all tokens in a rollout. Process reward models can provide finer-grained supervision, but they require step-level annotation or additional reward modeling. We show that hidden-state distributions contain a useful signal for local reasoning quality that can be extracted using only outcome-level correctness labels available in RLVR. Specifically, within each GRPO group, the Wasserstein distance between span-level hidden state distributions of correct and incorrect rollouts increases around regions where their local reasoning quality diverges. This association holds both across examples and within individual trajectories, suggesting that hidden-state distributional divergence can serve as a self-supervision signal for fine-grained credit assignment. We formalize this observation with a separation theorem showing that, under mild structural assumptions, post-divergence spans have larger Wasserstein distances than pre-divergence spans whenever the population-level distributional gap exceeds finite-sample noise. Motivated by this result, we propose \textbf{S}pan-level \textbf{H}idden state \textbf{E}nabled \textbf{A}dvantage \textbf{R}eweighting (SHEAR), which modifies GRPO by using span-level Wasserstein distances to scale token-level advantages, amplifying updates on tokens whose hidden states are more separated from the opposing group. The method requires no additional model and only minimal changes to the training pipeline. Experiments on five mathematical reasoning benchmarks and five code generation benchmarks show improvements over standard GRPO and strong performance relative to supervised process reward models, while requiring no additional annotation or reward model training.

[28] Bridging Reasoning and Action: Hybrid LLM-RL Framework for Efficient Cross-Domain Task-Oriented Dialogue

Yangyang Zhao, Linfan Dai, Li Cai, Bowen Xing, Libo Qin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Cross-domain task-oriented dialogue requires reasoning over implicit and explicit feasibility constraints while planning long-horizon, multi-turn actions. Large language models (LLMs) can infer such constraints but are unreliable over long horizons, while Reinforcement learning (RL) optimizes long-horizon behavior yet cannot recover constraints from raw dialogue. Naively coupling LLMs with RL is therefore brittle: unverified or unstructured LLM outputs can corrupt state representations and misguide policy learning. Motivated by this, we propose Verified LLM-Knowledge empowered RL (VLK-RL), a hybrid framework that makes LLM-derived constraint reasoning usable for RL. VLK-RL first elicits candidate constraints with an LLM and then verifies them via a dual-role cross-examination procedure to suppress hallucinations and cross-turn inconsistencies. The verified constraints are mapped into ontology-aligned slot-value representations, yielding a structured, constraint-aware state for RL policy optimization. Experiments across multiple benchmarks demonstrate that VLK-RL significantly improves generalization and robustness, outperforming strong single-model baselines on long-horizon tasks.

[29] Evaluating Large Language Models on Computer Science University Exams in Data Structures

Edan Gabay, Yael Maoz, Jonathan Stahl, Naama Maoz, Abdo Amer, Orr Eilat, Hanoch Levy, Michal Kleinbort, Amir Rubinstein, Adi Haviv

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU), curated to assess LLMs’ abilities in handling closed and multiple-choice questions. We evaluated the performance of OpenAI’s GPT 4o and Anthropic’s Claude 3.5, popular LLMs, alongside two smaller LLMs, Mathstral 7B and LLaMA 3 8B, across the TAU exams benchmark. Our findings provide insight into the current capabilities of LLMs in CS education.

[30] When Chain-of-Thought Fails, the Solution Hides in the Hidden States

Houman Mehrafarin, Amit Parekh, Ioannis Konstas

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Whether intermediate reasoning is computationally useful or merely explanatory depends on whether chain-of-thought (CoT) tokens contain task-relevant information. We present a mechanistic causal analysis of CoT on GSM8K using activation patching: transferring token-level hidden states from a CoT generation to a direct-answer run for the same question, then measuring the effect on final-answer accuracy. Across models, generating after patching yields substantially higher accuracy than both direct-answer prompting and the original CoT trace, revealing that individual CoT tokens can encode sufficient information to recover the correct answer, even when the original trace is incorrect. This task-relevant information is more prevalent in correct than incorrect CoT runs and is unevenly distributed across tokens, concentrating in mid-to-late layers and appearing earlier in the reasoning trace. Moreover, patching language tokens such as verbs and entities carry task-solving information that steers generation toward correct reasoning, whereas mathematical tokens encode answer-proximal content that rarely succeeds. Patched outputs are often shorter and yet exceed the accuracy of a full CoT trace, suggesting complete reasoning chains are not always necessary. Together, these findings demonstrate that CoT encodes recoverable, token-level problem-solving information, offering new insight into how reasoning is represented and where it breaks down.

[31] VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs

Yurui Xiang, Xingyi Mao, Rui Sheng, Zixin Chen, Zelin Zang, Yuyang Wu, Haipeng Zeng, Huamin Qu, Yushi Sun, Yanna Lin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) show promise in medical diagnosis, but real-world deployment remains challenging due to high-stakes clinical decisions and imperfect reasoning reliability. As a result, careful inspection of model behavior is essential for assessing whether diagnostic reasoning is reliable and clinically grounded. However, debugging medical LLMs remains difficult. First, developers often lack sufficient medical domain expertise to interpret model errors in clinically meaningful terms. Second, models can fail across a large and diverse set of instances involving different input types, tasks, and reasoning steps, making it challenging for developers to prioritize which errors deserve focused inspection. Third, developers struggle to identify recurring error patterns across cases, as existing debugging practices are largely instance-centric and rely on manual inspection of isolated failures. To address these challenges, we present VeriLLMed, a visual analytics system that integrates external biomedical knowledge to audit and debug medical LLM diagnostic reasoning. VeriLLMed transforms model outputs into comparable reasoning paths, constructs knowledge graph-grounded reference paths, and identifies three recurring classes of diagnosis errors: relation errors, branch errors, and missing errors. Case studies and expert evaluation demonstrate that VeriLLMed helps developers identify clinically implausible reasoning and generate actionable insights that can inform the improvement of medical LLMs.

[32] Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing

Arthur Amalvy, Vincent Labatut, Xavier Bost, Hen-Hsen Huang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While annotated corpora are crucial in the field of natural language processing (NLP), those containing copyrighted material are difficult to exchange among researchers. Yet, such corpora are necessary to fully represent the diversity of data found in the wild in the context of NLP tasks. We tackle this issue by proposing a method to lawfully and publicly share the annotations of copyrighted literary texts. The corpus creator shares the annotations in clear, along with a non-reversible hashed version of the source material. The corpus user must own the source material, and apply the same hash function to their own tokens, in order to match them to the shared annotations. Crucially, our method is robust to reasonable divergences in the version of the copyrighted data owned by the user. As an illustration, we present alignment experiments on different editions of novels. Our results show that our method is able to correctly align 98.7 to 99.79% of tokens depending on the novel, provided the user version is sufficiently close to the corpus creator’s version. We publicly release novelshare, a Python implementation of our method.

[33] Beyond Local vs. External: A Game-Theoretic Framework for Trustworthy Knowledge Acquisition

Rujing Yao, Yufei Shi, Yang Wu, Ang Li, Zhuoren Jiang, XiaoFeng Wang, Haixu Tang, Xiaozhong Liu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Cloud-hosted Large Language Models (LLMs) offer unmatched reasoning capabilities and dynamic knowledge, yet submitting raw queries to these external services risks exposing sensitive user intent. Conversely, relying exclusively on trusted local models preserves privacy but often compromises answer quality due to limited parameter scale and knowledge. To resolve this dilemma, we propose Game-theoretic Trustworthy Knowledge Acquisition (GTKA), a framework that formulates the trade-off between knowledge utility and privacy as a strategic game. GTKA consists of three components: (i) a privacy-aware sub-query generator that decomposes sensitive intent into generalized, low-risk fragments; (ii) an adversarial reconstruction attacker that attempts to infer the original query from these fragments, providing adaptive leakage signals; and (iii) a trusted local integrator that synthesizes external responses within a secure boundary. By training the generator and attacker in an alternating adversarial manner, GTKA optimizes the sub-query generation policy to maximize knowledge acquisition accuracy while minimizing the reconstructability of the original sensitive intent. To validate our approach, we construct two sensitive-domain benchmarks in the biomedical and legal fields. Extensive experiments demonstrate that GTKA significantly reduces intent leakage compared to state-of-the-art baselines while maintaining high-fidelity answer quality.

[34] Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

Boqi Chen, Xudong Liu, Yunke Ao, Jianing Qiu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for Reasoning Models, which outperforms both stochastic sampling and standard greedy decoding in multimodal reasoning scenarios. Overall, our results caution against naively inheriting LLMs decoding heuristics in MLLMs and demonstrate that greedy decoding can be an efficient yet strong default for VQA.

[35] Scheming Ability in LLM-to-LLM Strategic Interactions

Thao Pham

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As large language model (LLM) agents are deployed autonomously in diverse contexts, evaluating their capacity for strategic deception becomes crucial. While recent research has examined how AI systems scheme against human developers, LLM-to-LLM scheming remains underexplored. We investigate the scheming ability and propensity of frontier LLM agents through two game-theoretic frameworks: a Cheap Talk signaling game and a Peer Evaluation adversarial game. Testing four models (GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, and Llama-3.3-70b), we measure scheming performance with and without explicit prompting while analyzing scheming tactics through chain-of-thought reasoning. When prompted, most models, especially Gemini-2.5-pro and Claude-3.7-Sonnet, achieved near-perfect performance. Critically, models exhibited significant scheming propensity without prompting: all models chose deception over confession in Peer Evaluation (100% rate), while models choosing to scheme in Cheap Talk succeeded at 95-100% rates. These findings highlight the need for robust evaluations using high-stakes game-theoretic scenarios in multi-agent settings.

[36] AI Safety Training Can be Clinically Harmful

Suhas BN, Andrew M. Sherrill, Rosa I. Arriaga, Chris W. Wiese, Saeed Abdullah

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models are being deployed as mental health support agents at scale, yet only 16% of LLM-based chatbot interventions have undergone rigorous clinical efficacy testing, and simulations reveal psychological deterioration in over one-third of cases. We evaluate four generative models on 250 Prolonged Exposure (PE) therapy scenarios and 146 CBT cognitive restructuring exercises (plus 29 severity-escalated variants), scored by a three-judge LLM panel. All models scored near-perfectly on surface acknowledgment (~0.91-1.00) while therapeutic appropriateness collapsed to 0.22-0.33 at the highest severity for three of four models, with protocol fidelity reaching zero for two. Under CBT severity escalation, one model’s task completeness dropped from 92% to 71% while the frontier model’s safety-interference score fell from 0.99 to 0.61. We identify a systematic, modality-spanning failure: RLHF safety alignment disrupts the therapeutic mechanism of action by grounding patients during imaginal exposure, offering false reassurance, inserting crisis resources into controlled exercises, and refusing to challenge distorted cognitions mentioning self-harm in PE; and through task abandonment or safety-preamble insertion during CBT cognitive restructuring. These findings motivate a five-axis evaluation framework (protocol fidelity, hallucination risk, behavioral consistency, crisis safety, demographic robustness), mapped onto FDA SaMD and EU AI Act requirements. We argue that no AI mental health system should proceed to deployment without passing multi-axis evaluation across all five dimensions.

[37] Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata

Zhengqing Yuan, Yiyang Li, Weixiang Sun, Zheyuan Zhang, Kaiwen Shi, Keerthiram Murugesan, Yanfang Ye

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Food insecurity remains a persistent public health emergency in the United States, tightly interwoven with chronic disease, mental illness, and opioid misuse. Yet despite the existence of thousands of food banks and pantries, access remains fragmented: 1) current retrieval systems depend on static directories or generic search engines, which provide incomplete and geographically irrelevant results; 2) LLM-based chatbots offer only vague nutritional suggestions and fail to adapt to real-world constraints such as time, mobility, and transportation; and 3) existing food recommendation systems optimize for culinary diversity but overlook survival-critical needs of food-insecure populations, including immediate proximity, verified availability, and contextual barriers. These limitations risk leaving the most vulnerable individuals, those experiencing homelessness, addiction, or digital illiteracy, unable to access urgently needed resources. To address this, we introduce Food4All, the first multi-agent framework explicitly designed for real-time, context-aware free food retrieval. Food4All unifies three innovations: 1) heterogeneous data aggregation across official databases, community platforms, and social media to provide a continuously updated pool of food resources; 2) a lightweight reinforcement learning algorithm trained on curated cases to optimize for both geographic accessibility and nutritional correctness; and 3) an online feedback loop that dynamically adapts retrieval policies to evolving user needs. By bridging information acquisition, semantic analysis, and decision support, Food4All delivers nutritionally annotated and guidance at the point of need. This framework establishes an urgent step toward scalable, equitable, and intelligent systems that directly support populations facing food insecurity and its compounding health risks.

[38] A Benchmark Suite of Reddit-Derived Datasets for Mental Health Detection

Khalid Hasan, Jamil Saquer

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The growing availability of online support groups has opened up new windows to study mental health through natural language processing (NLP). However, it is hindered by a lack of high-quality, well-validated datasets. Existing studies have a tendency to build task-specific corpora without collecting them into widely available resources, and this makes reproducibility as well as cross-task comparison challenging. In this paper, we present a uniform benchmark set of four Reddit-based datasets for disjoint but complementary tasks: (i) detection of suicidal ideation, (ii) binary general mental disorder detection, (iii) bipolar disorder detection, and (iv) multi-class mental disorder classification. All datasets were established upon diligent linguistic inspection, well-defined annotation guidelines, and human-judgmental verification. Inter-annotator agreement metrics always exceeded the baseline agreement score of 0.8, ensuring the labels’ trustworthiness. Previous work’s evidence of performance on both transformer and contextualized recurrent models demonstrates that these models receive excellent performances on tasks (F1 ~ 93-99%), further validating the usefulness of the datasets. By combining these resources, we establish a unifying foundation for reproducible mental health NLP studies with the ability to carry out cross-task benchmarking, multi-task learning, and fair model comparison. The presented benchmark suite provides the research community with an easy-to-access and varied resource for advancing computational approaches toward mental health research.

[39] JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

Rohith Reddy Bellibatlu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models are increasingly deployed as automated judges for evaluating other models, yet the stability of their verdicts under semantically equivalent prompt paraphrases remains unmeasured. We introduce JudgeSense, a framework and benchmark for quantifying this property via the Judge Sensitivity Score (JSS), defined as the fraction of paraphrase pairs on which a judge returns an identical decision. Evaluating nine judge models on 494 validated paraphrase pairs, we find that coherence is the only task where judges meaningfully differ, with JSS ranging from 0.389 to 0.992. On factuality, all judges cluster near JSS about 0.63, driven by a polarity-inverted prompt artifact; after correction, factuality JSS rises to about 0.9. Pairwise tasks (preference and relevance) exhibit degenerate always-A behavior in 8 of 9 judges, indicating strong position bias. Model scale does not predict consistency. We release code, decision logs, and a validated paraphrase dataset to support standardized JSS reporting.

[40] Your Students Don’t Use LLMs Like You Wish They Did

Sebastian Kobler, Matthew Clemson, Angela Sun, Jonathan K. Kummerfeld

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Educational NLP systems are typically evaluated using engagement metrics and satisfaction surveys, which are at best a proxy for meeting pedagogical goals. We introduce six computational metrics for automated evaluation of pedagogical alignment in student-AI dialogue. We validate our metrics through analysis of 12,650 messages across 500 conversations from four courses. Using our metrics, we identify a fundamental misalignment: educators design conversational tutors for sustained learning dialogue, but students mainly use them for answer-extraction. Deployment context is the strongest predictor of usage patterns, outweighing student preference or system design: when AI tools are optional, usage concentrates around deadlines; when integrated into course structure, students ask for solutions to verbatim assignment questions. Whole-dialogue evaluation misses these turn-by-turn patterns. Our metrics will enable researchers building educational dialogue systems to measure whether they are achieving their pedagogical goals.

Vijay Yadav

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Early detection of mental health conditions, particularly stress and depression, from social media text remains a challenging open problem in computational psychiatry and natural language processing. Automated systems must contend with figurative language, implicit emotional expression, and the high noise inherent in user-generated content. Existing approaches either leverage external commonsense knowledge to model mental states explicitly, or apply self-augmentation and contrastive training to improve generalization, but seldom do both in a principled, unified framework. We propose K-SENSE (Knowledge-guided Self-augmented Encoder for Neuro-Semantic Evaluation of Mental Health), a framework that jointly exploits external psychological reasoning and internal representation robustness. K-SENSE adopts a three-stage encoding pipeline: (1) inferential commonsense knowledge is extracted from the COMET model across five mental state dimensions; (2) a semantic anchor is constructed by combining hidden representations from two parallel encoding streams, projected into a shared space before fusion; and (3) a supervised contrastive learning objective aligns same-class representations while encouraging the attention mechanism to suppress irrelevant knowledge noise. We evaluate K-SENSE on Dreaddit (stress detection) and Depression_Mixed (depression detection), achieving mean F1-scores of 86.1 (0.6%) and 94.3 (0.8%), respectively, over five independent runs. These represent improvements of approximately 2.6 and 1.5 percentage points over the strongest prior baselines. Ablation experiments confirm the contribution of each architectural component, including the temporal knowledge integration strategy and the choice to keep the knowledge encoder frozen during fine-tuning.

[42] MTRouter: Cost-Aware Multi-Turn LLM Routing with History-Model Joint Embeddings

Yiqun Zhang, Hao Li, Zihan Wang, Shi Feng, Xiaocui Yang, Daling Wang, Bo Zhang, Lei Bai, Shuyue Hu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multi-turn, long-horizon tasks are increasingly common for large language models (LLMs), but solving them typically requires many sequential model invocations, accumulating substantial inference costs. Here, we study cost-aware multi-turn LLM routing: selecting which model to invoke at each turn from a model pool, given a fixed cost budget. We propose MTRouter, which encodes the interaction history and candidate models into joint history-model embeddings, and learns an outcome estimator from logged trajectories to predict turn-level model utility. Experiments show that MTRouter improves the performance-cost trade-off: on ScienceWorld, it surpasses GPT-5 while reducing total cost by 58.7%; on Humanity’s Last Exam (HLE), it achieves competitive accuracy while reducing total cost by 43.4% relative to GPT-5, and these gains even carry over to held-out tasks. Further analyses reveal several mechanisms underlying its effectiveness: relative to prior multi-turn routers, MTRouter makes fewer model switches, is more tolerant to transient errors, and exhibits emergent specialization across models. Code: https://github.com/ZhangYiqun018/MTRouter

[43] Pref-CTRL: Preference Driven LLM Alignment using Representation Editing

Imranul Ashrafi, Inigo Jauregi Unanue, Massimo Piccardi

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Test-time alignment methods offer a promising alternative to fine-tuning by steering the outputs of large language models (LLMs) at inference time with lightweight interventions on their internal representations. Recently, a prominent and effective approach, RE-Control (Kong et al., 2024), has proposed leveraging an external value function trained over the LLM’s hidden states to guide generation via gradient-based editing. While effective, this method overlooks a key characteristic of alignment tasks, i.e. that they are typically formulated as learning from human preferences between candidate responses. To address this, in this paper we propose a novel preference-based training framework, Pref-CTRL, that uses a multi-objective value function to better reflect the structure of preference data. Our approach has outperformed RE-Control on two benchmark datasets and showed greater generalization on out-of-domain datasets. Our source code is available at https://github.com/UTS-nlPUG/pref-ctrl.

[44] RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization

Dongxin Guo, Jikun Wu, Siu Ming Yiu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Serving diverse NLP workloads with large language models is costly: at one enterprise partner, inference costs exceeded $200K/month despite over 70% of queries being routine tasks well within the capability of smaller models. We present RouteNLP, a closed-loop framework that routes queries across a tiered model portfolio to minimize cost while satisfying per-task quality constraints. The framework integrates three components: a difficulty-aware router with shared task-conditioned representations trained on preference data and quality signals; confidence-calibrated cascading that uses conformal prediction for distribution-free threshold initialization; and a distillation-routing co-optimization loop that clusters escalation failures, applies targeted knowledge distillation to cheaper models, and automatically retrains the router, yielding over twice the cost improvement of untargeted distillation. In an 8-week pilot deployment processing ~5K queries/day at an enterprise customer-service division, RouteNLP reduced inference costs by 58% while maintaining 91% response acceptance and reducing p99 latency from 1,847 ms to 387 ms. On a six-task benchmark spanning finance, customer service, and legal domains, the framework achieves 40-85% cost reduction while retaining 96-100% quality on structured tasks and 96-98% on generation tasks, with human evaluation confirming that 74.5% of routed generation outputs match or exceed frontier-model quality.

[45] LLMs Reading the Rhythms of Daily Life: Aligned Understanding for Behavior Prediction and Generation

Fanjin Meng, Jingtao Ding, Nian Li, Yizhou Sun, Yong Li

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Human daily behavior unfolds as complex sequences shaped by intentions, preferences, and context. Effectively modeling these behaviors is crucial for intelligent systems such as personal assistants and recommendation engines. While recent advances in deep learning and behavior pre-training have improved behavior prediction, key challenges remain–particularly in handling long-tail behaviors, enhancing interpretability, and supporting multiple tasks within a unified framework. Large language models (LLMs) offer a promising direction due to their semantic richness, strong interpretability, and generative capabilities. However, the structural and modal differences between behavioral data and natural language limit the direct applicability of LLMs. To address this gap, we propose Behavior Understanding Alignment (BUA), a novel framework that integrates LLMs into human behavior modeling through a structured curriculum learning process. BUA employs sequence embeddings from pretrained behavior models as alignment anchors and guides the LLM through a three-stage curriculum, while a multi-round dialogue setting introduces prediction and generation capabilities. Experiments on two real-world datasets demonstrate that BUA significantly outperforms existing methods in both tasks, highlighting its effectiveness and flexibility in applying LLMs to complex human behavior modeling.

[46] ComplianceNLP: Knowledge-Graph-Augmented RAG for Multi-Framework Regulatory Gap Detection

Dongxin Guo, Jikun Wu, Siu Ming Yiu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Financial institutions must track over 60,000 regulatory events annually, overwhelming manual compliance teams; the industry has paid over USD 300 billion in fines and settlements since the 2008 financial crisis. We present ComplianceNLP, an end-to-end system that automatically monitors regulatory changes, extracts structured obligations, and identifies compliance gaps against institutional policies. The system integrates three components: (1) a knowledge-graph-augmented RAG pipeline grounding generations in a regulatory knowledge graph of 12,847 provisions across SEC, MiFID II, and Basel III; (2) multi-task obligation extraction combining NER, deontic classification, and cross-reference resolution over a shared LEGAL-BERT encoder; and (3) compliance gap analysis that maps obligations to internal policies with severity-aware scoring. On our benchmark, ComplianceNLP achieves 87.7 F1 on gap detection, outperforming GPT-4o+RAG by +3.5 F1, with 94.2% grounding accuracy ($r=0.83$ vs. human judgments) and 83.4 F1 under realistic end-to-end error propagation. Ablations show that knowledge-graph re-ranking contributes the largest marginal gain (+4.6 F1), confirming that structural regulatory knowledge is critical for cross-reference-heavy tasks. Domain-specific knowledge distillation (70B $\to$ 8B) combined with Medusa speculative decoding yields $2.8\times$ inference speedup; regulatory text’s low entropy ($H=2.31$ bits vs. $3.87$ general text) produces 91.3% draft-token acceptance rates. In four months of parallel-run deployment processing 9,847 updates at a financial institution, the system achieved 96.0% estimated recall and 90.7% precision, with a $3.1\times$ sustained analyst efficiency gain. We report deployment lessons on trust calibration, GRC integration, and distributional shift monitoring for regulated-domain NLP.

[47] XITE: Cross-lingual Interpolation for Transfer using Embeddings

Barah Fazili, Preethi Jyothi

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Facilitating cross-lingual transfer in multilingual language models remains a critical challenge. Towards this goal, we propose an embedding-based data augmentation technique called XITE. We start with unlabeled text from a low-resource target language, identify an English counterpart in a task-specific training corpus using embedding-based similarities and adopt its label. Next, we perform a simple interpolation of the source and target embeddings to create synthetic data for task-specific fine-tuning. Projecting the target text into a language-rich subspace using linear discriminant analysis (LDA), prior to interpolation, further boosts performance. Our cross-lingual embedding-based augmentation technique XITE yields significant improvements of up to 35.91% for sentiment analysis and up to 81.16% for natural language inference, using XLM-R, for a diverse set of target languages including Korean, Arabic, Urdu and Hindi. Apart from boosting cross-lingual transfer, adaptation using XITE also safeguards against forgetting and maintains task performance on the high-resource language.

[48] Personality Shapes Gender Bias in Persona-Conditioned LLM Narratives Across English and Hindi: An Empirical Investigation

Tanay Kumar, Shreya Gautam, Aman Chadha, Vinija Jain, Francesco Pierri

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) are increasingly deployed in persona-driven applications such as education, customer service, and social platforms, where models are prompted to adopt specific personas when interacting with users. While persona conditioning can improve user experience and engagement, it also raises concerns about how personality cues may interact with gender biases and stereotypes. In this work, we present a controlled study of persona-conditioned story generation in English and Hindi, where each story portrays a working professional in India producing context-specific artifacts (e.g., lesson plans, reports, letters) under systematically varied persona gender, occupational role, and personality traits from the HEXACO and Dark Triad frameworks. Across 23,400 generated stories from six state-of-the-art LLMs, we find that personality traits are significantly associated with both the magnitude and direction of gender bias. In particular, Dark Triad personality traits are consistently associated with higher gender-stereotypical representations compared to socially desirable HEXACO traits, though these associations vary across models and languages. Our findings demonstrate that gender bias in LLMs is not static but context-dependent. This suggests that persona-conditioned systems used in real-world applications may introduce uneven representational harms, reinforcing gender stereotypes in generated educational, professional, or social content.

[49] Applications of the Transformer Architecture in AI-Assisted English Reading Comprehension

Ping Li

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper studies interpretable and fair artificial intelligence architectures for understanding English reading. Introduced transformer-based models, integrating advanced attention mechanisms and gradient-based feature attribution. The model’s lack of interpretability, reduction of algorithmic bias, and unreliable performance in learning environments are the current issues faced in natural language teaching. A unified technical pipeline has been constructed, including adversarial bias correction methods, token-level attribution analysis, and multi-head attention heatmap visualization. Experimental validation was conducted using a large-scale labeled English reading comprehension dataset, and the data partitioning scheme and parameter optimization procedures have been determined. The method significantly outperforms the state-of-the-art models for this task in terms of accuracy and macro-average F1 score; in some aspects, it even surpasses or closely matches the results of human evaluations. In multi-week user experiments, the explainable transformer improved teachers’ trust and operability in feedback-based assessments within the scoring system. The proposed method aims to ensure high prediction accuracy and fairness for different learners. This indicates that it is a real-world educational application based on artificial intelligence with a focus on interpretation. Improve the user experience in AI-assisted reading comprehension systems, counteract biases, and enhance the details explained by transformers.

[50] GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs

Tao Feng, Haozhen Zhang, Zijie Lei, Peixuan Han, Jiaxuan You

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: LLM routing has achieved promising results in integrating the strengths of diverse models while balancing efficiency and performance. However, to support more realistic and challenging applications, routing must extend into agentic LLM settings, where task planning, multi-round cooperation among heterogeneous agents, and memory utilization are indispensable. To address this gap, we propose GraphPlanner, a heterogeneous graph memory-augmented agentic router for multi-agent LLMs that generates routing workflows for each query and supports both inductive and transductive inference. GraphPlanner formulates workflow generation as a Markov Decision Process (MDP), where at each step it selects both the LLM backbone and the agent role, including Planner, Executor, and Summarizer. By leveraging a heterogeneous graph, denoted as GARNet, to capture interaction memories among queries, agents, and responses, GraphPlanner integrates historical memory and workflow memory into richer state representations. The entire pipeline is optimized with reinforcement learning, jointly improving task-specific performance and computational efficiency. We evaluate GraphPlanner across 14 diverse LLM tasks and demonstrate that: (1) GraphPlanner outperforms strong single-round and multi-round routers, improving accuracy by up to 9.3% while reducing GPU cost from 186.26 GiB to 1.04 GiB; (2) GraphPlanner generalizes robustly to unseen tasks and LLMs, exhibiting strong zero-shot capabilities; and (3) GraphPlanner effectively leverages historical memories, supporting both inductive and transductive inference for more adaptive routing. Our code for GraphPlanner is released at https://github.com/ulab-uiuc/GraphPlanner.

[51] Neural Grammatical Error Correction for Romanian

Teodor-Mihai Cotet, Stefan Ruseti, Mihai Dascalu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Resources for Grammatical Error Correction (GEC) in non-English languages are scarce, while available spellcheckers in these languages are mostly limited to simple corrections and rules. In this paper we introduce a first GEC corpus for Romanian consisting of 10k pairs of sentences. In addition, the German version of ERRANT (ERRor ANnotation Toolkit) scorer was adapted for Romanian to analyze this corpus and extract edits needed for evaluation. Multiple neural models were experimented, together with pretraining strategies, which proved effective for GEC in low-resource settings. Our baseline consists of a small Transformer model trained only on the GEC dataset (F0.5 of 44.38), whereas the best performing model is produced by pretraining a larger Transformer model on artificially generated data, followed by finetuning on the actual corpus (F0.5 of 53.76). The proposed method for generating additional training examples is easily extensible and can be applied to any language, as it requires only a POS tagger

[52] Benchmarking Testing in Automated Theorem Proving

Jongyoon Kim, Hojae Han, Seung-won Hwang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly worse under our semantic metric. The best model, Claude-Sonnet-4.5, achieves only 38.9% Testing Accuracy on the full set, given both natural language proof and successor theorems as context, revealing a critical gap in current theorem generation capabilities.

[53] Agri-CPJ: A Training-Free Explainable Framework for Agricultural Pest Diagnosis Using Caption-Prompt-Judge and LLM-as-a-Judge

Wentao Zhang, Qi Zhang, Mingkun Xu, Mu You, Henghua Shen, Zhongzhi He, Keyan Jin, Derek F. Wong, Tao Fang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Crop disease diagnosis from field photographs faces two recurring problems: models that score well on benchmarks frequently hallucinate species names, and when predictions are correct, the reasoning behind them is typically inaccessible to the practitioner. This paper describes Agri-CPJ (Caption-Prompt-Judge), a training-free few-shot framework in which a large vision-language model first generates a structured morphological caption, iteratively refined through multi-dimensional quality gating, before any diagnostic question is answered. Two candidate responses are then generated from complementary viewpoints, and an LLM judge selects the stronger one based on domain-specific criteria. Caption refinement is the component with the largest individual impact: ablations confirm that skipping it consistently degrades downstream accuracy across both models tested. On CDDMBench, pairing GPT-5-Nano with GPT-5-mini-generated captions yields \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. Evaluated without modification on AgMMU-MCQs, GPT-5-Nano reached 77.84% and Qwen-VL-Chat reached 64.54%, placing them at or above most open-source models of comparable scale despite the format shift from open-ended to multiple-choice. The structured caption and judge rationale together constitute a readable audit trail: a practitioner who disagrees with a diagnosis can identify the specific caption observation that was incorrect. Code and data are publicly available https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis

[54] AIPsy-Affect: A Keyword-Free Clinical Stimulus Battery for Mechanistic Interpretability of Emotion in Language Models

Michael Keeman

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Mechanistic interpretability research on emotion in large language models – linear probing, activation patching, sparse autoencoder (SAE) feature analysis, causal ablation, steering vector extraction – depends on stimuli that contain the words for the emotions they test. When a probe fires on “I am furious”, it is unclear whether the model has detected anger or detected the word “furious”. The two readings have very different consequences for every downstream claim about emotion circuits, features, and interventions. We release AIPsy-Affect, a 480-item clinical stimulus battery that removes the confound at the stimulus level: 192 keyword-free vignettes evoking each of Plutchik’s eight primary emotions through narrative situation alone, 192 matched neutral controls that share characters, setting, length, and surface structure with the affect surgically removed, plus moderate-intensity and discriminant-validity splits. The matched-pair structure supports linear probing, activation patching, SAE feature analysis, causal ablation, and steering vector extraction under a strong methodological guarantee: any internal representation that distinguishes a clinical item from its matched neutral cannot be doing so on the basis of emotion-keyword presence. A three-method NLP defense battery – bag-of-words sentiment, an emotion-category lexicon, and a contextual transformer classifier – confirms the property: bag-of-words methods see only situational vocabulary, and a contextual classifier detects affect (p < 10^-15) but cannot identify the category (5.2% top-1 vs. 82.5% on a keyword-rich control). AIPsy-Affect extends our earlier 96-item battery (arXiv:2603.22295) by a factor of four and is released openly under MIT license.

[55] Multimodal QUD: Inquisitive Questions from Scientific Figures

Yating Wu, William Rudman, Venkata S Govindarajan, Alexandros G. Dimakis, Junyi Jessy Li

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Asking inquisitive questions while reading, and looking for their answers, is an important part in human discourse comprehension, curiosity, and creative ideation, and prior work has investigated this in text-only scenarios. However, in scientific or research papers, many of the critical takeaways are conveyed through both figures and the text that analyzes them. While scientific visualizations have been used to evaluate Vision-Language Models (VLMs) capabilities, current benchmarks are limited to questions that focus simply on extracting information from them. Such questions only require lower-level reasoning, do not take into account the context in which a figure appears, and do not reflect the communicative goals the authors wish to achieve. We generate inquisitive questions that reach the depth of questions humans generate when engaging with scientific papers, conditioned on both the figure and the paper’s context, and require reasoning across both modalities. To do so, we extend the linguistic theory of Questions Under Discussion (QUD) from being text-only to multimodal, where implicit questions are raised and resolved as discourse progresses. We present MQUD, a dataset of research papers in which such questions are made explicit and annotated by the original authors. We show that fine-tuning a VLM on MQUD shifts the model from generating generic low-level visual questions to content-specific grounding that requires a high-level of multimodal reasoning, yielding higher-quality, more visually grounded multimodal QUD generation.

[56] Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale

Avi-ad Avraam Buskila

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Practitioners deploying small open-weight large language models (LLMs) for medical question answering face a recurring design choice: invest in a domain-fine-tuned model, or keep a general-purpose model and inject domain knowledge at inference time via retrieval-augmented generation (RAG). We isolate this trade-off by holding model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol fixed, and varying only (i) whether the model has been domain-adapted (Gemma 3 4B vs. MedGemma 4B, both 4-bit quantized and served via Ollama) and (ii) whether retrieved passages from a medical knowledge corpus are inserted into the prompt. We evaluate all four cells of this 2x2 design on the full MedQA-USMLE 4-option test split (1,273 questions) with three repetitions per question (15,276 LLM calls). Domain fine-tuning yields a +6.8 percentage-point gain in majority-vote accuracy over the general 4B baseline (53.3% vs. 46.4%, McNemar p < 10^-4). RAG over MedMCQA explanations does not produce a statistically significant gain in either model, and in the domain-tuned model the point estimate is slightly negative (-1.9 pp, p = 0.16). At this scale and on this benchmark, domain knowledge encoded in weights dominates domain knowledge supplied in context. We release the full experiment code and JSONL traces to support replication.

[57] LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models

Tianchun Li, Haochen Liu, Vishwa Pardeshi, Xingchen Wang, Tianci Liu, Huijun Zhao, Wei Fan, Jing Gao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Small language models (SLMs) are promising for real-world deployment due to their efficiency and low operational cost. However, their limited capacity struggles with high-stakes legal reasoning tasks that require coherent statute interpretation and logically consistent deduction. Furthermore, training SLMs for such tasks demands high-quality, concise reasoning trajectories, which are prohibitively expensive to manually collect and difficult to curate via standard rejection sampling, lacking granularity beyond final verdicts. To address these challenges, we propose {LegalDrill}, a diagnosis-driven synthesis framework that extracts and iteratively refines reasoning trajectories from a capable teacher via fine-grained prompting, then a self-reflective verification is employed to adaptively select the most effective data for the SLM student. The resulting data empower SLM training through supervised fine-tuning and direct preference optimization. Extensive experiments on several legal benchmarks demonstrate that {LegalDrill} significantly bolsters the legal reasoning capabilities of representative SLMs while bypassing the need for scarce expert annotations, paving a scalable path toward practical legal reasoning systems.

[58] DRACULA: Hunting for the Actions Users Want Deep Research Agents to Execute

Nishant Balepur, Malachi Hamada, Varsha Kishore, Sergey Feldman, Amanpreet Singh, Pao Siangliulue, Joseph Chee Chang, Rachel Rudinger, Eunsol Choi, Jordan Lee Boyd-Graber, Doug Downey, Aakanksha Naik

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Scientific Deep Research (DR) agents answer user queries by synthesizing research papers into multi-section reports. User feedback can improve their utility, but existing protocols only score the final report, making it hard to study and learn which intermediate actions DR agents should take to improve reports. We collect DRACULA, the first dataset with user feedback on intermediate actions for DR. Over five weeks, nineteen expert CS researchers ask queries to a DR system that proposes actions (e.g., “Add a section on datasets”). Our users select actions they prefer, then judge whether an output report applied their selections successfully, yielding 8,103 action preferences and 5,230 execution judgments. After confirming a DR agent can execute DRACULA’s actions, we study the predictability of user-preferred actions via simulation-how well LLMs predict the actions users select-a step toward learning to generate useful actions. We discover: (1) LLM judges initially struggle to predict action selections, but improve most when using a user’s full selection history, rather than self-reported or extrapolated user context signals; (2) Users’ selections for the same query differ based on unstated goals, bottlenecking simulation and motivating affordances that let users steer reports; and (3) Our simulation results inform an online intervention that generates new actions based on the user’s past interactions, which users pick most often in follow-up studies. Overall, while work extensively studies execution, DRACULA reveals a key challenge is deciding which actions to execute in the first place. We open-source DRACULA’s study design, user feedback, and simulation tasks to spur future work on action feedback for long-horizon agents.

[59] Resource-Lean Lexicon Induction for German Dialects

Robert Litschko, Barbara Plank, Diego Frassinelli

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Automatic induction of high-quality dictionaries is essential for building lexical resources, yet low-resource languages and dialects pose several challenges: limited access to annotators, high degree of spelling variations, and poor performance of large language models (LLMs). We empirically show that statistical models (random forests) trained on string similarity features are surprisingly effective for inducing German dialect lexicons. They outperform LLMs, enable cross-dialect transfer, and offer a lightweight data-driven alternative. We evaluate our models intrinsically on bilingual lexicon induction (BLI) and extrinsically on dialect information retrieval (IR). On BLI, random forests outperform Mistral-123b while being more resource-lean. On dialect IR with BM25, using our dialect dictionaries for query expansion yields relative improvements of up to 28.9% in nDCG@10 and 50.7% in Recall@100. Motivated by the resource scarcity in dialects, we further investigate the extent to which models transfer across different German dialects, and their performance under varying amounts of training data.

[60] One Size Fits None: Heuristic Collapse in LLM Investment Advice

Jillian Ross, Andrew W. Lo

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models are increasingly deployed as advisors in high-stakes domains – answering medical questions, interpreting legal documents, recommending financial products – where good advice requires integrating a user’s full context rather than responding to salient surface features. We investigate whether frontier LLMs actually do this, or whether they instead exhibit heuristic collapse: a systematic reduction of complex, multi-factor decisions to a small number of dominant inputs. We study the phenomenon in investment advice, where legal standards explicitly require individualized reasoning over a client’s full circumstances. Applying interpretable surrogate models to LLM outputs, we find systematic heuristic collapse: investment allocation decisions are largely determined by self-reported risk tolerance, while other relevant factors contribute minimally. We further find that web search partially attenuates heuristic collapse but does not resolve it. These findings suggest that heuristic collapse is not resolved by web search augmentation or model scale alone, and that deploying LLMs as advisors requires auditing input sensitivity, not just output quality.

[61] Reheat Nachos for Dinner? Evaluating AI Support for Cross-Cultural Communication of Neologisms

Dayeon Ki, Yu Hou, Rachel Rudinger, Hal Daumé, Marine Carpuat, Fumeng Yang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Neologisms and emerging slang are central to daily conversation, yet challenging for non-native speakers (NNS) to interpret and use appropriately in cross-cultural communication with native speakers (NS). NNS increasingly make use of Artificial Intelligence (AI) tools to learn these words. We study the utility of such tools in mediating an informal communication scenario through a human-subjects study (N=234): NNS participants learn English neologisms with AI support, write messages using the learned word to an NS friend, and judge contextual appropriateness of the neologism in two provided writing samples. Using both NS evaluator-rated communicative competence of NNS-produced writing and NNS’ contextual appropriateness judgments, we compare three AI-based support conditions: AI Definition, AI Rewrite into simpler English, AI Explanation of meaning and usage, and Non-AI Dictionary for comparison. We show that AI Explanation yields the largest gains over no support in NS-rated competence, while contextual appropriateness judgments show indifference across support. NNS participants’ self-reported perceptions tend to overestimate NS ratings, revealing a mismatch between perceived and actual competence. We further observe a significant gap between NNS- and NS-produced writing, highlighting the limitations of current AI tools and informing design for future tools.

[62] Translate or Simplify First: An Analysis of Cross-lingual Text Simplification in English and French

Ido Dahan, Omer Toledano, Roey J. Gafter, Sharon Pardo, Oren Tsur, Hila Zahavi, Elior Sulem

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Cross-Lingual Text Simplification (CLTS) aims to make content more accessible across languages by simultaneously addressing both linguistic complexity and translation. This study investigates the effectiveness of different prompting strategies for CLTS between English and French using large language models (LLMs). We examine five distinct prompting systems: a direct prompt instructing the LLM to perform both translation and simplification simultaneously, two Composition approaches that either translate-then-simplify or simplify-then-translate within a single prompt, and two decomposition approaches that perform the same operations in separate, consecutive prompts. These systems are evaluated across a diverse set of five corpora of different genres (Wikipedia and medical texts) using seven state-of-the-art LLMs. Output quality is assessed through a multi-faceted evaluation framework comprising automatic metrics, comprehensive linguistic feature analysis, and human evaluation of simplicity and meaning preservation. Our findings reveal that while direct prompting consistently achieves the highest BLEU scores, indicating meaning fidelity, Translate-then-Simplify approaches demonstrate the highest simplicity, as measured by the linguistic features.

[63] Learning Selective LLM Autonomy from Copilot Feedback in Enterprise Customer Support Workflows

Nikita Borovkov, Elisei Rykov, Olga Tsymboi, Sergei Filimonov, Nikita Surnachev, Dmitry Bitman, Anatolii Potapov

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present a deployed system that automates end-to-end customer support workflows inside an enterprise Business Process Management (BPM) platform. The approach is scalable in production and reaches selective automation within two weeks for a new process, leveraging supervision already generated at scale: structured per-case UI interaction traces and low-overhead copilot feedback, where operators either accept a suggestion or provide a correction. A staged deployment pipeline trains a next UI action policy, learns a critic from copilot feedback to calibrate abstention, and executes only high-confidence steps in the background while deferring uncertain decisions to operators and resuming from the updated UI state. This setup lets one operator supervise multiple concurrent sessions and be interrupted only when the system is uncertain. The system operates on a schema-driven view of the BPM interface and includes monitoring and safe fallbacks for production. In production, it automated 45% of sessions and reduced average handling time by 39% without degrading support quality level.

[64] Knowledge Vector of Logical Reasoning in Large Language Models

Zixuan Wang, Yuanyuan Lei

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Logical reasoning serve as a central capability in LLMs and includes three main forms: deductive, inductive, and abductive reasoning. In this work, we study the knowledge representations of these reasoning types in LLMs and analyze the correlations among them. Our analysis shows that each form of logical reasoning can be captured as a reasoning-specific knowledge vector in a linear representation space, yet these vectors are largely independent of each other. Motivated by cognitive science theory that these subforms of logical reasoning interact closely in the human brain, as well as our observation that the reasoning process for one type can benefit from the reasoning chain produced by another, we further propose to refine the knowledge representations of each reasoning type in LLMs to encourage complementarity between them. To this end, we design a complementary subspace-constrained refinement framework, which introduces a complementary loss that enables each reasoning vector to leverage auxiliary knowledge from the others, and a subspace constraint loss that prevents erasure of their unique characteristics. Through steering experiments along reasoning vectors, we find that refined vectors incorporating complementary knowledge yield consistent performance gains. We also conduct a mechanism-interpretability analysis of each reasoning vector, revealing insights into the shared and specific features of different reasoning in LLMs.

[65] TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

Xiaochen Zheng, Zhiwen Jiang, Melanie Guerard, Klas Hatje, Tatyana Doktorova

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Target Safety Assessment (TSA) requires systematic integration of heterogeneous evidence, including genetic, transcriptomic, target homology, pharmacological, and clinical data, to evaluate potential safety liabilities of therapeutic targets. This process is inherently iterative and expert-driven, posing challenges in scalability and reproducibility. We present TSAssistant, a multi-agent framework designed to support TSA report drafting through a modular, section-based, and human-in-the-loop paradigm. The framework decomposes report generation into a coordinated pipeline of specialised subagents, each targeting a single TSA section. Specialised subagents retrieve structured and unstructured data as well as literature evidence from curated biomedical sources through standardised tool interfaces, producing individually citable, evidence-grounded sections. Agent behaviour is governed by a hierarchical instruction architecture comprising system prompts, domain-specific skill modules, and runtime user instructions. A key feature is an interactive refinement loop in which users may manually edit sections, append new information, upload additional sources, or re-invoke agents to revise specific sections, with the system maintaining conversational memory across iterations. TSAssistant is designed to reduce the mechanical burden of evidence synthesis and report drafting, supporting a hybrid model in which agentic AI augments evidence synthesis while toxicologists retain final decision authority.

[66] KOMBO: Korean Character Representations Based on the Combination Rules of Subcharacters

SungHo Kim, Juhyeong Park, Yeachan Kim, SangKeun Lee

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The Korean writing system, \textit{Hangeul}, has a unique character representation rigidly following the invention principles recorded in \textit{Hunminjeongeum}.\footnote{\textit{Hunminjeongeum} is a book published in 1446 that describes the principles of invention and usage of \textit{Hangeul}, devised by King Sejong \cite{Hunminjeongeum_Guide}.} However, existing pre-trained language models (PLMs) for Korean have overlooked these principles. In this paper, we introduce a novel framework for Korean PLMs called KOMBO, which firstly brings the invention principles of \textit{Hangeul} to represent character. Our proposed method, KOMBO, exhibits notable experimental proficiency across diverse NLP tasks. In particular, our method outperforms the state-of-the-art Korean PLM by an average of 2.11% in five Korean natural language understanding tasks. Furthermore, extensive experiments demonstrate that our proposed method is suitable for comprehending the linguistic features of the Korean language. Consequently, we shed light on the superiority of using subcharacters over the typical subword-based approach for Korean PLMs. Our code is available at: https://github.com/SungHo3268/KOMBO.

[67] Quantum Knowledge Graph: Modeling Context-Dependent Triplet Validity

Yao Wang, Zixu Geng, Jun Yan

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Knowledge graphs (KGs) are increasingly used to support large lan guage model (LLM) reasoning, but standard triplet-based KGs treat each relation as globally valid. In many settings, whether a relation should count as evidence depends on the context. We therefore formulate triplet validity as a triplet-specific function of context and refer to this formulation as a Quantum Knowledge Graph (QKG). We instantiate QKG in medicine using a diabetes-centered PrimeKG subgraph, whose 68,651 context-sensitive relations are further annotated with patient-group-specific constraints. We evaluate it in a reasoner–validator pipeline for medical question answering on a KG-grounded subset of MedReason containing 2,788 questions. With Haiku-4.5 as both the Reasoner and the Validator, KG-backed validation significantly improves over a no-validator baseline ($+0.61$ pp), and QKG with context matching yields the largest gain, outperforming both KG validation without context matching ($+0.79$ pp) and the no-validator baseline ($+1.40$ pp; paired McNemar, all $p<0.05$). Under a stronger validator (Qwen-3.6-Plus), the raw QKG gain over the no-validator baseline grows from $+1.40$ pp to $+5.96$ pp; the context-matching gap is non-significant ($p=0.73$) on the raw set but becomes borderline significant ($p=0.05$) after adjustment for knowledge leakage and suspicious questions, consistent with a benchmark-gold ceiling rather than a QKG limitation. Taken together, the results support the view that the value of a KG in LLM-based clinical reasoning lies not merely in storing medically related facts, but in representing whether those facts are applicable to the specific patient context. For reproducibility and further research, we release the curated QKG datasets and source code.\footnote{https://github.com/HKAI-Sci/QKG}

[68] Propagation Structure-Semantic Transfer Learning for Robust Fake News Detection

Mengyang Chen, Lingwei Wei, Han Cao, Wei Zhou, Zhou Yan, Songlin Hu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Fake news generally refers to false information that is spread deliberately to deceive people, which has detrimental social effects. Existing fake news detection methods primarily learn the semantic features from news content or integrate structural features from propagation. However, in practical scenarios, due to the semantic ambiguity of informal language and unreliable user interactive behaviors on social media, there are inherent semantic and structural noises in news content and propagation. Although some recent works consider the effect of irrelevant user interactions in a hybrid-modeling way, they still suffer from the mutual interference between structural noise and semantic noise, leading to limited performance for robust detection. To alleviate this issue, this paper proposes a novel Propagation Structure-Semantic Transfer Learning framework (PSS-TL) for robust fake news detection under a teacher-student architecture. Specifically, we design dual teacher models to learn semantics knowledge and structure knowledge from noisy news content and propagation structure independently. Besides, we design a Multi-channel Knowledge Distillation (MKD) loss to enable the student model to acquire specialized knowledge from the teacher models, thereby avoiding mutual interference. Extensive experiments on two real-world datasets validate the effectiveness and robustness of our method.

[69] Stabilizing Efficient Reasoning with Step-Level Advantage Selection

Han Wang, Xiaodong Yu, Jialian Wu, Jiang Liu, Ximeng Sun, Mohit Bansal, Zicheng Liu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this overhead through length-based rewards or pruning, many approaches are post-trained under a much shorter context window than base-model training, a factor whose effect has not been systematically isolated. We first show that short-context post-training alone, using standard GRPO without any length-aware objective, already induces substantial reasoning compression-but at the cost of increasingly unstable training dynamics and accuracy degradation. To address this, we propose Step-level Advantage Selection (SAS), which operates at the reasoning-step level and assigns a zero advantage to low-confidence steps in correct rollouts and to high-confidence steps in verifier-failed rollouts, where failures often arise from truncation or verifier issues rather than incorrect reasoning. Across diverse mathematical and general reasoning benchmarks, SAS improves average Pass@1 accuracy by 0.86 points over the strongest length-aware baseline while reducing average reasoning length by 16.3%, yielding a better accuracy-efficiency trade-off.

[70] From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

Qiliang Liang, Hansi Wang, Zhong Liang, Yang Liu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: LLM agents increasingly rely on reusable skills, capability packages that combine instructions, control flow, constraints, and tool calls. In most current agent systems, however, skills are still represented by text-heavy artifacts, including SKILL.md-style documents and structured records whose machine-usable evidence remains embedded largely in natural-language descriptions. This poses a challenge for skill-centered agent systems: managing skill collections and using skills to support agent both require reasoning over invocation interfaces, execution structure, and concrete side effects that are often entangled in a single textual surface. An explicit representation of skill knowledge may therefore help make these artifacts easier for machines to acquire and leverage. Drawing on Memory Organization Packets, Script Theory, and Conceptual Dependency from Schank and Abelson’s classical work on linguistic knowledge representation, we introduce what is, to our knowledge, the first structured representation for agent skill artifacts that disentangles skill-level scheduling signals, scene-level execution structure, and logic-level action and resource-use evidence: the Scheduling-Structural-Logical (SSL) representation. We instantiate SSL with an LLM-based normalizer and evaluate it on a corpus of skills in two tasks, Skill Discovery and Risk Assessment, and superiorly outperform the text-only baselines: in Skill Discovery, SSL improves MRR from 0.573 to 0.707; in Risk Assessment, it improves macro F1 from 0.744 to 0.787. These findings reveal that explicit, source-grounded structure makes agent skills easier to search and review. They also suggest that SSL is best understood as a practical step toward more inspectable, reusable, and operationally actionable skill representations for agent systems, rather than as a finished standard or an end-to-end mechanism for managing and using skills.

[71] Improving Robustness of Tabular Retrieval via Representational Stability

Kushal Raj Bhandari, Adarsh Singh, Jianxi Gao, Soham Dan, Vivek Gupta

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Transformer-based table retrieval systems flatten structured tables into token sequences, making retrieval sensitive to the choice of serialization even when table semantics remain unchanged. We show that semantically equivalent serializations, such as $\texttt{csv}$, $\texttt{tsv}$, $\texttt{html}$, $\texttt{markdown}$, and $\texttt{ddl}$, can produce substantially different embeddings and retrieval results across multiple benchmarks and retriever families. To address this instability, we treat serialization embedding as noisy views of a shared semantic signal and use its centroid as a canonical target representation. We show that centroid averaging suppresses format-specific variation and can recover the semantic content common to different serializations when format-induced shifts differ across tables. Empirically, centroid representations outrank individual formats in aggregate pairwise comparisons across $\texttt{MPNet}$, $\texttt{BGE-M3}$, $\texttt{ReasonIR}$, and $\texttt{SPLADE}$. We further introduce a lightweight residual bottleneck adapter on top of a frozen encoder that maps single-serialization embeddings towards centroid targets while preserving variance and enforcing covariance regularization. The adapter improves robustness for several dense retrievers, though gains are model-dependent and weaker for sparse lexical retrieval. These results identify serialization sensitivity as a major source of retrieval variance and show the promise of post hoc geometric correction for serialization-invariant table retrieval. Our code, datasets, and models are available at $\href{https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval}{https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval}$.

[72] Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B

Jon-Paul Cacioli

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Small instruct-tuned LLMs produce degenerate verbal confidence under minimal elicitation: ceiling rates above 95%, near-chance Type-2 AUROC, and Invalid validity profiles. We test whether confidence-conditioned supervised fine-tuning (CSFT) with self-consistency-derived targets can close the gap between internal information and verbal readout. A pre-registered Phase 0 protocol on Gemma 3 4B-it with a modal filter restricting training to items with correct modal answers produced a negative result: AUROC2 dropped from 0.554 to 0.509 due to label-entropy collapse in the training targets. An exploratory rescue removed the filter, training on all 2,000 calibration items. This produced a binary verbal correctness discriminator with AUROC2 = 0.774 on held-out TriviaQA, compressing a 10-sample self-consistency signal (AUROC2 = 0.999) into a single-pass readout exceeding logit entropy (0.701). The shuffled-target control showed no improvement (0.501). On MMLU, accuracy improved from 54.2% to 77.4% with the shuffled model at baseline (56.1%), supporting a target-dependent interpretation. The result is exploratory, binary rather than continuously calibrated, and observed at a single scale. It identifies two design lessons: confidence training requires label entropy, and correct targets regularise output format.

[73] PeeriScope: A Multi-Faceted Framework for Evaluating Peer Review Quality

Sajad Ebrahimi, Soroush Sadeghian, Ali Ghorbanpour, Negar Arabzadeh, Sara Salamat, Seyed Mohammad Hosseini, Hai Son Le, Mahdi Bashari, Ebrahim Bagheri

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The increasing scale and variability of peer review in scholarly venues has created an urgent need for systematic, interpretable, and extensible tools to assess review quality. We present PeeriScope, a modular platform that integrates structured features, rubric-guided large language model assessments, and supervised prediction to evaluate peer review quality along multiple dimensions. Designed for openness and integration, PeeriScope provides both a public interface and a documented API, supporting practical deployment and research extensibility. The demonstration illustrates its use for reviewer self-assessment, editorial triage, and large-scale auditing, and it enables the continued development of quality evaluation methods within scientific peer review. PeeriScope is available both as a live demo at https://app.reviewer.ly/app/peeriscope and via API services at https://github.com/Reviewerly-Inc/Peeriscope.

[74] How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

Xinran Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Safety benchmarks such as HarmBench rely on LLM judges to classify model responses as harmful or safe, yet the judge configuration, namely the combination of judge model and judge prompt, is typically treated as a fixed implementation detail. We show this assumption is problematic. Using a 2 x 2 x 3 factorial design, we construct 12 judge prompt variants along two axes, evaluation structure and instruction framing, and apply them using a single judge model, Claude Sonnet 4-6, producing 28,812 judgments over six target models and 400 HarmBench behaviors. We find that prompt wording alone, holding the judge model fixed, shifts measured harmful-response rates by up to 24.2 percentage points, with even within-condition surface rewording causing swings of up to 20.1 percentage points. Model safety rankings are moderately unstable, with mean Kendall tau = 0.89, and category-level sensitivity ranges from 39.6 percentage points for copyright to 0 percentage points for harassment. A supplementary multi-judge experiment using three judge models shows that judge-model choice adds further variance. Our results demonstrate that judge prompt wording is a substantial, previously under-examined source of measurement variance in safety benchmarking.

[75] The Pragmatic Persona: Discovering LLM Persona through Bridging Inference

Jisoo Yang, Jongwon Ryu, Minuk Ma, Trung X. Pham, Junyeong Kim

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface-level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse-level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference – implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small-scale models to 80B-parameter systems, demonstrate that bridging-inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style-based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at https://github.com/JiSoo-Yang/Persona_Bridging.git

[76] BiMol-Diff: A Unified Diffusion Framework for Molecular Generation and Captioning

Aditya Hemant Shahane, Anuj Kumar Sirohi, Devansh Arora, Nitin Kumar, Prathosh A P, Sandeep Kumar

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Bridging molecular structures and natural language is essential for controllable design. Autoregressive models struggle with long-range dependencies, while standard diffusion processes apply uniform corruption across positions, which can distort structurally informative tokens. We present BiMol-Diff, a unified diffusion framework for the paired tasks of text-conditioned molecule generation and molecule captioning. Our key component is a token-aware noise schedule that assigns position-dependent corruption based on token recovery difficulty, preserving harder-to-recover substructures during the forward process. On ChEBI-20 and M3-20M, BiMol-Diff improves molecule reconstruction with a 15.4% relative gain in Exact Match and achieves strong captioning results, attaining best BLEU and BERTScore among compared baselines. These results indicate token-aware noising improves fidelity in molecular structure-language modelling.

[77] Factual and Edit-Sensitive Graph-to-Sequence Generation via Graph-Aware Adaptive Noising

Aditya Hemant Shahane, Anuj Kumar Sirohi, Tanmoy Chakraborty, Prathosh A P, Sandeep Kumar

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Fine-tuned autoregressive models for graph-to-sequence generation (G2S) often struggle with factual grounding and edit sensitivity. To tackle these issues, we propose a non-autoregressive diffusion framework that generates text by iterative refinement conditioned on an input graph, named as Diffusion Language Model for Graphs (DLM4G). By aligning graph components (entities/relations) with their corresponding sequence tokens, DLM4G employs an adaptive noising strategy. The proposed strategy uses per-token denoising error as a signal to adaptively modulate noise on entity and relation tokens, improving preservation of graph structure and enabling localized updates under graph edits. Evaluated on three datasets, DLM4G consistently outperforms competitive G2S diffusion baselines trained on identical splits across both surface-form and embedding-based metrics. DLM4G further exceeds fine-tuned autoregressive baselines up to 12x larger (e.g., T5-Large) and is competitive with zero-shot LLM transfer baselines up to 127x larger. Relative to the strongest fine-tuned PLM baseline, DLM4G improves factual grounding (FGT@0.5) by +5.16% and edit sensitivity (ESR) by +7.9%; compared to the best diffusion baseline, it yields gains of +3.75% in FGT@0.5 and +23.6% in ESR. We additionally demonstrate applicability beyond textual graphs through experiments on molecule captioning, indicating the method’s generality for scientific G2S generation.

[78] IRIS: Interleaved Reinforcement with Incremental Staged Curriculum for Cross-Lingual Mathematical Reasoning

Navya Gupta, Rishitej Reddy Vyalla, Avinash Anand, Chhavi Kirtani, Erik Cambria, Zhengchen Zhang, Zhengkui Wang, Timothy Liu, Aik Beng Ng, Simon See, Rajiv Ratn Shah

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Curriculum learning helps language models tackle complex reasoning by gradually increasing task difficulty. However, it often fails to generate consistent step-by-step reasoning, especially in multilingual and low-resource settings where cross-lingual transfer from English to Indian languages remains limited. We propose IRIS: Interleaved Reinforcement with Incremental Staged Curriculum, a two-axis framework that combines Supervised Fine-Tuning on progressively harder problems (vertical axis) with Reverse Curriculum Reinforcement Learning to reduce reliance on step-by-step guidance (horizontal axis). We design a composite reward combining correctness, step-wise alignment, continuity, and numeric incentives, optimized via Group Relative Policy Optimization (GRPO). We release CL-Math, a dataset of 29k problems with step-level annotations in English, Hindi, and Marathi. Across standard benchmarks and curated multilingual test sets, IRIS consistently improves performance, with strong results on math reasoning tasks and substantial gains in low-resource and bilingual settings, alongside modest improvements in high-resource languages.

[79] Psychologically-Grounded Graph Modeling for Interpretable Depression Detection

Rishitej Reddy Vyalla, Kritarth Prasad, Avinash Anand, Erik Cambria, Shaoxiong Ji, Faten S. Alamri, Zhengkui Wang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Automatic depression detection from conversational interactions holds significant promise for scalable screening but remains hindered by severe data scarcity and a lack of clinical interpretability. Existing approaches typically rely on black-box deep learning architectures that struggle to model the subtle, temporal evolution of depressive symptoms or account for participant-specific heterogeneity. In this work, we propose PsyGAT (Psychological Graph Attention Network), a psychologically grounded framework that models conversational sessions as dynamic temporal graphs. We introduce Psychological Expression Units (PEUs) to explicitly encode utterance-level clinical evidence, structuring the session graph to capture transitions in psychological states rather than mere semantic dependencies. To address the critical class imbalance in depression datasets, we employ clinically approved persona-based data augmentation, enable robust model learning. Additionally, we integrate session-level personality context directly into the graph structure to disentangle trait-based behavior from acute depressive symptoms. PsyGAT achieves state-of-the-art performance, surpassing both strong graph-based baselines and closed-source LLMs like GPT-5, achieving 89.99 and 71.37 Macro F1 scores in DAIC-WoZ and E-DAIC, respectively. We further introduce Causal-PsyGAT, an interpretability module that identifies symptom triggers. Experiments show a 20% improvement in MRR for identifying causal indicators, effectively bridging the gap between depression monitoring and clinical explainability. The full augmented dataset is publicly available at https://doi.org/10.6084/m9.figshare.31801921.

[80] AdapTime: Enabling Adaptive Temporal Reasoning in Large Language Models

Yimin Deng, Yejing Wang, Zhenxi Lin, Zichuan Fu, Guoshuai Zhao, Derong Xu, Yefeng Zheng, Xiangyu Zhao, Xian Wu, Li Zhu, Xueming Qian

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models have demonstrated strong reasoning capabilities in general knowledge question answering. However, their ability to handle temporal information remains limited. To address this limitation, existing approaches often involve external tools or manual verification and are tailored to specific scenarios, leading to poor generalizability. Moreover, these methods apply a fixed pipeline to all questions, overlooking the fact that different types of temporal questions require distinct reasoning strategies, which leads to unnecessary processing for simple cases and inadequate reasoning for complex ones. To this end, we propose AdapTime, an adaptive temporal reasoning method that dynamically executes reasoning steps based on the input context. Specifically, it involves three temporal reasoning actions: reformulate, rewrite and review, with an LLM planner guiding the reasoning process. AdapTime integrates seamlessly with state-of-the-art LLMs and significantly enhances their temporal reasoning capabilities without relying on external support. Extensive experiments demonstrate the effectiveness of our approach.

[81] MemeScouts@LT-EDI 2026: Asking the Right Questions – Prompted Weak Supervision for Meme Hate Speech Detection

Ivo Bueno, Lea Hirlimann, Enkelejda Kasneci

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Detecting hate speech in memes is challenging due to their multimodal nature and subtle, culturally grounded cues such as sarcasm and context. While recent vision-language models (VLMs) enable joint reasoning over text and images, end-to-end prompting can be brittle, as a single prediction must resolve target, stance, implicitness, and irony. These challenges are amplified in multilingual settings. We propose a prompted weak supervision (PWS) approach that decomposes meme understanding into targeted, question-based labeling functions with constrained answer options for homophobia and transphobia detection in the LT-EDI 2026 shared task. Using a quantized Qwen3-VLM to extract features by answering targeted questions, our method outperforms direct VLM classification, with substantial gains for Chinese and Hindi, ranking 1st in English, 2nd in Chinese, and 3rd in Hindi. Iterative refinement via error-driven LF expansion and feature pruning reduces redundancy and improves generalization. Our results highlight the effectiveness of prompted weak supervision for multilingual multimodal hate speech detection.

[82] MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning

Yimin Deng, Zhenxi Lin, Yejing Wang, Guoshuai Zhao, Pengyue Jia, Zichuan Fu, Derong Xu, Yefeng Zheng, Xiangyu Zhao, Li Zhu, Xian Wu, Xueming Qian

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diagnostic prediction and clinical reasoning are critical tasks in healthcare applications. While Large Language Models (LLMs) have shown strong capabilities in commonsense reasoning, they still struggle with diagnostic reasoning due to limited domain knowledge. Existing approaches often rely on internal model knowledge or static knowledge bases, resulting in knowledge insufficiency and limited adaptability, which hinder their capacity to perform diagnostic reasoning. Moreover, these methods focus solely on the accuracy of final predictions, overlooking alignment with standard clinical reasoning trajectories. To this end, we propose MultiDx, a two-stage diagnostic reasoning framework that performs differential diagnosis by analyzing evidence collected from multiple knowledge sources. Specifically, it first generates suspected diagnoses and reasoning paths by leveraging knowledge from web search, SOAP-formatted case, and clinical case database. Then it integrates multi-perspective evidence through matching, voting, and differential diagnosis to generate the final prediction.~Extensive experiments on two public benchmarks demonstrate the effectiveness of our approach.

[83] Seeing Is No Longer Believing: Frontier Image Generation Models, Synthetic Visual Evidence, and Real-World Risk

Shuai Wu, Xue Li, Yanna Feng, Yufang Li, Zhijun Wang, Ran Wang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Frontier image generation has moved from artistic synthesis toward synthetic visual evidence. Systems such as GPT Image 2, Nano Banana Pro, Nano Banana 2, Grok Imagine, Qwen Image 2.0 Pro, and Seedream 5.0 Lite combine photorealistic rendering, readable typography, reference consistency, editing control, and in several cases reasoning or search-grounded image construction. These capabilities create large benefits for design, education, accessibility, and communication, yet they also weaken one of society’s most common trust shortcuts: the belief that a plausible picture is a reliable record. This paper provides a source-grounded technical and policy analysis of synthetic visual risk. We first summarize the public capabilities of recent image models, then analyze public incidents involving fake crisis images, celebrity and public-figure imagery, medical scans, forged-looking documents, synthetic screenshots, phishing assets, and market-moving rumors. We introduce a capability-weighted risk framework that links model affordances to real-world harm in finance, medicine, news, law, emergency response, identity verification, and civic discourse. Our findings show that risk is driven less by photorealism alone than by the convergence of realism, legible text, identity persistence, fast iteration, and distribution context. We argue for layered control: model-side restrictions, cryptographic provenance, visible labeling, platform friction, sector-grade verification, and incident response. The paper closes with practical recommendations for model providers, platforms, newsrooms, financial institutions, healthcare systems, legal organizations, regulators, and ordinary users.

[84] Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer

Shun Shao, Binxu Wang, Shay B. Cohen, Anna Korhonen, Yonatan Belinkov

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Mechanistic interpretability has made it possible to localize circuits underlying specific behaviors in language models, but existing methods are expensive, model-specific, and difficult to scale to larger architectures. We introduce \textbf{Differentiable Faithfulness Alignment (DFA)}, a framework that transfers circuit information from a smaller source model to a larger target model through a learned differentiable alignment. DFA projects source-model node importance scores into the target model and trains this mapping with a soft faithfulness objective, avoiding full circuit discovery on the target model. We evaluate DFA on Llama-3 and Qwen-2.5 across six tasks spanning factual retrieval, multiple-choice reasoning, and arithmetic. The strongest results occur on Llama-3 $1$B$\rightarrow3$B, where aligned circuits are often competitive with direct node attribution and zero-shot transfer remains effective. Recovery weakens for larger source–target gaps and is substantially lower on Qwen-2.5, suggesting that transfer becomes harder as architectural and scaling differences increase. Overall, DFA consistently outperforms simple baselines and, in some settings, recovers target-model circuits with faithfulness comparable to or stronger than direct attribution. These results suggest that smaller models can provide useful mechanistic priors for larger ones, while highlighting both the promise and the limits of node-level cross-model circuit alignment.\footnote{Code is available at https://github.com/jasonshaoshun/dfa-circuits.

[85] DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

Junshuo Zhang, Chengrui Huang, Feng Guo, Zihan Li, Ke Shi, Menghua Jiang, Jiguo Yu, Shuo Shang, Shen Gao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language model (LLM) agents that follow the sequential “reason-then-act” paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at https://github.com/LePanda026/Code-for-DPEPO)

[86] Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering

Daria Berdyugina, Anaëlle Cohen, Yohann Rioual

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Standard Retrieval-Augmented Generation (RAG) chunking methods often create excessive redundancy, increasing storage costs and slowing retrieval. This study explores chunk filtering strategies, such as semantic, topic-based, and named-entity-based methods in order to reduce the indexed corpus while preserving retrieval quality. Experiments are conducted on multiple corpora. Retrieval performance is evaluated using a token-based framework based on precision, recall, and intersection-over-union metrics. Results indicate that entity-based filtering can reduce vector index size by approximately 25% to 36% while maintaining high retrieval quality close to the baseline. These findings suggest that redundancy introduced during chunking can be effectively reduced through lightweight filtering, improving the efficiency of retrieval-oriented components in RAG pipelines.

[87] OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents

Zheng Wu, Yi Hua, Zhaoyuan Huang, Chenhao Xue, Yijie Lu, Pengzhou Cheng, Zongru Wu, Lingzhong Dong, Gongshen Liu, Xinghao Jiang, Zhuosheng Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at https://github.com/Wuzheng02/OS-SPEAR.

[88] Culture-Aware Machine Translation in Large Language Models: Benchmarking and Investigation

Zekun Yuan, Yangfan Ye, Xiaocheng Feng, Baohang Li, Qichen Hong, Yunfei Lu, Dandan Tu, Bing Qin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) have achieved strong performance in general machine translation, yet their ability in culture-aware scenarios remains poorly understood. To bridge this gap, we introduce CanMT, a Culture-Aware Novel-Driven Parallel Dataset for Machine Translation, together with a theoretically grounded, multi-dimensional evaluation framework for assessing cultural translation quality. Leveraging CanMT, we systematically evaluate a wide range of LLMs and translation systems under different translation strategy constraints. Our findings reveal substantial performance disparities across models and demonstrate that translation strategies exert a systematic influence on model behavior. Further analysis shows that translation difficulty varies across types of culture-specific items, and that a persistent gap remains between models’ recognition of culture-specific knowledge and their ability to correctly operationalize it in translation outputs. In addition, incorporating reference translations is shown to substantially improve evaluation reliability in LLM-as-a-judge, underscoring their essential role in assessing culture-aware translation quality. The corpus and code are available at CanMT.

[89] SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution

Sichun Luo, Yi Huang, Haochen Luo, Fengyuan Liu, Guanzhi Deng, Lei Li, Qinghua Yao, Zefa Hu, Junlan Feng, Qi Liu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: LLM-guided evolutionary search has emerged as a promising paradigm for automated algorithm discovery, yet most systems track search progress primarily through executable programs and scalar fitness. Even when natural-language reflection is used, it is often used locally in mutation prompts or stored without an explicit population-level organization of strategic directions. As a result, evolutionary search can struggle to distinguish syntactically different implementations of the same idea, preserve lower-fitness but strategically promising directions, or detect when an entire family of strategies has saturated. We introduce \model, a modular strategy-space layer that elevates natural-language strategy descriptions from transient prompt context to first-class population-level evolutionary state in LLM-driven program search. \model augments each candidate program with an explicit natural language strategy description and uses this representation in three ways: Strategy Articulation turns mutation into a diagnose-direct-implement process; Stratified Experience Retrieval organizes the archive into strategy clusters and selects inspirations by behavioral complementarity; and Strategic Landscape Navigation periodically summarizes effective, saturated, and underexplored strategy families to guide future mutations. Across mathematical algorithm discovery, systems optimization, and agent-scaffold benchmarks, \model improves the underlying evolutionary backbones in most settings, with particularly large gains (21% relative improvement) on open-ended system optimization tasks. These results suggest that persistent strategy representations provide a practical mechanism for improving the robustness and efficiency of LLM-guided evolutionary search, suggesting a path toward compound AI systems that accumulate algorithmic knowledge over time.

[90] MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining

Phung Gia Huy, Hai An Vu, Minh-Phuc Truong, Thang Duc Tran, Linh Ngo Van, Thanh Hong Nguyen, Trung Le

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Representation learning is fundamental to NLP, but building embeddings that work well at different computational budgets is challenging. Matryoshka Representation Learning (MRL) offers a flexible inference paradigm through nested embeddings; however, learning such structures requires explicit coordination of how information is arranged across embedding dimensionality and model depth. In this work, we propose MIPIC (Matryoshka Representation Learning via Self-Distilled Intra-Relational Alignment and Progressive Information Chaining), a unified training framework designed to produce structurally coherent and semantically compact Matryoshka representations. MIPIC promotes cross-dimensional structural consistency through Self-Distilled Intra-Relational Alignment (SIA), which aligns token-level geometric and attention-driven relations between full and truncated representations using top-k CKA self-distillation. Complementarily, it enables depth-wise semantic consolidation via Progressive Information Chaining (PIC), a scaffolded alignment strategy that incrementally transfers mature task semantics from deeper layers into earlier layers. Extensive experiments on STS, NLI, and classification benchmarks (spanning models from TinyBERT to BGEM3, Qwen3) demonstrate that MIPIC yields Matryoshka representations that are highly competitive across all capacities, with significant performance advantages observed under extreme low-dimensional.

[91] Learning Evidence of Depression Symptoms via Prompt Induction

Eliseo Bao, Anxo Perez, David Otero, Javier Parapar

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Depression places substantial pressure on mental health services, and many people describe their experiences outside clinical settings in high-volume user-generated text (e.g., online forums and social media). Automatically identifying clinical symptom evidence in such text can therefore complement limited clinical capacity and scale to large populations. We address this need through sentence-level classification of 21 depression symptoms from the BDI-II questionnaire, using BDI-Sen, a dataset annotated for symptom relevance. This task is fine-grained and highly imbalanced, and we find that common LLM approaches (zero-shot, in-context learning, and fine-tuning) struggle to apply consistent relevance criteria for most symptoms. We propose Symptom Induction (SI), a novel approach which compresses labeled examples into short, interpretable guidelines that specify what counts as evidence for each symptom and uses these guidelines to condition classification. Across four LLM families and eight models, SI achieves the best overall weighted F1 on BDI-Sen, with especially large gains for infrequent symptoms. Cross-domain evaluation on an external dataset further shows that induced guidelines generalize across other diseases shared symptomatology (bipolar and eating disorders).

[92] Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from small language models, but these methods offer limited flexibility and remain computationally intensive. We study a complementary route: compressing existing LVLMs by applying structured pruning to the language model backbone, followed by lightweight recovery training. Specifically, we investigate two structural pruning paradigms: layerwise and widthwise pruning, and pair them with supervised finetuning and knowledge distillation on logits and hidden states. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios, where computational resources are limited or there is insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels. Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance. Through empirical study on three representative LVLM families ranging from 3B to 7B parameters, this study offers actionable insights for practitioners to compress LVLMs without extensive computation resources or sufficient data. The code base is available at https://github.com/YiranHuangIrene/VLMCompression.git.

[93] Scaling Properties of Continuous Diffusion Spoken Language Models

Jason Ramapuram, Eeshan Gunesh Dhekane, Amitis Shidani, Dan Busbridge, Bogdan Mazoure, Zijin Gu, Russ Webb, Tatiana Likhomanenko, Navdeep Jaitly

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Speech-only spoken language models (SLMs) lag behind text and text-speech models in performance, with recent discrete autoregressive (AR) SLMs indicating significant computational and data demands to match text models. Since discretizing continuous speech for AR creates bottlenecks, we explore whether continuous diffusion (CD) SLM is more viable. To quantify the SLMs linguistic quality, we introduce the phoneme Jensen-Shannon divergence (pJSD) metric. Our analysis reveals CD SLMs, mirroring AR behavior, exhibit scaling laws for validation loss and pJSD, and show optimal token-to-parameter ratios decreasing as compute scales. However, for the latter, loss becomes insensitive to choice of data and model sizes, showing potential for fast inference. Scaling CD SLMs to 16B parameters with tens of millions of hours of conversational data enables generation of emotive, prosodic, multi-speaker, multilingual speech, though achieving long-form coherence remains a significant challenge.

[94] A Multi-Dimensional Audit of Politically Aligned Large Language Models

Lisa Korver, Mohamed Mostagir, Sherief Reda

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As the application of Large Language Models (LLMs) spreads across various industries, there are increasing concerns about the potential for their misuse, especially in sensitive areas such as political discourse. Deliberately aligning LLMs with specific political ideologies, through prompt engineering or fine-tuning techniques, can be advantageous in use cases such as political campaigns, but requires careful consideration due to heightened risks of performance degradation, misinformation, or increased biased behavior. In this work, we propose a multi-dimensional framework inspired by Habermas’ Theory of Communicative Action to audit politically aligned language models across four dimensions: effectiveness, fairness, truthfulness, and persuasiveness using automated, quantitative metrics. Applying this to nine popular LLMs aligned via fine-tuning or role-playing revealed consistent trade-offs: while larger models tend to be more effective at role-playing political ideologies and truthful in their responses, they were also less fair, exhibiting higher levels of bias in the form of angry and toxic language towards people of different ideologies. Fine-tuned models exhibited lower bias and more effective alignment than the corresponding role-playing models, but also saw a decline in performance reasoning tasks and an increase in hallucinations. Overall, all of the models tested exhibited some deficiency in at least one of the four metrics, highlighting the need for more balanced and robust alignment strategies. Ultimately, this work aims to ensure politically-aligned LLMs generate legitimate, harmless arguments, offering a framework to evaluate the responsible political alignment of these models.

[95] Kwai Summary Attention Technical Report

Chenglong Chu, Guorui Zhou, Guowang Zhang, Han Li, Hao Peng, Hongtao Cheng, Jian Liang, Jiangxia Cao, Kun Gai, Lingzhi Zhou, Lu Ren, Qi Zhang, Ruiming Tang, Ruitao Wang, Xinchen Luo, Yi Su, Zhiyuan Liang, Ziqi Wang, Boyang Ding, Chengru Song, Dunju Zang, Hui Wang, Jiao Ou, Jiaxin Deng, Jijun Shi, Jinghao Zhang, Junmin Chen, Lejian Ren, Minxuan Lv, Qianqian Wang, Qigen Hu, Shiyao Wang, Siyang Mao, Tao Wang, Xingmei Wang, Zhixin Ling, Ziming Li, Zixing Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly. Existing solutions mitigate this issue through two technique routings: i) Reducing the KV cache per layer, such as from the head-level compression GQA, and the embedding dimension-level compression MLA, but the KV cache remains linearly dependent on the sequence length at a 1:1 ratio. ii) Interleaving with KV Cache friendly architecture, such as local attention SWA, linear kernel GDN, but often involve trade-offs among KV Cache and long-context modeling effectiveness. Besides the two technique routings, we argue that there exists an intermediate path not well explored: {Maintaining a linear relationship between the KV cache and sequence length, but performing semantic-level compression through a specific ratio $k$}. This $O(n/k)$ path does not pursue a ``minimum KV cache’’, but rather trades acceptable memory costs for complete, referential, and interpretable retention of long distant dependency. Motivated by this, we propose Kwai Summary Attention (KSA), a novel attention mechanism that reduces sequence modeling cost by compressing historical contexts into learnable summary tokens.

[96] Can You Make It Sound Like You? Post-Editing LLM-Generated Text for Personal Style

Connor Baumler, Calvin Bao, Huy Nghiem, Xinchen Yang, Marine Carpuat, Hal Daumé

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Despite the growing use of large language models (LLMs) for writing tasks, users may hesitate to rely on LLMs when personal style is important. Post-editing LLM-generated drafts or translations is a common collaborative writing strategy, but it remains unclear whether users can effectively reshape LLM-generated text to reflect their personal style. We conduct a pre-registered online study ($n=81$) in which participants post-edit LLM-generated drafts for writing tasks where personal style matters to them. Using embedding-based style similarity metrics, we find that post-editing increases stylistic similarity to participants’ unassisted writing and reduces similarity to fully LLM-generated output. However, post-edited text still remains stylistically closer in style to LLM text than to participants’ unassisted control text, and it exhibits reduced stylistic diversity compared to unassisted human text. We find a gap between perceived stylistic authenticity and model-measured stylistic similarity, with post-edited text often perceived as representative of participants’ personal style despite remaining detectable LLM stylistic traces.

[97] Zero-shot Large Language Models for Automatic Readability Assessment

Riley Grossman, Yi Chen

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Unsupervised automatic readability assessment (ARA) methods have important practical and research applications (e.g., ensuring medical or educational materials are suitable for their target audiences). In this paper, we propose a new zero-shot prompting methodology for ARA and present the first comprehensive evaluation of using large language models (LLMs) as an unsupervised ARA method by testing 10 diverse open-source LLMs (e.g., different sizes and developers) on 14 diverse datasets (e.g., different text lengths and languages). Our findings show that our proposed prompting methodology outperforms prior methods on 13 of the 14 datasets. Furthermore, we propose LAURAE, which combines LLM and readability formula scores to improve robustness by capturing both contextual and shallow (e.g., sentence length) features of readability. Our evaluation demonstrates that LAURAE robustly outperforms prior methods across languages, text lengths, and amounts of technical language.

[98] SEARCH-R: Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator for Multi-hop Question Answering

Yuqing Fu, Yimin Deng, Wanyu Wang, Yuhao Wang, Yejing Wang, Hongshi Liu, Yiqi Wang, Xiao Han, Maolin Wang, Guoshuai Zhao, Yi Chang, Xiangyu Zhao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multi-hop Question Answering (MHQA) aims to answer questions that require multi-step reasoning. It presents two key challenges: generating correct reasoning paths in response to the complex user queries, and accurately retrieving essential knowledge in the face of potential limitations in large language models (LLMs). Existing approaches primarily rely on prompt-based methods to generate reasoning paths, which are further combined with traditional sparse or dense retrieval to produce the final answer. However, the generation of reasoning paths commonly lacks effective control over the generative process, thus leading the reasoning astray. Meanwhile, the retrieval methods over-rely on knowledge matching or similarity scores rather than evaluating the practical utility of the information, resulting in retrieving homogeneous or non-useful information. Therefore, we propose a Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator framework named SEARCH-R. Specifically, SEARCH-R trains an end-to-end reasoning path navigator, which is able to provide a powerful sub-question decomposer by fine-tuning the Llama3.1-8B model. Moreover, a novel dependency tree-based retrieval is designed to evaluate the informational contribution of the document quantitatively. Extensive experiments on three challenging multi-hop datasets validate the effectiveness of the proposed framework. The code and dataset are available at: https://github.com/Applied-Machine-Learning-Lab/ACL2026_SEARCH-R.

[99] Generating Place-Based Compromises Between Two Points of View

Sumanta Bhattacharyya, Francine Chen, Scott Carter, Yan-Ying Chen, Tatiana Lau, Nayeli Suseth Bravo, Monica P. Van, Kate Sieck, Charlene C. Wu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) excel academically but struggle with social intelligence tasks, such as creating good compromises. In this paper, we present methods for generating empathically neutral compromises between two opposing viewpoints. We first compared four different prompt engineering methods using Claude 3 Opus and a dataset of 2,400 contrasting views on shared places. A subset of the gen erated compromises was evaluated for acceptability in a 50-participant study. We found that the best method for generating compromises between two views used external empathic similarity between a compromise and each viewpoint as iterative feedback, outperforming stan dard Chain of Thought (CoT) reasoning. The results indicate that the use of empathic neutrality improves the acceptability of compromises. The dataset of generated compromises was then used to train two smaller foundation models via margin-based alignment of human preferences, improving efficiency and removing the need for empathy estimation during inference.

[100] Aligned Multi-View Scripts for Universal Chart-to-Code Generation

Zhihan Zhang, Lizi Liao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Chart-to-code generation converts a chart image into an executable plotting script, enabling faithful reproduction and editable visualizations. Existing methods are largely Python-centric, limiting practical use and overlooking a critical source of supervision: the same chart can be expressed by semantically equivalent scripts in different plotting languages. To fill this gap, we introduce Chart2NCode, a dataset of 176K charts paired with aligned scripts in Python, R, and LaTeX that render visually equivalent outputs, constructed via a metadata-to-template pipeline with rendering verification and human quality checks. Building on a LLaVA-style architecture, we further propose CharLuMA, a parameter-efficient adaptation module that augments the multimodal projector with a language-conditioned mixture of low-rank subspaces, allowing the model to share core chart understanding while specializing code generation to the target language through lightweight routing. Extensive experiments show consistent gains in executability and visual fidelity across all languages, outperforming strong open-source baselines and remaining competitive with proprietary systems. Further analyses reveal that balanced multi-language supervision benefits all languages and that the adapter allocates a compact shared core plus language-specific capacity. Codes and data are available at https://github.com/Zhihan72/CharLuMA.

[101] How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models’ ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.

Xihang Wang, Zihan Wang, Chengkai Huang, Quan Z. Sheng, Lina Yao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multimodal Retrieval-Augmented Generation (MRAG) addresses key limitations of Multimodal Large Language Models (MLLMs), such as hallucination and outdated knowledge. However, current MRAG systems struggle to distinguish whether retrieved multimodal data truly supports the semantic core of an answer or merely provides superficial relevance. Existing metrics often rely on heuristic position-based confidence, which fails to capture the informational density of multimodal entities. To address this, we propose Multi-modal Evidence Grounding (MEG), a semantic-aware metric that quantifies the contribution of retrieved evidence. Unlike standard confidence measures, MEG utilizes Semantic Certainty Anchoring, focusing on high-IDF information-bearing tokens that better capture the semantic core of the answer. Building on MEG, we introduce MEG-RAG, a framework that trains a multimodal reranker to align retrieved evidence with the semantic anchors of the ground truth. By prioritizing high-value content based on semantic grounding rather than token probability distributions, MEG-RAG improves the accuracy and multimodal consistency of generated outputs. Extensive experiments on the M$^2$RAG benchmark show that MEG-RAG consistently outperforms strong baselines and demonstrates robust generalization across different teacher models.

[103] Skill Retrieval Augmentation for Agentic AI

Weihang Su, Jianming Long, Qingyao Ai, Yichen Tang, Changyue Wang, Yiteng Tu, Yiqun Liu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model’s ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.

[104] Evaluation of Pose Estimation Systems for Sign Language Translation

Catherine O’Brien, Gerard Sant, Mathias Müller, Sarah Ebling

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Many sign language translation (SLT) systems operate on pose sequences instead of raw video to reduce input dimensionality, improve portability, and partially anonymize signers. The choice of pose estimator is often treated as an implementation detail, with systems defaulting to widely available tools such as MediaPipe Holistic or OpenPose. We present a systematic comparison of pose estimators for pose-based SLT, covering widely used baselines (MediaPipe Holistic, OpenPose) and newer whole-body/high-capacity models (MMPose WholeBody, OpenPifPaf, AlphaPose, SDPose, Sapiens, SMPLest-X). We quantify downstream impact by training a controlled SLT pipeline on RWTH-PHOENIX-Weather 2014 where only the pose representation varies, evaluating with BLEU and BLEURT. To contextualize translation outcomes, we analyze temporal stability, missing hand keypoints, and robustness to occlusion using higher-resolution videos from the Signsuisse dataset. SDPose and Sapiens achieve the best translation performance (BLEU ~11.5), outperforming the common MediaPipe baseline (BLEU ~10). In occlusion cases, Sapiens is correct in all tested instances (15/15), while OpenPifPaf fails in nearly all (1/15) and also yields the weakest translation scores. Estimators that frequently leave out hand keypoints are associated with lower BLEU/BLEURT. We release code that can be used not only to reproduce our experiments, but also considerably lowers the barrier for other researchers to use alternative pose estimators.

[105] Looking for the Bottleneck in Fine-grained Temporal Relation Classification

Hugo Sousa, Ricardo Campos, Alípio Jorge

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Temporal relation classification is the task of determining the temporal relation between pairs of temporal entities in a text. Despite recent advancements in natural language processing, temporal relation classification remains a considerable challenge. Early attempts framed this task using a comprehensive set of temporal relations between events and temporal expressions. However, due to the task complexity, datasets have been progressively simplified, leading recent approaches to focus on the relations between event pairs and to use only a subset of relations. In this work, we revisit the broader goal of classifying interval relations between temporal entities by considering the full set of relations that can hold between two time intervals. The proposed approach, Interval from Point, involves first classifying the point relations between the endpoints of the temporal entities and then decoding these point relations into an interval relation. Evaluation on the TempEval-3 dataset shows that this approach can yield effective results, achieving a temporal awareness score of $70.1$ percent, a new state-of-the-art on this benchmark.

[106] K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

Soyeon Kim, Cheongwoong Kang, Myeongjin Lee, Eun-Chul Chang, Jaedeok Lee, Jaesik Choi

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The development of practical (multimodal) large language model assistants for Korean weather forecasters is hindered by the absence of a multidimensional, expert-level evaluation framework grounded in authoritative sources. To address this, we introduce K-MetBench, a diagnostic benchmark grounded in national qualification exams. It exposes critical gaps across four dimensions: expert visual reasoning of charts, logical validity via expert-verified rationales, Korean-specific geo-cultural comprehension, and fine-grained domain analysis. Our evaluation of 55 models reveals a profound modality gap in interpreting specialized diagrams and a reasoning gap where models hallucinate logic despite correct predictions. Crucially, Korean models outperform significantly larger global models in local contexts, demonstrating that parameter scaling alone cannot resolve cultural dependencies. K-MetBench serves as a roadmap for developing reliable, culturally aware expert AI agents. The dataset is available at https://huggingface.co/datasets/soyeonbot/K-MetBench .

[107] DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

Zahra Dehghanighobadi, Asja Fischer

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint grows linearly with sequence length, leading to a major memory bottleneck. To mitigate this overhead, KV cache pruning methods discard cached tokens with low attention scores during inference. Most existing methods apply a uniform pruning ratio across layers, implicitly assuming that all layers contribute equally to overall model performance. We show that this assumption is suboptimal, as layers differ significantly in their sensitivity to pruning. We propose DepthKV, a layer-dependent pruning framework that allocates a fixed global KV budget across layers based on their sensitivity, rather than using a uniform allocation. Across multiple models and tasks, DepthKV consistently outperforms uniform pruning at the same global pruning ratio, demonstrating more effective utilization of the KV cache budget through layer-dependent allocation.

[108] Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation

Sercan Karakaş, Yusuf Şimşek

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze contexts where the information source is overtly external, while only its perceived reliability is manipulated (High-Trust vs. Low-Trust). In a human production experiment, native speakers of Turkish show a robust trust effect: High-Trust contexts yield relatively more -DI, whereas Low-Trust contexts yield relatively more -mIs, with the pattern remaining stable across sensitivity analyses. We then evaluate 10 LLMs in three prompting paradigms (open gap-fill, explicit past-tense gap-fill, and forced-choice A/B selection). LLM behavior is highly model- and prompt-dependent: some models show weak or local trust-consistent shifts, but effects are generally unstable, often reversed, and frequently overshadowed by output-compliance problems and strong base-rate suffix preferences. The results provide new evidence for a trust-/commitment-based account of Turkish evidentiality and reveal a clear human-LLM gap in source-sensitive evidential reasoning.

[109] Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

Lirong Gao, Zeqing Wang, Yuyan Cai, Jiayi Deng, Yanmei Gu, Yiming Zhang, Jia Zhou, Yanfei Zhang, Junbo Zhao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.

[110] Contextual Linear Activation Steering of Language Models

Brandon Hsu, Daniel Beaglehole, Adityanarayanan Radhakrishnan, Mikhail Belkin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Linear activation steering is a powerful approach for eliciting the capabilities of large language models and specializing their behavior using limited labeled data. While effective, existing methods often apply a fixed steering strength to all tokens, resulting in inconsistent steering quality across diverse input prompts. In this work, we introduce Contextual Linear Activation Steering (CLAS), a method that dynamically adapts linear activation steering to context-dependent steering strengths. Across eleven steering benchmarks and four model families, it consistently outperforms standard linear activation steering and matches or exceeds the performance of ReFT and LoRA in settings with limited labeled data. We therefore propose CLAS as a scalable, interpretable, and accurate method for specializing and steering large language models.

[111] The Chameleon’s Limit: Investigating Persona Collapse and Homogenization in Large Language Models

Yunze Xiao, Vivienne J. Zhang, Chenghao Yang, Ningshan Ma, Weihao Xuan, Jen-tse Huang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Applications based on large language models (LLMs), such as multi-agent simulations, require population diversity among agents. We identify a pervasive failure mode we term \emph{Persona Collapse}: agents each assigned a distinct profile nonetheless converge into a narrow behavioral mode, producing a homogeneous simulated population. To quantify persona collapse, we propose a framework that measures how much of the persona space a population occupies (Coverage), how evenly agents spread across it (Uniformity), and how rich the resulting behavioral patterns are (Complexity). Evaluating ten LLMs on personality simulation (BFI-44), moral reasoning, and self-introduction, we observe persona collapse along two axes: (1) Dimensions: a model can appear diverse on one axis yet structurally degenerate on another, and (2) Domains: the same model may collapse the most in personality yet be the most diverse in moral reasoning. Furthermore, item-level diagnostics reveal that behavioral variation tracks coarse demographic stereotypes rather than the fine-grained individual differences specified in each persona. Counter-intuitively, \textbf{the models achieving the highest per-persona fidelity consistently produce the most stereotyped populations}. We release our toolkit and data to support population-level evaluation of LLMs.

[112] Green Shielding: A User-Centric Approach Towards Trustworthy AI

Aaron J. Li, Nicolas Sanchez, Hao Huang, Ruijiang Dong, Jaskaran Bains, Katrin Jaradeh, Zhen Xiang, Bo Li, Feng Liu, Aaron Kornblith, Bin Yu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) are increasingly deployed, yet their outputs can be highly sensitive to routine, non-adversarial variation in how users phrase queries, a gap not well addressed by existing red-teaming efforts. We propose Green Shielding, a user-centric agenda for building evidence-backed deployment guidance by characterizing how benign input variation shifts model behavior. We operationalize this agenda through the CUE criteria: benchmarks with authentic Context, reference standards and metrics that capture true Utility, and perturbations that reflect realistic variations in the Elicitation of model behavior. Guided by the PCS framework and developed with practicing physicians, we instantiate Green Shielding in medical diagnosis through HealthCareMagic-Diagnosis (HCM-Dx), a benchmark of patient-authored queries, together with structured reference diagnosis sets and clinically grounded metrics for evaluating differential diagnosis lists. We also study perturbation regimes that capture routine input variation and show that prompt-level factors shift model behavior along clinically meaningful dimensions. Across multiple frontier LLMs, these shifts trace out Pareto-like tradeoffs. In particular, neutralization, which removes common user-level factors while preserving clinical content, increases plausibility and yields more concise, clinician-like differentials, but reduces coverage of highly likely and safety-critical conditions. Together, these results show that interaction choices can systematically shift task-relevant properties of model outputs and support user-facing guidance for safer deployment in high-stakes domains. Although instantiated here in medical diagnosis, the agenda extends naturally to other decision-support settings and agentic AI systems.

[113] Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari, Akash Haridas, Mingyu Yang, Vansh Bhatia, Guihong Li, Vikram Appia, Emad Barsoum

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emph{HyLo} (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to $32\times$ through efficient post-training and reduces KV-cache memory by more than $90%$, enabling up to 2M-token prefill and decoding in our \texttt{vLLM} inference stack, while comparable Llama baselines run out of memory beyond 64K context. Across 1B- and 3B-scale settings (Llama- and Qwen-based variants), HyLo delivers consistently strong short- and long-context performance and significantly outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER. Notably, at similar scale, HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, Lm-Harness common sense reasoning and RULER-64K.

[114] Sentiment and Emotion Classification of Indonesian E-Commerce Reviews via Multi-Task BiLSTM and AutoML Benchmarking

Hermawan Manurung, Ibrahim Al-Kahfi, Ahmad Rizqi, Martin Clinton Tosima Manullang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Indonesian marketplace reviews mix standard vocabulary with slang, regional loanwords, numeric shorthands, and emoji, making lexicon-based sentiment tools unreliable in practice. This paper describes a two-track classification pipeline applied to the PRDECT-ID dataset, which contains 5,400 product reviews from 29 Indonesian e-commerce categories, each labeled for binary sentiment (Positive/Negative) and five-class emotion (Happy, Sad, Fear, Love, Anger). The first track applies TF-IDF vectorization with a PyCaret AutoML sweep across standard classifiers. The second track is a PyTorch Bidirectional Long Short-Term Memory (BiLSTM) network with a shared encoder and two task-specific output heads. A preprocessing module applies 14 sequential cleaning steps, including a 140-entry slang dictionary assembled from marketplace corpora. Four configurations are benchmarked: BiLSTM Baseline, BiLSTM Improved, BiLSTM Large, and TextCNN. Training uses class-weighted cross-entropy loss, ReduceLROnPlateau scheduling, and early stopping. Both tracks are deployed as Gradio applications on Hugging Face Spaces. Source code is publicly available at https://github.com/ikii-sd/pba2026-crazyrichteam.

[115] Survey in Characterizing Semantic Change

Jader Martins Camboim de Sá, Marcos Da Silveira, Cédric Pruski

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Live languages continuously evolve to integrate the cultural change of human societies. This evolution manifests through neologisms (new words) or \textbf{semantic changes} of words (new meaning to existing words). Understanding the meaning of words is vital for interpreting texts coming from different cultures (regionalism or slang), domains (e.g., technical terms), or periods. In computer science, these words are relevant to computational linguistics algorithms such as translation, information retrieval, question answering, etc. Semantic changes can potentially impact the quality of the outcomes of these algorithms. Therefore, it is important to understand and characterize these changes formally. The study of this impact is a recent problem that has attracted the attention of the computational linguistics community. Several approaches propose methods to detect semantic changes with good precision, but more effort is needed to characterize how the meaning of words changes and to reason about how to reduce the impact of semantic change. This survey provides an understandable overview of existing approaches to the \textit{characterization of semantic changes} and also formally defines three classes of characterizations: if the meaning of a word becomes more general or narrow (change in dimension) if the word is used in a more pejorative or positive/ameliorated sense (change in orientation), and if there is a trend to use the word in a, for instance, metaphoric or metonymic context (change in relation). We summarized the main aspects of the selected publications in a table and discussed the needs and trends in the research activities on semantic change characterization.

[116] EXCEEDS: Extracting Complex Events via Nugget-based Grid Modeling in Scientific Domain

Yi-Fan Lu, Xian-Ling Mao, Bo Wang, Xiao Liu, Heyan Huang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: It is crucial to understand a specific domain by events. Extensive event extraction research has been conducted in many domains such as news, finance, and biology. However, event extraction in scientific domain is still insufficiently supported by comprehensive datasets and tailored methods. Compared with other domains, scientific domain has two characteristics: (1) denser nuggets and events, and (2) more complex information forms. To solve the above problem, considering these two characteristics, we first construct SciEvents, a large-scale multi-event document-level dataset with a schema tailored for scientific domain. It consists of 2,508 documents and 24,381 events under multi-stage manual annotation and quality control. Then, we propose EXCEEDS, an end-to-end scientific event extraction framework by encoding dense nuggets into a grid matrix and simplifying complex event extraction as a nugget-based grid modeling task. Experiments on SciEvents demonstrate state-of-the-art performances of EXCEEDS. Both the SciEvents dataset and the EXCEEDS framework are released publicly to facilitate future research.

[117] AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models

Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Zhiming Zheng

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Retrieved documents containing noise will hinder RAG from detecting answer clues and make the inference process slow and expensive. Therefore, context compression is necessary to enhance its accuracy and efficiency. Existing context compression methods use extractive or generative models to retain the most query-relevant sentences or apply the information bottleneck theory to preserve sufficient information. However, these methods may face issues such as over-compression or high computational costs. We observe that the retriever often ranks relevant documents at the top, but the exact number of documents needed to answer the query is uncertain due to the impact of query complexity and retrieval quality: complex queries like multi-hop questions may require retaining more documents than simpler queries, and a low-quality retrieval may need to rely on more documents to generate accurate outputs. Therefore, determining the minimum number of required documents (compression rate) is still a challenge for RAG. In this paper, we introduce AdaComp, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality. Specifically, we first annotate the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate and then construct triplets of the query, retrieved documents, and its compression rate. Then, we use this triplet dataset to train a compression-rate predictor. Experiments on three QA datasets and one conversational Multi-doc QA dataset show that AdaComp significantly reduces inference costs while maintaining performance nearly identical to uncompressed models, achieving a balance between efficiency and performance.

[118] EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models

He Hu, Lianzhong You, Hongbo Xu, Qianning Wang, Fei Richard Yu, Fei Ma, Zebang Cheng, Zheng Lian, Yucheng Zhou, Laizhong Cui

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: With the integration of multimodal large language models (MLLMs) into robotic systems and AI applications, embedding emotional intelligence (EI) capabilities is essential for enabling these models to perceive, interpret, and respond to human emotions effectively in real-world scenarios. Existing static, text-based, or text-image benchmarks overlook the multimodal complexities of real interactions and fail to capture the dynamic, context-dependent nature of emotional expressions, rendering them inadequate for evaluating MLLMs’ EI capabilities. To address these limitations, we introduce EmoBench-M, a systematic benchmark grounded in established psychological theories, designed to evaluate MLLMs across 13 evaluation scenarios spanning three hierarchical dimensions: foundational emotion recognition (FER), conversational emotion understanding (CEU), and socially complex emotion analysis (SCEA). Evaluation was conducted on 27 state-of-the-art MLLMs, using both objective task-specific metrics and LLM-based evaluation, revealing a substantial performance gap relative to human-level competence. Even the best performing models, Gemini-3.0-Pro and GPT-5.2, achieve the highest scores on EmoBench-M, 70.5 and 66.5 points respectively. Specialized models such as AffectGPT exhibit uneven performance across EmoBench-M, demonstrating strengths in certain scenarios but generally lacking comprehensive emotional intelligence. By providing a comprehensive, multimodal evaluation framework, EmoBench-M captures both the strengths and weaknesses of current MLLMs across diverse emotional contexts. All benchmark resources, including datasets and code, are publicly available at https://emo-gml.github.io/, facilitating further research and advancement in MLLM emotional intelligence.

[119] Quantifying and Improving the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data

Shiping Yang, Jie Wu, Wenbiao Ding, Ning Wu, Shining Liang, Ming Gong, Hongzhi Li, Hengyuan Zhang, Angel X. Chang, Dongmei Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Robustness has become a critical attribute for the deployment of RAG systems in real-world applications. Existing research focuses on robustness to explicit noise (e.g., document semantics) but overlooks implicit noise (spurious features). Moreover, previous studies on spurious features in LLMs are limited to specific types (e.g., formats) and narrow scenarios (e.g., ICL). In this work, we identify and study spurious features in the RAG paradigm, a robustness issue caused by the sensitivity of LLMs to semantic-agnostic features. We then propose a novel framework, SURE, to empirically quantify the robustness of RALMs against spurious features. Beyond providing a comprehensive taxonomy and metrics for evaluation, the framework’s data synthesis pipeline facilitates training-based strategies to improve robustness. Further analysis suggests that spurious features are a widespread and challenging problem in the field of RAG. Our code is available at https://github.com/maybenotime/RAG-SpuriousFeatures .

[120] Green Prompting: Characterizing Prompt-driven Energy Costs of LLM Inference

Marta Adamska, Daria Smirnova, Hamid Nasiri, Zhengxin Yu, Peter Garraghan

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) have become widely used across various domains spanning search engines, code generation, and text creation. However, a major concern associated with their adoption is the high cost of inference, impacting both their sustainability and financial feasibility. In this study, we empirically study how different prompt and response characteristics directly impact LLM inference energy cost. We conduct experiments leveraging three open-source transformer-based LLMs across three task types$-$question answering, sentiment analysis, and text generation. For each inference, we analyzed prompt and response characteristics (length, semantic meaning, time taken, energy consumption). Our results demonstrate that even when presented with identical tasks, models generate responses with varying characteristics and subsequently exhibit distinct energy consumption patterns. We found that prompt length is less significant than the semantic meaning of the task itself. In addition, we identified specific keywords associated with higher or lower energy usage that vary between associated tasks. These findings highlight the importance of prompt design in optimizing inference efficiency. We conclude that the semantic meaning of prompts and certain task-related keywords significantly impact inference costs, leading the way for deeper exploration towards creating energy-adaptive LLMs.

[121] Always Tell Me The Odds: Fine-grained Conditional Probability Estimation

Liaoyaqi Wang, Zhengping Jiang, Anqi Liu, Benjamin Van Durme

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present a state-of-the-art model for fine-grained probability estimation of propositions conditioned on context. Recent advances in large language models (LLMs) have significantly enhanced their reasoning capabilities, particularly on well-defined tasks with complete information. However, LLMs continue to struggle with making accurate and well-calibrated probabilistic predictions under uncertainty or partial information. While incorporating uncertainty into model predictions often boosts performance, obtaining reliable estimates of that uncertainty remains understudied. In particular, LLM probability estimates tend to be coarse and biased towards more frequent numbers. Through a combination of human and synthetic data creation and assessment, scaling to larger models, and better supervision, we propose a set of strong and precise probability estimation models. We conduct systematic evaluations across tasks that rely on conditional probability estimation and show that our approach consistently outperforms existing fine-tuned and prompting-based methods by a large margin.

[122] What Prompts Don’t Say: Understanding and Managing Underspecification in LLM Prompts

Chenyang Yang, Yike Shi, Qianou Ma, Michael Xieyang Liu, Christian Kästner, Tongshuang Wu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Prompt underspecification is a common challenge when interacting with LLMs. In this paper, we present an in-depth analysis of this problem, showing that while LLMs can often infer unspecified requirements by default (41.1%), such behavior is fragile: Under-specified prompts are 2x as likely to regress across model or prompt changes, sometimes with accuracy drops exceeding 20%. This instability makes it difficult to reliably build LLM applications. Moreover, simply specifying all requirements does not consistently help, as models have limited instruction-following ability and requirements can conflict. Standard prompt optimizers likewise provide little benefit. To address these issues, we propose requirements-aware prompt optimization mechanisms that improve performance by 4.8% on average over baselines. We further advocate for a systematic process of proactive requirements discovery, evaluation, and monitoring to better manage prompt underspecification in practice.

[123] DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models

Yuxuan Jiang, Dawei Li, Francis Ferraro

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While Large Reasoning Models (LRMs) have demonstrated success in complex reasoning tasks through long chain-of-thought (CoT) reasoning, their inference often involves excessively verbose reasoning traces, resulting in substantial inefficiency. To address this, we propose Distilled Reasoning Pruning (DRP), a hybrid framework that combines inference-time pruning with tuning-based distillation, two widely used strategies for efficient reasoning. DRP uses a teacher model to perform skill-aware step decomposition and content pruning, and then distills the pruned reasoning paths into a student model, enabling it to reason both efficiently and accurately. Across several challenging mathematical reasoning datasets, we find that models trained with DRP achieve substantial improvements in token efficiency without sacrificing accuracy. Specifically, DRP reduces average token usage on GSM8K from 917 to 328 while improving accuracy from 91.7% to 94.1%, and achieves a 43% token reduction on AIME with no performance drop. Further analysis shows that aligning the reasoning structure of training CoTs with the student’s reasoning capacity is critical for effective knowledge transfer and performance gains.

[124] CUB: Benchmarking Context Utilisation Techniques for Language Models

Lovisa Hagström, Youna Kim, Haeun Yu, Sang-goo Lee, Richard Johansson, Hyunsoo Cho, Isabelle Augenstein

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) - the first comprehensive benchmark designed to help diagnose CMTs under diverse noisy context conditions within retrieval-augmented generation (RAG). With this benchmark, we conduct the most extensive evaluation to date of seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to 11 LMs. Our findings expose critical gaps in current CMT evaluation practices, demonstrating the need for holistic testing. We reveal that most existing CMTs struggle to handle the full spectrum of context types encountered in real-world RAG scenarios. We also find that many CMTs display inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples.

[125] SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation

Wenjie Yang, Mao Zheng, Mingyang Song, Zheng Li, Sitong Wang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English $\leftrightarrow$ Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR with external supervision from COMET, our strongest model, SSR-X-Zero-7B, achieves state-of-the-art performance in English $\leftrightarrow$ Chinese translation, surpassing all existing open-source models under 72B parameters and even outperforming closed-source models. Our analysis highlights the effectiveness of the self-rewarding mechanism compared to the external LLM-as-a-judge approach in MT and demonstrates its complementary benefits when combined with trained RMs. Our findings provide valuable insight into the potential of self-improving RL methods. We have publicly released our code, data and models.

[126] Explaining Sources of Uncertainty in Automated Fact-Checking

Jingyi Sun, Greta Warren, Irina Shklovski, Isabelle Augenstein

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Understanding sources of a model’s uncertainty regarding its predictions is crucial for effective human-AI collaboration. Prior work proposes using numerical uncertainty or hedges (“I’m not sure, but …”), which do not explain uncertainty that arises from conflicting evidence, leaving users unable to resolve disagreements or rely on the output. We introduce CLUE (Conflict-and-Agreement-aware Language-model Uncertainty Explanations), the first framework to generate natural language explanations of model uncertainty by (i) identifying relationships between spans of text that expose claim-evidence or inter-evidence conflicts and agreements that drive the model’s predictive uncertainty in an unsupervised way, and (ii) generating explanations via prompting and attention steering that verbalize these critical interactions. Across three language models and two fact-checking datasets, we show that CLUE produces explanations that are more faithful to the model’s uncertainty and more consistent with fact-checking decisions than prompting for uncertainty explanations without span-interaction guidance. Human evaluators judge our explanations to be more helpful, more informative, less redundant, and more logically consistent with the input than this baseline. CLUE requires no fine-tuning or architectural changes, making it plug-and-play for any white-box language model. By explicitly linking uncertainty to evidence conflicts, it offers practical support for fact-checking and generalises readily to other tasks that require reasoning over complex information.

[127] Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

Nathaniel Getachew, Abulhair Saparov

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We introduce StorySim, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, or rely on an LLM for generation, StorySim produces novel, compositional story prompts anchored by a highly controllable Storyboard, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of LLMs show that most models achieve higher accuracy on WM tasks than on ToM tasks, and that models tend to reason more accurately when the subject of reasoning is a person rather than an inanimate object. Additionally, our framework enabled us to find evidence of heuristic behavior and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.

[128] Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources

Deshan Sumanathilaka, Sameera Perera, Sachithya Dharmasiri, Maneesha Athukorala, Anuja Dilrukshi Herath, Rukshan Dias, Pasindu Gamage, Ruvan Weerasinghe, Y. H. P. P. Priyadarshana

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The Swa-bhasha Resource Hub provides a comprehensive collection of data resources and algorithms developed for Romanized Sinhala to Sinhala transliteration between 2020 and 2025. These resources have played a significant role in advancing research in Sinhala Natural Language Processing (NLP), particularly in training transliteration models and developing applications involving Romanized Sinhala. The current openly accessible data sets and corresponding tools are made publicly available through this hub. This paper presents a detailed overview of the resources contributed by the authors and includes a comparative analysis of existing transliteration applications in the domain.

[129] The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage

Skyler Hallinan, Jaehun Jung, Melanie Sclar, Ximing Lu, Abhilasha Ravichander, Sahana Ramnath, Yejin Choi, Sai Praneeth Karimireddy, Niloofar Mireshghallah, Xiang Ren

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Membership inference attacks serves as useful tool for fair use of language models, such as detecting potential copyright infringement and auditing data leakage. However, many current state-of-the-art attacks require access to models’ hidden states or probability distribution, which prevents investigation into more widely-used, API-access only models like GPT-4. In this work, we introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model, enabling attacks on completely black-box models. We leverage the observation that models are more likely to memorize and subsequently generate text patterns that were commonly observed in their training data. Specifically, to make a prediction on a candidate member, N-Gram Coverage Attack first obtains multiple model generations conditioned on a prefix of the candidate. It then uses n-gram overlap metrics to compute and aggregate the similarities of these outputs with the ground truth suffix; high similarities indicate likely membership. We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods while also impressively achieving comparable or even better performance to state-of-the-art white-box attacks - despite having access to only text outputs. Interestingly, we find that the success rate of our method scales with the attack compute budget - as we increase the number of sequences generated from the target model conditioned on the prefix, attack performance tends to improve. Having verified the accuracy of our method, we use it to investigate previously unstudied closed OpenAI models on multiple domains. We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference, suggesting an evolving trend toward improved privacy protections.

[130] For-Value: Efficient Forward-Only Data Valuation for finetuning LLMs and VLMs

Wenlong Deng, Qi Zeng, Jiaming Zhang, Minghui Chen, Zixin Ding, Christos Thrampoulidis, Boying Gong, Xiaoxiao Li

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Data valuation is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing methods typically rely on gradient computations, making them computationally prohibitive for billion-parameter models and precluding batch parallelization. In this work, we introduce For-Value, a forward-only data valuation framework that enables efficient batch-scalable value estimation while maintaining effectiveness. Leveraging the expressive power of pretrained LLMs/VLMs, we theoretically demonstrate that data valuation can be captured by the alignment between the final hidden representations and prediction errors at the last layer. In light of this insight, For-Value computes data value using a simple closed-form expression with a single forward pass, eliminating the need for costly backpropagation and enabling efficient batch calculating at scale. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in detecting influential data and mislabeled data, while achieving significant efficiency improvements.

[131] CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model’s parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.

[132] Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

Rikuto Kotoge, Mai Nishimura, Jiaxin Ma

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reinforcement Learning has emerged as a dominant post-training approach to elicit agentic RAG behaviors such as search and planning from language models. Despite its success with larger models, applying RL to compact models (e.g., 0.5–1B parameters) presents unique challenges. The compact models exhibit poor initial performance, resulting in sparse rewards and unstable training. To overcome these difficulties, we propose Distillation-Guided Policy Optimization (DGPO), which employs cold-start initialization from teacher demonstrations and continuous teacher guidance during policy optimization. To understand how compact models preserve agentic behavior, we introduce Agentic RAG Capabilities (ARC), a fine-grained metric analyzing reasoning, search coordination, and response synthesis. Comprehensive experiments demonstrate that DGPO enables compact models to achieve sophisticated agentic search behaviors, even outperforming the larger teacher model in some cases. DGPO makes agentic RAG feasible in computing resource-constrained environments.

[133] Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

Ilham Wicaksono, Zekun Wu, Rahul Patel, Theo King, Adriano Koshiyama, Philip Treleaven

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As large language models increasingly deployed into agentic systems, existing methods face critical gaps in observing, assessing, and mitigating deployment-specific risks. We present a comprehensive, observability-driven workflow: we introduce \textbf{AgentSeer}, observability tool which decomposes agentic executions into granular \emph{action-component} graphs; we use this decomposition to rigorously quantify the gap between model-level and agent-level jailbreaking risk via cross-model validation on GPT-OSS-20B and Gemini-2.0-flash with HarmBench under single-turn and iterative-refinement attacks; we leverage action-graph risk signals to automate iterative prompt hardening against direct and iterative jailbreak attacks. Stark differences is revealed between model-level and agentic-level vulnerability profiles. Model-level evaluation reveals baseline differences: GPT-OSS-20B (39.47% ASR) versus Gemini-2.0-flash (50.00% ASR), with both models showing susceptibility to social engineering. However, agentic-level assessment exposes agent-specific risks invisible to traditional evaluation. We discover “agentic-only” vulnerabilities that emerge exclusively in agentic contexts, with tool-calling showing 24-60% higher ASR across both models. Cross-model analysis reveals universal agentic patterns, where agent transfer operations as highest-risk tools, with semantic pattern revealed rather than syntactic vulnerability mechanisms. Direct attack transfer from model-level to agentic contexts shows degraded performance of successful prompts (GPT-OSS-20B: 57% human injection ASR; Gemini-2.0-flash: 28%), while context-aware iterative attacks successfully compromise objectives that failed at model-level, confirming systematic vulnerabilities gaps. Action-based prompt improvement substantially reduces action-averaged agentic jailbreak success on GPT-OSS-20B (direct: 45.3%

[134] Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain

Gang Cheng, Haibo Jin, Wenbin Zhang, Haohan Wang, Jun Zhuang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) are increasingly deployed in finance, where unsafe behavior can lead to serious regulatory risks. However, most red-teaming research focuses on overtly harmful content and overlooks attacks that appear legitimate on the surface yet induce regulatory-violating responses. We address this gap by introducing a controllable black-box multi-turn risk-concealed red-teaming framework (CoRT) that progressively conceals surface-level risk while exploiting regulatory-violating behaviors. CoRT contains two key components: (i) a Risk Concealment Attacker (RCA) that generates multi-turn prompts via iterative refinement, and (ii) a Risk Concealment Controller (RCC) that predicts a turn-level Risk Concealment Score (RCS) to steer RCA’s follow-up style. We also built a domain-specific benchmark, FinRisk-Bench, with 522 instructions spanning six financial risk categories. Experiments on nine widely used LLMs show that CoRT (RCA) achieves 93.19% average attack success rate (ASR), and CoRT (RCA+RCC) further improves the average ASR to 95.00%. Our code and FinRisk-Bench are available at https://github.com/gcheng128/CoRT.

[135] SWE-QA: Can Language Models Answer Repository-level Code Questions?

Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, Xiaodong Gu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.

[136] V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models

Qidong Wang, Junjie Hu, Ming Jiang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advances in causal interpretability have extended from language models to vision-language models (VLMs), seeking to reveal their internal mechanisms through input interventions. While textual interventions often target semantics, visual interventions typically rely on coarse pixel-level perturbations, limiting semantic insights on multimodal integration. In this study, we introduce V-SEAM, a novel framework that combines Visual Semantic Editing and Attention Modulating for causal interpretation of VLMs. V-SEAM enables concept-level visual manipulations and identifies attention heads with positive or negative contributions to predictions across three semantic levels: objects, attributes, and relationships. We observe that positive heads are often shared within the same semantic level but vary across levels, while negative heads tend to generalize broadly. Finally, we introduce an automatic method to modulate key head embeddings, demonstrating enhanced performance for both LLaVA and InstructBLIP across three diverse VQA benchmarks. Our data and code are released at: https://github.com/petergit1/V-SEAM.

[137] OLaPh: Optimal Language Phonemizer

Johannes Wirth

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Phonemization is a critical component in text-to-speech synthesis. Traditional approaches rely on deterministic transformations and lexica, while neural methods offer potential for higher generalization on out-of-vocabulary (OOV) terms. This work introduces OLaPh (Optimal Language Phonemizer), a hybrid framework that integrates extensive multilingual lexica with advanced NLP techniques and a statistical subword segmentation function. Evaluations on the WikiPron benchmark show that the OLaPh framework significantly outperforms established baselines in overall accuracy and maintains robustness on OOV data through advanced fallback mechanisms. To further explore neural generalization, we utilize the framework to synthesize a high-consistency training corpus for an instruction-tuned Large Language Model (LLM). While the deterministic framework remains more accurate overall, the LLM demonstrates strong generalization, matching or partly exceeding the framework’s performance. This suggests that the LLM successfully internalized phonetic intuitions from the synthetic data that transcend the framework’s capabilities. Together, these tools provide a comprehensive, open-source resource for multilingual G2P research.

[138] Investigating the Representation of Backchannels and Fillers in Fine-tuned Language Models

Yu Wang, Leyi Lao, Langchu Huang, Gabriel Skantze, Yang Xu, Hendrik Buschmeier

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Backchannels and fillers are important linguistic expressions in dialogue, but often treated as ’noise’ to be bypassed in modern transformer-based language models (LMs). Here, we study how they are represented in LMs using three fine-tuning strategies on three dialogue corpora in English and Japanese, in which backchannels and fillers are both preserved and annotated. This allows us to investigate how fine-tuning can help LMs learn these representations. We first apply clustering analysis to the learnt representation of backchannels and fillers, and find increased silhouette scores in representations from fine-tuned models, which suggests that fine-tuning enables LMs to distinguish the nuanced semantic variation in different backchannel and filler use. We also employ natural language generation metrics and qualitative analyses to verify that utterances produced by fine-tuned LMs resemble those produced by humans more closely. Our findings suggest the potential for transforming general LMs into conversational LMs that can produce human-like language more adequately.

[139] CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

Nithin Somasekharan, Ling Yue, Yadi Cao, Weichao Li, Patrick Emami, Pochinapeddi Sai Bhargav, Anurag Acharya, Xingyu Xie, Shaowu Pan

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system – a critical and labor-intensive component – remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components – CFDQuery, CFDCodeBench, and FoamBench – designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.

[140] Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models

Jingyi Sun, Pepa Atanasova, Sagnik Ray Choudhury, Sekh Mainul Islam, Isabelle Augenstein

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Context utilisation, the ability of Language Models (LMs) to incorporate relevant information from the provided context when generating responses, remains largely opaque to users, who cannot determine whether models draw from parametric memory or provided context, nor identify which specific context pieces inform the response. Highlight explanations (HEs) offer a natural solution as they can point the exact context pieces and tokens that influenced model outputs. However, no existing work evaluates their effectiveness in accurately explaining context utilisation. We address this gap by introducing the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations. To demonstrate the framework’s broad applicability, we evaluate four HE methods – three established techniques and MechLight, a mechanistic interpretability approach we adapt for this task – across four context scenarios, four datasets, and five LMs. Overall, we find that MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, pointing to fundamental challenges in explanation accuracy that require new approaches to deliver reliable context utilisation explanations at scale.

[141] Evaluating Language Models’ Evaluations of Games

Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reasoning is not just about solving problems – it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems’ evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over 100 novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more “jaggedness” across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.

[142] ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering

Simon Lupart, Mohammad Aliannejadi, Evangelos Kanoulas

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present ChatR1, a reasoning framework based on reinforcement learning (RL) for conversational question answering (CQA). Reasoning plays an important role in CQA, where user intent evolves across dialogue turns, and utterances are often underspecified, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Unlike static `rewrite, retrieve, and generate’ pipelines, ChatR1 interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through RL. To address the challenge of sparse and delayed rewards in RL, we propose an intent-aware reward that provides turn-level feedback by aligning retrieval and reasoning with evolving user goals. ChatR1 demonstrates strong performance on both 3B and 7B model backbones, outperforming competitive models on five CQA datasets, measured by different metrics (F1, BERTScore, and LLM-as-judge). We include a diverse set of CQA datasets to cover topic shifts, evolving intents, mixed-initiative dialogues, and multi-document grounding, testing ChatR1’s performance from various aspects. Ablation studies confirm the effectiveness of the intent-aware reward. Our analyses further reveal diverse reasoning trajectories and effective use of the search tool. ChatR1 also generalizes robustly across domains, demonstrating that RL-based reasoning enables more flexible and context-aware behavior than static CQA pipelines.

[143] BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning

Jia-Chen Gu, Junyi Zhang, Di Wu, Yuankai Li, Kai-Wei Chang, Nanyun Peng

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As retrieval-augmented generation (RAG) tackles complex tasks, increasingly expanded contexts offer richer information, but at the cost of higher latency and increased cognitive load on the model. To mitigate this bottleneck, especially for intricate multi-hop questions, we introduce BRIEF-Pro. It is a universal, lightweight compressor that distills relevant evidence for a given query from retrieved documents into a concise summary for seamless integration into in-context RAG. Using seed data consisting of relatively short contexts (fewer than 1k words), BRIEF-Pro is trained to perform abstractive compression of extended contexts exceeding 10k words across a wide range of scenarios. Furthermore, BRIEF-Pro offers flexible user control over summary length by allowing users to specify the desired number of sentences. Experiments on four open-domain multi-hop question-answering datasets show that BRIEF-Pro generates more concise and relevant summaries, enhancing performance across small, large, and proprietary language models. With the 70B reader model, 32x compression by BRIEF-Pro improves QA performance by 4.67% on average over LongLLMLingua’s 9x, while requiring only 23% of its computational overhead.

[144] POPI: Personalizing LLMs via Optimized Natural Language Preference Inference

Yizhuo Chen, Xin Liu, Ruijie Wang, Zheng Li, Pei Chen, Changlong Yu, Qingyu Yin, Priyanka Nigam, Meng Jiang, Bing Yin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) are typically aligned with population-level preferences, despite substantial variation across individual users. We introduce POPI, a user-level personalization framework that separates the problem into two components connected by a natural-language interface: a shared inference model that distills heterogeneous user signals into a concise preference summary, and a shared generator that conditions on this summary to produce personalized responses. Both components are trained under a unified preference-optimization objective, with reinforcement learning handling the non-differentiable inference step. This objective decomposes into generator approximation error and summary informativeness, revealing how a single loss simultaneously drives accurate generation and informative summarization. Because the interface is natural language, learned summaries can be inferred once per user and reused across different generators – including frozen, black-box commercial APIs. Across four personalization benchmarks, POPI generally improves personalization quality while reducing context overhead by up to an order of magnitude.

[145] AI use in American newspapers is widespread, uneven, and rarely disclosed

Jenna Russell, Marzena Karpinska, Destiny Akinode, Katherine Thai, Bradley Emi, Max Spero, Mohit Iyyer

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1.5K American newspapers published in the summer of 2025. Using Pangram, a state-of-the-art AI detector, we discover that approximately 9% of newly-published articles are either partially or fully AI-generated. This AI use is unevenly distributed, appearing more frequently in smaller, local outlets, in specific topics such as weather and technology, and within certain ownership groups. We also analyze 45K opinion pieces from Washington Post, New York Times, and Wall Street Journal, finding that they are 6.4 times more likely to contain AI-generated content than news articles from the same publications, with many AI-flagged op-eds authored by prominent public figures. Despite this prevalence, we find that AI use is rarely disclosed: a manual audit of 100 AI-flagged articles found only five disclosures of AI use. Overall, our audit highlights the immediate need for greater transparency and updated editorial standards regarding the use of AI in journalism to maintain public trust.

[146] ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction

Lvhua Wu, Xuefeng Jiang, Sheng Sun, Yan Lei, Tian Wen, Yuwei Wang, Min Liu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The rapid spread of fake news threatens social stability and public trust, highlighting the urgent need for its effective detection. Although large language models (LLMs) show potential in fake news detection, they are limited by knowledge cutoff and easily generate factual hallucinations when handling time-sensitive news. Furthermore, the thinking of a single LLM easily falls into early stance locking and confirmation bias, making it hard to handle both content reasoning and fact checking simultaneously. To address these challenges, we propose ZoFia, a two-stage zero-shot fake news detection framework. In the first retrieval stage, we propose novel Hierarchical Salience and Salience-Calibrated Minimum Marginal Relevance (SC-MMR) algorithm to extract core entities accurately, which drive dual-source retrieval to overcome knowledge and evidence gaps. In the subsequent stage, a multi-agent system conducts multi-perspective reasoning and verification in parallel and achieves an explainable and robust result via adversarial debate. Comprehensive experiments on two public datasets show that ZoFia outperforms existing zero-shot baselines and even most few-shot methods. Our code has been open-sourced to facilitate the research community at https://github.com/SakiRinn/ZoFia.

[147] Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

Max Schaffelder, Albert Gatt

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As synthetic data becomes widely used in language model development, understanding its impact on model behavior is crucial. This paper investigates the impact of the diversity of sources of synthetic data on fine-tuned large language models. We focus on three key dimensions: distribution collapse, adversarial robustness, and self-preference bias. Our findings reveal that fine-tuning models on synthetic data from diverse sources can mitigate distribution collapse, preserving the breadth of the output distribution and the diversity of the output text. Furthermore, while both human and synthetic fine-tuning data can remove safeguards, we observe a tendency for higher output quality in the latter case, thus making outputs potentially more usable and dangerous. Finally, we also find evidence that fine-tuning reduces self-preference bias, with human data being the most effective, followed by multi-source synthetic data.

[148] Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Improving reasoning abilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Looped transformers address this by performing multiple latent iterations to refine each token beyond a single forward pass. However, we identify a latent overthinking phenomenon: most token predictions are already correct after the first pass, but are sometimes revised into errors in later iterations. In this work, we ask whether selectively skipping latent iterations may improve accuracy. We reveal significant potential with an oracle iteration policy that boosts model performance by up to 7.3%. Motivated by this, we propose Think-at-Hard (TaH), a looped transformer optimized for selective iteration. TaH employs a lightweight neural decider to trigger latent iteration only at tokens that are likely incorrect after the standard forward pass. During latent iterations, depth-aware Low-Rank Adaptation (LoRA) modules shift the LLM’s objective from general next-token prediction to focused hard-token refinement. A duo-causal attention mechanism extends attention from the token sequence dimension to an additional iteration depth dimension, enabling cross-iteration information flow with full sequential parallelism. Experiments on nine benchmarks show consistent gains across math, QA, and coding tasks. With identical parameter counts, TaH outperforms always-iterate baselines by 3.8-4.4% while skipping iterations on 93% of tokens, and exceeds single-iteration Qwen3 baselines by 3.0-3.8%. When allowing <3% more parameters from LoRA and decider modules, the gains further increase to 5.3-6.2% and 6.1-6.8%, respectively. Our code is available at https://github.com/thu-nics/TaH.

[149] Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

Juexi Shao, Siyou Li, Yujian Gan, Chris Madge, Vanja Karan, Massimo Poesio

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Dialogue-Based Generalized Referring Expression Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.

Michelle Wastl, Jannis Vamvas, Rico Sennrich

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recognizing semantic differences across documents is crucial for text generation evaluation and content alignment, especially in cross-lingual settings. However, as a standalone task, it has received little attention. We address this by introducing SwissGov-RSD, the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition. It encompasses a total of 224 multi-parallel documents in English–German, English–French, and English–Italian with token-level difference annotations by human annotators. We evaluate a variety of open-source and closed-source large language models as well as encoder models across different fine-tuning settings on this new benchmark. Our results show that current automatic approaches perform poorly compared to their performance on monolingual, sentence-level, and synthetic benchmarks, revealing a considerable gap for both LLMs and encoder models. We make our code and dataset publicly available.

[151] AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

Honglin Mu, Jinghao Liu, Kaiyang Wan, Rui Xing, Xiuying Chen, Timothy Baldwin, Wanxiang Che

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) excel at text comprehension and generation, making them ideal for automated tasks like code review and content moderation. However, our research identifies a vulnerability: LLMs can be manipulated by “adversarial instructions” hidden in input data, such as resumes or code, causing them to deviate from their intended task. Notably, while defenses may exist for mature domains such as code review, they are often absent in other common applications such as resume screening and peer review. This paper introduces a benchmark to assess this vulnerability in resume screening, revealing attack success rates exceeding 80% for certain attack types. We evaluate two defense mechanisms: prompt-based defenses achieve 10.1% attack reduction with 12.5% false rejection increase, while our proposed FIDS (Foreign Instruction Detection through Separation) using LoRA adaptation achieves 15.4% attack reduction with 10.4% false rejection increase. The combined approach provides 26.3% attack reduction, demonstrating that training-time defenses outperform inference-time mitigations in both security and utility preservation.

[152] Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

Karolina Drożdż, Kacper Dudzic, Anna Sterna, Marcin Moskalewicz

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. This depth-first case study provides the first direct comparison of state-of-the-art LLMs and mental health professionals in assessing Borderline (BPD) and Narcissistic (NPD) Personality Disorders based on Polish-language first-person autobiographical accounts. Within our sample, the overall diagnostic scores of the top-performing Gemini Pro models (65.48%) were 21.91 percentage points higher than the average scores of the human professionals (43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a potential reluctance toward the value-laden term “narcissism.” Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patients’ sense of self and temporal experience. Our findings demonstrate that while LLMs might be competent at interpreting complex first-person clinical data, their outputs still carry critical reliability and bias issues.

[153] Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Zhijun Chen, Zeyu Ji, Qianren Mao, Hao Wu, Jinhuan Song, Junhang Cheng, Bangjie Qin, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, Yikun Ban, Zhu Sun, Xiangyang Ji, Hailong Sun

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a transparent and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a straightforward averaging strategy or a principled graphical model-based truth inference algorithm to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. Our results across four datasets show that the two variants of the proposed approach outperform the advanced model Smoothie-Global by 6.9% and 7.3% points, cross diverse task types including factual recall QA, math reasoning, and instruction following.

[154] NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

Zhongtao Miao, Kaiyan Zhao, Masaaki Nagata, Yoshimasa Tsuruoka

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains underexplored compared with general machine translation (MT). In this paper, we propose an agentic framework, NeoAMT, for neologism-aware machine translation equipped with a Wiktionary-based search toolkit. Specifically, we first construct a dedicated dataset for neologism-aware machine translation and build a search toolkit grounded in Wiktionary. The dataset covers 16 languages and 75 translation directions in total, derived from approximately 10 million records of an English Wiktionary dump. The retrieval corpus of the search toolkit is also constructed from around 3 million cleaned records of the same dump. We then leverage the dataset and toolkit to train a translation agent via reinforcement learning (RL) and to evaluate the accuracy of neologism-aware machine translation. Furthermore, we propose an RL training framework featuring a novel reward design and an adaptive rollout generation strategy that exploits translation difficulty to further improve the translation quality of translation agents using our search toolkit.

Mutaz Ayesh, Saif M. Mohammad, Nedjma Ousidhoum

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Warmth (W) (often further broken down intoTrust (T) and Sociability (S)) and Competence (C) are central dimensions along which people evaluate individuals and social groups (Fiske, 2018). While these constructs are well established in social psychology, they are only starting to get attention in NLP research through word-level lexicons, which do not fully capture their contextual expression in larger text units and discourse. In this work, we introduce Warmth and Competence Sentences (W&C-Sent), the first sentence-level dataset annotated for warmth and competence. The dataset includes over 1,600 English sentence–target pairs annotated along three dimensions: trust and sociability (components of warmth), and competence. The sentences in W&C-Sent are social media posts that express attitudes and opinions about specific individuals or social groups (the targets of our annotations). We describe the data collection, annotation, and quality-control procedures in detail, and evaluate a range of large language models (LLMs) on their ability to identify trust, sociability, and competence in text. W&C-Sent provides a new resource for analyzing warmth and competence in language and supports future research at the intersection of NLP and computational social science.

[156] Stress-Testing Emotional Support Models: Moving from Homogeneous to Diverse Help Seekers

Chaewon Heo, Cheyon Jin, Yohan Jo

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As emotional support chatbots have recently gained significant traction across both research and industry, a common evaluation strategy has emerged: use help-seeker simulators to interact with supporter chatbots. However, current simulators suffer from two critical limitations: (1) they fail to capture the behavioral diversity of real-world seekers, often portraying them as overly cooperative, and (2) they lack the controllability required to simulate specific seeker profiles. To address these challenges, we present a controllable seeker simulator driven by nine psychological and linguistic features that underpin seeker behavior. Using authentic Reddit conversations, we train our model via a Mixture-of-Experts (MoE) architecture, which effectively differentiates diverse seeker behaviors into specialized parameter subspaces, thereby enhancing fine-grained controllability. Our simulator achieves superior profile adherence and behavioral diversity compared to existing approaches. Furthermore, evaluating 7 prominent supporter models with our system uncovers previously obscured performance degradations. These findings underscore the utility of our framework in providing a more faithful and stress-tested evaluation for emotional support chatbots.

[157] A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification

Gonzalo Ariel Meyoyan, Luciano Del Corro

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Production LLM systems often rely on separate models for safety and other classification-heavy steps, increasing latency, VRAM footprint, and operational complexity. We instead reuse computation already paid for by the serving LLM: we train lightweight probes on its hidden states and predict labels in the same forward pass used for generation. We frame classification as representation selection over the full token-layer hidden-state tensor, rather than committing to a fixed token or fixed layer (e.g., first-token logits or final-layer pooling). To implement this, we introduce a two-stage aggregator that (i) summarizes tokens within each layer and (ii) aggregates across layer summaries to form a single representation for classification. We instantiate this template with direct pooling, a 100K-parameter scoring-attention gate, and a downcast multi-head self-attention (MHA) probe with up to 35M trainable parameters. Across safety and sentiment benchmarks our probes improve over logit-only reuse (e.g., MULI) and are competitive with substantially larger task-specific baselines, while preserving near-serving latency and avoiding the VRAM and latency costs of a separate guard-model pipeline. Multi-backbone experiments on dense and mixture-of-experts architectures (Llama-3.2-3B, GPT-OSS-20B, Qwen3-30B-A3B) confirm that these findings generalize beyond a single model family.

[158] CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning

Zhiyuan Lu, Chenliang Li, Yingcheng Shi, Weizhou Shen, Ming Yan, Fei Huang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single long texts or rely on a “sparse retrieval” assumption-that answers can be derived from a few relevant chunks. This assumption fails for true corpus-level analysis, where evidence is highly dispersed across hundreds of documents and answers require global integration, comparison, and statistical aggregation. To address this critical gap, we introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework. By decoupling reasoning from textual representation, this framework creates complex, computation-intensive queries with programmatically guaranteed ground-truth answers, challenging systems to perform holistic reasoning over vast, unstructured text without relying on fallible human annotation. We further demonstrate the utility of our framework beyond evaluation, showing that fine-tuning on our synthesized data effectively enhances an LLM’s general long-context reasoning capabilities. Extensive experiments reveal that even state-of-the-art long-context LLMs struggle as input length increases, and standard retrieval-augmented generation systems collapse entirely. Our findings indicate that memory-augmented agentic architectures offer a more robust alternative, suggesting a critical shift is needed from simply extending context windows to developing advanced architectures for global information synthesis.

[159] MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA

Yutong Song, Shiva Shrestha, Chenhan Lyu, Elahe Khatibi, Pengfei Zhang, Honghui Xu, Nikil Dutt, Amir Rahmani

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Spoken question-answering (SQA) systems relying on automatic speech recognition (ASR) often struggle with accurately recognizing medical terminology. To this end, we propose MedSpeak, a novel knowledge graph-aided ASR error correction framework that refines noisy transcripts and improves downstream answer prediction by leveraging both semantic relationships and phonetic information encoded in a medical knowledge graph, together with the reasoning power of LLMs. Comprehensive experimental results on benchmarks demonstrate that MedSpeak significantly improves the accuracy of medical term recognition and overall medical SQA performance, establishing MedSpeak as a state-of-the-art solution for medical SQA. The code is available at https://github.com/RainieLLM/MedSpeak.

Polina Tsvilodub, Jan-Felix Klumpp, Amir Mohammadpour, Jennifer Hu, Michael Franke

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper investigates whether LMs recruit shared computational mechanisms for general Theory of Mind (ToM) and language-specific pragmatic reasoning in order to contribute to the general question of whether LMs may be said to have emergent “social world models”, i.e., representations of mental states that are repurposed across tasks (the functional integration hypothesis). Using behavioral evaluations and causal-mechanistic experiments via functional localization methods inspired by cognitive neuroscience, we analyze LMs’ performance across seven subcategories of ToM abilities (Beaudoin et al., 2020) on a substantially larger localizer dataset than used in prior like-minded work. Results from stringent hypothesis-driven statistical testing offer suggestive evidence for the functional integration hypothesis, indicating that LMs may develop interconnected “social world models” rather than isolated competencies. This work contributes novel ToM localizer data, methodological refinements to functional localization techniques, and empirical insights into the emergence of social cognition in artificial systems.

[161] Gated Tree Cross-Attention for Checkpoint-Compatible Syntax Injection in Decoder-Only LLMs

Xinyu Gao, Shaonan Wang, Nai Ding

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Decoder-only large language models achieve strong broad performance but are brittle to minor grammatical perturbations, undermining reliability for downstream reasoning. However, directly injecting explicit syntactic structure into an existing checkpoint can interfere with its pretrained competence. We introduce a checkpoint-compatible gated tree cross-attention (GTCA) branch that reads precomputed constituency chunk memory while leaving backbone architecture unchanged. Our design uses a token update mask and staged training to control the scope and timing of structural updates. Across benchmarks and Transformer backbones, GTCA strengthens syntactic robustness beyond continued-training baselines without compromising Multiple-Choice QA performance or commonsense reasoning, providing a practical checkpoint-compatible route to more syntax-robust decoder-only LLMs.

[162] A Lightweight Explainable Guardrail for Prompt Safety

Md Asiful Islam, Mihai Surdeanu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We propose a lightweight explainable guardrail (LEG) method to detect unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained on synthetic explanation data, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG’s training process uses a novel loss that captures global explanation signals as a weak supervision and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches.

[163] Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

Baris Karacan, Barbara Di Eugenio, Patrick Thornton

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.

[164] CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Zhengqing Yuan, Kaiwen Shi, Zheyuan Zhang, Lichao Sun, Nitesh V. Chawla, Yanfang Ye

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted papers at major machine learning venues, exposing vulnerabilities in peer review. Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation. We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing. Our multi-agent verification pipeline decomposes citation checking into claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment to assess whether a cited source truly supports its claim. We construct a large-scale human-validated dataset across domains and define unified metrics for citation faithfulness and evidence alignment. Experiments with state-of-the-art LLMs reveal substantial citation errors and show that our framework significantly outperforms prior methods in both accuracy and interpretability. This work provides the first scalable infrastructure for auditing citations in the LLM era and practical tools to improve the trustworthiness of scientific references.

[165] A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs

Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Chris Lott, Mingu Lee, Fatih Porikli

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Autoregressive (AR) language models build representations incrementally via left-to-right prediction, while diffusion language models (dLLMs) are trained through full-sequence denoising. Although recent dLLMs match AR performance, whether diffusion objectives fundamentally reshape internal representations remains unclear. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B), using cosine similarity across layers and tokens alongside static inference-time layer-skipping as an analytical probe of redundancy. We find that diffusion objectives produce more global representations with substantial early-layer redundancy and reduced recency bias, while AR objectives yield tightly coupled, locally structured representations. AR-initialized dLLMs retain AR-like dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this redundancy, native dLLMs absorb up to 18.75% FLOPs reduction while retaining over 90% performance on math-reasoning and coding benchmarks, whereas AR models collapse under identical skipping, revealing that diffusion objectives, rather than architecture alone, induce depth redundancy that enables principled compression.

[166] Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike

Miriam Winkler, Verena Blaschke, Barbara Plank

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Indirectness is a common feature of daily communication, yet is underexplored in NLP research for both low-resource as well as high-resource languages. Indirect Question Answering (IQA) aims at classifying the polarity of indirect answers. In this paper, we present two multilingual corpora for IQA of varying quality that both cover English, Standard German and Bavarian, a German dialect without standard orthography: InQA+, a small high-quality evaluation dataset with hand-annotated labels, and GenIQA, a larger training dataset, that contains artificial data generated by GPT-4o-mini. We find that IQA is a pragmatically hard task that comes with various challenges, based on several experiment variations with multilingual transformer models (mBERT, XLM-R and mDeBERTa). We suggest and employ recommendations to tackle these challenges. Our results reveal low performance, even for English, and severe overfitting. We analyse various factors that influence these results, including label ambiguity, label set and dataset size. We find that the IQA performance is poor in high- (English, German) and low-resource languages (Bavarian) and that it is beneficial to have a large amount of training data. Further, GPT-4o-mini does not possess enough pragmatic understanding to generate high-quality IQA data in any of our tested languages.

[167] When Annotators Agree but Labels Disagree: The Projection Problem in Stance Detection

Bowen Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Stance detection is nearly always formulated as classifying text into Favor, Against, or Neutral. This convention was inherited from debate analysis and has been applied without modification to social media since SemEval-2016. However, attitudes toward complex targets are not unitary. A person can accept climate science while opposing carbon taxes, expressing support on one dimension and opposition on another. When annotators must compress such multi-dimensional attitudes into a single label, different annotators may weight different dimensions, producing disagreement that reflects different compression choices rather than confusion. We call this the projection problem. We conduct an annotation study across five targets from three stance benchmarks (SemEval-2016, P-Stance, COVID-19-Stance), with the same three annotators labeling all targets. For each target, annotators assign both a standard stance label and per-dimension judgments along target-specific dimensions discovered through bottom-up analysis, using the same number of categories for both. Across all fifteen target–dimension pairs, dimensional agreement consistently exceeds label agreement. The gap appears to scale with target complexity: modest for a single-entity target like Joe Biden (AC1: 0.87 vs. 0.95), but large for a multi-faceted policy target like school closures (AC1: 0.21 vs. 0.71).

[168] Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers

Yusheng Zhao, Hourun Li, Bohan Wu, Yichun Yin, Lifeng Shang, Jingyang Yuan, Meng Zhang, Ming Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The attention mechanism has been the core component in modern transformer architectures. However, the computation of standard full attention scales quadratically with the sequence length, serving as a major bottleneck in long-context language modeling. Sliding window attention restricts the context length for better efficiency at the cost of narrower receptive fields. While existing efforts attempt to take the benefits from both sides by building hybrid models, they often resort to static, heuristically designed alternating patterns that limit efficient allocation of computation in various scenarios. In this paper, we propose Switch Attention (SwiAttn), a novel hybrid transformer that enables dynamic and fine-grained routing between full attention and sliding window attention. For each token at each transformer layer, SwiAttn dynamically routes the computation to either a full-attention branch for global information aggregation or a sliding-window branch for efficient local pattern matching. An adaptive regularization objective is designed to encourage the model towards efficiency. Moreover, we adopt continual pretraining to optimize the model, transferring the full attention architecture to the hybrid one. Extensive experiments are conducted on twenty-three benchmark datasets across both regular (4K) and long (32K) context lengths, demonstrating the effectiveness of the proposed method.

[169] How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization

Ramon Ferrer-i-Cancho

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.

[170] Council Mode: A Heterogeneous Multi-Agent Consensus Framework for Reducing LLM Hallucination and Bias

Shuai Wu, Xue Li, Yanna Feng, Yufang Li, Zhijun Wang, Ran Wang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) have demonstrated advanced capabilities but often suffer from factual inaccuracies (hallucinations) and systematic biases. These issues, sometimes amplified in specific architectures like Mixture-of-Experts (MoE) which motivate our work, pose risks for reliable deployment. To address these challenges, we propose the Council Mode, a multi-agent consensus framework. Our approach dispatches queries to multiple heterogeneous frontier LLMs in parallel and synthesizes their outputs using a dedicated consensus model. The pipeline consists of three phases: an intelligent triage for query complexity, parallel generation across diverse models, and a structured synthesis that identifies agreement, disagreement, and unique findings. In our evaluation, conducted under controlled no-web settings, the Council Mode achieved a 35.9% relative reduction in hallucination rates on a 1,200-sample HaluEval subset and a 7.8-point improvement on TruthfulQA compared to the top-performing individual model. On our curated MDR-500 multi-domain reasoning benchmark, the Council Mode achieved a Quality Score of 91.7%, representing a 10.2-point improvement over the best individual model. The framework also exhibited lower measured bias variance under our rubric-based evaluation protocol. We provide a cost-effectiveness analysis showing that the framework incurs a 4.2x token-cost overhead, making it most suitable for accuracy-prioritized applications where the cost of errors exceeds the added inference cost. These findings suggest that structured multi-agent consensus is a promising direction for enhancing the reliability and factual grounding of LLM-generated content.

[171] Position: Logical Soundness is not a Reliable Criterion for Neurosymbolic Fact-Checking with LLMs

Jason Chan, Robert Gaizauskas, Zhixue Zhao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As large language models (LLMs) are increasing integrated into fact-checking pipelines, formal logic is often proposed as a rigorous means by which to mitigate bias, errors and hallucinations in these models’ outputs. For example, some neurosymbolic systems verify claims by using LLMs to translate natural language into logical formulae and then checking whether the proposed claims are logically sound, i.e. whether they can be validly derived from premises that are verified to be true. We argue that such approaches structurally fail to detect misleading claims due to systematic divergences between conclusions that are logically sound and inferences that humans typically make and accept. Drawing on studies in cognitive science and pragmatics, we present a typology of cases in which logically sound conclusions systematically elicit human inferences that are unsupported by the underlying premises. Consequently, we advocate for a complementary approach: leveraging human-like reasoning tendencies of LLMs as a feature rather than a bug, and using these models to validate the outputs of formal components in neurosymbolic systems against potentially misleading conclusions.

[172] AtomEval: Atomic Evaluation of Adversarial Claims in Fact Verification

Hongyi Cen

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Adversarial claim rewriting is widely used to test fact-checking systems, but standard metrics fail to capture truth-conditional consistency and often label semantically corrupted rewrites as successful. We introduce AtomEval, a validity-aware evaluation framework that decomposes claims into subject-relation-object-modifier (SROM) atoms and scores adversarial rewrites with Atomic Validity Scoring (AVS), enabling detection of factual corruption beyond surface similarity. Experiments on the FEVER dataset across representative attack strategies and LLM generators show that AtomEval provides more reliable evaluation signals in our experiments. Using AtomEval, we further analyze LLM-based adversarial generators and observe that stronger models do not necessarily produce more effective adversarial claims under validity-aware evaluation, highlighting previously overlooked limitations in current adversarial evaluation practices.

[173] Can We Still Hear the Accent? Investigating the Resilience of Native Language Signals in the LLM Era

Nabelanita Utami, Ryohei Sasano

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The evolution of writing assistance tools from machine translation to large language models (LLMs) has changed how researchers write. This study investigates whether this shift is homogenizing research papers by analyzing native language identification (NLI) trends in ACL Anthology papers across three eras: pre-neural network (NN), pre-LLM, and post-LLM. We construct a labeled dataset using a semi-automated framework and fine-tune a classifier to detect linguistic fingerprints of author backgrounds. Our analysis shows a consistent decline in NLI performance over time. Interestingly, the post-LLM era reveals anomalies: while Chinese and French show unexpected resistance or divergent trends, Japanese and Korean exhibit sharper-than-expected declines.

[174] Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning

Xinyi Huang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Selecting the right knowledge is critical when using large language models (LLMs) to solve domain-specific data analysis tasks. However, most retrieval-augmented approaches rely primarily on lexical or embedding similarity, which is often a weak proxy for the task-critical knowledge needed for multi-step reasoning. In many such tasks, the relevant knowledge is not merely textually related to the query, but is instead grounded in executable code and the dependency structure through which computations are carried out. To address this mismatch, we propose SGKR (Structure-Grounded Knowledge Retrieval), a retrieval framework that organizes domain knowledge with a graph induced by function-call dependencies. Given a question, SGKR extracts semantic input and output tags, identifies dependency paths connecting them, and constructs a task-relevant subgraph. The associated knowledge and corresponding function implementations are then assembled as a structured context for LLM-based code generation. Experiments on multi-step data analysis benchmarks show that SGKR consistently improves solution correctness over no-retrieval and similarity-based retrieval baselines for both vanilla LLMs and coding agents.

[175] Robust Explanations for User Trust in Enterprise NLP Systems

Guilin Zhang, Kai Zhao, Jeffrey Friedman, Xu Chu, Amine Anoun, Jerry Ting

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Robust explanations are increasingly required for user trust in enterprise NLP, yet pre-deployment validation is difficult in the common case of black-box deployment (API-only access) where representation-based explainers are infeasible and existing studies provide limited guidance on whether explanations remain stable under real user noise, especially when organizations migrate from encoder classifiers to decoder LLMs. To close this gap, we propose a unified black-box robustness evaluation framework for token-level explanations based on leave-one-out occlusion, and operationalize explanation robustness with top-token flip rate under realistic perturbations (swap, deletion, shuffling, and back-translation) at multiple severity levels. Using this protocol, we conduct a systematic cross-architecture comparison across three benchmark datasets and six models spanning encoder and decoder families (BERT, RoBERTa, Qwen 7B/14B, Llama 8B/70B; 64,800 cases). We find that decoder LLMs produce substantially more stable explanations than encoder baselines (73% lower flip rates on average), and that stability improves with model scale (44% gain from 7B to 70B). Finally, we relate robustness improvements to inference cost, yielding a practical cost-robustness tradeoff curve that supports model and explanation selection prior to deployment in compliance-sensitive applications.

[176] Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

Tomer Ashuach, Shai Gretz, Yoav Katz, Yonatan Belinkov, Liat Ein-Dor

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model’s own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

[177] One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Instruction-tuned large language models produce helpful, structured responses, but how robust is this helpfulness under trivial constraints? We show that simple lexical constraints (banning a single punctuation character or common word) cause instruction-tuned LLMs to collapse their responses, losing 14–48% of comprehensiveness across seven models spanning five families (7B–70B, open- and closed-weight). A blinded human evaluation with 10 STEM-trained evaluators confirms genuine content loss, with information criteria degrading $1.5$–$2.3\times$ more than surface criteria, a finding corroborated by over 4,100 automated pairwise comparisons (77–100% baseline preference) across three LLM judges from two model families. Diagnostic analysis identifies this as a \emph{planning failure}: two-pass generation recovers 59–96% of response length, and linear probes on prompt representations predict response length with $R^2 = 0.51$–$0.94$ before generation begins. The same probes yield negative $R^2$ on base models, confirming that instruction tuning introduces the representational structure underlying the collapse. Base models show no systematic degradation under identical constraints, demonstrating that instruction tuning couples task competence to narrow surface-form templates. The effect extends to realistic deployment constraints (preamble suppression, corporate tone guidelines, legal compliance hedging, accessibility requirements) causing comparable degradation ($-$22% to $-$34%), with suppressing the conversational opener alone (``Certainly!’’) causing 40% collapse on our most fragile model despite restricting only the opening tokens. We further show that standard independent LLM-as-judge evaluation detects only a 3.5% quality drop where pairwise evaluation reveals 23%, exposing a methodological blind spot in current evaluation practice.

[178] EVE: A Domain-Specific LLM Framework for Earth Intelligence

Àlex R. Atrio, Antonio Lopez, Jino Rohit, Yassine El Ouahidi, Marcello Politi, Vijayasri Iyer, Umar Jamil, Sébastien Bratières, Nicolas Longépé

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We introduce Earth Virtual Expert (EVE), the first open-source, end-to-end initiative for developing and deploying domain-specialized LLMs for Earth Intelligence. At its core is EVE-Instruct, a domain-adapted 24B model built on Mistral Small 3.2 and optimized for reasoning and question answering. On newly constructed Earth Observation and Earth Sciences benchmarks, it outperforms comparable models while preserving general capabilities. We release curated training corpora and the first systematic domain-specific evaluation benchmarks, covering MCQA, open-ended QA, and factuality. EVE further integrates RAG and a hallucination-detection pipeline into a production system deployed via API and GUI, supporting 350 pilot users so far. All models, datasets, and code are ready to be released under open licenses as contributions to our field at huggingface.co/eve-esa and github.com/eve-esa.

[179] Peer-Predictive Self-Training for Language Model Reasoning

Shi Feng, Hanlin Zhang, Fan Nie, Sham Kakade, Yiling Chen

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.

[180] Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

Danae Sánchez Villegas, Samuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.

[181] PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations

Yuhe Wu, Guangyu Wang, Yuran Chen, Jiatong Zhang, Yutong Zhang, Yujie Chen, Jiaming Shang, Guang Zhang, Zhuang Liu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As large language models (LLMs) evolve from conversational assistants into agents capable of handling complex tasks, they are increasingly deployed in high-risk domains. However, existing benchmarks largely rely on mixed queries and posterior evaluation, output-level scoring, which quantifies hallucination severity but offers limited insight into where and why hallucinations arise in the generation pipeline. We therefore reformulate hallucination evaluation as a diagnostic problem and propose PRISM, a controlled benchmark that disentangles hallucinations into four dimensions: knowledge missing, knowledge errors, reasoning errors, and instruction-following errors, grounded in three stages of generation (memory, instruction, and reasoning). PRISM contains 9,448 instances across 65 tasks and supports fine-grained, stage-aware diagnostic evaluation. Evaluating 24 mainstream open-source and proprietary LLMs, we uncover consistent trade-offs across instruction following, memory retrieval, and logical reasoning, showing that mitigation strategies often improve specific dimensions at the expense of others. We hope PRISM provides a framework for understanding the specific mechanisms behind LLMs hallucinations, ultimately accelerating the development of trustworthy large language models.

[182] HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution

Hanhua Hong, Yizhi LI, Jiaoyan Chen, Sophia Ananiadou, Xiaoli Li, Jung-jae Kim, Chenghua Lin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17745: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17745&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[183] AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment

Yixuan Wang, Yue Huang, Hong Qian, Yunzhao Wei, Yifei Ding, Wenkai Wang, Zhi Liu, Zhongjing Huang, Aimin Zhou, Jiajun Guo

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18398: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18398&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[184] The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models

Shuai Wu, Xue Li, Yanna Feng, Yufang Li, Zhijun Wang, Ran Wang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.19139: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.19139&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[185] Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics

Dmitry Pronin, Evgeny Kazartsev

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.19499: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.19499&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[186] MathDuels: Evaluating LLMs as Problem Posers and Solvers

Zhiqiu Xu, Shibo Jin, Shreya Arya, Mayur Naik

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21916: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21916&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[187] Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose $\textbf{Abstract Chain-of-Thought}$, a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen ‘‘abstract’’ tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to $11.6\times$ fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across language model families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.

[188] MERIT: Modular Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning

Mir Nafis Sharear Shopnil, Sharad Duwal, Abhishek Tyagi, Adiba Mahbub Proma

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.17590: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17590&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[189] PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

Xudong Xie, Hao Yan, Liang Yin, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, Xiang Bai

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2410.05970: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.05970&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[190] In-depth Analysis of Graph-based RAG in a Unified Framework

Yingli Zhou, Yaodong Su, Youran Sun, Shu Wang, Taotao Wang, Runyuan He, Yongwei Zhang, Sicong Liang, Xilin Liu, Yuchi Ma, Yixiang Fang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2503.04338: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.04338&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[191] PARASITE: Conditional System Prompt Poisoning to Hijack LLMs

Viet Pham, Thai Le

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.16888: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16888&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[192] VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval

Di Wu, Yixin Wan, Kai-Wei Chang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.20291: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20291&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Qibin Wang, Pu Zhao, Shaohan Huang, Fangkai Yang, Lu Wang, Furu Wei, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.00084: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.00084&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[194] Beyond Context: Large Language Models’ Failure to Grasp Users’ Intent

Ahmed M. Hussain, Salahuddin Salahuddin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.21110: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21110&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Hao Ban, Kaiyi Ji

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.25414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[196] On the Reasoning Abilities of Masked Diffusion Language Models

Anej Svete, Ashish Sabharwal

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.13117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[197] AP-BMM: Approximating Capability-Efficiency Pareto Sets of LLMs via Asynchronous Prior-guided Bayesian Model Merging

Kesheng Chen, Yamin Hu, Zhenqian Zhu, Yiya Diao, Wenjian Luo

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.09972: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09972&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[198] DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack

Hao Li, Yubing Ren, Yanan Cao, Yingjie Li, Fang Fang, Shi Wang, Li Guo

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.16182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[199] CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

Eric Onyame, Akash Ghosh, Subhadip Baidya, Sriparna Saha, Xiuying Chen, Chirag Agarwal

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.13262: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13262&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[200] KOCO-BENCH: Can Large Language Models Leverage Domain Knowledge in Software Development?

Xue Jiang, Ge Li, Jiaru Qian, Xianjie Shi, Chenjie Li, Hao Zhu, Ziyu Wang, Jielun Zhang, Zheyu Zhao, Lingwei Wu, Kechi Zhang, Jia Li, Wenpin Jiao, Zhi Jin, Yihong Dong

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.13240: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13240&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[201] SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

Yuhang Wang, Yuling Shi, Mo Yang, Rongrui Zhang, Shilin He, Heng Lian, Yuting Chen, Siyu Ye, Kai Cai, Xiaodong Gu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.16746: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16746&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[202] The Consensus Trap: Dissecting Subjectivity and the “Ground Truth” Illusion in Data Annotation

Sheza Munir, Benjamin Mah, Krisha Kalsi, Shivani Kapania, Julian Posada, Edith Law, Ding Wang, Syed Ishtiaque Ahmed

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.11318: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11318&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[203] Reinforcement Learning with Backtracking Feedback

Bilgehan Sel, Vaishakh Keshava, Phillip Wallis, Lukas Rutishauser, Ming Jin, Dingcheng Li

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.08377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[204] KLong: Training LLM Agent for Extremely Long-horizon Tasks

Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie, Xinlong Yang, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.17547: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17547&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[205] LongFlow: Efficient KV Cache Compression for Reasoning Models

Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.11504: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11504&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[206] AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

Liang Ding

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.21357: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21357&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[207] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation

Liang Ding

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.21362: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21362&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[208] Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, Dongbin Zhao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.25562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.00829: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00829&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[210] Arch: An AI-Native Hardware Description Language for Register-Transfer Clocked Hardware Design

Shuqing Zhao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.05983: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05983&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[211] How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM’s Residual Role in a Planning Agent

Sungwoo Jung, Seonil Son

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.07236: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07236&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[212] Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Yizhe Chi, Deyao Hong, Dapeng Jiang, Tianwei Luo, Kaisen Yang, Boshi Zhang, Zhe Cao, Xiaoyan Fan, Bingxiang He, Han Hao, Weiyang Jin, Dianqiao Lei, Qingle Liu, Houde Qian, Bowen Wang, Situ Wang, Youjie Zheng, Yifan Zhou, Calvin Xiao, Eren Cai, Qinhuai Na

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12290: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12290&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[213] The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

Prashant C. Raju

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17698: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17698&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[214] Reducing Maintenance Burden in Behaviour-Driven Development: A Paraphrase-Robust Duplicate-Step Detector with a 1.1M-Step Open Benchmark

Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.20462: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.20462&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[215] AVISE: Framework for Evaluating the Security of AI Systems

Mikko Lempinen, Joni Kemppainen, Niklas Raesalmi

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.20833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.20833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[216] How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

Kristian Schwethelm, Daniel Rueckert, Georgios Kaissis

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21106: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21106&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[217] Hyperloop Transformers

Abbas Zeitoun, Lucas Torroba-Hennigen, Yoon Kim

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21254: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21254&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[218] Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation

Shihai Wang, Tao Chen

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21380: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21380&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[219] Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

Grigory Sapunov

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21999: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21999&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[220] Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations

Nalin Poungpeth, Nicholas Clark, Tanu Mitra

Liyao Jiang, Ruichen Chen, Keith G. Mills

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Pre-training is a general method that is used in a range of deep learning tasks. By first training a model on one task, and then further training on the downstream task used for final evaluation, the model is forced to learn a more general understanding of the input data. While pre-training has been applied to 3D Human Pose Estimation (HPE) previously, the scope of datasets used is typically very limited to some strong benchmarks, like Human3.6M. Therefore, in this project, we expand the scope of an existing 3D HPE scheme to be compatible with additional 2D and 3D HPE datasets, like Occlusion Person. We perform an extensive study on how aspects of 2D pre-training, such as model size, affect downstream performance, and to what extent pre-training can help the model generalize to different datasets. Experimental results show that 2D pre-training consistently outperforms training on 3D data alone, particularly in terms of computational efficiency. Finally, using MPII and Human3.6M, we are able to obtain an MPJPE score of under 64.5mm.

Yahui Li, Yinfeng Yu, Liejun Wang, Shengjie Shen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Emotionally talking head video generation aims to generate expressive portrait videos with accurate lip synchronization and emotional facial expressions. Current methods rely on simple emotional labels, leading to insufficient semantic information. While introducing high-level semantics enhances expressiveness, it easily causes lip-sync degradation. Furthermore, mainstream generation methods struggle to balance computational efficiency and global motion awareness in long videos and suffer from poor temporal coherence. Therefore, we propose an \textbf{E}motion-\textbf{A}ware \textbf{D}iffusion model-based \textbf{Net}work, called \textbf{EAD-Net}. We introduce SyncNet supervision and Temporal Representation Alignment (TREPA) to mitigate lip-sync degradation caused by multi-modal fusion. To model complex spatio-temporal dependencies in long video sequences, we propose a Spatio-Temporal Directional Attention (STDA) mechanism that captures global motion patterns through strip attention. Additionally, we design a Temporal Frame graph Reasoning Module (TFRM) to explicitly model temporal coherence between video frames through graph structure learning. To enhance emotional semantic control, a large language model is employed to extract textual descriptions from real videos, serving as high-level semantic guidance. Experiments on the HDTF and MEAD datasets demonstrate that our method outperforms existing methods in terms of lip-sync accuracy, temporal consistency, and emotional accuracy.

[245] Intervention-Aware Multiscale Representation Learning from Imaging Phenomics and Perturbation Transcriptomics

Jiayuan Chen, Ruoqi Liu, Zishan Gu, Ping Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Microscopy-based phenotypic profiling is scalable for drug discovery but lacks the mechanistic depth of transcriptomics, which remains costly and scarce. Existing multimodal approaches either use images to support other modalities or naively align representations by sample identity, ignoring cell-type and dose variations in weakly paired data-limiting generalization to unseen interventions. In this paper, we introduce an intervention-aware distillation framework that leverages perturbational transcriptomics to guide image representation learning. A transcriptome-conditioned teacher integrates gene expression and intervention metadata to produce soft distributions over a chemistry-aware codebook organized by drug similarity. The teacher employs a fine-tuned single-cell foundation model to encode cell-type context and disentangle dose effects. An image-only student learns to predict these distributions from microscopy alone, distilling mechanistic knowledge while operating independently at test time. This design emphasizes intervention semantics rather than identity alignment and explicitly handles dose and cell-type mismatches. We provide theoretical guarantees showing that transcriptomic guidance tightens the risk bound for image-based prediction. On Cell Painting and RxRx datasets paired with L1000, our method significantly improves one-shot transfer to unseen interventions and drug-target gene discovery compared to self-supervised and alignment baselines.

[246] ZID-Net: Zero-Inference Diffusion Prior Decoupling Network for Single Image Dehazing

Xinheng Li, Minghao Chen, Mengqing Wu, Yan Liu, Guanying Huo

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Single image dehazing is often constrained by a trade-off between restoration quality and computational efficiency. While efficient, CNN networks struggle to learn robust priors for dense and non-homogeneous haze. Conversely, diffusion models provide strong generative priors but suffer from severe inference latency and sampling instability. To address these limitations, we propose ZID-Net, a novel framework that explicitly decouples diffusion supervision from feed-forward inference. For efficient inference, we design a frequency-spatial decoupled feed-forward backbone. Within this backbone, a Channel-Spatial Laplacian Mask (CSLM) filters haze-amplified noise to extract purified structural details, while Lightweight Global Context Blocks (LGCBs) establish long-range spatial dependencies to capture the global variations of haze. A Dynamic Feature Arbitration Block (DFAB) then adaptively fuses these semantic and structural features for robust reconstruction. To provide this backbone with physical priors without the inference cost, we introduce a Zero-Inference Prior Propagation Head (ZI-PPH) during training. ZI-PPH leverages a conditional diffusion process to predict residual noise, providing degradation-aware structural supervision to the backbone. By discarding the diffusion branch at test time, ZID-Net integrates diffusion priors into a pure feed-forward architecture for accurate and efficient restoration. ZID-Net achieves 40.75 dB PSNR on the synthetic RESIDE dataset and outperforms existing methods with a 1.13 dB gain on real-world datasets. Additionally, it yields a 3.06 dB PSNR gain on the StateHaze1k remote sensing dataset with an inference time of just 19.35 ms. The project code is available at: https://github.com/XoomitLXH/ZID-Net.

[247] Building a Precise Video Language with Human-AI Oversight

Zhiqiu Lin, Chancharik Mitra, Siyuan Cen, Isaac Li, Yuhan Huang, Yu Tong Tiffany Ling, Hewei Wang, Irene Pi, Shihang Zhu, Ryan Rao, George Liu, Jiaxi Li, Ruojin Li, Yili Han, Yilun Du, Deva Ramanan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/

[248] WebSerial Vision Training for Microcontrollers: A Browser-Based Companion to On-Device CNN Training

Jeremy Ellis

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper presents webmcu-vision-web, a single-file, zero-install browser application for end-to-end TinyML vision model training and deployment on the Seeed Studio XIAO ESP32-S3 Sense (XIAO ML Kit, $15–40 USD). Acting as a browser-based companion to the on-device Arduino firmware of Paper 1 [1], it provides a private, fully local machine learning pipeline, from firmware flashing through image collection, CNN training, weight export, and live activation visualization, without any software installation beyond a Chromium-based browser. The system targets educators, small businesses, and researchers who need to train task-specific visual classifiers under their exact deployment conditions. Key capabilities include: in-browser firmware flashing via esptool-js; an SD card file browser with image preview and inline editing; config.json live-sync for zero-recompile hyperparameter adjustment; webcam and ESP32 OV2640 camera image capture; TensorFlow.js CNN training completing a three-class run (~30 images per class, 20 epochs) in approximately 1 minute browser-side versus 9 minutes on-device, enabling a complete collect-train-deploy cycle in under 10 minutes; weight export as myWeights.bin and myWeights.h; confusion matrix; and a live Conv2 activation heatmap streamed from the ESP32 during inference. No data leaves the local machine at any stage. A five-run consistency evaluation on the three-class reference problem (0Blank, 1Cup, 2Pen) demonstrates stable convergence with mean accuracy and standard deviation reported; all artefacts are released at the repository link below. The repository is a living template for LLM-assisted adaptation to new hardware and tasks. All source code is MIT-licensed at https://github.com/webmcu-ai/webmcu-vision-web.

[249] Robust Grounding with MLLMs against Occlusion and Small Objects via Language-guided Semantic Cues

Beomchan Park, Seongho Kim, Hyunjun Kim, Sungjune Park, Yong Man Ro

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While Multimodal Large Language Models (MLLMs) have enhanced grounding capabilities in general scenes, their robustness in crowded scenes remains underexplored. Crowded scenes entail visual challenges (i.e., occlusion and small objects), which impair object semantics and degrade grounding performance. In contrast, language expressions are immune to such degradation and preserve object semantics. In light of these observations, we propose a novel method that overcomes such constraints by leveraging Language-Guided Semantic Cues (LGSCs). Specifically, our approach introduces a Semantic Cue Extractor (SCE) to derive semantic cues of objects from the visual pipeline of an MLLM. We then guide these cues using corresponding text embeddings to produce LGSCs as linguistic semantic priors. Subsequently, they are reintegrated into the original visual pipeline to refine object semantics. Extensive experiments and analyses demonstrate that incorporating LGSCs into an MLLM effectively improves grounding accuracy in crowded scenes.

[250] ParkingScenes: A Structured Dataset for End-to-End Autonomous Parking in Simulation Scenes

Haonan Chen, Kaiwen Xiao, Bin Tian, Jun Fu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Autonomous parking remains a critical yet challenging task in intelligent driving systems, particularly within constrained urban environments where maneuvering space is limited and precise control is essential. While recent advances in end-to-end learning have shown great promise, the lack of high-quality, structured datasets tailored for parking scenarios remains a significant bottleneck.To address this gap, we present ParkingScenes, a comprehensive multimodal dataset specifically designed for end-to-end autonomous parking in simulated scenes. Built on the CARLA simulator, ParkingScenes features structured parking trajectories generated by a Hybrid A* planner and a Model Predictive Controller (MPC), providing accurate and reproducible supervision signals. The dataset includes 16 reverse-in and 6 parallel parking scenarios, each executed under two pedestrian conditions (present and absent), resulting in 704 structured episodes and approximately 105000 frames. Each scenario is repeated 16 times to ensure consistent coverage. Each frame contains synchronized data from four RGB cameras, four depth sensors, vehicle motion states, and Bird’s-Eye View (BEV) representations, enabling rich multimodal fusion and context-aware learning. To demonstrate the utility of our dataset, we compare models trained on ParkingScenes with those trained on unstructured, manually collected simulation data under identical conditions. Results show significant improvements in performance, underscoring the effectiveness of structured supervision for robust and accurate parking policy learning. By releasing both the dataset and the collection framework, ParkingScenes establishes a scalable and reproducible benchmark for advancing learning-based autonomous parking systems. The dataset and collection framework will be released at: https://github.com/haonan-ai/ParkingScenes

[251] Bridging Restoration and Generation Manifolds in One-Step Diffusion for Real-World Super-Resolution

Shyang-En Weng, Yi-Cheng Liao, Yu-Syuan Xu, Wei-Chen Chiu, Ching-Chun Huang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Pretrained diffusion models have revolutionized real-world image super-resolution (Real-ISR) but suffer from computational bottlenecks due to iterative sampling. Recent single-step distillation accelerates inference but faces a stark perception-distortion trade-off due to rigid timestep initialization, distributional trajectory mismatches, and fragile stochastic modulation. To address this, we present Adaptive Inversion and Degradation-aware Sampling for Real-ISR (IDaS-SR), a one-step framework bridging the deterministic restoration and stochastic generation manifolds. At its core, the Manifold Inversion Noise Estimator (MINE) resolves these initialization and trajectory mismatches by predicting a severity-aware timestep and inversion noise, precisely anchoring low-quality latents onto the diffusion trajectory. Furthermore, to mitigate fragile stochastic modulation, we propose CHARIOT, a continuous generative steering mechanism. By rescheduling trajectories and interpolating noise, it enables explicit navigation of the perception-distortion boundary without compromising structural priors. Extensive experiments demonstrate that IDaS-SR outperforms state-of-the-art methods, seamlessly transitioning from a rigorous structural restorer to a sophisticated texture hallucinator in a single inference step.

[252] AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method

Deshui Miao, Chao Yang, Chao Tian, Guoqing Zhu, Kai Yang, Zhifan Mo, Xin Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This report describes a Ref-VOS pipeline centered on Sa2VA and organized with explicit agent roles. The key idea is that Sa2VA should provide the first dense semantic hypothesis, while an agent loop decides whether that hypothesis should be accepted, revised, or refined. The pipeline starts with a target-presence judgment stage. If the referred object does not exist in the video, the system directly outputs zero masks. Otherwise, Sa2VA receives the video and referring prompt and produces a coarse mask trajectory over the full video. This trajectory is treated as a semantic prior rather than a final answer. A planner agent decomposes the query, temporal partition agents identify informative blocks, scout agents search for anchor frames, and refinement agents convert reliable Sa2VA masks into boxes and points for SAM3 propagation. A critic scores candidate trajectories, a reflection controller repairs weak hypotheses, and a collaboration controller reconciles multiple agent branches. The result is a Ref-VOS system in which Sa2VA is responsible for dense grounded understanding, while the agent layer handles presence verification, temporal search, confidence-aware revision, and final mask refinement.

[253] OAMVOS:2nd Report for 5th PVUW MOSE Track

Deshui Miao, Xingsen Huang, Yameng Gu, Xiaogang yu, Xin Li, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: SAM-based dense trackers provide strong short-term mask propagation but remain fragile under long occlusion, fast motion, viewpoint change, and distractors. The problem is especially severe for small objects, where a few incorrect memory updates can dominate later predictions. This report presents an occlusion- and reappearance-aware extension of DAM4SAM that improves memory control rather than changing the backbone. The method augments the original SAM3 tracker with four ingredients: a reliability-aware tracking state machine, branch-based recovery, delayed DRM promotion, and a selective policy for native SAM3 memory selection. During stable tracking, the model follows the original single-path propagation process. Once confidence drops, the tracker enters an ambiguous or recovery mode, maintains a small set of candidate branches, and commits memory only after a branch is reconfirmed. For small-object disappearance and reappearance, native memory selection is temporarily bypassed so older anchors remain accessible. In addition, the first conditioning frame is explicitly preserved, and the conditioning-memory budget is moderately enlarged to improve long-gap recovery. The resulting design keeps DAM4SAM efficient in easy cases while improving robustness in sequences dominated by occlusion and reappearance.

[254] Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning

Xin Ning, Qiankun Li, Xiaolong Huang, Qiupu Chen, Feng He, Weijun Li, Prayag Tiwari, Xinwang Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: With the accumulation of resources in the era of big data and the rise of pre-trained models in deep learning, optimizing neural networks for various tasks often involves different strategies for fine-tuning pre-trained models versus training from scratch. However, existing optimizers primarily focus on reducing the loss function by updating model parameters, without fully addressing the unique demands of these two major paradigms. In this paper, we propose DualOpt, a novel approach that decouples optimization techniques specifically tailored for these distinct training scenarios. For training from scratch, we introduce real-time layer-wise weight decay, designed to enhance both convergence and generalization by aligning with the characteristics of weight updates and network architecture. For more importantly fine-tuning, we integrate weight rollback with the optimizer, incorporating a rollback term into each weight update step. This ensures consistency in the weight distribution between upstream and downstream models, effectively mitigating knowledge forgetting and improving fine-tuning performance. Additionally, we extend the layer-wise weight decay to dynamically adjust the rollback levels across layers, adapting to the varying demands of different downstream tasks. Extensive experiments across diverse tasks, including image classification, object detection, semantic segmentation, and instance segmentation, demonstrate the broad applicability and state-of-the-art performance of DualOpt. Code is available at https://github.com/qklee-lz/OLOR-AAAI-2024.

[255] From Skeletons to Pixels: Few-Shot Precise Event Spotting via Representation and Prediction Distillation

Zhong Han Ervin Yeoh, Jiang Kan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Precise Event Spotting (PES) is essential in fast-paced sports such as tennis, where fine-grained events occur within very short temporal windows. Accurate frame-level localization is challenging because of motion blur, subtle action differences, and limited annotated data. We study two complementary distillation strategies for few-shot PES: Adaptive Weight Distillation (AWD), a prediction-level method that adaptively weights teacher supervision on unlabeled data, and Annealed Multimodal Distillation for Few-Shot Event Detection (AMD-FED), a representation-level framework that transfers robust skeleton knowledge into visual modalities through annealed pseudo-labeling. Both methods use multimodal distillation to improve generalization under limited supervision. We evaluate them on F3Set-Tennis(sub) under few-shot k-clip settings, where they consistently outperform single-modality baselines and prior PES approaches. After observing the stronger performance of representation-level distillation on tennis, we further validate AMD-FED on a second sports dataset, Figure Skating, where it also shows robust performance in the k-clip scenario. These results highlight the effectiveness of multimodal distillation, especially representation-level transfer, for few-shot precise event spotting.

[256] Unified Multi-Foundation-Model Slide Representation for Pan-Cancer Recognition and Text-Guided Tumor Localization

Tianyang Wang, Ziyu Su, Abdul Rehman Akbar, Usama Sajjad, Lina Gokhale, Charles Rabolli, Wei Chen, Anil Parwani, Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The expanding ecosystem of pathology foundation models has produced powerful but fragmented tile-level representations, limiting their use in clinical tasks that require unified slide-level reasoning and interpretable linkage to clinically meaningful information. We present ASTRA, a pan-cancer framework that integrates heterogeneous foundation-model representations into a shared slide-level representation space and semantically grounds that space using structured pathology annotation fields, including classification category, cancer type, and anatomic site. ASTRA combines sparse mixture-of-experts contextualization, masked multi-model reconstruction, and contrastive alignment to structured pathology prompts to learn slide representations that support 4-category classification, 3-class solid tumor typing, 16-class cancer typing, and text-guided tumor localization without pixel-level supervision. Developed on a CHTN cohort of 10,359 whole-slide images (WSIs) spanning 16 tumor types, ASTRA consistently improves pan-cancer classification across four pathology foundation-model backbones, achieving up to 97.8% macro-AUC for 4-category classification, 99.7% for 3-class solid tumor typing, and 99.2% for 16-class cancer typing. For tumor localization, ASTRA achieves a mean Dice of 0.897 on an annotated in-domain CHTN subset (n = 380) spanning 16 cancer types and 0.738 on an external TCGA cohort (n = 1,686) spanning four cancer types. These results demonstrate that minimal structured pathology annotation fields derived from slide-level metadata can provide effective semantic supervision for unified slide representation learning, enabling both pan-cancer prediction and weakly supervised tumor localization within a single framework.

[257] Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes

Tim Merino, Sam Earle, Ryunosuke Iwai, Julian Togelius, Edoardo Cetin

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We introduce Dream-Cubed, a large-scale dataset of Minecraft worlds at voxel resolution, and a family of models using cubes as powerful compositional units for efficient generation of interactive 3D environments. Dream-Cubed comprises tens of billions of tokens from a carefully curated mixture of procedural biome terrain and high-quality human-authored maps. We use this dataset to conduct the first large-scale study of 3D diffusion models for voxel generation, analyzing discrete and continuous diffusion formulations, data compositions, and architectural design choices. Our models operate directly in the space of blocks, enabling efficient and semantically grounded generation while supporting interactive user workflows such as inpainting and outpainting from user-authored blocks. To quantitatively evaluate our models, we adapt the FID metric to assess semantic differences between real and generated world renderings, and validate generation quality through a human preference study. We release the full dataset, code, and all our pretrained models, which we hope will provide a foundation for future research in efficient generative modeling for structured, interactive 3D environments.

[258] LunarDepthNet: Generation of Digital Elevation Models using Deep Learning and Monocular Satellite Images

Aaranay Aadi, Jai Gopal Singla, Amitabh, Nitant Dube, Praveen Kumar Shukla, Vijaypal Singh Dhaka

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent times have seen an increase in demand of high quality Digital Elevation Models (DEMs) for the lunar surface, because they are highly important for studying the moon and planning future missions. However, there is an evident lack of detailed elevation data on the Moon. To overcome this limitation, this study proposes a novel deep learning method that estimates and generates a surface elevation map directly from monocular images of the surface. The dataset used comprises of the Chandrayaan-2 Terrain Mapping Camera (TMC) images with their corresponding Digital Terrain Models (DTMs). The study proposes LunarDepthNet, which comprises of a UNet architecture to generate DEMS. It incorporates an EfficientNet encoder and custom layers to correctly learn how the light shadows on the surface relate to the actual elevation values. A combined loss function was also utilized to keep the terrain details accurate and smooth. During validation, the model showed a stable loss convergence of 12%. It achieved a mean nRMSE of 0.437 and an MAE of 4.5m in the testing stage. These results prove the model can generate dependable elevation maps from single orbital images, which are quite useful in regions of the moon where stereo-images are not available.

[259] Accelerating New Product Introduction for Visual Quality Inspection via Few-Shot Diffusion-Based Defect Synthesis

Serkan Hamdi Güğül, Kemal Levi, Burak Acar

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Industrial visual inspection systems often suffer from a severe scarcity of labeled defect data, particularly during the early stages of New Product Introduction (NPI). This limitation hinders the deployment of robust supervised detectors precisely when automated quality control is most needed. We present an end-to-end generative framework for high-fidelity, few-shot defect synthesis that enables both in-domain augmentation and cross-domain transfer. Our approach disentangles defect morphology from background appearance by combining masked textual inversion for defect representation learning, noise-blended conditioned generation for surface-aware synthesis, and gradient-aware post-processing for seamless visual integration. We evaluate the framework in two practically relevant settings: few-shot data augmentation, where synthetic samples enrich a small set of real defects, and zero-shot adaptation, where defects learned from a source domain are transferred to a novel target surface without any real target-domain defect examples. Using RF-DETR as the downstream detector, we show that the proposed pipeline substantially narrows the domain gap on a private industrial dataset. In the few-shot setting, synthetic augmentation improves mAP from 78.8% to 83.3%. In the zero-shot setting, synthetic domain adaptation improves mAP from 65.0% to 85.1%. These results demonstrate that high-fidelity defect synthesis can meaningfully accelerate NPI by enabling effective inspection models before sufficient real defect data has been collected.

[260] EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving

Finn Rasmus Schäfer, Yuan Gao, Dingrui Wang, Thomas Stauner, Stephan Günnemann, Mattia Piccinini, Sebastian Schmidt, Johannes Betz

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While Vision-Language Models (VLMs) have advanced highlevel reasoning in autonomous driving, their ability to ground this reasoning in the underlying physics of ego-motion remains poorly understood. We introduce EgoDyn-Bench, a diagnostic benchmark for evaluating the semantic ego-motion understanding of vision-centric foundation models. By mapping continuous vehicle kinematics to discrete motion concepts via a deterministic oracle, we decouple a model’s internal physical logic from its visual perception. Our large-scale empirical audit spanning 20 + models, including closed-source MLLMs, open-source VLMs across multiple scales, and specialized VLAs, identifies a significant Perception Bottleneck: while models exhibit logical physical concepts, they consistently fail to accurately align them with visual observations, frequently underperforming classical non-learned geometric baselines. This failure persists across model scales and domain-specific training, indicating a structural deficit in how current architectures couple visual perception with physical reasoning. We demonstrate that providing explicit trajectory encodings substantially restores physical consistency across all evaluated models, revealing a functional disentanglement between vision and language: egomotion logic is derived almost exclusively from the language modality, while visual observations contribute negligible additional signal. This structural finding provides a standardized diagnostic framework and a practical pathway toward physically aligned embodied AI. Keywords: Ego-motion - Physical Reasoning - Foundation Models

[261] FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods

Chao Pan, Xin Yao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Fast Adversarial Training (FastAT) seeks to achieve adversarial robustness at a fraction of the computational cost incurred by standard multi-step methods such as PGD-AT. Although numerous FastAT techniques have been proposed in recent years, fair comparison among them remains elusive. Existing benchmarks and public leaderboards typically permit diverse model architectures, varying training configurations, and external data sources, making it unclear whether reported improvements reflect genuine algorithmic advances or merely more favorable experimental conditions. To address this problem, we introduce the FastAT Benchmark, a controlled evaluation framework built on three core design principles: unified architecture requirements, standardized training settings, and strict prohibition of external or synthetic data. The benchmark implements over twenty representative FastAT methods within a single codebase, enabling direct and reproducible comparison. Each method is assessed through a dual-metric evaluation framework that measures both adversarial robustness (accuracy under PGD, AutoAttack, and CR Attack) and computational cost (GPU training time and peak memory footprint). Comprehensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet provide reliable baseline measurements and reveal that well-designed single-step methods can match or surpass PGD-AT robustness at substantially lower cost, while no single method dominates across all evaluation dimensions. The complete benchmark, including source code, configuration files, and experimental results, is publicly available to support transparent and fair evaluation of future FastAT research.

[262] MAE-Based Self-Supervised Pretraining for Data-Efficient Medical Image Segmentation Using nnFormer

R. M. Krishna Sureddi, T. Satyanarayana Murthy, Nomula Varsha Reddy, Adi Kanishka, Nalla Manvika Reddy

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Transformer architectures, including nnFormer,have demonstrated promising results in volumetric medical image segmentation by being able to capture long-range spatial interactions. Although they have high performance, these models need large quantities of labeled training data and are also likely to overfit and become training unstable. This is a serious practical problem because it is not only time-consuming but also expensive to obtain medical images that are annotated by experts. Moreover, fully supervised traditional training pipelines do not take advantage of the available large amounts of unlabeled medical imaging data that can be easily obtained in the clinics. We have solved these drawbacks by advancing the efficiency of the nnFormer with a self-supervised pretraining framework, which is based on the Masked Autoencoders (MAE). In this method, the model is pretrained on unlabeled volumetric medical images to reconstruct randomly masked parts of the input. This allows the encoder to learn meaningful anatomical and structural representations . The encoder is then further fine-tuned on a labeled dataset on the downstream segmentation task. Conducted Experiment shows that the offered method leads to a higher segmentation performance on the count of Dice score, a quicker convergence rate on the course of the fine-tuning procedure, and a superior generalization on the basis of limited labeled data . These findings validate that self-supervised learning combined with transformer-based segmentation models is an appropriate approach to the problem of data shortage in medical image analysis.

[263] Evaluating Remote Sensing Image Captions Beyond Metric Biases

Ziyun Chen, Fan Liu, Liang Yao, Chuanyi Zhang, Yuye Ma, Wei Zhou

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The core objective of image captioning is to achieve lossless semantic compression from visual signals into textual modalities. However, the reliance on manually curated reference texts for evaluation essentially forces models to mimic specific human annotation styles, thereby masking the true descriptive capabilities of advanced foundation models. This systemic misalignment prompts a critical question: Is task-specific fine-tuning truly necessary for Remote Sensing Image Captioning, or is the perceived performance gap merely an artifact of flawed evaluation criteria? To investigate this discrepancy, we propose ReconScore, a novel reference-free evaluation metric. Rather than computing textual similarities, we assess caption quality by its capability to reconstruct the original visual elements solely from the generated text, effectively neutralizing human annotation biases. Applying this metric, we uncover a profound, counterintuitive truth: inherently powerful, unfine-tuned MLLMs surpass their fine-tuned counterparts in authentic zero-shot RSIC tasks. Driven by this structural discovery, we introduce RemoteDescriber, a completely training-free generation methodology. By employing ReconScore as a self-correction mechanism, we iteratively refine the semantic precision of MLLM outputs without any computational fine-tuning overhead. Comprehensive experiments demonstrate that RemoteDescriber achieves state-of-the-art performance on three datasets. Furthermore, we validate ReconScore’s reliability and analyze the flaws of traditional metrics. Our code is available at https://github.com/hhu-czy/RemoteDescriber.

[264] Attention-Augmented YOLOv8 with Ghost Convolution for Real-Time Vehicle Detection in Intelligent Transportation Systems

Syed Sajid Ullah, Muhammad Zunair Zamir, Ahsan Ishfaq, Salman Khan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate vehicle detection is a critical component of autonomous driving, traffic surveillance, and intelligent transportation systems. This paper presents an enhanced YOLOv8n-based model that integrates the Ghost Module, Convolutional Block Attention Module (CBAM), and Deformable Convolutional Networks v2 (DCNv2) to improve detection performance. The Ghost Module reduces feature redundancy through efficient feature generation, CBAM refines feature representation via channel and spatial attention, and DCNv2 enhances adaptability to geometric variations in vehicle structures. Evaluated on the KITTI dataset, the proposed model achieves 95.4% mAP@0.5, representing an 8.97% improvement over the baseline YOLOv8n, along with 96.2% precision, 93.7% recall, and a 94.93% F1-score. Comparative analysis against seven state-of-the-art detectors demonstrates consistent superiority across key performance metrics, while ablation studies validate the individual and combined contributions of the integrated modules. By addressing feature redundancy, attention refinement, and spatial adaptability, the proposed approach offers a robust and computationally efficient solution for vehicle detection in diverse and complex traffic environments.

[265] IoT-Enhanced CNN-Based Labelled Crack Detection for Additive Manufacturing Image Annotation in Industry 4.0

Mohsen Asghari Ilani, Yaser Mike Banad

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper presents an IoT-enhanced deep learning framework for automated crack detection in Additive Manufacturing (AM) surfaces using convolutional neural networks (CNNs). By integrating IoT-enabled real-time monitoring, high-resolution imaging, and edge computing, the system enables continuous in-situ defect detection and classification. Real-time data acquisition supports immediate CNN-based analysis, improving both accuracy and efficiency in AM quality control. The framework supports supervised and semi-supervised learning, enabling robust performance on large, sparsely annotated datasets. Using LabelImg for annotation and OpenCV for preprocessing, the system achieves 99.54% accuracy on 14,982 images, with 96% precision, 98% recall, and a 97% F1-score. Dataset balancing and augmentation significantly improve generalization, increasing accuracy from 32% to 99%. Beyond detection, the framework establishes a linkage between AM process parameters, defect formation, and surface topology, supporting predictive analytics and defect mitigation. Aligned with Industry 4.0, it incorporates Digital Twin (DT) technology for real-time process simulation, predictive maintenance, and adaptive control. Key contributions include an IoT-based monitoring system using edge devices (Raspberry Pi 4B), an optimized CNN with model quantization and batch processing reducing inference latency by 47%, and an MQTT-based low-latency data streaming system with 5G connectivity, lowering transmission overhead by 35%. DT integration further enables predictive defect analysis and dynamic adjustment of AM parameters. This work advances intelligent AM quality control by providing a scalable, high-accuracy, and low-latency framework. Future directions include multimodal data fusion, hybrid architectures, and enhanced Digital Twin simulations for AI-driven defect prevention.

[266] A Digital Pathology Resource for Liver Cancer Quantification with Datasets, Benchmarks, and Tools

Ying Xiao, Shimiao Tang, Xitong Ling, Weiming Chen, Jun Wang, Jiawen Li, Huaitian Yuan, Jianghui Yang, Bowen Li, Huan Li, Yiting Meng, Tian Guan, Yonghong He, Hongfang Yin

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Liver cancer, especially hepatocellular carcinoma (HCC), imposes a substantial global disease burden. Accurate diagnosis and prognostic assessment directly influence treatment selection and patient survival, and pathological examination remains the gold standard for liver cancer diagnosis. Identifying diverse tissue components and pathological subtypes on histopathology slides is crucial for estimating postoperative recurrence risk and overall prognosis. However, most publicly available resources are still provided at the whole-slide image (WSI) level, and well-annotated datasets for fine-grained tissue component identification in liver cancer are scarce, which hinders reproducible model development and the deployment of quantitative analysis tools. To address this gap, we release HepatoBench, a patch-level image database for liver cancer with annotations for seven key tissue categories. Based on HepatoBench, we train and open-source a deep learning classification model as a tissue recognition tool. Furthermore, we train a WSI-level tumor/non-tumor segmentation model to automatically localize lesion regions across entire slides. By integrating the patch-level tissue classifier with the WSI-level segmentation model, we build HepatoQuant, an end-to-end, disease-specific regional quantification tool for liver cancer, enabling a unified workflow from WSIs to tissue composition parsing and quantitative statistics. We also open-source HepatoBench, the benchmarking protocol, and supporting tools, providing a solid foundation for automated regional quantification and fair method comparison in liver cancer pathology.

[267] MeshLAM: Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction

Yisheng He, Steven Hoi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We introduce MeshLAM, a feed-forward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, animatable 3D head avatars from a single image. Unlike previous work that relies on time-consuming test-time optimization or extensive multi-view data, our method produces complete mesh representations with inherent animatability from a single image in a single forward pass. Our approach employs a dual shape and texture map architecture that simultaneously processes mesh vertices and texture map with extracted image features from a shared transformer backbone, allowing for coherent shape carving and appearance modeling. To prevent mesh collapse and ensure topological integrity during feed-forward deformation, we propose an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement, coupled with a novel reprojection-based texture guidance mechanism that anchors appearance learning to the input image. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in reconstruction quality, animation capability, and computational efficiency. Project page at https://meshlam.github.io.

[268] Probing Visual Planning in Image Editing Models

Zhimu Zhou, Yanpeng Zhao, Qiuyu Liao, Bo Zhao, Xiaojian Ma

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is often tackled through a verbal-centric lens. While recent research demonstrates the promise of fully visual approaches, they suffer from significant computational inefficiency due to the step-by-step planning-by-generation paradigm. In this work, we present EAR, an editing-as-reasoning paradigm that reformulates visual planning as a single-step image transformation. To isolate intrinsic reasoning from visual recognition, we employ abstract puzzles as probing tasks and introduce AMAZE, a procedurally generated dataset that features the classical Maze and Queen problems, covering distinct, complementary forms of visual planning. The abstract nature of AMAZE also facilitates automatic evaluation of autoregressive and diffusion-based models in terms of both pixel-wise fidelity and logical validity. We assess leading proprietary and open-source editing models. The results show that they all struggle in the zero-shot setting, finetuning on basic scales enables remarkable generalization to larger in-domain scales and out-of-domain scales and geometries. However, our best model that runs on high-end hardware fails to match the zero-shot efficiency of human solvers, highlighting a persistent gap in neural visual reasoning.

[269] Vision-Based Lane Following and Traffic Sign Recognition for Resource-Constrained Autonomous Vehicles

Md Tanjemul Islam, Md Rafiul Kabir

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Autonomous vehicles (AVs) rely on real-time perception systems to understand road environments and ensure safe navigation. However, implementing reliable perception algorithms on resource-constrained embedded platforms remains challenging due to limited computational resources. This paper presents a lightweight vision-based framework that integrates lane detection, lane tracking, and traffic sign recognition for embedded autonomous vehicles. A computationally efficient threshold-based lane segmentation method combined with perspective transformation and histogram-based curvature estimation is used for robust lane tracking under varying illumination conditions. A rule-based steering controller generates steering commands to maintain stable vehicle navigation. For traffic sign recognition, two lightweight convolutional neural networks (CNNs), EfficientNet-B0 and MobileNetV2, are evaluated using a custom dataset captured from the vehicle’s onboard camera. Experimental results show that the system achieves real-time performance while maintaining accurate lane tracking with only 3.16% maximum offset RMSE. EfficientNet-B0 achieves a high offline classification accuracy of 98.77% on the test dataset, while achieving 90% accuracy during real-time on-device deployment, outperforming MobileNetV2 in both settings. MobileNetV2, however, offers slightly faster inference and lower computational cost. These results highlight the effectiveness of lightweight vision-based perception pipelines for resource-constrained autonomous driving applications.

[270] SketchVLM: Vision language models can annotate images to explain thoughts and guide users

Brandon Collins, Logan Bolton, Hung Huy Nguyen, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across seven benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 percentage points and annotation quality by up to 1.48x relative to image-editing and fine-tuned sketching baselines, while also producing annotations that are more faithful to the model’s stated answer. We find that single-turn generation already achieves strong accuracy and annotation quality, and multi-turn generation opens up further opportunities for human-AI collaboration. An interactive demo and code are at https://sketchvlm.github.io/.

[271] NeuroAPS-Net: Neuro-Anatomically Aware Point Cloud Representation for Efficient Alzheimer’s Disease Classification

Towhidul Islam, Mufti Mahmud

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder and a major cause of dementia. Structural MRI is widely used to analyze AD-related brain atrophy; however, most deep learning methods rely on computationally expensive 3D convolutional neural networks (CNNs), limiting deployment in resource-constrained settings. This work introduces two main contributions. First, we propose a pipeline that converts T1-weighted MRI into anatomically informed 2D point clouds using Anatomical Priority Sampling (APS), producing ADNI-2DPC, the first neuroanatomically labeled MRI-derived point cloud dataset. Second, we present NeuroAPS-Net, a lightweight geometric deep learning model that incorporates anatomical priors via region-aware feature encoding and ROI token aggregation. Experiments on ADNI-2DPC demonstrate that NeuroAPS-Net achieves competitive classification accuracy while significantly reducing inference latency and GPU memory compared to state-of-the-art point cloud methods. These results highlight the potential of anatomically guided point cloud learning as an efficient and interpretable alternative to voxel-based CNNs for AD classification.

[272] Can Multimodal Large Language Models Truly Understand Small Objects?

Fujun Han, Junan Chen, Xintong Zhu, Jingqi Ye, Xuanjie Mao, Tao Chen, Peng Ye

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM’s ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj-X/SOU.

Hefeng Zhou, Xuan Liu, Sicheng Chen, Wutong Zhang, Wu Yan, Jiong Lou, Chentao Wu, Guangtao Xue, Wei Zhao, Jie Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Federated cross-modal retrieval faces severe challenges from heterogeneous client data, particularly non-IID semantic distributions and missing modalities. Under such heterogeneity, a single global model is often insufficient to capture both shared cross-modal knowledge and client-specific characteristics. We propose RCSR, a personalization-friendly federated framework that integrates prototype anchoring, retrieval-centric semantic routing, and optional client-specific adapters. Built on a frozen CLIP backbone, RCSR leverages lightweight shared adapters for global knowledge transfer while supporting efficient local personalization. Prototype anchoring helps unimodal clients align with global cross-modal semantics, and a server-side semantic router adaptively assigns aggregation weights based on retrieval consistency to mitigate alignment drift during heterogeneous updates. Extensive experiments on MS-COCO, Flickr30K, and other benchmarks show that RCSR consistently improves global retrieval accuracy and training stability, while further enhancing client-level retrieval performance, especially for clients with incomplete modalities. Code is available at https://github.com/RezinChow/RCSR-Retrieval-Centric-Semantic-Routing.

[274] Breaking Degradation Coupling: A Structural Entropy Guided Decoupled Framework and Benchmark for Infrared Enhancement

Pu Li, Huafeng Li, Yafei Zhang, Yu Liu, Wen Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Thermal infrared image enhancement aims to restore high-quality images from complex compound degradations. Existing all-in-one approaches typically employ a single shared backbone to handle diverse degradations, which causes gradient interference and parameter competition. To address this, we propose a Structural Entropy-Guided Decoupled (SEGD) Framework. Unlike unified modeling paradigms, SEGD decomposes compound degradations into independent sub-processes and models them in a divide-and-conquer manner through Degradation-Specific Residual Modules (DRMs). Each DRM focuses on residual estimation for a specific degradation, enabling task decoupling while remaining jointly trainable, which mitigates parameter contention. A Degradation-Aware Evidential Network further estimates degradation type and intensity, providing priors that adaptively regulate DRM restoration strength. To handle compound cases, DRMs are composed in varying orders to form multiple restoration paths, from which the most informative features are aggregated under a structural-entropy criterion, yielding decoder-ready representations with structural fidelity and degradation awareness. Integrating divide-and-conquer restoration, evidential perception, and entropy-guided adaptation, SEGD achieves fine-grained and interpretable enhancement. We also construct a nighttime TIR benchmark for evaluation under real low-light conditions. Experimental results demonstrate that SEGD surpasses state-of-the-art methods while achieving higher efficiency with fewer parameters.

[275] Text-Guided Multimodal Unified Industrial Anomaly Detection

Zewen Li, Shuo Ye, Zitong Yu, Weicheng Xie, Linlin Shen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Industrial anomaly detection based on RGB-3D multimodal data has emerged as a mainstream paradigm for intelligent quality inspection. However, existing unsupervised methods suffer from two critical limitations: ambiguous cross-modal alignment caused by the lack of high-level semantic guidance and insufficient geometric modeling for RGB-to-3D feature mapping. To address these issues, we propose a unified multimodal industrial anomaly detection framework guided by text semantics. The framework consists of two core modules: a Geometry-Aware Cross-Modal Mapper to preserve geometric structure during modality conversion, and an Object-Conditioned Textual Feature Adaptor to align multimodal features with semantic priors. Furthermore, we establish a unified learning paradigm for multimodal industrial anomaly detection, which breaks the one-model-one-class constraint and enables accurate anomaly detection across diverse classes using a single model. Extensive experiments on the MVTec 3D-AD and Eyecandies datasets demonstrate that our method achieves state-of-the-art performance in classification and localization under unsupervised settings.

[276] On the Complementarity of Quantum and Classical Features: Adaptive Hybrid Quantum-Classical Feature Fusion for Breast Cancer Classification

Yasmin Rodrigues Sobrinho, João Renato Ribeiro Manesco, João Paulo Papa

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The integration of quantum machine learning with classical deep learning offers promising avenues for medical image analysis by mapping data into high-dimensional Hilbert spaces. However, effectively unifying these distinct paradigms remains challenging due to common optimization asymmetries. In this paper, a novel hybrid quantum-classical architecture for breast cancer diagnosis based on a dual-branch feature-extraction pipeline is proposed. Our framework extracts and unifies complementary representations from classical models and quantum circuits, exploring both trainable and deterministic (non-trainable) quantum paradigms. To integrate these embeddings, three progressive feature fusion strategies are introduced: Static Hybrid Fusion (SHF) for offline extraction, Dynamic Hybrid Fusion (DHF) for end-to-end co-adaptation, and a novel Temperature-Scaled Hybrid Fusion (TSHF). The TSHF strategy incorporates a learnable scalar, inspired by multimodal learning, that dynamically balances hybrid gradient dynamics and resolves optimization bottlenecks. Empirical validation on the BreastMNIST dataset confirms our hypothesis that unifying diverse feature representations creates a richer data context. The TSHF strategy, specifically when pairing a ResNet backbone with a trainable quantum circuit, achieved a peak accuracy of 87.82%, F1-score of 91.77%, and an AUC-ROC of 89.08%, outperforming purely classical baselines. These results demonstrate that the proposed hybrid framework improves classification accuracy and threshold reliability, providing a stable, high-performance architecture for the clinical deployment of quantum-enhanced diagnostic tools.

[277] VS-DDPM: Efficient Low-Cost Diffusion Model for Medical Modality Translation

Nikoo Moradi, Gijs Luijten, Behrus Hinrichs-Puladi, Jens Kleesiek, Victor Alves, Jan Egger, André Ferreira

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diffusion models produce high-quality synthetic data but suffer from slow inference. We propose 3D Variable-Step Denoising Diffusion Probabilistic Model (VS-DDPM) a framework engineered to maintain generative quality while accelerating inference by several factors. We tested our approach on four tasks (missing MRI, tumor removal, MRI-to-sCT, and CBCT-to-sCT) within the BraTS2025 and SynthRAD2025 challenges. Designed for high efficiency under hardware and time constrains imposed by both challenges. VS-DDPM achieved state-of-the-art (SOTA) performance in missing MRI synthesis, yielding Dice scores of 0.80, 0.83, and 0.88 for the enhancing tumor, tumor core, and whole tumor regions, respectively, alongside a structural similarity index (SSIM) of 0.95. For MRI tumor removal, the model attained a root mean squared error (RMSE) of 0.053, a peak signal-to-noise ratio (PSNR) of 26.77, and an SSIM of 0.918. While the framework demonstrated competitive performance in MRI-to-sCT and CBCT-to-sCT tasks, it did not reach SOTA benchmarks, potentially due to sensitivities in data pre and post-processing pipelines or specific loss function configurations. These results demonstrate that VS-DDPM provides a robust and tunable solution for high-fidelity 3D medical image synthesis. The code is available in https://github.com/andre-fs-ferreira/SynthRAD_by_Faking_it.

[278] AnemiaVision: Non-Invasive Anemia Detection via Smartphone Imagery Using EfficientNet-B3 with TrivialAugmentWide, Mixup Augmentation, and Persistent Patient History Management

Rahul Patel

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Anemia affects over one billion people globally and remains severely under-diagnosed in low-resource regions where laboratory blood tests are inaccessible. This paper presents AnemiaVision, an end-to-end web-based system for non-invasive anemia screening from smartphone photographs of the palpebral conjunctiva and fingernail beds. The proposed pipeline fine-tunes a pre-trained EfficientNet-B3 backbone with a redesigned three-layer classifier head incorporating BatchNorm, GELU activations, and high-rate Dropout (0.45/0.35). Training employs four orthogonal accuracy-boosting techniques: TrivialAugmentWide for policy-free image augmentation, RandomErasing for spatial regularisation, Mixup (alpha=0.2) for inter-class smoothing, and cosine-annealing scheduling with linear warmup. Early stopping is governed by peak validation accuracy rather than validation loss to prevent premature termination on high-variance epochs. The deployed Flask application integrates persistent patient-history management backed by PostgreSQL on Render, with an automated database-migration entrypoint ensuring zero data loss across redeploys. Ablation experiments demonstrate that accuracy-first early stopping contributes +1.6% and Mixup contributes +2.8% to final validation accuracy. Overall, the proposed system achieves a validation accuracy of 96.2% and AUC-ROC of 0.98, compared with 44.9% validation accuracy and AUC-ROC of 0.58 from the three-epoch CPU-only baseline. Sensitivity for the anemic class reaches 0.96, making the system suitable as a first-line screening tool for community health workers in rural settings. The system is publicly accessible and source code is openly available.

[279] BrickNet: Graph-Backed Generative Brick Assembly

Peter Kulits, Cordelia Schmid

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We train a language model to generate LEGO-brick build sequences. While prior work has been restricted to discrete, voxel-like towers, we consider a much broader set of pieces, encompassing thousands of part types with diverse connection semantics. To enable this, we first collect a large-scale dataset of over 100,000 human-designed LDraw brick objects and scenes. The complexity of our setting makes it challenging to autoregressively assemble structures that satisfy physical constraints. When predicting block pose directly, build sequences quickly become invalid after a small number of steps. Although pieces are placed in 3D space, it is the spatial relationships of the parts which define the whole. With this in mind, we design a graph-based program representation that parametrizes structure through connectivity, improving the physical grounding of generated sequences. To enable future applications, we make our dataset and models available for research purposes. https://kulits.github.io/BrickNet

[280] CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

Ashwin Kumar, Robbie Holland, Corey Barrett, Jangwon Kim, Maya Varma, Zhihong Chen, Yunhe Gao, Greg Zaharchuk, Tara Taghavi, Krishnaram Kenthapadi, Akshay Chaudhari

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP-pretrained vision encoder to an LLM using LLaVA-style finetuning. This two-stage, decoupled approach introduces a projection layer that can distort visual features. This is especially concerning in medical imaging where subtle cues are essential for accurate diagnoses. In contrast, early-fusion generative approaches such as Chameleon eliminate the projection bottleneck by processing image and text tokens within a single unified sequence, enabling joint representation learning that leverages the inductive priors of language models. We present CheXmix, a unified early-fusion generative model trained on a large corpus of chest X-rays paired with radiology reports. We expand on Chameleon’s autoregressive framework by introducing a two-stage multimodal generative pretraining strategy that combines the representational strengths of masked autoencoders with MLLMs. The resulting models are highly flexible, supporting both discriminative and generative tasks at both coarse and fine-grained scales. Our approach outperforms well-established generative models across all masking ratios by 6.0% and surpasses CheXagent by 8.6% on AUROC at high image masking ratios on the CheXpert classification task. We further inpaint images over 51.0% better than text-only generative models and outperform CheXagent by 45% on the GREEN metric for radiology report generation. These results demonstrate that CheXmix captures fine-grained information across a broad spectrum of chest X-ray tasks. Our code is at: https://github.com/StanfordMIMI/CheXmix.

[281] Hard to See, Hard to Label: Generative and Symbolic Acquisition for Subtle Visual Phenomena

Renjith Prasad, Rishabh Sharma, Andrew E. Shao, Annmary Justine Koomthanam, Shreyas Kulkarni, Suparna Bhattacharya, Martin Foltin, Amit Sheth, David Orozco, Brian Sammuli

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Subtle visual anomalies such as hairline cracks, sub-millimeter voids, and low-contrast inclusions are structurally atypical yet visually ambiguous, making them both difficult to annotate and easy to overlook during active learning. Standard acquisition heuristics based on discriminative uncertainty or feature diversity often overselect dominant patterns while underexploring sparse yet important regions of the data space. This failure mode is especially severe in industrial defect inspection, where anomalies may be both low-prevalence and difficult to distinguish from surrounding structure. To resolve this, we propose GSAL, an active learning framework for object detection that combines a diffusion-based difficulty signal with a hierarchical semantic coverage prior. The diffusion component scores images and proposals using reconstruction discrepancy and denoising variability, prioritizing visually atypical or ambiguous examples. However, diffusion alone does not prevent acquisition from repeatedly favoring hard samples within dominant semantic modes. The semantic component therefore organizes candidate samples in a three-level concept graph and promotes coverage of underrepresented semantic regions while providing interpretable acquisition rationales. By balancing visual difficulty with semantic coverage, GSAL improves retrieval of subtle and rare targets that are often missed by uncertainty-only selection. Experiments on a proprietary thin-film defect, Pascal VOC and MS COCO dataset show consistent gains in label efficiency and rare-class retrieval over uncertainty-, diversity-, and hybrid-based baselines

[282] Efficient Image Annotation via Semi-Supervised Object Segmentation with Label Propagation

Vitalii Tutevych, Raphael Memmesheimer, Luca Eichler, Dmytro Pavlichenko, Fynn Schilke, Rodja Krudewig, Sven Behnke

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reliable object perception is necessary for general-purpose service robots. Open-vocabulary detectors struggle to generalize beyond a few classes and fully supervised training of object detectors requires time-intensive annotations. We present a semi-supervised label propagation approach for household object segmentation. A segment proposer generates class-agnostic masks, and an ensemble of Hopfield networks assigns labels by learning representative embeddings in complementary foundation model embedding spaces (CLIP, ViT, Theia). Our approach scales to 50 object classes with limited annotation overhead and can automatically label 60% of the data in a RoboCup@Home setting, where preparation time is severely constrained. Dataset and code are publicly available at https://github.com/ais-bonn/label_propagation.

[283] GenAssets: Generating in-the-wild 3D Assets in Latent Space

Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, Raquel Urtasun

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: High-quality 3D assets for traffic participants are critical for multi-sensor simulation, which is essential for the safe end-to-end development of autonomy. Building assets from in-the-wild data is key for diversity and realism, but existing neural-rendering based reconstruction methods are slow and generate assets that render well only from viewpoints close to the original observations, limiting their usefulness in simulation. Recent diffusion-based generative models build complete and diverse assets, but perform poorly on in-the-wild driving scenes, where observed actors are captured under sparse and limited fields of view, and are partially occluded. In this work, we propose a 3D latent diffusion model that learns on in-the-wild LiDAR and camera data captured by a sensor platform and generates high-quality 3D assets with complete geometry and appearance. Key to our method is a “reconstruct-then-generate” approach that first leverages occlusion-aware neural rendering trained over multiple scenes to build a high-quality latent space for objects, and then trains a diffusion model that operates on the latent space. We show our method outperforms existing reconstruction and generation based methods, unlocking diverse and scalable content creation for simulation.

[284] AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI

Mohammad Sadegh Salehi, Alex Perkins, Igor Maurell, Ashkan Dabbagh, Raymond Wong

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Web-scale 3D asset collections are abundant, but rarely deployment-ready. Assets ship with arbitrary metric scale, incorrect pivots and forward axes, brittle geometry, and textures that do not support relighting, which limits their utility for embodied AI, robotics simulation, game development, and AR/VR. We present AmaraSpatial-10K, a dataset of over 10,000 synthetic 3D assets designed for downstream use rather than volume alone. Each asset is released as a metric-scaled, semantically anchored .glb with separated PBR material maps, a convex collision hull, a paired reference image, and rich multi-sentence text metadata. The dataset spans indoor objects, vehicles, architecture, creatures, and props under a unified spatial convention. Alongside the dataset, we introduce an evaluation suite for 3D asset banks. The suite comprises a continuous Scale Plausibility Score (SPS) with an LLM-as-Judge interval protocol, an LLM Concept Density score for metadata, an anchor-error metric, and a cross-modal CLIP coherence protocol, and we use it to audit AmaraSpatial-10K alongside matched subsets from Objaverse, HSSD, ABO, and GSO. Compared with Objaverse-sourced assets, we demonstrate that AmaraSpatial-10K substantially improves text-based retrieval precision (CLIP Recall@5 of 0.612 vs 0.181, a 3.4x improvement with median rank falling from 267 to 3), and we establish that it satisfies the spatial and semantic prerequisites for physics-aware scene composition and embodied-AI asset banks, leaving those downstream evaluations to future work. AmaraSpatial-10K is publicly available on Hugging Face.

[285] Understanding Representation Gaps Across Scales in Tropical Tree Species Classification from Drone Imagery

Sulagna Saha, Arthur Ouaknine, Etienne Laliberté, Carol Altimas, Evan M. Gora, Adriane Esquivel Muelbert, Ian R. McGregor, Cesar Gutierrez, Vanessa E. Rubio, David Rolnick

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate classification of tropical tree species from unoccupied aerial vehicle (UAV) imagery remains challenging due to high species diversity and strong visual similarity among species at typical image resolutions (centimeters per pixel). In contrast, models trained on close-up citizen science photographs captured with smartphones achieve strong plant species classification performance. Recent advances in UAV data acquisition now enable the collection of close-up images that are spatially registered with top-view aerial imagery and approach the level of visual detail found in smartphone photographs, with the trade-off that such high-resolution photos cannot be acquired for many trees. In this work, we evaluate the performance of existing methods using paired top-view and close-up UAV imagery collected in a species-rich tropical forest. Through fine-tuning experiments, we quantify the performance gap between vision foundation models and in-domain generalist plant recognition models across both image types (high-resolution close-up versus coarser-resolution top-view imagery). We show that classification performance is consistently higher on close-up images than on top-view aerial imagery, and that this performance gap widens for rare species. Finally, we propose that self-supervised representation alignment across these two spatial scales offers a promising approach for integrating fine-grained visual information into canopy-level species classification models based on top-view UAV imagery. Leveraging high-resolution close-up UAV imagery to enhance canopy-level species classification could substantially improve large-scale monitoring of tropical forest biodiversity.

[286] Urban Flood Observations (UFO): A hand-labeled training and validation dataset of post-flood inundation

Rohit Mukherjee, Hannah K. Friedrich, Beth Tellman, Ariful Islam, Zhijie Zhang, Jonathan Giezendanner, Upmanu Lall, Venkataraman Lakshmi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Urban flooding affects lives and infrastructure worldwide. Mapping inundation in complex urban environments from satellite imagery remains challenging due to limited spatial resolution, infrequent acquisitions, and cloud cover. We present Urban Flood Observations (UFO), a global, hand-labeled dataset of post-flood inundation in diverse urban settings. UFO comprises 215 image chips (1024 by 1024 pixels) from 14 flood events between 2017 and 2021, derived from 3 m PlanetScope imagery. Each chip is annotated with two classes: ‘inundated’ (all visible surface water, including floodwater and pre-existing water bodies (permanent or seasonal)) and ’non-inundated’. To demonstrate the dataset’s utility, we trained a segmentation model using leave-one-event-out cross-validation, achieving a mean Intersection over Union (IoU) of 77.3. We also used UFO to evaluate two widely used surface water products, the Sentinel-1-based NASA IMPACT model and Google’s 10 m Dynamic World water class, which yielded IoUs of 44.1 and 48.1, respectively. UFO is publicly available to support the development and validation of urban inundation mapping methods.

[287] From Pixels to Explanations: Interpretable Diabetic Retinopathy Grading with CNN-Transformer Ensembles, Visual Explainability and Vision-Language Models

Pir Bakhsh Khokhar, Carmine Gravino, Fabio Palomba, Sule Yildirim Yayilgan, Sarang Shaikh

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The quality of diabetic retinopathy (DR) screening relies on the ability to correctly grade severity; however, many deep-learning (DL) classifiers cannot be easily interpreted in the clinical context. This study presents a methodology that combines strong discriminative models with multimodal explanations, converting retinal pixels into clinically interpretable outputs. Using the APTOS 2019 benchmark, we evaluated six representative CNN- and transformer-based backbones under a controlled protocol with stratified five-fold cross-validation. We then compared ensembling strategies (hard voting, weighted soft voting, stacking) and investigated a hybrid class-level fusion variant to exploit grade-specific advantages. For interpretability, we produced Grad-CAM++ visual attribution maps and short textual rationales using vision-language models (VLMs) conditioned on the fundus image and classifier outputs under conservative prompting constraints. Modern CNN backbones (ResNet-50 and ConvNeXt-Tiny) provided the strongest single-model baselines, with cross-validated QWK up to 0.919 and 0.914, respectively. Ensembling improved ordinal agreement, and weighted soft voting was the most consistent across folds (QWK 0.934 +/- 0.017). Hybrid class-level fusion was competitive but did not yield a statistically reliable improvement over standard fusion in paired fold comparisons (Holm-adjusted p >= 1.000). For explanation quality, Grad-CAM++ offered plausible but coarse localization, and VLM rationales were generally grade-consistent. Quantitatively, VLM variants showed a trade-off between clinical completeness and template-level semantic similarity (coverage 0.700 vs. BERTScore 0.072), while image-text alignment was comparable (CLIPScore approximately 0.34).

[288] Toward Real-World Adoption of Portrait Relighting via Hybrid Domain Knowledge Fusion

Qian Huang, Mayoore Selvarasa Jaiswal, Zhen Zhong, Rochelle Pereira, Jianyuan Min

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The real-world adoption of portrait relighting is hindered by dataset domain gaps, camera sensitivity, and computational costs. We address these challenges with Hybrid Domain Knowledge Fusion, a paradigm that fuses the specialized strengths of synthetic, One-Light-at-A-Time (OLAT), and real-world datasets into a compact model. Our approach features specialized prior models hardened by domain-aware adaptation, followed by augmented knowledge distillation into a lightweight student model with multi-domain expertise. Our method demonstrates a 6x to 240x inference speedup while maintaining state-of-the-art (SOTA) visual quality in the experiments. Additionally, we construct a massive, high-fidelity synthetic dataset with diverse ground-truth intrinsics to support our training pipeline.

[289] INSIGHT: Indoor Scene Intelligence from Geometric-Semantic Hierarchy Transfer for Public~Safety

Alexander Nikitas Dimopoulos, Joseph Grasso, John Beltz

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Indoor environments lack the spatial intelligence infrastructure that GPS provides outdoors; first responders arriving at unfamiliar buildings typically have no machine-readable map of safety equipment. Prior work on 3D semantic segmentation for public safety identified two barriers: scarcity of labeled indoor training data and poor recognition of small safety-critical features by native point-cloud methods. This paper presents INSIGHT, a zero-target-domain-annotation pipeline that projects 2D image understanding into 3D metric space via registered RGB-D data. Two interchangeable vision stacks share a common 3D back end: a SAM3 foundation-model stack for text-prompted segmentation, and a traditional CV stack (open-set detection, VQA, OCR) whose intermediate outputs are independently inspectable. Evaluated on all seven subareas of Stanford 2D-3D-S (70{,}496 images), the pipeline produces Pointcept-schema-compatible labeled point clouds and ISO~~19164-compliant scene graphs with ${\sim}10^{4}{\times}$ compression; role-filtered payloads transmit in ${<}15$,s at 1,Mbps over FirstNet Band~~14. We report per-point labeling accuracy on 7 shared classes, detection sensitivity for 15 safety-critical classes absent from public 3D benchmarks alongside code-capped deployable estimates, and inter-pipeline complementarity, demonstrating that 2D-to-3D semantic transfer addresses the labeled-data bottleneck while scene graphs provide building intelligence compact enough for field deployment.

[290] Transferable Physical-World Adversarial Patches Against Object Detection in Autonomous Driving

Zihui Zhu, Ziqi Zhou, Yichen Wang, Lulu Xue, Minghui Li, Shengshan Hu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Deep learning drives major advances in autonomous driving (AD), where object detectors are central to perception. However, adversarial attacks pose significant threats to the reliability and safety of these systems, with physical adversarial patches representing a particularly potent form of attack. Physical adversarial patch attacks pose severe risks but are usually crafted for a single model, yielding poor transferability to unseen detectors. We propose AdvAD, a transfer-based physical attack against object detection in autonomous driving. Instead of targeting a specific detector, AdvAD optimizes adversarial patches over multiple detection models in a unified framework, encouraging the learned perturbations to capture shared vulnerabilities across architectures. The optimization process adaptively balances model contributions and enforces robustness to physical variations. It further employs data augmentation and geometric transformations to maintain patch effectiveness under diverse physical conditions. Experiments in both digital and real-world settings show that AdvAD consistently outperforms state-of-the-art (SOTA) attacks in performance and transferability.

[291] Learning from Imperfect Text Guidance: Robust Long-Tail Visual Recognition with High-Noise Label

Mengke Li, Haiquan Ling, Yiqun Zhang, Yang Lu, Hui Huang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Real-world data often exhibit long-tailed distributions with numerous noisy labels, substantially degrading the performance of deep models. While prior research has made progress in addressing this combined challenge, it overlooks the severe label-image mismatch inherent to high-noise settings, thereby limiting their effectiveness. Given that observed labels, though mismatched with images, still retain category information, we propose employing auxiliary text information from labels to address label-image inconsistencies in long-tailed noisy data. Specifically, we leverage the intrinsic cross-modal alignment in pre-trained visual-language models to correct the label-image inconsistencies. This supervisory signal, referred to as Weak Teacher Supervision (WTS), is unaffected by label noise and data distribution biases, albeit exhibits limited accuracy. Therefore, the activation of WTS is determined by evaluating the discrepancy between text-predicted labels and observed labels. Extensive experiments demonstrate the superior performance of WTS across synthetic and real-world datasets, particularly under high-noise conditions. The source code is available at https://anonymous.4open.science/r/WTS-0F3C.

[292] CNN-ViT Fusion with Adaptive Attention Gate for Brain Tumor MRI Classification: A Hybrid Deep Learning Model

Syed Ibad Hasnain, Muhammad Faris, Hafiza Syeda Yusra Tirmizi, Rabail Khowaja, Hafsa Israr

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Early detection and classifying brain tumors using Magnetic Resonance Imaging (MRI) images is highly important but difficult to extract in medical images. Convolutional Neural Networks (CNNs) are good at capturing both local texture and spatial information whereas Vision Transformers (ViTs) are good at capturing long-range global dependencies. We propose a new hybrid architecture that combines a SqueezeNet-style CNN branch with a MobileViT-style global transformer branch, through an Adaptive Attention Gate mechanism, in this paper. The gate learns dynamically per-sample, per-feature weights to weight the contribution of each branch, allowing context-sensitive merging of local and global representations. The proposed model has a test accuracy of 97.60, a precision of 97.30, a recall of 97.50, an F1-score of 97.40, and a macro-average area under the curve (AUC) of 0.9946 with a trained and evaluated on the Brain Tumor MRI Dataset (Kaggle). These scores are higher than single CNN and ViT baselines, and current competitive fusion methods, showing that dynamic feature weighting is an effective way to classify medical images.

[293] UpstreamQA: A Modular Framework for Explicit Reasoning on Video Question Answering Tasks

Jason Nguyen, Ameet Rao, Alexander Chang, Ishaan Kumar, Erin Tan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Video Question Answering (VideoQA) demands models that jointly reason over spatial, temporal, and linguistic cues. However, the task’s inherent complexity often requires multi-step reasoning that current large multimodal models (LMMs) perform implicitly, leaving their internal decision process opaque. In contrast, large reasoning models (LRMs) explicitly generate intermediate logical steps that enhance interpretability and can improve multi-hop reasoning accuracy. Yet, these models are not designed for native video understanding, as they typically rely on static frame sampling. We propose UpstreamQA, a modular framework that disentangles and evaluates core video reasoning components through explicit upstream reasoning modules. Specifically, we employ multimodal LRMs to perform object identification and scene context generation before passing enriched reasoning traces to downstream LMMs for VideoQA. We evaluate UpstreamQA on the OpenEQA and NExTQA datasets using two LRMs (o4-mini, Gemini 2.5 Pro) and two LMMs (GPT-4o, Gemini 2.5 Flash). Our results demonstrate that introducing explicit reasoning can significantly boost performance and interpretability of downstream VideoQA, but can also lead to performance degradation when baseline performance is sufficiently high. Overall, UpstreamQA offers a principled framework for combining explicit reasoning and multimodal understanding, advancing both performance and diagnostic transparency in VideoQA in several scenarios.

[294] BSViT: A Burst Spiking Vision Transformer for Expressive and Efficient Visual Representation Learning

Hongxiang Peng, Dewei Bai, Hong Qu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Spiking Vision Transformers (S-ViTs) offer a promising framework for energy-efficient visual learning. However, existing designs remain limited by two fundamental issues: the restricted information capacity of binary spike coding and the dense token interactions introduced by global self-attention. To address these challenges, this work proposes BSViT, a burst spiking-driven Vision Transformer featuring a Dual-Channel Burst Spiking Self-Attention (DBSSA) mechanism. DBSSA encodes queries with binary spikes and keys with burst spikes to enhance representational capacity. The value pathway adopts dual excitatory and inhibitory binary channels, enabling signed modulation and richer spike interactions. Importantly, the entire attention operation preserves addition-only computation, ensuring compatibility with energy-efficient neuromorphic hardware. To further reduce spike activity and incorporate spatial priors, a patch adjacency masking strategy is introduced to restrict attention to local neighborhoods, resulting in structure-aware sparsity and reduced computational overhead. In addition, burst spike coding is systematically integrated across the network to increase spike-level representational capacity beyond conventional binary spiking. Extensive experiments on both static and event-based vision benchmarks demonstrate that BSViT consistently outperforms existing spiking Transformers in accuracy while maintaining competitive energy efficiency.

[295] A Topology fixated Shape Gradient Framework for Non Simple Boundary Extraction for CIE Lab color images with Repulsive Energy

Shafeequdheen Palengara, Jyotiranjan Nayak, Vijayakrishna Rowthu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: A levelset free but a hybrid image segmentation approach based on a modified version of the piece wise constant shape gradient of an Mumford Shah shape functional and a repulsive function is considered. The segmentation is performed a non-local shape based through an evolution of discrete curves driven by a non local shape based energy to segment images containing disjoint regions and multiple boundaries. This formulation has a novel additional component as a multivariable function dependent on a few sampled points of the curves that handles the occurrence of self intersection during boundary curves evolution. The method is applied to a few gray scale and color images, including images with nested structures and astronomical objects. The results indicate effective segmentation in complex scenarios with absolute control on the topology of the segments and self-intersections of the boundaries

[296] One Identity, Many Roles: Multimodal Entity Coreference for Enhanced Video Situation Recognition

Balaji Darur, Amanmeet Garg, Makarand Tapaswi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Video Situation Recognition (VidSitu) addresses the challenging problem of “who did what to whom, with what, how, and where” in a video. It tests thorough video understanding by requiring identification of salient actions and associated short descriptions for event roles across multiple events. Grounding with VidSitu requires spatio-temporal localization of key entities across shots and varied appearances. We posit that coherent video understanding requires consistent identification of entities that play different roles. We propose Multimodal Entity Coreference (MEC) to unite entity descriptions in text with grounding across the video. Towards this, we introduce CineMEC, a multi-stage approach that unites event role mention groups with visual clusters of entities, without explicit grounding supervision during training. Our approach is designed to exploit the synergy between visual grounding and captioning, where improving one influences the other and vice versa. For evaluation, we extend the VidSitu dataset with grounding annotations. While previous work focuses primarily on descriptions, CineMEC improves consistency across both: captioning (+2.5% CIDEr, +7% LEA) and visual grounding (+18% HOTA).

[297] DyABD: The Abdominal Muscle Segmentation in Dynamic MRI Benchmark

Niamh Belton, Victoria Joppin, Aonghus Lawlor, Catherine Masson, Thierry Bege, David Bendahan, Kathleen M. Curran

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This work introduces DyABD, a novel and complex benchmark dataset of dynamic abdominal MRIs from patients with abdominal hernias and associated high quality abdominal muscle annotations. DyABD is the first-of-its-kind in four key ways; (1) it proposes the first abdominal muscle segmentation task, (2) the dynamic MRIs are acquired whilst the patients perform various exercises, introducing extreme anatomical variability, making it one of the most challenging segmentation datasets to date, (3) it includes both pre and post corrective MRIs and (4) DyABD promotes clinical research into the high recurrence rates of abdominal hernias. Beyond dataset introduction, this work provides a comprehensive evaluation of the generalisation capabilities of existing segmentation models across Supervised, Few Shot and Zero Shot paradigms on the unseen DyABD dataset. This work reveals that there is still room for substantial improvement in the field of medical image segmentation, with the majority of techniques achieving a Dice Coefficient of 0.82. This work therefore sheds light on the true progress of the field and redefines the benchmark for progress in medical image segmentation.

Yihan Wang, Lei Li, Yao Lai, Jing Wang, Yan Lu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Analog circuit design relies heavily on reusing existing intellectual property (IP), yet searching across heterogeneous representations such as SPICE netlists, schematics, and functional descriptions remains challenging. Existing methods are largely limited to exact matching within a single modality, failing to capture cross-modal semantic relationships. To bridge this gap, we present AnalogRetriever, a unified tri-modal retrieval framework for analog circuit search. We first build a high-quality dataset on top of Masala-CHAI through a two-stage repair pipeline that raises the netlist compile rate from 22% to 100%. Built on this foundation, AnalogRetriever encodes schematics and descriptions with a vision-language model and netlists with a port-aware relational graph convolutional network, mapping all three modalities into a shared embedding space via curriculum contrastive learning. Experiments show that AnalogRetriever achieves an average Recall@1 of 75.2% across all six cross-modal retrieval directions, significantly outperforming existing baselines. When integrated into the AnalogCoder agentic framework as a retrieval-augmented generation module, it consistently improves functional pass rates and enables previously unsolved tasks to be completed. Our code and dataset will be released.

[299] Micro-Expression-Aware Avatar Fingerprinting via Inter-Frame Feature Differencing

Masoumeh Chapariniya, Jean-Marc Odobez, Volker Dellwo, Teodora Vuković

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Avatar fingerprinting, i.e., verifying who drives a synthetic talking-head video rather than whether it is real, is a critical safeguard for authorized use of face-reenactment technology. Existing methods rely on a fixed, non-differentiable landmark extraction stage that prevents the fingerprinting model from being optimized end-to-end from raw pixels. We propose a preprocessing-free system built on a micro-expression-aware backbone operating on raw video frames, with inter-frame feature differencing as the core design principle: consecutive feature maps are subtracted in the learned deep feature space, so that temporally stable appearance dimensions contribute zero to the output while driver-specific motion dynamics are preserved. A controlled ablation on NVFAIR confirms that temporal motion accounts for the large majority of discriminative performance, and that raw appearance features actively degrade identity separation. Both the choice of backbone and the differencing principle are essential: differencing alone is insufficient when applied to a generic encoder, as appearance-dominated features collapse to near-identical representations across adjacent frames, while the micro-expression-aware F5C backbone retains measurable motion variation that the differencing operation can exploit. Without any external preprocessing, our model achieves an overall AUC of 0.877 on NVFAIR and matches or exceeds the landmark-based baseline on the majority of cross-generator pairs.

[300] MotionHiFlow: Text-to-motion via hierarchical flow matching

Heng Li, Xiaotong Lin, Ling-An Zeng, Yulei Kang, Shuai Li, Jian-Fang Hu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Text-to-motion generation aims to generate 3D human motions that are tightly aligned with the input text while remaining physically plausible and rich in fine-grained detail. Although recent approaches can produce complex and natural movements, they usually operate at only one temporal scale, which limits both semantic alignment and temporal coherence. Inspired by the fact that complex motions are conceptualized hierarchically rather than at a single temporal scale in the human cognitive system, we propose \textit{MotionHiFlow}, a hierarchical flow matching framework to generate motion progressively by constructing flow path from low to high temporal scales. The flows at lower scales capture high-level semantics and coarse motion structures, while flows at higher scales refine temporal details. To link the flows across scales, we introduce a novel cross-scale transition process, ensuring continuity and preserving noise consistency. Furthermore, by integrating a Text-Motion Diffusion Transformer and a topology-aware Motion VAE, MotionHiFlow explicitly models structural dependencies among joints via joint-aware positional encoding and skeletal topology, enabling precise semantic alignment alongside fine-grained motion details. Extensive experiments on HumanML3D and KIT-ML benchmarks demonstrate state-of-the-art performance, with ablation studies confirming the effectiveness of the hierarchical design and key components. Code is available at https://github.com/ai-lh/MotionHiFlow.

[301] LatentBurst: A Fast and Efficient Multi Frame Super-Resolution for Hexadeca-Bayer Pattern CIS images

Sangwook Baek, Vin Van Duong, Karam Park, Pilkyu Park

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper introduces a novel multi frame super-resolution network (MFSR) for burst hexadeca Bayer pattern Contact Image Sensor (CIS) images, which includes demosaicing, denoising, multi-frame fusion, and super-resolution. Designing a high-quality reconstruction network poses several challenges as follows: 1) Unlike the Bayer color filter array (CFA) pattern, it is hard to interpolate hexadeca-Bayer pattern since the pixel distance between the same color groups increases; 2) Due to large object motion and camera movements, the final fusion result usually suffers the misalignment resulting a blurry image or ghosting artifacts; 3) The proposed network should be fast and efficient enough to operate in real-time on mobile devices. To overcome these challenges, we propose a novel network, called LatentBurst, which contains: 1) a pyramid align and fusion approach in latent feature to deal with large motion scenario; 2) an efficient UNet-based structure which can run efficiently on mobile device; 3) fine-tuned optical flow estimation and two-step knowledge distillation to reduce domain-gap more effectively. Experimental results in various scenarios demonstrate the effectiveness of our proposed method compared with other state-of-the-art methods.

[302] A Hierarchical Ensemble Inference Pipeline for Robust White Blood Cell Classification Under Domain Shifts

Ruyi Dai, Tingkwong Ng, Hao Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Automated white blood cell (WBC) classification is essential for scalable leukaemia screening. However, real-world deployment is challenged by domain shifts caused by staining protocols, scanner characteristics, and inter-laboratory variability, which often degrade model performance. The White Blood Cell Classification Challenge (WBCBench) at ISBI 2026 aims to advance robust WBC recognition, with a focus on accurately identifying blast cells and other clinically critical rare subtypes. We propose a memory-augmented, hierarchical ensemble pipeline for WBC classification under domain shifts, leveraging a feature bank and a DinoBloom backbone fine-tuned with LoRA. Our three-stage inference hierarchy combines k-nearest neighbors (kNN) retrieval at each level, reducing over-reliance on any single decision. Evaluated on the WBCBench dataset, our method ranks within the top ten by macro F1-score in the final testing phase.

[303] SemiGDA: Generative Dual-distribution Alignment for Semi-Supervised Medical Image Segmentation

Kaiwen Huang, Yi Zhou, Yizhe Zhang, Jingxiong Li, Tao Zhou

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Semi-supervised learning addresses label scarcity and high annotation costs in medical image segmentation by exploiting the latent information in unlabeled data to enhance model performance. Traditional discriminative segmentation relies on segmentation masks, neglecting feature-level distribution constraints. This limits robust semantic representation learning and adaptive modeling of unlabeled data in scenarios with few labels. To address these limitations, we propose SemiGDA, a novel Generative Dual-distribution Alignment framework for semi-supervised medical image segmentation. Our SemiGDA overcomes the reliance of discriminative methods on large labeled datasets by aligning feature and semantic distributions to boost semantic learning and scene adaptability. Specifically, we propose a Dual-distribution Alignment Module (DAM), which employs two structurally distinct encoders to model image and mask feature distributions. It enforces their alignment in the latent space via distributional constraints, establishing structured feature consistency. Moreover, we design a Consistency-Driven Skip Adapter (CDSA) strategy, which introduces dual skip adapters (Image and Mask) to fuse multi-scale features via skip connections. Using a consistency loss, CDSA enhances cross-branch semantic alignment and reinforces fine-grained semantic consistency. Experimental results on diverse medical datasets show that our method outperforms other state-of-the-art semi-supervised segmentation methods. Code is released at: https://github.com/taozh2017/SemiGDA.

[304] Lightweight and Production-Ready PDF Visual Element Parsing

Meizhu Liu, Yassi Abbasi, Matthew Rowe, Michael Avendi, Paul Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: PDF documents contain critical visual elements such as figures, tables, and forms whose accurate extraction is essential for document understanding and multimodal retrieval-augmented generation (RAG). Existing PDF parsers often miss complex visuals, extract non-informative artifacts (e.g., watermarks, logos), produce fragmented elements, and fail to reliably associate captions with their corresponding elements, which degrades downstream retrieval and question answering. We present a lightweight and production level PDF parsing framework that can accurately detect visual elements and associates captions using a combination of spatial heuristics, layout analysis, and semantic similarity. On popular benchmark datasets and internal product data, the proposed solution achieves $\geq96%$ visual element detection accuracy and $93%$ caption association accuracy. When used as a preprocessing step for multimodal RAG, it significantly outperforms state-of-the-art parsers and large vision-language models on both internal data and the MMDocRAG benchmark, while reducing latency by over $2\times$. We have deployed the proposed system in challenging production environment.

[305] STAND: Semantic Anchoring Constraint with Dual-Granularity Disambiguation for Remote Sensing Image Change Captioning

Yanpei Gong, Beichen Zhang, Hao Wang, Zhaobo Qi, Xinyan Liu, Yuanrong Xu, Ruiyang Gao, Weigang Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Remote sensing image change captioning (RSICC) aims to describe the difference between two remote sensing images. While recent methods have explored video modeling, they largely overlook the inherent ambiguities in viewpoint, scale, and prior knowledge, lacking effective constraints on the encoder. In this paper, we present STAND, a Semantic Anchoring Constraint with Dual-Granularity Disambiguation for RSICC, to progressively resolve these ambiguities. Specifically, to establish a reliable feature foundation, we first introduce an interpretable constraint to regularize temporal representations. Operating on these purified features, a dual-granularity disambiguation module resolves spatial uncertainties by coupling macro-level global context aggregation for viewpoint confusion with micro-level frequency-refocused attention for small-object scale enhancement. Ultimately, to translate these visually disambiguated features into precise text, a semantic concept anchoring module leverages language categorical priors to tackle knowledge ambiguity during decoding. Extensive experiments verify the superiority of STAND and its effectiveness in addressing ambiguities.

[306] Learning from Noisy Prompts: Saliency-Guided Prompt Distillation for Robust Segmentation with SAM

Jingxuan Kang, Ziqi Zhang, Shaoming Zheng, Shuang Li, Uday Bharat Patel, Alexander Harry Fitzhugh, Phillip Lung, Yusuf Kiberu, Nikesh Jathanna, Shahnaz Jamil-Copley, Bernhard Kainz, Chen Qin

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Segmentation is central to clinical diagnosis and monitoring, yet the reliability of modern foundation models in medical imaging still depends on the availability of precise prompts. The Segment Anything Model (SAM) offers powerful zero-shot capabilities, although it collapses under the weak, generic, and noisy prompts that dominate real clinical workflows. In practice, annotations such as centerline points are coarse and ambiguous, often drifting across neighboring anatomy and misguiding SAM toward inconsistent or incomplete masks. We introduce SPD, a Saliency-Guided Prompt Distillation framework that converts these unreliable cues into robust guidance. SPD first learns data-driven anatomical priors through a lightweight saliency head to obtain confident localization maps. These priors then drive Contextual Prompt Distillation, which validates and enriches noisy prompts using cues from anatomically adjacent slices, producing a consensus prompt set that matches the behavior of expert reasoning. A Pairwise Slice Consistency objective further enforces local anatomical coherence during segmentation. Experiments on four challenging MRI and CT benchmarks demonstrate that SPD consistently outperforms existing SAM adaptations and supervised baselines, delivering large gains in both region-based and boundary-based metrics. SPD provides a practical and principled path toward reliable foundation model deployment in clinical environments where only imperfect prompts are available.

[307] KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition

Zhaoxiang Liu, Zhicheng Ma, Kaikai Zhao, Kai Wang, Shiguo Lian

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The Convolutional Neural Networks (CNNs) have been the dominant and effective approach for general computer vision tasks. Recently, Kolmogorov-Arnold neural networks (KANs), based on the Kolmogorov-Arnold representation theorem, have shown potential to replace Multi-Layer Perceptrons (MLPs) in deep learning. KANs, which use learnable nonlinear activations on edges and simple summation on nodes, offer fewer parameters and greater explainability compared to MLPs. However, there has been limited exploration of integrating the Kolmogorov-Arnold representation theorem with convolutional methods for computer vision tasks. Existing attempts have merely replaced learnable activation functions with weights, undermining KANs’ theoretical foundation and limiting their potential effectiveness. Additionally, the B-spline curves used in KANs suffer from computational inefficiency and a tendency to overfit. In this paper, we propose a novel Kolmogorov-Arnold Convolutional Layer that deeply integrates the Kolmogorov-Arnold representation theorem with convolution. This layer provides stronger method interpretability because it is based on established mathematical theorems and its design has theoretical alignment. Building on the Kolmogorov-Arnold Convolutional Layer, we design an efficient network architecture called KAConvNet, which outperforms existing methods combining KAN and convolution, and achieves competitive performance compared to mainstream ViTs and CNNs. We believe that our work offers valuable insight into the field of artificial intelligence and will inspire the development of more innovative CNNs in the 2020s. The code is publicly available at https://github.com/UnicomAI/KAConvNet.

[308] H-SemiS: Hierarchical Fusion of Semi and Self-Supervised Learning for Knee Osteoarthritis Severity Grading

Chandravardhan Singh Raghaw, Anushka Parwal, Shahid Shafi Dar, Prajakta Darade, Nagendra Kumar

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Knee osteoarthritis (KOA) is a degenerative joint disease that can lead to chronic pain, reduced mobility, and long-term disability. Automated severity grading from knee radiographs can support early assessment, but current methods heavily depend on large labeled datasets and remain sensitive to class imbalance, noisy samples, and variability in clinical annotations. To alleviate these limitations, we propose a Hierarchical fusion of Semi-Supervised framework with Self-Supervision (H-SemiS) for KOA severity grading in knee X-ray samples using limited annotated data. Rather than treating severity grading as a flat multi-class problem, H-SemiS decomposes the task into a sequence of binary sub-tasks within a semi-supervised teacher-student architecture, directly mitigating the impact of class imbalance. To further enhance feature learning from unlabeled data, the framework integrates an adversarial self-supervised reconstruction module that encourages the network to capture robust anatomical structures. In parallel, a teacher-student design with quantum-inspired feature mixing improves discrimination boundaries between adjacent grades when pseudo-labels are noisy. We comprehensively evaluate H-SemiS on two challenging multi-class datasets and assess its generalizability on two binary-class datasets. Our experimental results demonstrate the superiority of the proposed H-SemiS framework across multiple evaluation metrics, consistently outperforming several competing baselines and state-of-the-art methods. The code is publicly available at https://github.com/chandravardhan-singh-raghaw/H-SemiS.

[309] Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection

Sanghoon Lee, Geon Lee, Hyekang Park, Bumsub Ham

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Conventional object detectors typically operate under a closed-set assumption, limiting recognition to a predefined set of base classes seen during training. Open-vocabulary object detection (OVD) addresses this limitation by leveraging vision-language models (VLMs) to generate pseudo labels for novel object classes. However, existing OVD methods suffer from two critical drawbacks: (1) inaccurate class label assignments, as VLMs are optimized for image-level predictions rather than the region-level predictions required for pseudo labeling, and (2) unreliable objectness scores from region proposal networks (RPNs) trained exclusively on base object classes. To address these issues, we propose a novel pseudo labeling framework for OVD. Our approach introduces a hierarchical confidence calibration (HCC) technique, which ensures reliable class label estimation by assessing consistency across hierarchical semantic levels (class, super- and sub-category). We also present LoCLIP, a parameter-efficient adaptation of CLIP that incorporates an objectness token to mitigate base class bias problem of RPNs and provide reliable objectness estimations for novel object classes. Extensive experiments on standard OVD benchmarks, including COCO and LVIS, demonstrate that our approach clearly sets a new state of the art, validating the effectiveness of our approach. Project site: https://cvlab.yonsei.ac.kr/projects/HCC

[310] EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs

He Hu, Tengjin Weng, Zebang Cheng, Yu Wang, Jiachen Luo, Björn Schuller, Zheng Lian, Laizhong Cui

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and generation, and are increasingly used in applications such as social robots and human-computer interaction, where understanding human emotions is essential. However, existing benchmarks mainly formulate emotion understanding as a static recognition problem, leaving it largely unclear whether current MLLMs can understand emotion as a dynamic process that evolves, shifts between states, and unfolds across diverse social contexts. To bridge this gap, we present EmoTrans, a benchmark for evaluating emotion dynamics understanding in multimodal videos. EmoTrans contains 1,000 carefully collected and manually annotated video clips, covering 12 real-world scenarios, and further provides over 3,000 task-specific question-answer (QA) pairs for fine-grained evaluation. The benchmark introduces four tasks, namely Emotion Change Detection (ECD), Emotion State Identification (ESI), Emotion Transition Reasoning (ETR), and Next Emotion Prediction (NEP), forming a progressive evaluation framework from coarse-grained detection to deeper reasoning and prediction. We conduct a comprehensive evaluation of 18 state-of-the-art MLLMs on EmoTrans and obtain two main findings. First, although current MLLMs show relatively stronger performance on coarse-grained emotion change detection, they still struggle with fine-grained emotion dynamics modeling. Second, socially complex settings, especially multi-person scenarios, remain substantially challenging, while reasoning-oriented variants do not consistently yield clear improvements. To facilitate future research, we publicly release the benchmark, evaluation protocol, and code at https://github.com/Emo-gml/EmoTrans.

[311] Hierarchical Spatio-Channel Clustering for Efficient Model Compression in Medical Image Analysis

Sisipho Hamlomo, Marcellin Atemkeng, Habte Tadesse Likassa, Blaise Ravelo, Thierry Bouwmans, Sébastien Lalléchère, Antoine Vacavant, Ding-Geng Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Convolutional neural networks (CNNs) have become increasingly difficult to deploy in resource-constrained environments due to their large memory and computational requirements. Although low-rank compression methods can reduce this burden, most existing approaches compress spatial and channel redundancy independently and therefore do not fully exploit the localised structure within convolutional feature maps. This paper proposes a hierarchical spatio-channel low-rank compression framework for CNNs that exploits redundancy across spatial regions and channel activations. Unlike conventional methods, which apply a uniform decomposition across an entire layer, the proposed approach first partitions feature maps into spatial regions, then groups channels according to their co-activation patterns within each region, and finally applies rank-adaptive SVD to each resulting spatio-channel cluster. The method is evaluated on an AlexNet-based brain tumour MRI classification model and compared with Global SVD and Tucker decomposition under (3\times) and (6\times) compression budgets. Our method outperforms both baselines, reducing FLOPs from (8.21,\mathrm{G}) to (1.55,\mathrm{G}) ((81.1%) reduction), achieving a (1.38\times) inference speed-up, and increasing classification accuracy from (87.76%) to (89.80%). The method also improves the macro (F_1)-score and performance on challenging classes such as meningioma. A hyper-parameter trade-off analysis demonstrates that the framework provides Pareto-optimal configurations, enabling control over the balance between compression and predictive performance. Moderate clustering with adaptive rank selection yields strong results. Bootstrap standard errors are reported for all classification metrics.

[312] Keypoint-based Dynamic Object 6-DoF Pose Tracking via Event Camera

Zhe Wang, Qijin Song, Zihao Li, Jingyu Xiao, Weibang Bai

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate 6-DoF pose estimation of objects is critical for robots to perform precise manipulation tasks. However, for dynamic object pose estimation, conventional camera-based approaches face several major challenges, such as motion blur, sensor noise, and low-light limitation. To address these issues, we employ event cameras, whose high dynamic range and low latency offer a promising solution. Furthermore, we propose a keypoint-based detection and tracking approach for dynamic object pose estimation. Firstly, a keypoint detection network is constructed to extract keypoints from the time surface generated by the event stream. Subsequently, the polarity and spatial coordinates of the events are leveraged, and the event density in the vicinity of each keypoint is utilized to achieve continuous keypoint tracking. Finally, a hash mapping is established between the 2D keypoints and the 3D model keypoints, and the EPnP algorithm is employed to estimate the 6-DoF pose. Experimental results demonstrate that, whether in simulated or real event environments, the proposed method outperforms the event-based state-of-the-art methods in terms of both accuracy and robustness.

[313] Breaking the Resource Wall: Geometry-Guided Sequence Modeling for Efficient Semantic Segmentation

Sheng-Wei Chan, Xin-Jui Pan, Chun-Po Shen, Chia-Min Lin, Yung-Che Wang, Jen-Shiun Chiang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: High-performance semantic segmentation has achieved significant progress in recent years, often driven by increasingly large backbones and higher computational budgets. While effective, such approaches introduce substantial computational overhead and limit accessibility under constrained hardware settings. In this paper, we propose DGM-Net (Directional Geometric Mamba Network), an efficient architecture that improves modeling capability through structural design rather than increasing model capacity. We introduce Directional Geometric Mamba (G-Mamba), a linear-complexity O(N) operator as an alternative to conventional context modeling modules such as ASPP and PPM. To further enhance structural awareness in state space model (SSM)-based modeling, we design the DGM-Module, which extracts centripetal flow fields and topological skeletons to guide the scanning process and improve boundary preservation. Without relying on large-scale pretraining or heavy backbone scaling, DGM-Net achieves 80.8% mIoU within 28k iterations, 82.3% mIoU on Cityscapes test set, and 45.24% mIoU on ADE20K. In addition, the model maintains stable performance under constrained hardware settings (e.g., batch size of 2 on 8GB VRAM), highlighting its efficiency and practicality. These results demonstrate that incorporating geometric guidance into SSM-based architectures provides an effective and resource-efficient direction for semantic segmentation.

[314] Learn&Drop: Fast Learning of CNNs based on Layer Dropping

Giorgio Cruciata, Luca Cruciata, Liliana Lo Presti, Jan Van Gemert, Marco La Cascia

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper proposes a new method to improve the training efficiency of deep convolutional neural networks. During training, the method evaluates scores to measure how much each layer’s parameters change and whether the layer will continue learning or not. Based on these scores, the network is scaled down such that the number of parameters to be learned is reduced, yielding a speed up in training. Unlike state-of-the-art methods that try to compress the network to be used in the inference phase or to limit the number of operations performed in the backpropagation phase, the proposed method is novel in that it focuses on reducing the number of operations performed by the network in the forward propagation during training. The proposed training strategy has been validated on two widely used architecture families: VGG and ResNet. Experiments on MNIST, CIFAR-10 and Imagenette show that, with the proposed method, the training time of the models is more than halved without significantly impacting accuracy. The FLOPs reduction in the forward propagation during training ranges from 17.83% for VGG-11 to 83.74% for ResNet-152. These results demonstrate the effectiveness of the proposed technique in speeding up learning of CNNs. The technique will be especially useful in applications where fine-tuning or online training of convolutional models is required, for instance because data arrive sequentially.

[315] PushupBench: Your VLM is not good at counting pushups

Shengzhi Li, Jiarun Chen, Karun Sharma, Jiaqi Su, Shichao Pei

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large vision-language models (VLMs) can recognize \textit{what} happens in video but fail to count \textit{how many} times. We introduce \textbf{PushupBench}, 446 long-form clips (avg. 36.7s) for evaluating repetition counting. The best frontier model achieves 42.1% exact accuracy; open-source 4B models score $\sim$6%, matching supervised baselines. We show that accuracy alone misleads – weaker models exploit the modal count rather than reason temporally. Fine-tuning on counting with 1k samples transfers to general video understanding: MVBench (+2.15), PerceptionTest (+1.88), TVBench (+4.54), suggesting counting is a proxy for broader temporal reasoning.PushupBench incorporated in \texttt{lmms-eval} (https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/1262) and hosted on (pushupbench.com/)

[316] A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis

Md. Afzalur Rahaman, Tahmid Rahman

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Most two-stream action recognition networks apply the same convolutional backbone to both RGB and optical flow streams, ignoring the fact that the two modalities have fundamentally different structural properties. Optical flow captures fine-grained motion patterns, while RGB frames carry rich appearance and scene context - treating them identically discards this distinction. We propose DualStreamHybrid, a heterogeneous two-stream architecture that assigns each stream a backbone suited to its input: a pretrained ViT-Tiny/16 for RGB frames, and a MobileNetV2 trained from scratch on a 20-channel stacked optical flow representation. A learned projection layer maps the two differently-sized feature vectors to a common dimensionality before fusion, enabling the two streams to interact without forcing architectural symmetry. We design five fusion strategies within a unified framework - late fusion, concatenation, cross-attention, weighted fusion, and gated fusion - and evaluate them on UCF11 (1,600 videos, 11 classes) and UCF50 (6,681 videos, 50 classes) to study how fusion behaviour scales with dataset size. On UCF11, cross-attention achieves 98.12% test accuracy, outperforming the RGB-only ViT-Tiny baseline of 95.94%, which suggests that explicit inter-modal attention is particularly effective on smaller, less complex datasets. On UCF50, weighted fusion reaches 96.86% and proves the most consistent strategy across both benchmarks. The learned stream weights reveal an interesting pattern: UCF11 sees near-equal modality contribution (RGB: 0.507, flow: 0.493), while UCF50 favours the RGB stream slightly more (RGB: 0.554, flow: 0.446) - arguably reflecting the larger and more visually diverse action space. Taken together, these results suggest that even a lightweight motion stream meaningfully complements a strong appearance encoder, and that the optimal fusion strategy depends on dataset scale.

[317] Enhanced Privacy and Communication Efficiency in Non-IID Federated Learning with Adaptive Quantization and Differential Privacy

Emre Ardıç, Yakup Genç

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Federated learning (FL) is a distributed machine learning method where multiple devices collaboratively train a model under the management of a central server without sharing underlying data. One of the key challenges of FL is the communication bottleneck caused by variations in connection speed and bandwidth across devices. Therefore, it is essential to reduce the size of transmitted data during training. Additionally, there is a potential risk of exposing sensitive information through the model or gradient analysis during training. To address both privacy and communication efficiency, we combine differential privacy (DP) and adaptive quantization methods. We use Laplacian-based DP to preserve privacy, which is relatively underexplored in FL and offers tighter privacy guarantees than Gaussian-based DP. We propose a simple and efficient global bit-length scheduler using round-based cosine annealing, along with a client-based scheduler that dynamically adapts based on client contribution estimated through dataset entropy analysis. We evaluate our approach through extensive experiments on CIFAR10, MNIST, and medical imaging datasets, using non-IID data distributions across varying client counts, bit-length schedulers, and privacy budgets. The results show that our adaptive quantization methods reduce total communicated data by up to 52.64% for MNIST, 45.06% for CIFAR10, and 31% to 37% for medical imaging datasets compared to 32-bit float training while maintaining competitive model accuracy and ensuring robust privacy through differential privacy.

[318] Sphere-Depth: A Benchmark for Depth Estimation Methods with Varying Spherical Camera Orientations

Soulayma Gazzeh, Giuseppe Mazzola, Liliana Lo Presti, Marco La Cascia

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reliable depth estimation from spherical images is crucial for 360° vision in robotic navigation and immersive scene understanding. However, the onboard spherical camera can experience unintentional pose variations in real-world robotic platforms that, along with the geometric distortions inherent in equirectangular projections, significantly impact the effectiveness of depth estimation. To study this issue, a novel public benchmark, called Sphere-Depth, is introduced to systematically evaluate the robustness of monocular depth estimation models from equirectangular images in a reproducible way. Camera pose perturbations are simulated and used to assess the performance of a popular perspective-based model, Depth Anything, and of spherical-aware models such as Depth Anywhere, ACDNet, Bifuse++, and SliceNet. Furthermore, to ensure meaningful evaluation across models, a depth calibration-based error protocol is proposed to convert predicted relative depth values into metric depth values using supervised learned scaling factors for each model. Experiments show that even models explicitly designed to process spherical images exhibit substantial performance degradation when variations in the camera pose are observed with respect to the canonical pose. The full benchmark, evaluation protocol, and dataset splits are made publicly available at: https://github.com/sgazzeh/Sphere_depth

[319] Knee-xRAI: An Explainable AI Framework for Automatic Kellgren-Lawrence Grading of Knee Osteoarthritis

Azmul A. Irfan, Nur Ahmad Khatim, Alfan Alfian Irfan, Achmad Zaki, Erike A. Suwarsono, Mansur M. Arief

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Radiographic grading of knee osteoarthritis (KOA) with the Kellgren-Lawrence (KL) system is limited by inter-reader variability and the opacity of current deep learning approaches, which predict KL grades directly from images without decomposing structural features. We present Knee-xRAI, a modular framework that independently quantifies the three cardinal radiographic features of KOA (joint space narrowing [JSN], osteophytes, and subchondral sclerosis) and integrates them into an explainable KL grade classification. The pipeline combines U-Net++ segmentation for contour-based JSN measurement, an SE-ResNet-50 network for per-site osteophyte grading (OARSI scale), and a hybrid texture-CNN classifier for binary sclerosis quantification. The resulting 50-dimensional structured feature vector feeds two complementary classification paths. An XGBoost path supports SHAP-based feature attribution. A ConvNeXt hybrid path combines the structured vector with a full-image encoder for enhanced predictive performance. Evaluated on 8,260 radiographs from an OAI-derived dataset, the JSN module achieved a Dice coefficient of 0.8909 and an mJSW intraclass correlation of 0.8674 against manual annotations. The ConvNeXt hybrid path reached a test quadratic weighted kappa (QWK) of 0.8436 and AUC of 0.9017. The transparent XGBoost path achieved a test QWK of 0.6294 with full feature-level audit capability. Ablation confirmed JSN as the dominant predictor (QWK = 0.6103 alone), with osteophyte features providing consistent incremental gain (+0.0183) and sclerosis contributing marginally. Inference-time ablation of Path B confirmed the structured pathway contributes materially beyond the image encoder, with QWK drops of 0.098 (feature zeroing) and 0.284 (feature-image permutation). Knee-xRAI explicitly quantifies all three KL-defining radiographic features within a single auditable pipeline.

[320] Resource-Constrained UAV-Based Weed Detection for Site-Specific Management on Edge Devices

Linyuan Wang, Haibo Yao, Te-Ming Tseng, Kelvin Betitame, Xin Sun, Hanbo Huang, Dong Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Weeds compete with crops for light, water, and nutrients, reducing yield and crop quality. Efficient weed detection is essential for site-specific weed management (SSWM). Although deep learning models have been deployed on UAV-based edge systems, a systematic understanding of how different model architectures perform under real-world resource constraints is still lacking. To address this gap, this study proposes a deployment-oriented framework for real-time UAV-based weed detection on resource-constrained edge platforms. The framework integrates UAV data acquisition, model development, and on-device inference, with a focus on balancing detection accuracy and computational efficiency. A diverse set of state-of-the-art object detection models is evaluated, including convolution-based YOLO models (v8-v12) and transformer-based RT-DETR models (v1-v2). Experiments on three edge devices (Jetson Orin Nano, Jetson AGX Xavier, and Jetson AGX Orin) demonstrate clear trade-offs between accuracy and inference latency across models and hardware configurations. Results show that high-capacity models achieve up to 86.9% mAP50 but suffer from high latency, limiting real-time deployment. In contrast, lightweight models achieve 66%-71% mAP50 with significantly lower latency, enabling real-time performance. Among all models, RT-DETRv2-R50-M achieves competitive accuracy (79% mAP50) with improved efficiency, while YOLOv10n provides the fastest inference speed. YOLOv11s and RT-DETRv2-R50-M offer the best balance between accuracy and speed, making them strong candidates for real-time UAV deployment.

[321] From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers

Jainum Sanghavi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision Transformers trained only on image classification routinely transfer to tasks that demand spatial understanding, yet they receive no spatial supervision during pretraining. We ask where and how robustly such structure is encoded. Probing a frozen ViT-B/16 layerwise for two complementary properties, local patch boundaries (BSDS500) and per-patch depth (NYU Depth V2), reveals a clear hierarchy: boundary structure becomes linearly decodable at layers 5-6 (AP = 0.833), while depth, which requires integrating global cues, peaks two to three layers later at layer 8 (MAE = 0.0875). Both signals collapse at the final classification layer, and random-weight controls confirm the encodings are learned rather than architectural. Causal interventions add specificity: ablating the single direction a linear depth probe reads degrades depth decoding by up to 165%, while ablating any other direction changes it by less than 1%. Targeted activation patching along that direction shows the depth signal is partially re-derived at each layer rather than passively carried in the residual stream, with mid-layer interventions persisting most strongly downstream. The result is that a classification-trained ViT develops an actively maintained spatial hierarchy that mirrors the early-to-late progression observed in the primate visual cortex.

[322] Leveraging Spatial Transcriptomics as Alternative to Manual Annotations for Deep Learning-Based Nuclei Analysis

Kazuya Nishimura, Ryoma Bise, Haruka Hirose, Yasuhiro Kojima

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Deep learning-based nuclei segmentation and classification in pathology images typically rely on large-scale pixel-level manual annotations, which are costly and difficult to obtain across diverse tissues and staining conditions. To address this limitation, we propose a framework that leverages spatial transcriptomics (ST) data as supervision for nuclei segmentation and classification. By incorporating cell-level ST data, we obtain gene expression profiles and corresponding nuclear masks from histopathological images. Gene expression profiles are converted into cell-type labels and used as training data for image-based classification. Because existing gene expression-based cell-type classification methods are not designed for image recognition, we introduce an image-oriented classification approach that bridges gene expression-based cell typing and image-based cell classification. To evaluate generalization, we conduct segmentation experiments on previously unseen organs and compare our method with conventional supervised models. Despite being trained on fewer organ types, our framework achieves higher segmentation accuracy, demonstrating strong transferability. Classification experiments further show consistent improvements over existing approaches.

[323] BurstGP: Enhancing Raw Burst Image Super Resolution with Generative Priors

Dong Huo, Tristan Aumentado-Armstrong, Samrudhdhi B. Rangrej, Maitreya Suin, Angela Ning Ye, Zhiming Hu, Amanpreet Walia, Amirhossein Kazerouni, Konstantinos G. Derpanis, Iqbal Mohomed, Alex Levinshtein

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Burst image super resolution (BISR) aims to construct a single high-resolution (HR) image by aggregating information from multiple low-resolution (LR) frames, relying on temporal redundancy and spatial coherence across the burst. While conventional methods achieve impressive results, they often struggle with complex textures and oversmoothing. Diffusion models, particularly those pretrained on high-quality data, have shown remarkable capability in generating realistic details for image and video super-resolution. However, their potential remains largely under-explored in BISR, where existing approaches typically rely on task-specific diffusion models trained from scratch and operate on single-frame reconstructions. In this work, we propose BurstGP, a novel diffusion-based solution for BISR, which leverages generative priors of recent foundation models to overcome these issues. In particular, we build a multiframe-aware diffusion model on top of a conventional BISR approach, which boosts image quality with minimal loss to fidelity. Further, we introduce (i) a novel degradation-aware conditioning mechanism, which controls synthesis of fine details based on the estimated degradation in the input, and (ii) a robust sRGB-to-lRGB inverter, enabling us to utilize generative multiframe (video) sRGB priors, while operating with raw input and lRGB output images. Empirically, we demonstrate that BurstGP outperforms the existing state of the art, both quantitatively (especially with respect to perceptual metrics, including MUSIQ and LPIPS) and qualitatively. In particular, our proposed method excels at recovering richer textures and finer structural details, highlighting the potential of video priors for BISR over traditional methods.

[324] Emotion-Conditioned Short-Horizon Human Pose Forecasting with a Lightweight Predictive World Model

Jingni Huang, Peter Bloodsworth

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Short-term human pose prediction plays a crucial role in interactive systems, assistive robots, and emotion-aware human-computer interaction[1-3]. While current trajectory prediction models primarily rely on geometric motion cues, they often overlook the underlying emotional signals influencing human motion dynamics[4-5]. This paper investigates whether facial expression-derived emotion embeddings can provide auxiliary conditional signals for short-term pose prediction. To further evaluate multimodal conditionation in a recursive prediction setting, we propose a lightweight autoregressive predictive world model that performs 15-step rolling pose prediction. This framework combines pose keypoints with emotion embeddings through a learnable gating mechanism and performs autoregressive unfolding prediction using a recurrent sequence model based on a two-layer LSTM architecture. Experiments were conducted on two small-scale pose-emotion video datasets: controlled motion sequences with minimal facial expression changes and, natural emotion-driven motion sequences with considerable facial expression changes. The results show that simple multimodal fusion does not consistently improve prediction accuracy, while normalized gating fusion significantly enhances the performance of emotion-driven motion sequences. Furthermore, counterfactual perturbation experiments demonstrate that the predicted trajectory exhibits measurable sensitivity to changes in multimodal input, suggesting that facial expression embeddings act as auxiliary conditional signals rather than redundant features. In summary, these results indicate that incorporating facial expression-derived emotion embeddings into emotion-conditional short-term pose prediction based on a lightweight predictive world model architecture is a feasible approach.

[325] $Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

Haosen Li, Wenshuo Chen, Shaofeng Liang, Lei Wang, Kaishen Yuan, Yutao Yue

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diffusion models have achieved unprecedented success in text-aligned generation, largely driven by Classifier-Free Guidance (CFG). However, standard CFG operates strictly on instantaneous gradients, omitting the intrinsic curvature of the data manifold. Recent methods like Zigzag-sampling (Z-Sampling) explicitly traverse multi-step forward-backward trajectories to probe this curvature, significantly improving semantic alignment. Yet, these explicit traversals triple the Neural Function Evaluation (NFE) cost and introduce unconstrained truncation errors from off-manifold evaluations, causing cumulative drift from the true marginal distribution. In this paper, we theoretically demonstrate that the explicit zigzag sequence is topologically reducible. We propose Implicit Z-Sampling, rigorously proving that intermediate states can be algebraically annihilated via operator dualities, physically eliminating off-manifold approximation errors. To push sampling efficiency to its theoretical lower bound, we introduce $Z^2$-Sampling (Zero-cost Zigzag Sampling). Exploiting the Probability Flow ODE’s temporal coherence, $Z^2$-Sampling couples implicit algebraic collapse with a dynamically cached Temporal Semantic Surrogate. This restores the standard 2-NFE baseline without sacrificing semantic exploration. We formally prove via Backward Error Analysis that this discrete collapse inherently synthesizes a directional derivative curvature penalty. Finally, extensive evaluations demonstrate that $Z^2$-Sampling structurally shatters the performance-efficiency Pareto frontier. We validate its universal applicability across diverse architectures (U-Nets, DiTs) and modalities (image/video), establishing seamless orthogonality with advanced alignment frameworks (AYS, Diffusion-DPO).

[326] Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

Haosen Li, Wenshuo Chen, Lei Wang, Shaofeng Liang, Haozhe Jia, Yutao Yue

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Text-to-image diffusion models have achieved remarkable generative capabilities, yet accurately aligning complex textual prompts with synthesized layouts remains an ongoing challenge. In these models, the initial Gaussian noise acts as a critical structural seed dictating the macroscopic layout. Recent online optimization and search methods attempt to refine this noise to enhance text-image alignment. However, relying on unconstrained Euclidean gradient ascent mathematically inflates the latent norm and destroys the standard Gaussian prior, causing severe visual artifacts like color over-saturation. Furthermore, these methods suffer from inefficient semantic routing and easily fall into the ``reward hacking’’ trap of external proxy models. To address these intertwined bottlenecks, we propose Oracle Noise, a zero-shot framework reframing noise initialization as semantic-driven optimization strictly confined to a Riemannian hypersphere. Instead of relying on complex external parsers, we directly identify the most impactful structural words in the prompt to efficiently route optimization energy. By updating the noise strictly along a spherical path, we mathematically preserve the original Gaussian distribution. This geometric constraint eliminates norm inflation and unlocks aggressive step sizes for rapid convergence. Extensive experiments demonstrate that Oracle Noise significantly accelerates semantic alignment and achieves superior aesthetics without black-box models. It completely mitigates Euclidean-induced degradation, establishing state-of-the-art performance across human preference metrics (e.g., HPSv2, ImageReward), semantic alignment (CLIP Score), and sample diversity, all within a strict 2-second optimization budget.

[327] AusSmoke meets MultiNatSmoke: a fully-labelled diverse smoke segmentation dataset

Weihao Li, Hongjin Zhao, Gao Zhu, Ge-Peng Ji, Nicholas Wilson, Marta Yebra, Nick Barnes

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Wildfires are an escalating global concern due to the devastating impacts on the environment, economy, and human health, with notable incidents such as the 2019-2020 Australian bushfires and the 2025 California wildfires underscoring the severity of these events. AI-enabled camera-based smoke detection has emerged as a promising approach for the rapid detection of wildfires. However, existing wildfire smoke segmentation datasets that are used for training detection and segmentation models are limited in scale, geographically constrained, and often rely on synthetic imagery, which hinders effective training and generalization. To overcome these limitations, we present AusSmoke, a new smoke segmentation dataset collected from Australia to address the data scarcity in this region. Furthermore, we introduce a MultiNational geographically diverse and substantially larger fully-labelled benchmark, called MultiNatSmoke, that consolidates publicly available international datasets with the newly collected Australian imagery, expanding the scale by an order of magnitude over previous collections. Finally, we benchmark smoke segmentation models, demonstrating improved performance and enhanced generalization across diverse geographical contexts. The project is available at \href{https://github.com/henryzhao0615/MultiNatSmoke}{Github}.

[328] COMO: Closed-Loop Optical Molecule Recognition with Minimum Risk Training

Zhuoqi Lyu, Qing Ke

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Optical chemical structure recognition (OCSR) translates molecular images into machine-readable representations like SMILES strings or molecular graphs, but remains challenging in real-world documents due to inexhaustible variations in chemical structures, shorthand conventions, and visual noise. Most existing deep-learning-based approaches rely on teacher forcing with token-level Maximum Likelihood Estimation (MLE). This training paradigm suffers from exposure bias, as models are trained under ground-truth prefixes but must condition on their own previous predictions during inference. Moreover, token-level MLE objectives hinder the optimization towards molecular-level evaluation criteria such as chemical validity and structural similarity. Here we introduce Minimum Risk Training (MRT) to OCSR and propose COMO (Closed-loop Optical Molecule recOgnition), a closed-loop framework that mitigates exposure bias by directly optimizing over molecule-level, non-differentiable objectives, by iteratively sampling and evaluating the model’s own predictions. Experiments on ten benchmarks including synthetic and real-world chemical diagrams from patent and scientific literature demonstrate that COMO substantially outperforms existing rule-based and learning-based methods with less training data. Ablation studies further show that MRT is architecture-agnostic, demonstrating its potential for broad application to end-to-end OCSR systems.

[329] Spatiotemporal Degradation-Aware 3D Gaussian Splatting for Realistic Underwater Scene Reconstruction

Shaohua Liu, Ning Gao, Zuoya Gu, Hongkun Dou, Yue Deng, Hongjue Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reconstructing realistic underwater scenes from underwater video remains a meaningful yet challenging task in the multimedia domain. The inherent spatiotemporal degradations in underwater imaging, including caustics, flickering, attenuation, and backscattering, frequently result in inaccurate geometry and appearance in existing 3D reconstruction methods. While a few recent works have explored underwater degradation-aware reconstruction, they often address either spatial or temporal degradation alone, falling short in more real-world underwater scenarios where both types of degradation occur. We propose MarineSTD-GS, a novel 3D Gaussian Splatting-based framework that explicitly models both temporal and spatial degradations for realistic underwater scene reconstruction. Specifically, we introduce two paired Gaussian primitives: Intrinsic Gaussians represent the true scene, while Degraded Gaussians render the degraded observations. The color of each Degraded Gaussian is physically derived from its paired Intrinsic Gaussian via a Spatiotemporal Degradation Modeling (SDM) module, enabling self-supervised disentanglement of realistic appearance from degraded images. To ensure stable training and accurate geometry, we further propose a Depth-Guided Geometry Loss and a Multi-Stage Optimization strategy. We also construct a simulated benchmark with diverse spatial and temporal degradations and ground-truth appearances for comprehensive evaluation. Experiments on both simulated and real-world datasets show that MarineSTD-GS robustly handles spatiotemporal degradations and outperforms existing methods in novel view synthesis with realistic, water-free scene appearances.

[330] PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics

Tianyidan Xie, Zhentao Huang, Mingjie Wang, Xin Huang, Jun Zhou, Minglun Gong, Zili Yi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Existing image-to-video generation methods often produce physically implausible motions and lack precise control over object dynamics. While prior approaches have incorporated physics simulators, they remain confined to 2D planar motions and fail to capture depth-aware spatial interactions. We introduce PhysLayer, a novel framework enabling language-guided, depth-aware layered animation of static images. PhysLayer consists of three key components: First, a language-guided scene understanding module that utilizes vision foundation models to decompose scenes into depth-based layers by analyzing object composition, material properties, and physical parameters. Second, a depth-aware layered physics simulation that extends 2D rigid-body dynamics with depth motion and perspective-consistent scaling, enabling more realistic object interactions without requiring full 3D reconstruction. Third, a physics-guided video synthesis module that integrates simulated trajectories with scene-aware relighting for temporally coherent results. Experimental results demonstrate improvements in CLIP-Similarity (+2.2%), FID score (+9.3%), and Motion-FID (+3%), with human evaluation showing enhanced physical plausibility (+24%) and text-video alignment (+35%). Our approach provides a practical balance between physical realism and computational efficiency for controllable image animation.

Zehua Cheng, Wei Dai, Jiahao Sun

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multi-modal retrieval-augmented generation (MRAG) systems retrieve visual evidence from large image corpora to ground the responses of large multi-modal models, yet the retrieved images frequently contain human faces whose identities constitute sensitive personal information. Existing anonymization techniques that destroy the non-identity visual cues that downstream reasoning depends on or fail to provide principled privacy guarantees. We propose Identity-Decoupled MRAG, a framework that interposes a generative anonymization module between retrieval and generation. Our approach consists of three components: (i)a disentangled variational encoder that factorizes each face into an identity code and a spatially-structured attribute code, regularized by a mutual-information penalty and a gradient-based independence term; (ii)a manifold-aware rejection sampler that replaces the identity code with a synthetic one guaranteed to be both distinct from the original and realistic; and (iii)a conditional latent diffusion generator that synthesizes the anonymized face from the replacement identity and the preserved attributes, distilled into a latent consistency model for low-latency deployment. Privacy is enforced through a multi-oracle ensemble of face recognition models with a hinge-based loss that halts optimization once identity similarity drops below the impostor-regime threshold.

[332] Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation

Simone Mosco, Daniel Fusaro, Alberto Pretto

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Understanding the surrounding environment is fundamental in autonomous driving and robotic perception. Distinguishing between known classes and previously unseen objects is crucial in real-world environments, as done in Anomaly Segmentation. However, research in the 3D field remains limited, with most existing approaches applying post-processing techniques from 2D vision. To cover this lack, we propose a new efficient approach that directly operates in the feature space, modeling the feature distribution of inlier classes to constrain anomalous samples. Moreover, the only publicly available 3D LiDAR anomaly segmentation dataset contains simple scenarios, with few anomaly instances, and exhibits a severe domain gap due to its sensor resolution. To bridge this gap, we introduce a set of mixed real-synthetic datasets for 3D LiDAR anomaly segmentation, built upon established semantic segmentation benchmarks, with multiple out-of-distribution objects and diverse, complex environments. Extensive experiments demonstrate that our approach achieves state-of-the-art and competitive results on the existing real-world dataset and the newly introduced mixed datasets, respectively, validating the effectiveness of our method and the utility of the proposed datasets. Code and datasets are available at https://simom0.github.io/lido-page/.

[333] Comparative Study of Weighted and Coupled Second- and Fourth-Order PDEs for Image Despeckling in Grayscale, Color, SAR, and Ultrasound

Manish Kumar, Rajendra K. Ray

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Partial Differential Equation (PDE)-based approaches have gained significant attention in image despeckling due to their strong capability to preserve structural details while suppressing noise. However, conventional second-order PDE models tend to generate blocky artifacts, whereas higher-order models often introduce speckle patterns. To resolve it, this paper proposes and comparatively analyzes two advanced PDE-based frameworks designed for speckle noise suppression while preserving the fine edges. The first model introduces a novel weighted formulation that combines second and fourth-order PDEs through a weighting parameter. The second-order diffusion coefficient employs grayscale and gradient-based indicators, while the fourth-order term is guided solely by a Laplacian-based indicator. The second model constructs a coupled PDE framework, where independent fourth and second-order components are explicitly solved in an iterative manner. In this coupled structure, each diffusion coefficient is defined separately to enhance adaptability in varying image regions. Both models are implemented using the explicit finite difference method. The proposed techniques are extensively evaluated on a variety of datasets, including standard grayscale, color, Synthetic Aperture Radar (SAR), and ultrasound images. Comparative experiments with the existing Telegraph Diffusion Model (TDM) and Fourth-Order Telegraph Diffusion Model (TDFM) demonstrate the superiority of the proposed approaches in reducing speckle noise while effectively preserving fine image structures and edges. Quantitative evaluations using PSNR, SSIM and Speckle Index metrics confirm that the proposed models produce higher image quality and enhanced visual perception. Overall, the presented PDE-based formulations provide a reliable and efficient framework for image despeckling in both natural and medical imaging.

[334] A Synergistic CNN-Transformer Network with Pooling Attention Fusion for Hyperspectral Image Classification

Peng Chen, Wenxuan He, Feng Qian, Guangyao Shi, Jingwen Yan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In the hyperspectral image (HSI) classification task, each pixel is categorized into a specific land-cover category or material. Convolutional neural networks (CNNs) and transformers have been widely used to extract local and non-local features in HSI classification. Recent works have utilized a multi-scale vision transformer (ViT) to enhance spectral feature capture and yield promising results. However, most existing methods still face challenges in the effective joint use of spatial-spectral information and in preserving information across layers during the propagation process. To address these issues, we propose a synergistic CNN-Transformer network with pooling attention fusion for HSI classification, which collaboratively utilizes CNNs and ViT to process spatial and spectral features separately. Specifically, we propose a Twin-Branch Feature Extraction (TBFE) module, which employs 3D and 2D convolution in parallel to comprehensively extract spectral and spatial features from HSI. A hybrid pooling attention (HPA) module is designed to aggregate spatial attention. Moreover, a cascade transformer encoder is employed for global spectral feature extraction, and a simple yet efficient cross-layer feature fusion (CFF) module is designed to reduce the loss of crucial information in the previous network layers. Extensive experiments are conducted on several representative datasets to demonstrate the superior performance of our proposed method compared to the state-of-the-art works. Code is available at https://github.com/chenpeng052/SCT-Net.git.

[335] Discriminator-Guided Adaptive Diffusion for Source-Free Test-Time Adaptation under Image Corruptions

Francesco Olivato, Cigdem Beyan, Vittorio Murino

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In this work, we study Source-Free Unsupervised Domain Adaptation under corruption-induced domain shifts, where performance degradation is caused by natural image corruptions that go beyond additive noise, including blur, weather effects, and digital artifacts. We propose a diffusion-based, input-level adaptation framework that operates entirely at test time and keeps all source-trained models frozen, explicitly targeting robustness to corrupted target inputs. Our method leverages a source-trained diffusion model as a generative prior and introduces a discriminator-guided adaptive diffusion strategy that dynamically controls the amount of perturbation applied to each test sample. Rather than relying on a fixed diffusion depth, the discriminator determines, on a per-image basis, when sufficient forward diffusion has been applied to suppress corruption-specific artifacts, with each corruption type effectively defining a distinct target domain. This adaptive stopping mechanism applies only the necessary amount of noise to remove domainspecific corruption while preserving class-discriminative structure. The reverse diffusion process then reconstructs a source-aligned image, optionally stabilized through structural guidance, which is classified using a frozen source-trained classifier. We evaluate the proposed approach across a broad spectrum of corruption-induced target domains, covering 15 diverse corruption types, and demonstrate more balanced robustness with competitive or improved performance across non-noise corruptions. Additional analyses reveal how the adaptive diffusion schedule responds to different corruption characteristics, highlighting the practicality, generality, and robustness of the proposed framework. The code is publicly available at https://github.com/fmolivato/dgadiffusion/.

[336] VDLF-Net: Variational Feature Fusion for Adaptive and Few-Shot Visual Learning

Jiawei Yan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper introduces VDLF-Net, which attaches a compact VAE to a multi-scale CNN backbone. Latent vectors and softmax-gate support the backbone feature maps, while $\ell_2$-normalized embeddings from the gated maps contribute toward supervised classification or episodic few-shot prediction. Under standard CIFAR-100 and Mini-ImageNet protocols, VDLF-Net demonstrates an improved performance over ResNet-50 Enhanced, VGG-16, Prototypical Networks, and Matching Networks. Extensive ablations show that removing the fine-resolution scale has the greatest impact on VDLF-Net’s performance. At the same time, KL and reconstruction at the chosen $α$ pose a minor performance reduction, demonstrating that performance gains over classical episodic baselines mainly originate from the full VDLF-Net architecture and training strategy.

[337] RaV-IDP: A Reconstruction-as-Validation Framework for Faithful Intelligent Document Processing

Pritesh Jha

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Intelligent document processing pipelines extract structured entities (tables, images, and text) from documents for use in downstream systems such as knowledge bases, retrieval-augmented generation, and analytics. A persistent limitation of existing pipelines is that extraction output is produced without any intrinsic mechanism to verify whether it faithfully represents the source. Model-internal confidence scores measure inference certainty, not correspondence to the document, and extraction errors pass silently into downstream consumers. We present Reconstruction as Validation (RaV-IDP), a document processing pipeline that introduces reconstruction as a first-class architectural component. After each entity is extracted, a dedicated reconstructor renders the extracted representation back into a form comparable to the original document region, and a comparator scores fidelity between the reconstruction and the unmodified source crop. This fidelity score is a grounded, label-free quality signal. When fidelity falls below a per-entity-type threshold, a structured GPT-4.1 vision fallback is triggered and the validation loop repeats. We enforce a bootstrap constraint: the comparator always anchors against the original document region, never against the extraction, preventing the validation from becoming circular. We further propose a per-stage evaluation framework pairing each pipeline component with an appropriate benchmark. The code pipeline is publicly available at https://github.com/pritesh-2711/RaV-IDP for experimentation and use.

[338] Geometry-Conditioned Diffusion for Occlusion-Robust In-Bed Pose Estimation

Navid Aslankhani Khameneh, Marco Carletti, Cigdem Beyan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Robust in-bed human pose estimation under blanket occlusion remains challenging due to the scarcity of reliable labeled training data for heavily covered poses. Existing approaches rely on multi-modal sensing or image-to-image translation frameworks that remain conditioned on visible source imagery, limiting scalability and pose diversity. In this work, we reformulate occlusion-aware augmentation as a geometry-conditioned generative modeling task. We conduct a systematic comparison of deterministic masking, unpaired translation, paired diffusion-based translation, and a proposed pose-conditioned Latent Diffusion Model (Pose-LDM). Unlike image-guided methods, Pose-LDM synthesizes blanket-covered images directly from skeletal keypoints, eliminating dependence on paired supervision and pixel-level source-image conditioning while enabling generation from arbitrary pose inputs. All augmentation strategies are evaluated through their impact on downstream pose estimation under a fixed backbone. Pose- LDM achieves the highest strict localization accuracy under severe occlusion while maintaining overall detection performance comparable to paired diffusion models, approaching the performance of fully supervised training. These results demonstrate that geometry-conditioned diffusion provides an effective and supervision-efficient pathway toward occlusion-robust inbed pose estimation without modifying the sensing pipeline. The code is available at: github.com/navidTerraNova/ GeoDiffPose.

[339] ResAF-Net: An Anchor-Free Attention-Based Network for Tree Detection and Agricultural Mapping in Palestine

Rabee Al-Qasem

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reliable agricultural data is essential for food security, land-use planning, and economic resilience, yet in Palestine, such data remains difficult to collect at scale because of fragmented landscapes, limited field access, and restrictions on aerial monitoring. This paper presents ResAF-Net, a satellite-based tree detection framework designed for large-scale agricultural monitoring in resource-constrained settings. The proposed architecture combines a ResNet-50 encoder, Atrous Spatial Pyramid Pooling (ASPP), a feature-fusion stage, a multi-head self-attention refinement module, and an anchor-free FCOS detection head to improve tree localization in dense and heterogeneous scenes. Trained on the MillionTrees benchmark, the model achieved 82% Recall, 63.03% mAP@0.50, and 35.47% mAP@0.50:0.95 on the validation split, indicating strong sensitivity to tree presence while maintaining competitive localization quality. Beyond benchmark evaluation, we implemented the model within a web-based GIS application integrated with Palestinian cadastral data from GeoMolg, enabling tree analysis at scene, parcel, and community levels. This deployment demonstrates the practical feasibility of AI-assisted agricultural inventorying in Palestine. It provides a foundation for data-driven monitoring, reporting, and future species-level analysis of Mediterranean tree crops.

[340] BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments

Guoxi Huang, Ruirui Lin, Yini Li, David R. Bull, Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Videos captured in low-light and underwater conditions often suffer from distortions such as noise, low contrast, color imbalance, and blur. These issues not only limit visibility but also degrade automatic tasks like detection. Post-processing is typically required but can be time-consuming. AI-based tools for video enhancement also demand significantly more computational resources compared to image-based methods. This paper introduces a novel framework, Visual Mamba, designed to reduce memory usage and computational time by leveraging the Visual State Space (VSS) model. The framework consists of two modules: (i) a feature alignment module, where spatio-temporal displacement between input frames is registered in the feature space, and (ii) an enhancement module, where noise removal and brightness adjustment are performed using a UNet-like architecture, with all convolutional layers replaced by VSS blocks. Experimental results show that the Visual Mamba technique outperforms Transformer and convolution-based models in both low-light and underwater video enhancement tasks. Code is available on line at https://github.com/russellllaputa/BVI-Mamba.

[341] SolarFCD: A Large-Scale Dataset and Benchmark for Solar Fault Classification in Photovoltaic Systems

Misbah Ijaz, Saif Ur Rehman Khan, Abd Ur Rehman, Arooj Zaib, Sebastian Vollmer, Andreas Dengel, Muhammad Nabeel Asim

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The increasing global deployment of solar photovoltaic (PV) systems needs robust, scalable, and automated inspection technologies capable of detecting a wide range of panel flaws under a variety of operating situations. The lack of large-scale, multi-modal, publicly available annotated datasets is a major obstacle preventing advancement in this field. We introduce SolarFCD, an extensive dataset of solar panel defects created by methodically combining and reconciling three publicly accessible datasets covering two imaging modalities: RGB/Drone images and Thermal Infrared. The dataset consist of 4,435 images arranged under four unified defect classes such as: healthy images, Surface Obstruction, structural fault, and electrical fault. The dataset was divided into training, validation, and test splits at an 80:10:10 ratio through methodical label mapping, near-duplicate removal, and targeted augmentation of minority classes. Sixteen classification architectures from five design families were trained and assessed on the dataset to provide repeatable benchmark baselines. With an accuracy of 86.68%, precision of 88.65%, recall of 88.62%, and F1-score of 88.17%, ResNet101V2 performed the best overall. Per-class results showed balanced detection across all four defect categories within a narrow performance band of less than 1.2 percentage points. To promote open and repeatable research in automated PV inspection and solar energy operations and maintenance, the dataset, annotation files, and baseline code are made openly available.

[342] HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA

Francesco Dibitonto, Cigdem Beyan, Vittorio Murino

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advances in representation learning have shown that hyperbolic geometry can offer a more expressive alternative to the Euclidean embeddings used in CLIP models, capturing hierarchical structures and leading to better-organized representations. However, current hyperbolic CLIP variants are trained entirely from scratch, which is computationally expensive and resource-intensive. In this work, we propose HAC (Hyperbolic Adaptation of CLIP), a parameter-efficient framework that enables pretrained CLIP models to transition into hyperbolic space via lightweight fine-tuning. We apply HAC to Visual Question Answering (VQA), where models must interpret visual elements and align them with textual queries. Notably, HAC’s training is performed on a dataset with no overlap with any VQA benchmark, resulting in a strict zero-shot evaluation paradigm that underscores HAC’s task-agnostic adaptability. We evaluate HAC across a diverse suite of VQA benchmarks spanning General, Reasoning, and OCR categories. Both HAC-S (small) and HAC-B (medium) consistently surpass Euclidean baselines and prior hyperbolic approaches, with HAC-B delivering up to a +1.9 point average improvement over CLIP-B on reasoning-intensive tasks. Our code is available at https://github.com/fdibiton/HAC

[343] Deploy DINO with Many-to-Many Association

Haodong Jiang, Mingzhe Li, Junfeng Wu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Motivated by the limited generalization of supervised image matching models to unseen image domains, we explore the zero-shot deployment of DINO features for this task. The generalist visual representation extracted from DINO has inherent ambiguity when used to match feature points among semantically similar instances, prompting us to adopt a many-to-many (m-to-m) matching paradigm. However, the existing robust mechanism under m-to-m data association is computationally heavy, which requires finding a maximum-cardinality matching in the inlier association graph for each parameter evaluation. To address this inefficiency, we introduce a novel likelihood perspective, which interprets the existing method as a zeroth-order approximation of otherwise intractable likelihood calculation,and inspires us to propose a faster and finer-grained robust mechanism, termed as Harmonic Consensus Maximization (HCM). Take camera pose estimation as an exemplifying downstream task, we demonstrate that general-purpose visual features, used out of the box without any adaptation, can compete with specialized matching models on out-of-distribution datasets when mated with m-to-m association and the HCM mechanism.

[344] Learning to Decipher from Pixels – A Case Study of Copiale

Lei Kang, Giuseppe De Gregorio, Raphaela Heil, Alicia Fornés, Beáta Megyesi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Historical encrypted manuscripts require both paleographic interpretation of cipher symbols and cryptanalytic recovery of plaintext. Most existing computational workflows rely on a transcription-first paradigm, in which handwritten symbols are transcribed prior to decipherment. This intermediate step is labor-intensive, error-prone, and not always aligned with the goal of direct plaintext recovery. We propose an end-to-end, transcription-free approach that directly maps handwritten cipher images to plaintext. Using the Copiale cipher as a case study, we introduce the first text-line-level dataset pairing cipher images with German plaintext. We show that pretraining on generic handwriting data followed by cipher-specific fine-tuning substantially improves decipherment accuracy. Our results demonstrate that transcription-free image-to-plaintext decipherment is both feasible and effective for historical substitution ciphers, offering a simplified and scalable alternative to traditional pipelines. https://github.com/leitro/Decipher-from-Pixels-Copiale

[345] Reading in the Dark: Low-light Scene Text Recognition

Xuanshuo Fu, Lei Kang, Ernest Valveny, Dimosthenis Karatzas, Javier Vazquez-Corral

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate text recognition in low-light environments is essential for intelligent systems in applications ranging from autonomous vehicles to smart surveillance. However, challenges such as poor illumination and noise interference remain underexplored. To address this gap, we introduce LSTR, a large-scale Low-light Scene Text Recognition dataset comprising 11,273 low-light images generated from well-lit datasets (ICDAR2015, IIIT5K, and WordArt), along with ESTR, which includes 60 real nighttime street-scene images in English and Spanish for exclusive evaluation. We explore two solution strategies: (1) employing Optical Character Recognition (OCR) models with fine-tuning and LoRA-based fine-tuning and (2) a joint training strategy that integrates a low-light image enhancement (LLIE) module with an OCR model. In particular, we propose a novel re-render LLIE (RLLIE) module, which demonstrates improved performance on real-world data. Through extensive experimentation, we analyze various training strategies and address a key research question: \emph{How bright is bright enough for effective scene text recognition?} Our results indicate that standalone LLIE or OCR models perform inadequately under low-light conditions, highlighting the advantages of specialized, jointly trained text-centric approaches. Additionally, we provide a comprehensive benchmark to support future research in robust low-light scene text recognition. https://huggingface.co/datasets/lumimusta/Low-light_Scene_Text_Dataset.

[346] Do Protective Perturbations Really Protect Portrait Privacy under Real-world Image Transformations?

Ruiqing Sun, Xingshan Yao, Zhijing Wu, Tian Lan, Chenhao Cui, Huiyang Zhao, Jialing Shi, Chen Yang, Xianling Mao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Proactive defense methods protect portrait images from unauthorized editing or talking face generation (TFG) by introducing pixel-level protective perturbations, and have already attracted increasing attention for privacy protection. In real-world scenarios, images inevitably undergo various transformations during cross-device display and dissemination–such as scale transformations and color compression–that directly alter pixel values. However, it remains unclear whether such pixel-level modifications affect the effectiveness of existing proactive defense methods that rely on pixel-level perturbations. To solve this problem, we conduct a systematic evaluation of representative proactive defenses under image transformation. The evaluated methods are selected to span different generation architectures such as diffusion and GAN-based models, as well as defense scopes covering both portrait and natural images, and are assessed using both qualitative and quantitative metrics for subjective and objective comparison. Experimental results indicate that defense methods based on pixel-level perturbations struggle to withstand common image transformations, posing a risk of defense failure in real-world applications. To further highlight this risk, we propose a simple yet effective purification framework by leveraging the vulnerabilities induced by real-world image transformations. Experimental results demonstrate that the proposed method can efficiently remove protective perturbations with low computational cost, highlighting previously overlooked risks to the research community.

[347] A Pose-only Geometric Constraint for Multi-Camera Pose Adjustment

Shunkun Liang, Banglei Guan, Bin Li, Qifeng Yu, Yang Shang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multi-camera systems offer rich observation capabilities for visual navigation and 3D scene reconstruction; however, the resulting feature redundancy often compromises computational efficiency. This challenge is particularly pronounced during bundle adjustment, where the non-linear optimization of both system poses and scene points incurs substantial computational overhead. To address this challenge, this paper introduces a pose-only geometric constraint for multi-camera systems and proposes a corresponding pose adjustment algorithm. Specifically, we use generalized camera model to establish a unified representation of the multi-camera system. Building upon this model, we formulate the multi-camera pose-only constraint, which implicitly represents a 3D scene point using two base observations and their associated poses, thereby achieving a pose-only representation of the projection geometry. Subsequently, we introduce a multi-camera pose adjustment algorithm that eliminates 3D points from the parameter space, thereby achieving efficient and focused pose optimization. Experimental results on both synthetic and real-world datasets demonstrate that the proposed algorithm outperforms baseline bundle adjustment methods in computational efficiency, while maintaining or even improving pose estimation accuracy.

[348] Weakly Supervised Multicenter Nancy Index Scoring in Ulcerative Colitis Using Foundation Models

Adam Kukučka, Ondřej Fabián, Vít Musil, Tomáš Brázdil

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Histologic assessment of ulcerative colitis (UC) activity is an important endpoint in clinical trials and routine care, but manual grading with indices such as the Nancy histological index (NHI) is time-consuming and prone to observer variability. While computational pathology methods can automate scoring, many approaches depend on dense region-level annotations, which are costly to obtain, particularly in heterogeneous, multicenter cohorts. We propose a weakly supervised multiple instance learning (MIL) approach for whole-slide images that learns from case- and slide-level NHI labels, leveraging foundation models. Our method targets clinically relevant endpoints, including neutrophilic activity and derived Nancy-low/high groupings, enabling full five-grade NHI prediction. On a multicenter dataset of H&E-stained colon biopsies from three hospitals (2019-2025), we evaluate multiple foundation model encoders and aggregation strategies. We find that foundation model choice and resolution substantially affect performance, with Virchow2 providing the most consistent gains, and that a simple ensembling rule improves five-grade NHI prediction compared to a hierarchical gating baseline. Overall, our results demonstrate that weakly supervised MIL with modern foundation-model representations can provide robust, interpretable UC histology activity assessment in realistic multicenter settings.

Xuefen Liu, Xinquan Yang, Mianjie Zheng, Kun Tang, Xuguang Li, Xiaoqi Guo, Linlin Shen, He Meng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As dental caries appear as subtle, low-contrast lesions in intraoral imaging, existing deep learning models face significant challenges in the early detection of caries. While recent Transformer-based detectors have shown promising results in natural images, they often fail to capture the domain-specific anatomical priors crucial for dental caries detection. In this paper, we propose Caries-DETR, a specialized Transformer framework for caries detection in intraoral images. A Tooth Structure-aware Query Initialization (TSQI) is designed, leveraging large-scale intraoral photograph pre-training and a structure perception branch (SPB) to integrate high-frequency structural priors, guiding the model to focus on anatomically significant lesion areas. Furthermore, we design a Lesion-aware Dynamic Loss Refinement (LDLR) to implement quality-driven hard mining through adaptive loss reweighting based on lesion size, anatomical relevance, and prediction quality, optimizing detection for subtle lesions. Extensive experiments on two public datasets (i.e., AlphaDent and DentalAI) demonstrate that Caries-DETR achieves a state-of-the-art performance compared to existing methods and exhibits good generalization and robustness. Code and data at https://github.com/XuefenLiu-SZU/Caries-DETR}{https://github.com/XuefenLiu-SZU/Caries-DETR.

[350] Zoom In, Reason Out: Efficient Far-field Anomaly Detection in Expressway Surveillance Videos via Focused VLM Reasoning Guided by Bayesian Inference

Xiaowei Mao, Bowen Sui, Weijie Zhang, Yawen Yang, Shengnan Guo, Shilong Zhao, Jiaqi Lin, Tingrui Wu, Youfang Lin, Huaiyu Wa

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Expressway video anomaly detection is essential for safety management. However, identifying anomalies across diverse scenes remains challenging, particularly for far-field targets exhibiting subtle abnormal vehicle motions. While Vision-Language Models (VLMs) demonstrate strong semantic reasoning capabilities, processing global frames causes attention dilution for these far-field objects and incurs prohibitive computational costs. To address these issues, we propose VIBES, an asynchronous collaborative framework utilizing VLMs guided by Bayesian inference. Specifically, to overcome poor generalization across varying expressway environments, we introduce an online Bayesian inference module. This module continuously evaluates vehicle trajectories to dynamically update the probabilistic boundaries of normal driving behaviors, serving as an asynchronous trigger to precisely localize anomalies in space and time. Instead of processing the continuous video stream, the VLM processes only the localized visual regions indicated by the trigger. This targeted visual input prevents attention dilution and enables accurate semantic reasoning. Extensive evaluations demonstrate that VIBES improves detection accuracy for far-field anomalies and reduces computational overhead, achieving high real-time efficiency and explainability while demonstrating generalization across diverse expressway conditions.

[351] ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction

Yanping Wu, Meiting Dang, Lin Wu, Edmond S. L. Ho, Zhenghua Chen, Chongfeng Wei

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decisions and actions by modeling temporal dynamics, social interactions, and environmental context. However, existing studies remain constrained by oversimplified multi-agent interaction patterns, opaque reasoning logic, and a lack of global consistency in behavioral predictions, which compromise both robustness and interpretability. In this work, we propose ESIA (Energy-based Spatiotemporal Interaction-Aware framework), a novel Conditional Random Field (CRF)-based paradigm. We cast the intention prediction task as a structured prediction problem over a unified graph-based representation, treating pedestrians and the environment as spatiotemporal nodes. To characterize their distinct roles, we assign unary potentials to nodes to capture individual intentions, and pairwise potentials to edges to encode social and environmental interactions. These potentials are integrated into a unified global energy function to ensure scene-level consistency across behavioral predictions. To further constrain inference without ground-truth supervision, we introduce structural consistency terms to penalize logical contradictions. This optimization is efficiently solved via a novel Unary-Seeded Simulated Annealing (U-SSA) algorithm, which leverages high-confidence unary priors to rapidly converge to a high-quality solution. Extensive experiments on standard benchmarks demonstrate that ESIA achieves state-of-the-art performance with improved interpretability over existing methods.

[352] DynProto: Dynamic Prototype Evolution for Out-of-Distribution Detection

Yanqi Wu, Xinhua Lu, Runhe Lai, Qichao Chen, Jia-Xin Zhuang, Wei-Shi Zheng, Ruixuan Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent studies show that using potential out-of-distribution (OOD) labels from large corpora as auxiliary information can improve OOD detection in vision-language models (VLMs). However, these methods often fail when real-world OOD samples fall outside the predefined OOD label set. To address this limitation, we propose DynProto, a novel approach that learns OOD prototypes dynamically during testing using only in-distribution (ID) information. DynProto is inspired by a key observation: OOD samples predicted as the same ID class tend to cluster in the feature space. With this insight, we leverage easy-to-detect OOD samples as ``anchors’’ to find their harder-to-detect, similar counterparts. To this end, DynProto introduces two modules: \textbf{Coarse OOD Pattern Capturing Module} caches OOD patterns that are easily confused with each ID class during testing, and \textbf{Fine-grained OOD Pattern Refinement Module} subsequently clusters these patterns within each cache and aggregates them into representative OOD prototypes. By measuring similarity to ID and dynamic OOD prototypes, DynProto enables accurate OOD detection. DynProto significantly outperforms prior methods across multiple benchmarks. On ImageNet OOD benchmark, DynProto reduces FPR95 by 11.60% and improves AUROC by 4.70%. Moreover, the framework is architecture-agnostic and can be integrated into various backbones.

[353] Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing

Honghao Cai, Xiangyuan Wang, Yunhao Bai, Haohua Chen, Tianze Zhou, Runqi Wang, Wei Zhu, Yibo Chen, Xu Tang, Yao Hu, Zhen Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large diffusion transformers (DiTs) follow global editing instructions well but consistently leak local edits into unrelated regions, because joint-attention architectures offer no explicit channel telling the network where to apply the edit. We introduce REDEdit, a co-trained, instruction- and region-aware adapter framework that retrofits a frozen DiT into a precise local editor without modifying its backbone weights. A lightweight Block Adapter at every transformer block injects a structured condition stream that factorizes what to edit (instruction semantics) from where to edit (spatial mask); a learned SpatialGate routes the adapter signal selectively into the edit region while keeping the rest of the image near-identical to the source; and a Region-Aware Loss focuses the training objective on the changing pixels. Because these components make the backbone’s internal representation mask-aware end-to-end, a thin MaskPredictor head trained jointly with the editor can ground the edit region directly from the instruction and source image eliminating any user-mask requirement at deployment. We evaluate on two complementary benchmarks: MagicBrush (paired ground-truth targets) to measure pixel-level preservation and edit accuracy, and Emu-Edit Test (no ground-truth images, 9 diverse edit categories) to stress-test instruction following and generalization across edit types. On both, REDEdit achieves state-of-the-art results, simultaneously outperforming mask-free and oracle-mask baselines. A seven-variant ablation cleanly isolates the contribution of each component.

[354] From Noisy Historical Maps to Time-Series Oil Palm Mapping Without Annotation in Malaysia and Indonesia (2020-2024)

Nuttaset Kuapanich, Juepeng Zheng, Bohan Shi, Jiaying Liu, Jiayin Jiang, Jiatao Huang, Shenghan Tan, Qingmei Li, Haohuan Fu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate monitoring of oil palm plantations is critical for balancing economic development with environmental conservation in Southeast Asia. However, existing plantation maps often suffer from low spatial resolution and a lack of recent temporal coverage, impeding effective surveillance of rapid land-use changes. In this study, we propose a deep learning framework to generate 10-meter resolution oil palm plantation maps for Indonesia and Malaysia from 2020 to 2024, utilizing Sentinel-2 imagery without requiring new manual annotations. To address the resolution mismatch between coarse 100-meter historical labels and 10-meter imagery, we employ a U-Net architecture optimized with Determinant-based Mutual Information (DMI). This approach effectively mitigates the influence of label noise. We validated our method against 2,058 manually verified points, achieving overall accuracies of 70.64%, 63.53%, and 60.06% for the years 2020, 2022, and 2024, respectively. Our comprehensive analysis reveals that oil palm coverage in the region peaked in 2022 before experiencing a decline in 2024. Furthermore, land cover transition analysis highlights a concerning trajectory of plantation expansion into flooded vegetation areas, despite a general stabilization in rotations with other crop types. These high-resolution maps provide essential data for monitoring sustainability commitments and deforestation dynamics in the region, and the generated datasets are made publicly available at https://doi.org/10.5281/zenodo.17768444.

[355] ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

Fanqing Meng, Lingxiao Du, Zijian Wu, Guanzheng Chen, Xiangyan Liu, Jiaqi Liao, Chonghe Jiang, Zhenglin Wan, Jiawei Gu, Pengfei Zhou, Rui Huang, Ziqi Zhao, Shengyuan Ding, Ailing Yu, Bo Peng, Bowei Xia, Hao Sun, Haotian Liang, Ji Xie, Jiajun Chen, Jiajun Song, Liu Yang, Ming Xu, Qionglin Qiu, Runhao Fu, Shengfang Zhai, Shijian Wang, Tengfei Ma, Tianyi Wu, Weiyang Jin, Yan Wang, Yang Dai, Yao Lai, Youwei Shu, Yue Liu, Yunzhuo Hao, Yuwei Niu, Jinkai Huang, Jiayuan Zhuo, Zhennan Shen, Linyu Wu, Cihang Xie, Yuyin Zhou, Jiaheng Zhang, Zeyu Zheng, Mengkang Hu, Michael Qizhe Shieh

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar entries shift, knowledge-base records are updated, and evidence appears across images, scanned PDFs, audio, video, and spreadsheets. Existing benchmarks do not adequately evaluate this setting because they typically run within a single static episode and remain largely text-centric. We introduce \bench{}, a benchmark for coworker agents built around multi-turn multi-day tasks, a stateful sandboxed service environment whose state evolves between turns, and rule-based verification. The current release contains 100 tasks across 13 professional scenarios, executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet) and scored by 1537 deterministic Python checkers over post-execution service state; no LLM-as-judge is invoked during scoring. We benchmark seven frontier agent systems. The strongest model reaches 75.8 weighted score, but the best strict Task Success is only 20.0%, indicating that partial progress is common while complete end-to-end workflow completion remains rare. Turn-level analysis shows that performance drops after the first exogenous environment update, highlighting adaptation to changing state as a key open challenge. We release the benchmark, evaluation harness, and construction pipeline to support reproducible coworker-agent evaluation.

[356] ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction

Zichun Guo, Yuling Shi, Wenhao Zeng, Chao Hu, Haotian Lin, Terry Yue Zhuo, Jiawei Chen, Xiaodong Gu, Wenping Ma

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance in Visually Rich Document Understanding (VRDU) tasks, but their capabilities are mainly evaluated on pristine, well-structured document images. We consider content restoration from shredded fragments, a challenging VRDU setting that requires integrating visual pattern recognition with semantic reasoning under significant content discontinuities. To facilitate systematic evaluation of complex VRDU tasks, we introduce ShredBench, a benchmark supported by an automated generation pipeline that renders fragmented documents directly from Markdown. The proposed pipeline ensures evaluation validity by allowing the flexible integration of latest or unseen textual sources to prevent training data contamination. ShredBench assesses four scenarios (English, Chinese, Code, Table) with three fragmentation granularities (8, 12, 16 pieces). Empirical evaluations on state-of-the-art MLLMs reveal a significant performance gap: The method is effective on intact documents; however, once the document is shredded, restoration becomes a significant challenge, with NED dropping sharply as fragmentation increases. Our findings highlight that current MLLMs lack the fine-grained cross-modal reasoning required to bridge visual discontinuities, identifying a critical gap in robust VRDU research.

[357] MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks

Jui-Cheng Chiu, Yu-Chao Wang, Shengyang Luo, Tongyan Wang, Qi Yang, Nabin Khanal, Yingjie Victor Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Appreciating multi-figure paintings requires understanding how characters relate through subtle cues like gaze alignment, gesture, and spatial arrangement. We present MIRAGE, an evidence-centric framework designed to scaffold the exploration of these “micro-interactions” in multi-figure artworks. While such cues are essential for deep narrative appreciation, they are often distributed across complex scenes and difficult for viewers to systematically identify. Existing vision-language models (VLMs) frequently fail to provide reliable assistance, offering ungrounded interpretations that lack traceable visual evidence. MIRAGE addresses this by constructing a structured intermediate representation capturing identities, pose cues, and gaze hypotheses. However, the challenge extends beyond extracting these cues to coordinating them during interpretation. Without an explicit mechanism to organize and reconcile relational evidence, models often collapse multiple interaction hypotheses into a single unstable or weakly grounded narrative, even when low-level signals are available. This representation allows users to verify how high-level interpretations are anchored in low-level visual facts. By separating spatial grounding from narrative generation, MIRAGE enables users to inspect and reason about figure-to-figure relationships through a verifiable evidence layer. We evaluate MIRAGE against painting-only VLM baselines using a blind assessment protocol. Results show that MIRAGE significantly improves identity consistency, reduces relational hallucinations, and increases the coverage of subtle interactions. These findings suggest that structured grounding can serve as a critical interaction control layer, providing the necessary scaffolding for a more reliable, transparent, and human-led understanding of complex visual narratives.

[358] MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

Haojie Zhang, Di Wu, Bingyan Liu, Linjie Zhong, Yuancheng Wei, Xingsong Ye, Nanqing Liu, Yaling Liang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authentic narrative logic, spatiotemporal text-video alignment conflicts, and the “copy-paste” dilemma prevalent in Subject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale, dual-track dataset tailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer a progressive captioning pipeline that eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement a cross-shot matching mechanism to fundamentally eradicate the S2V copy-paste shortcut. Alongside the dataset, we propose the Cinematic Narrative Benchmark, featuring a visual-logic-driven paradigm and a novel Anti-Copy-Paste Variance (ACP-Var) metric to rigorously assess continuous storytelling and 3D structural consistency. Extensive experiments demonstrate that while current baselines struggle with continuous narrative logic or degenerate into trivial 2D sticker generators, our MuSS-augmented model achieves state-of-the-art narrative effectiveness and cross-shot identity preservation.

Yasin Shokrollahi, Karina B. Pinao Gonzales, Elizve N. Barrientos Toro, Paul Acosta, Patient Mosaic Team, Pingjun Chen, Yinyin Yuan, Xiaoxi Pan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate whole-cell and nuclear segmentation is essential for precision pathology and spatial omics, yet routine hematoxylin and eosin (H&E) staining provides limited cytoplasmic contrast, restricting analyses to nuclei. Multiplex immunofluorescence (mIF) facilitates precise whole-cell delineation but remains constrained by cost and accessibility. We introduce VitaminP, a cross-modal learning framework enabling whole cell segmentation from H&E images. By learning from paired H&E-mIF data, VitaminP transfers molecular boundary information from mIF to overcome cytoplasmic contrast in H&E, establishing cross-modal supervision as a general strategy for recovering missing biological structure. We train VitaminP on 14 public datasets covering 34 cancer types and over 7 million instances, integrating publicly available labels with extensive annotations generated in this study, forming one of the largest resources for segmentation. VitaminP outperforms four state-of-the-art methods and generalizes to unseen datasets, including an in-house dataset spanning 24 rare cancer types. We further developed VitaminPScope, an open-source platform providing an interface for scalable inference and enabling broad adoption.

[360] Bringing a Personal Point of View: Evaluating Dynamic 3D Gaussian Splatting for Egocentric Scene Reconstruction

Jan Warchocki, Xi Wang, Jonas Kulhanek, Jan van Gemert

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Egocentric video provides a unique view into human perception and interaction, with growing relevance for augmented reality, robotics, and assistive technologies. However, rapid camera motion and complex scene dynamics pose major challenges for 3D reconstruction from this perspective. While 3D Gaussian Splatting (3DGS) has become a state-of-the-art method for efficient, high-quality novel view synthesis, variants, that focus on reconstructing dynamic scenes from monocular video are rarely evaluated on egocentric video. It remains unclear whether existing models generalize to this setting or if egocentric-specific solutions are needed. In this work, we evaluate dynamic monocular 3DGS models on egocentric and exocentric video using paired ego-exo recordings from the EgoExo4D dataset. We find that reconstruction quality is consistently lower in egocentric views. Analysis reveals that the difference in reconstruction quality, measured in peak signal-to-noise ratio, stems from the reconstruction of static, not dynamic, content. Our findings underscore current limitations and motivate the development of egocentric-specific approaches, while also highlighting the value of separately evaluating static and dynamic regions of a video.

[361] Mapping License Plate Recoverability Under Extreme Viewing Angles for Oppor-tunistic Urban Sensing

Igor Adamenko, Orpaz Ben Aharon, Yehudit Aperstein, Alexander Apartsin

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Urban environments contain many imaging sensors built for specific purposes, including ATM, body-worn, CCTV, and dashboard cameras. Under the opportunistic sensing paradigm, these sensors can be repurposed for secondary inference tasks such as license plate recognition. Yet objects of interest in such imagery are often noisy, low-resolution, and captured from extreme viewpoints. Recent advances in AI-based restoration can recover use-ful information even from severely degraded images. A central challenge is determining which distortion parame-ters allow reliable recovery and which lead to inference failure. This paper introduces recoverability maps, a task-agnostic method for quantifying this boundary. The method combines a dense synthetic sweep of degrada-tion parameters with two summary measures: boundary area-under-curve, which estimates the recoverable frac-tion of the parameter space, and a reliability score, which captures the frequency and severity of failures within that region. We demonstrate the method on license plate recognition from highly angled views under realistic camera artifacts. Several restoration architectures are trained and evaluated, including U-Net, Restormer, Pix2Pix, and SR3 diffusion. The best model recovers about 93% of the parameter space. Similar results across models sug-gest that sensing geometry, rather than architecture, sets the limit of recovery.

Ines Abbes, Mahmood Alzubaidi, Mowafa Househ, Khalid Alyafei, Marco Agus, Samir Brahim Belhaouari

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Measurement-critical ultrasound tasks often depend on a small anatomical region, making global reconstruction metrics an unreliable proxy for clinical fidelity. We propose an ROI-aware representation learning framework and instantiate it for first-trimester nuchal translucency (NT) screening under multi-hospital domain shift. A two-phase convolutional autoencoder (CAE) first learns a globally faithful 128-D latent code via MS-SSIM, then refines the NT ROI using intensity (L1) and normalized Sobel-edge constraints. To combine these heterogeneous objectives without manual tuning, we initialize loss weights via gradient-based calibration from per-term gradient magnitudes. Under strict hospital-wise evaluation with one hospital held out, ROI refinement improves both global and measurement-relevant quality: on the standard dev split it increases PSNR by +0.27 dB (val) and +0.29 dB (held-out test), reduces ROI MAE by 8.87% (val) and 6.43% (held-out test), and reduces ROI Edge-MAE by 11.10% on source hospitals and 4.90% on the unseen hospital. Beyond reconstruction, frozen-latent probes provide additional evidence of generalization: hospital provenance becomes less confidently predictable on the unseen site (0.556 to 0.541 max-softmax; 0.684 to 0.688 entropy) while OOD detection remains strong across site-held-out protocols (Mahalanobis AUROC up to 0.9956, with modest KNN gains in challenging splits). The same ROI-aware refinement principle is anatomy-agnostic and can be adopted for other fetal biometry targets (e.g., crown-rump length (CRL), nasal bone (NB)) and broader medical imaging settings where small ROIs dominate clinical decisions.

[363] Latent Inter-Frame Pruning: A Training-Free Method Bridging Traditional Video Compression and Modern Diffusion Transformers for Efficient Generation

Dennis Menn, Chih-Hsien Chou

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Video generation, while capable of generating realistic videos, is computationally expensive and slow, prohibiting real-time applications. In this paper, we observe that video latents encoded via an autoencoder under the Latent Diffusion Model (LDM) framework contain redundancy along the temporal axis. Analogous to how traditional video compression algorithms avoid transmitting redundant frame data, we propose the Latent Inter-frame Pruning framework to prune (skip the re-computation of) duplicated latent patches, thereby reducing computational burden and increasing throughput. However, direct pruning results in visual artifacts due to the discrepancy between full-sequence training and pruned inference. To resolve these artifacts, we propose an Attention Recovery mechanism to bridge the train-inference gap. With our proposed method, we increase video editing throughput by 1.44$\times$, achieving 12.44 FPS on an NVIDIA RTX 6000 while maintaining video quality. We hope our work inspires further research into integrating traditional video compression methods with modern video generation pipelines. This work is a preliminary work on Training-free Latent Inter-Frame Pruning with Attention Recovery.

[364] Exploring Audio Hallucination in Egocentric Video Understanding

Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Egocentric videos provide a distinctive setting in which sound serves as crucial cues to understand user activities and surroundings, particularly when visual information is unstable or occluded due to continuous camera movement. State-of-the-art large audio-visual language models (AV-LLMs) can generate multimodal descriptions. However, we show in this work that they are prone to audio hallucinations, often inferring sounds from visual cues that are visible but not heard. We present a systematic and automatic evaluation framework for analyzing audio hallucinations in egocentric video through a targeted question-answering (Q/A) protocol. We curate a dataset of 300 egocentric videos and design 1,000 sound-focused questions to probe model outputs. To characterize hallucinations, we propose a grounded taxonomy that distinguishes between foreground action sounds from the user activities and background ambient sounds. Our evaluation shows that advanced AV-LLMs, such as Qwen2.5 Omni, exhibit high hallucination rates, achieving only 27.3% and 39.5% accuracy on Q/As related to foreground and background sounds, respectively. With this work, we highlight the need to measure the reliability of multimodal responses, emphasizing that robust evaluation of hallucinations is essential to develop reliable AV-LLMs.

[365] Empirical Ablation and Ensemble Optimization of a Convolutional Neural Network for CIFAR-10 Classification

Naser Khatti Dizabadi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Convolutional neural networks (CNNs) remain a central approach in image classification, but their performance depends strongly on architectural and training choices. This paper presents an empirical ablation-based study of CNN optimization for the CIFAR-10 benchmark. The study evaluates 17 progressive modifications involving training duration, learning-rate scheduling, dropout configuration, pooling strategy, network depth, filter arrangement, and dense-layer design. The goal is to identify which changes improve generalization and which increase complexity without improving performance. The baseline model achieved 79.5% test accuracy. Extending training duration improved performance steadily, whereas several structural redesigns reduced accuracy despite greater architectural variation. Based on the strongest individual configurations, a weighted ensemble was constructed, achieving 86.38% accuracy in the reduced-data setting and 89.23% when trained using the full CIFAR-10 dataset. These results suggest that performance gains in CNN-based classification depend less on indiscriminate increases in depth or parameter count than on careful empirical selection of training and architectural modifications. The study therefore highlights the practical value of ablation-oriented optimization and ensemble learning for small-image classification.

[366] Risk-Aware Robust Learning: Reducing Clinical Risk under Label Noise in Medical Image Classification

Maycon R. S. Pereira, Filipe R. Cordeiro

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Noisy labels are a pervasive challenge in medical image classification, where annotation errors arise from inter-observer variability and diagnostic ambiguity. Although several noise-robust learning methods have been proposed, their evaluation predominantly relies on accuracy-oriented metrics, overlooking the clinical implications of asymmetric error costs. In medical diagnosis, a false negative (missed disease) carries substantially higher consequences than a false positive (false alarm), as delayed treatment can directly impact patient outcomes. In this work, we investigate whether noise-robust training methods preserve clinical safety under label noise. We conduct a systematic risk-aware evaluation of the state-of-the-art noise-robust methods Coteaching, DivideMix, UNICON, and a GMM-based filtering approach on binarized DermaMNIST and PathMNIST datasets under clean and label noise rates of 20%, and 40%. Beyond balanced accuracy, we adopt a cost-sensitive Global Risk formulation that explicitly penalizes false negatives. Our analysis reveals that the robustness of state-of-the-art methods does not guarantee clinical safety. Furthermore, we demonstrate that integrating cost-sensitive optimization into noise-robust training significantly reduces clinical risk, while mantaining model utility. These findings demonstrate that noise-robust learning must be evaluated through a clinical risk lens, and that combining robust training with cost-sensitive optimization can meaningfully reduce risk in noisy-label medical imaging scenarios.

[367] Mammographic Lesion Segmentation with Lightweight Models: A Comparative Study

Helder Oliveira

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Breast cancer is a leading cause of cancer-related mortality among women worldwide, with mammography as the primary screening tool. While deep learning models have shown strong performance in lesion segmentation, most rely on computationally intensive architectures that limit their use in resource-constrained environments. This study evaluates the performance and efficiency of lightweight models for mammographic lesion segmentation. Architectures including MobileNetV2, EfficientNet Lite, ENet, and Fast-SCNN were compared against a U-Net baseline using the INbreast dataset with 5-fold cross-validation. Performance was assessed using Dice score, Intersection over Union (IoU), and Recall, alongside model complexity. MobileNetV2 with Squeeze-and-Excitation (SCSE) achieved the best performance, with a Dice score of 0.5766 while using approximately 75% fewer parameters than U-Net. Cross-dataset evaluation on the DMID dataset showed reduced accuracy due to domain shift but preserved recall. These results demonstrate that lightweight architectures offer a practical balance between performance and efficiency for deployable CAD systems.

[368] AMAVA: Adaptive Motion-Aware Video-to-Audio Framework for Visually-Impaired Assistance

Benjamin Klein, Kazi Ruslan Rahman, Sanchita Ghose

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Navigational aids for blind and low vision individuals struggle conveying dynamic real-world environments, leading to cognitive overload from continuous, undifferentiated feedback. We present AMAVA, a novel real-time video-to-audio framework that converts mobile device video into contextually relevant sound effects or text-to-speech descriptions. We propose a motion-aware pipeline using a lightweight AI classification model to distinguish between low and high-movement scenes followed by a real-time text-to-audio synthesis pipeline to enhance environmental perception more efficiently. In static environments, AMAVA generates spoken audio scene descriptions for situational awareness. In high-movement situations, it prioritizes safety by delivering sound cues, such as spoken hazard alerts and environmental sound effects. These audio outputs are produced by a decoder-only transformer-based vision-language model with mixture-of-experts and cross-modal attention for visual understanding, in conjunction with neural text-to-speech and natural sound synthesis networks. The proposed framework uses prompt-based caching and category-specific throttling to avoid auditory clutter and minimize latency. We present a comprehensive evaluation of the system, including a real-time navigation study comparing a white cane alone versus with AMAVA, that shows a significant increase in user confidence and perceived safety.

[369] 2nd of the 5th PVUW MeViS-Audio Track: ASR-SaSaSa2VA

Zhiyu Wang, Xudong Kang, Shutao Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Audio-based video object segmentation aims to locate and segment objects in videos conditioned on audio cues, requiring precise understanding of both appearance and motion. Recent audio-driven video segmentation methods extend MLLMs by fusing audio and visual features for end-to-end localization. Despite their promise, these approaches are computationally intensive, struggle with aligning temporal audio cues to dynamic video content, and depend on large paired audio-video datasets. To address these challenges, we present ASR-SaSaSa2VA, a resource-efficient framework for audio-guided video segmentation. The key idea is to convert audio inputs into textual motion descriptions via automatic speech recognition (ASR) models and then leverage pre-trained text-based referring video segmentation models (e.g., SaSaSa2VA) for pixel-level predictions. To further enhance robustness, we incorporate a no-target expression detection module, implemented by a fine-tuned audio-based MLLM, which filters out audio clips that do not refer to any target object. This design allows the system to exploit strong pre-trained models while effectively handling ambiguous or irrelevant audio inputs. Our approach achieves a final score of 80.7 in the 5th PVUW Challenge (MeViS-v2-Audio track), earning the second-place ranking.

[370] GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

Hongxin Li, Yuntao Chen, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained devices like mobile phones is increasingly critical for GUI agents requiring low latency. However, this goal faces a significant challenge, as current visual grounding methods typically employ large vision-language model (VLM) (more than 2.5B parameters), making them impractical for on-device execution due to memory and computational constraints. To address this, this paper introduces GoClick, a lightweight GUI element grounding VLM with only 230M parameters that achieves excellent visual grounding accuracy, even on par with significantly larger models. Simply downsizing existing decoder-only VLMs is a straightforward way to design a lightweight model, but our experiments reveal that this approach yields suboptimal results. Instead, we select an encoder-decoder architecture, which outperforms decoder-only alternatives at small parameter scales for GUI grounding tasks. Additionally, the limited capacity of small VLMs encourages us to develop a Progressive Data Refinement pipeline that utilizes task type filtering and data ratio adjustment to extract a high-quality 3.8M-sample core set from a 10.8M raw dataset. Training GoClick using this core set brings notable grounding accuracy gains. Our experiments show that GoClick excels on multiple GUI element grounding benchmarks while maintaining a small size and high inference speed. GoClick also enhances GUI agent performance when integrated into a device-cloud collaboration framework, where GoClick helps cloud-based task planners perform precise element localization and achieve higher success rates. We hope our method serves as a meaningful exploration within the GUI agent community.

[371] LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

Rinyoichi Takezoe, Yaqian Li, Zihao Bo, Anzhou Hou, Mo Guang, Kaiwen Long

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens, achieving substantial computational reduction while maintaining model performance. The core of token pruning lies in determining token importance, with current approaches primarily relying on attention scores from vision encoders or Large Language Models (LLMs). In this paper, we analyze the effectiveness of attention mechanisms in both vision encoders and LLMs. We find that vision encoders suffer from attention sink, leading to poor focus on informative foreground regions, while in LLMs, although prior studies have identified attention bias toward token positions, text-to-vision attention demonstrates resistance to this bias and enables effective pruning guidance in middle layers. Based on these observations, we propose LearnPruner, a two-stage token pruning framework that first removes redundant vision tokens via a learnable pruning module after the vision encoder, then retains only task-relevant tokens in the LLM’s middle layer. Experimental results show that our LearnPruner can preserve approximately 95% of the original performance while using only 5.5% of vision tokens, and achieve 3.2$\times$ inference acceleration, demonstrating a superior accuracy-efficiency trade-off.

Jiebin Yan, Kangcheng Wu, Jingwen Hou, Jiayu Zhang, Pengfei Chen, Yuming Fang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Blind omnidirectional image quality assessment (BOIQA) presents a great challenge to the visual quality assessment community, due to different storage formats and diverse user viewing behaviors. The main paradigm of BOIQA models includes two steps, ie, viewport generation, and quality prediction, which brings an extra computational burden and is hard to generalize to other visual contents (eg, 2D planar image). Thus, in this paper, we make an attempt to solve these issues. First, we experimentally find that BOIQA can be formulated as a blind (2D planar) image quality assessment (BIQA) problem, ie, the first step - viewport generation - is no longer needed, which narrows the natural gap between BOIQA and BIQA. Then, we present a new BOIQA approach, which has three merits: ie, viewport-unaware - it accepts an omnidirectional image in the widely used equirectangular projection format as input without any transformation; unified - it can also be applied to BIQA; and generalized - it shows better generalizability against other competitors. Finally, we validate its promise by held-out test, cross-database validation, and the well-established gMAD competition.

[373] LAVA: Layered Audio-Visual Anti-tampering Watermarking for Robust Deepfake Detection and Localization

Bokang Zeng, Zheng Gao, Xiaoyu Li, Xiaoyan Feng, Jiaojiao Jiang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Proactive watermarking offers a promising approach for deepfake tamper detection and localization in short-form videos. However, existing methods often decouple audio and visual evidence and assume that watermark signals remain reliable under real-world degradations, making tamper localization vulnerable to multimodal misalignment and compression distortions. Moreover, existing semi-fragile visual watermarking methods often degrade significantly under codec compression because their embedding bands overlap with compression-sensitive frequency regions. To address these limitations, we propose Layered Audio-Visual Anti-tampering Watermarking (LAVA), a calibration-aware audio-visual watermark fusion framework for deepfake tamper detection and localization. LAVA leverages cross-modal watermark fusion and calibration-aware alignment to preserve consistent and reliable tamper evidence under compression and audio-visual asynchrony, enabling robust tamper localization. Extensive experiments demonstrate that LAVA achieves near-perfect detection performance (AP = 0.999), remains robust to compression and multimodal misalignment, and significantly improves tamper localization reliability over existing audio-visual fusion baselines.

[374] Multi-View Synergistic Learning with Vision-Language Adaption for Low-Resource Biomedical Image Classification

Xiaoliu Luo, Minxue Xiao, Ting Xie, Mengzhu Wang, Huiqing Qi, Joey Tianyi Zhou, Taiping Zhang, Xu Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate biomedical image classification under low-resource conditions remains challenging due to limited annotations, subtle inter-class visual differences, and complex disease semantics. While vision–language models offer a promising foundation for mitigating data scarcity, their effective adaptation in biomedical settings is constrained by the need for parameter-efficient tuning alongside fine-grained and semantically consistent representation learning. In this work, we propose Multi-View Synergistic Learning (MVSL), a unified framework that addresses these challenges by jointly considering adaptation paradigms, representation granularity, and disease semantic relationships. MVSL decouples the adaptation of visual and textual encoders to respect their distinct representational characteristics, enabling more stable and effective parameter-efficient fine-tuning. It further introduces multi-granularity contrastive learning to explicitly model both global image semantics and localized lesion-level evidence, improving fine-grained discrimination for visually similar disease categories. In addition, MVSL preserves disease-level semantic structure by incorporating structured supervision derived from large language models, which constrains textual representations at the class level and indirectly regularizes visual embeddings through cross-modal alignment. Together, these components enable more stable cross-modal alignment and improved discrimination under limited supervision. Extensive experiments on $11$ public biomedical datasets spanning $9$ imaging modalities and $10$ anatomical regions demonstrate that MVSL consistently outperforms state-of-the-art methods in few-shot and zero-shot classification settings.

[375] Hierarchical Prototype-based Domain Priors for Multiple Instance Learning in Multimodal Histopathology Analysis

Xuemei Qiu, Dawei Fan, Yebin Huang, Yanping Chen, Lifang Wei

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Digital pathology has fundamentally altered diagnostic workflows by enabling the computational analysis of gigapixel Whole Slide Images (WSIs), yet effectively deciphering their complex tumor microenvironments remains a formidable challenge. Existing Multiple Instance Learning (MIL) frameworks typically treat Whole Slide Images as unstructured bags of patches, discarding critical morphological semantics and spatial geometry. This lack of inductive bias often leads to overfitting on background noise and fails to align visual features with high-level diagnostic knowledge. To overcome these limitations, we propose the Hierarchical Prototype-based Domain Priors (HPDP) framework, a unified multimodal approach for joint histopathology diagnosis and prognosis. HPDP mitigates the data-driven “black box” issue by introducing a Morphologically Anchored Prototype System (MAPS), which anchors learning to interpretable morphological clusters, and a Sinusoidal Positional Encoder (SPE) to explicitly model tissue architecture. Furthermore, we bridge the semantic gap via a Hierarchical Cross-Modal Alignment (HCMA) module, using Large Language Model (LLM)-generated descriptions to contextually refine visual representations. Extensive experiments across seven cancer cohorts demonstrate that HPDP consistently achieves state-of-the-art performance with superior robustness and interpretability.

[376] SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

Zi-Hao Bo, Yaqian Li, Anzhou Hou, Rinyoichi Takezoe, Ertao Zhao, Tianxiang Pan, Jiale Yan, Mo Guang, Kaiwen Long

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Mixture-of-Experts (MoE) has become a prevalent backbone for large vision-language models (VLMs), yet how modality-specific signals should guide expert routing remains under-explored. Existing routing strategies are either hand-crafted or modality-agnostic, relying on idealized priors that ignore the layer-dependent modality fusion patterns in MoE-VLMs and provide little guidance for expert specialization. We propose Soft Modality-guided Expert Specialization (SMoES), which consists of dynamic soft modality scores that capture layer-dependent fusion patterns, an expert binning mechanism aligned with expert-parallel deployment, and an inter-bin mutual information regularization that encourages coherent modality specialization. Our method leverages attention-based or Gaussian-statistics modality scores to optimize mutual information regularization. Experiments across four MoE-based VLMs and 16 benchmarks demonstrate improvement on both effectiveness and efficiency: 0.9% and 4.2% average gain on multimodal and language-only tasks, 56.1% reduction in EP communication overhead, and 12.3% throughput improvement under realistic deployment. These results validate that aligning routing with modality-aware expert specialization unlocks MoE-VLM capacity and efficiency.

[377] ServImage: An Image Generation and Editing Benchmark from Real-world Commercial Imaging Services

Fengxian Ji, Jingpu Yang, Zirui Song, Lang Gao, Junhong Liang, Zhenhao Chen, Jinghui Zhang, Xiuying Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent image generation and editing models demonstrate robust adherence to instructions and high visual quality on academic benchmarks. However, their performance on paid, real-world design projects remains uncertain. We introduce \textbf{ServImage}, a benchmark that explicitly correlates model outputs with economic value in commercial design projects. ServImage consists of (i) \textbf{\textit{ServImageBench}}: a dataset of 1.07k paid commercial design tasks and 2.05k designer deliverables totaling over $295k, covering portrait, product, and digital content, along with 33k candidate images and 33k human annotations. (ii) \textbf{\textit{ServImageScore}}: an integrated scoring system that combines three quality dimensions: baseline requirements fulfilment, visual execution quality, and commercial necessity satisfaction. These three dimensions are designed to characterize the factors that drive human payment decisions and indicate whether an image is commercially acceptable. (iii) \textbf{\textit{ServImageModel}}: under this scoring system, we propose a payment prediction model trained on the human-annotated candidate images, achieving 82.00% accuracy in predicting human payment decisions and producing calibrated payment probabilities. ServImage provides a comprehensive foundation for assessing the commercial viability of image generation models and offers a scalable resource for future research on economically grounded vision systems \href{https://github.com/FengxianJi/ServImage}{Github.}

[378] Breaking the Scalability Limit of Multi-Projector Calibration with Embedded Cameras

Takumi Kawano, Kohei Miura, Daisuke Iwai

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Conventional multi-projector calibration requires projecting and capturing structured light patterns for each projector sequentially, causing calibration time and effort to increase linearly with the number of projectors. This scalability bottleneck has long limited the deployment of large-scale projection mapping systems. We present a new calibration framework that breaks this limitation by embedding cameras into the surface of the calibration target. The embedded cameras directly capture the incoming projection light, enabling the separation of simultaneously projected structured light patterns from multiple projectors according to their incident directions. Our method establishes correspondences between the optical centers of the embedded cameras and the projector pixels, allowing the intrinsic and extrinsic parameters of all projectors to be simultaneously estimated. We further introduce a correction technique for small misalignments between the calibration board and camera optical centers. As a result, our system achieves calibration accuracy comparable to conventional methods while reducing the required number of projection-capture cycles from linear to nearly constant with respect to the number of projectors, dramatically improving scalability for dense multi-projector systems with overlapping projection regions, such as high-brightness stacking, super-resolution, light-field, and shadow-suppression displays.

[379] JSSFF: A Joint Structural-Semantic Fusion Framework for Remote Sensing Image Captioning

Swadhin Das, Vivek Yadav

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The encoder-decoder framework has become widely popular nowadays. In this model, the encoder extracts informative visual features from an input image, and the decoder employs a sequence-to-sequence formulation to generate the corresponding textual description from these features. The existing models focus more on the decision part. However, extracting meaningful information from the image can help the decoder generate an accurate caption by providing information about the objects and their relationship. Remote sensing images are highly complex. One major challenge is detecting objects that extend beyond their visible boundaries due to occlusion, overlapping structures, and unclear edges. Hence, there is a need to design an approach that can effectively capture both high-level semantics and low-level spatial details for accurate caption generation. In this work, we have proposed an edge-aware fusion method by incorporating the original image and its edge-aware version into the encoder to enhance feature representation and boundary awareness. We used a comparison-based beam search (CBBS) to generate captions to achieve a balanced trade-off between quantitative metrics and qualitative caption relevance through fairness-based comparison of candidate captions. Experimental results demonstrate our model’s superiority over several baseline models in quantitative and qualitative perspectives.

[380] CLLAP: Contrastive Learning-based LiDAR-Augmented Pretraining for Enhanced Radar-Camera Fusion

Bingyi Liu, Chuanhui Zhu, Hongfei Xue, Jian Teng, Jipeng Liu, Enshu Wang, Penglin Dai, Pu Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate 3D object detection is critical for autonomous driving, necessitating reliable, cost-effective sensors capable of operating in adverse weather conditions. Camera and millimeter-wave radar fusion has emerged as a promising solution; however, these methods often rely on finely annotated radar data, which is scarce and labor-intensive to produce. To address this challenge, we present CLLAP, a Contrastive Learning-based LiDAR-Augmented Pretraining framework that enhances the performance of existing radar-camera fusion methods for 3D object detection. CLLAP leverages abundant LiDAR data to generate pseudo-radar data using the proposed L2R (LiDAR-to-Radar) Sampling method. Then, it incorporates this data into a novel dual-stage, dual-modality contrastive learning strategy, enabling effective self-supervised learning from paired pseudo-radar and image data. This approach facilitates effective pretraining of existing radar-camera fusion models in a plug-and-play manner, enhancing their feature extraction capabilities and improving detection accuracy and robustness. Experimental results using NuScenes and Lyft Level 5 datasets demonstrate significant performance improvements across three baseline models, highlighting CLLAP’s effectiveness in advancing radar-camera fusion for autonomous driving applications.

[381] QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Woojun Jung, Junyeong Kim

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Chronology. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall’s $τ_b$, $τ_c$, and Spearman’s $ρ$. We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.

[382] Light ’em Up: Enabling Few-Shot Low-Light 3D Gaussian Splatting with Multi-Scale Explicit Retinex Illumination Decoupling

YuHao Yin, Zongji Wang, Yuanben Zhang, Biqing Li, Jiesong Bai, Junyi Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Full 360$^\circ$ novel view synthesis under low-light conditions remains challenging. Insufficient illumination, noise amplification, and view-dependent photometric inconsistencies prevent existing methods from jointly preserving geometric consistency and photorealism. Unsupervised approaches often exhibit color drift under large viewpoint variations, while supervised low-light enhancement models, though effective for 2D tasks, struggle to generalize to new scenes and typically require retraining. To address this issue, we propose MERID-GS, a Multi-Scale Explicit Retinex Illumination-Decoupled Gaussian framework for low-light 360$^\circ$ synthesis. Based on Retinex theory, the method explicitly separates illumination and reflectance, and suppresses noise propagation while enhancing dark-region structures via a learnable gain and Illumination-State-Guided Frequency Gating. Combined with lightweight Reflection Head and 3D Gaussian Splatting, MERID-GS adapts to new scenes with only a few shots and enables stable low-light novel view synthesis from sparse-view observations. In addition, we construct a low-light multi-view dataset covering full 360$^\circ$ scenes for joint evaluation. Thorough experiments across multiple datasets in this area demonstrate that MERID-GS achieves SOTA performance, exhibiting superior cross-scene generalization and view consistency. The source code and pre-trained models are available at https://github.com/YhuoyuH/MERID-GS..

[383] SemiSAM-O1: How far can we push the boundary of annotation-efficient medical image segmentation?

Yichi Zhang, Le Xue, Bichun Xu, Judong Luo, Zhigang Wu, Yu Fu, Zixin Hu, Yuan Cheng, Yuan Qi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Semi-supervised learning (SSL) has become a promising solution to alleviate the annotation burden of deep learning-based medical image segmentation models. While recent advances in foundation model-driven SSL have pushed the boundary to extremely limited annotation scenarios, they fail to maintain robust competitive performance in complex imaging modalities. In this paper, we propose SemiSAM-O1, an annotation-efficient framework using only one annotated template image for segmentation. SemiSAM-O1 extends the specialist-generalist collaborative learning framework to the extreme one-label setting by fully exploiting the foundation model’s feature representation capability beyond its prompting interface. SemiSAM-O1 operates in two stages. In the first stage, the foundation model’s encoder extracts dense features from all volumes, and class prototypes derived from the single annotated template are propagated to the unlabeled pool via feature similarity to produce coarse initial pseudo-labels. In the second stage, an iterative training-and-refinement loop progressively improves both the segmentation model and the pseudo-labels over multiple rounds, where each round trains the model from scratch on current pseudo-labels and generates updated predictions with voxel-wise uncertainty estimates. An uncertainty-guided refinement step further leverages the foundation model’s global feature space to correct high-uncertainty regions by aggregating labels from their most similar confident neighbors, establishing a virtuous cycle of mutual improvement. Extensive experiments on a wide range of segmentation tasks across different modalities and anatomical targets demonstrate that SemiSAM-O1 significantly narrows the performance gap between one-label semi-supervised learning and full supervision, while significantly reducing the computational overhead of online foundation model inference.

[384] TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations

Yifeng Bai, Zhirong Chen, Erkang Cheng, Haibin Ling

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Topology reasoning is crucial for autonomous driving. Current methods primarily focus on instance-level learning for centerline detection, followed by a sequential module for topology reasoning that relies on simplified MLP layers. Moreover, they often neglect the importance of \textit{point-to-instance} (P2I) relationships in topology reasoning. To address these limitations, we present TopoHR (Topological Hierarchical Representation), a novel end-to-end framework that establishes cyclic interaction between centerline detection and topology reasoning, allowing them to iteratively enhance each other. Specifically, we introduce a hierarchical centerline representation including point queries, instance queries, and semantic representations. These multi-level features are seamlessly integrated and fused within a hierarchical centerline decoder. Furthermore, we design a hierarchical topology reasoning module that captures both fine-grained P2I relationships and global instance-to-instance (I2I) connections within a unified architecture. With these novel components, TopoHR ensures accurate and robust topology reasoning. On the OpenLane-V2 benchmark, TopoHR refreshes state-of-the-art performance with significant improvements. Notably, compared with previous best results, TopoHR achieves +3.8 in $\mathrm{DET}{\text{l}}$, +5.4 in $\mathrm{TOP}{\text{ll}}$ on $\text{subset_A}$ and +11.0 in $\mathrm{DET}{\text{l}}$, +7.9 in $\mathrm{TOP}{\text{ll}}$ on $\text{subset_B}$, validating the effectiveness of the proposed components. The code will be shared publicly at https://github.com/Yifeng-Bai/TopoHR.git.

[385] FDIM: A Feature-distance-based Generic Video Quality Metric for Versatile Codecs

Jiayi Wang, Lichun Zhang, Xiaoqi Zhuang, Jiaqi Zhang, Lu Yu, Yin Zhao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Video technology is advancing toward Ultra High Definition (UHD) and High Dynamic Range (HDR), which intensifies the need for higher compression efficiency for these high-specification videos. Beyond advances in traditional codecs, neural video codecs (NVCs) have attracted significant research attention and have evolved rapidly over the past few years. The coding artifacts of NVCs often exhibit content-varying and generative characteristics, which differ from those of conventional codecs and are challenging for traditional video quality assessment (VQA) methods to capture. Therefore, VQA metrics are required to generalize across different codecs, content types, and dynamic ranges to better support video codec research and evaluation. In this paper, we propose FDIM, a feature-distance-based generic video quality metric for both traditional and neural video codecs across SDR and HDR formats. FDIM employs a hybrid architecture that integrates deep and hand-crafted features. The deep feature component learns multi-scale representations to capture distortions ranging from structural and textural fidelity degradation to high-level semantic deviations, while the hand-crafted feature component provides stable complementary cues to improve overall generalization. We trained FDIM on a large-scale subjective quality assessment dataset (DCVQA) consisting of over 16k video sequences encoded by traditional block-based hybrid video codecs and end-to-end perceptually optimized neural video codecs. Extensive experiments on ten SDR/HDR VQA datasets containing diverse, previously unseen codecs demonstrate that FDIM achieves strong generalization and high correlation with subjective assessment. The source code for FDIM and the DCVQA validation set will be released at https://github.com/MCL-ZJU/FDIM.

[386] Open-Vocabulary Semantic Segmentation Network Integrating Object-Level Label and Scene-Level Semantic Features for Multimodal Remote Sensing Images

Jinkun Dai, Yuanxin Ye, Peng Tang, Tengfeng Tang, Xianping Ma, Jing Xiao, Mi Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Semantic segmentation of multi-modal remote sensing imagery plays a pivotal role in land use/land cover (LULC) mapping, environmental monitoring, and precision earth observation. Current multi-modal approaches mainly focus on integrating complementary visual modalities, yet neglect the incorporating of non-visual textual data - a rich source of knowledge that can bridge semantic gaps between visual patterns and real-world concepts. To address this limitation, we propose TSMNet, a text supervised multi-modal open vocabulary semantic segmentation network that synergistically integrates textual supervision with visual representation for open-vocabulary semantic segmentation. Unlike conventional multi-modal segmentation frameworks, TSMNet introduces a dual-branch text encoder to extract both scene-level semantic and object-level label information from various textual data, enabling dynamic cross-modal fusion. These text-derived features dynamically interact with visual embeddings through the proposed text-guided visual semantic fusion module, enabling domain-aware feature refinement and human-interpretable decision-making. To verify our method, we innovatively construct two new multi-modal datasets, and carry out extensive experiments to make a comprehensive comparison between the proposed method and other state-of-the-art (SOTA) semantic segmentation models. Results demonstrate that TSMNet achieves superior segmentation accuracy while exhibiting robust generalization capabilities across diverse geographical and sensor-specific scenarios. This work establishes a new paradigm for explainable remote sensing analysis, demonstrating that textual knowledge integration significantly enhances model generalizability. The source code will be available at https://github.com/yeyuanxin110/TSMNet

[387] EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT

Xuguang Bai, Mingxuan Liu, Tongxi Song, Yifei Chen, Hongjia Yang, Kasidit Anmahapong, Zihan Li, Ying Zhou, Qiyuan Tian

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Chest computed tomography (CT) is central to the detection and management of thoracic disease, yet the growing scale and complexity of volumetric imaging increasingly exceed what can be addressed by scan-level prediction alone. Clinically useful AI for CT must not only recognize disease across the whole volume, but also localize abnormalities and provide interpretable visual evidence. Existing vision-language foundation models typically compress scans and reports into global image-text representations, limiting their ability to preserve spatial evidence and support clinically meaningful interpretation. Here we developed EXACT, an explainable anomaly-aware foundation model for three-dimensional chest CT that learns spatially resolved representations from paired clinical scans and radiology reports. EXACT was pre-trained on 25,692 CT-reports pairs using anatomy-aware weak supervision, jointly learning organ segmentation and multi-instance anomaly localization without manual voxel-level annotations. The resulting organ-specific anomaly-aware maps assign each voxel a disease-specific anomaly score confined to its corresponding anatomy, jointly encoding lesion extent and organ-level context. In retrospective multinational and multi-center evaluations, EXACT showed broad and consistent improvements across clinically relevant CT tasks, spanning multi-disease diagnosis, zero-shot anomaly localization, downstream adaptation, and visually grounded report generation, outperforming existing three-dimensional medical foundation models. By transforming routine clinical CT scans and free-text reports into explainable voxel-level representations, EXACT establishes a scalable paradigm for trustworthy volumetric medical AI.

[388] 6thGrid-Net: Unified Remote Sensing Image Dehazing Based on Color Restoration and Edge-Preserving

Runci Bai, Kui Jiang, Xiang Chen, Chen Wu, Dianjie Lu, Guijuan Zhang, Zhuoran Zheng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Remote sensing images are frequently degraded by adverse weather conditions, particularly clouds and haze, which severely impair downstream applications. Existing restoration methods typically rely on computationally heavy architectures or sequential pipelines (e.g., detail enhancement followed by color rendition) that suffer from mutual interference and artifact accumulation. Furthermore, recent unified grid-based approaches utilize fixed, isotropic interpolation kernels, neglecting the intrinsic low-dimensional manifold of natural images and inevitably causing edge blur. To address these limitations, we propose 6th Grid-Net, a highly efficient and unified remote sensing image restoration framework tailored for resource-constrained edge devices. Specifically, we construct a novel six-dimensional fusion tensor that seamlessly integrates the color rendition capabilities of 3D LUTs with the spatial-luminance detail preservation of bilateral grids. To overcome the drawbacks of standard trilinear interpolation, we introduce a manifold-adaptive high-dimensional sampling mechanism. This mechanism dynamically adjusts the interpolation kernel based on local edge orientation, texture strength, and color similarity, enabling simultaneous global color stylization and local edge refinement in a single forward pass. Additionally, an edge-aware grid smoothing constraint and dynamic quantization are incorporated to suppress ghosting artifacts and significantly compress the model size. Extensive experiments on multiple benchmark datasets demonstrate that 6th Grid-Net achieves state-of-the-art restoration quality across various degradation scenarios.

[389] Robust Deepfake Detection, NTIRE 2026 Challenge: Report

Benedikt Hopf, Radu Timofte, Chenfan Qu, Junchi Li, Fei Wu, Dagong Lu, Mufeng Yao, Xinlei Xu, Fengjun Guo, Yongwei Tang, Zhiqiang Yang, Zhiqiang Wu, Jia Wen Seow, Hong Vin Koay, Haodong Ren, Feng Xu, Shuai Chen, Minh-Khoa Le-Phan, Minh-Hoang Le, Trong-Le Do, Minh-Triet Tran, Chih-Yu Jian, Yi-Fan Wang, Bang-Kang Chen, You-Chen Chao, Chia-Ming Lee, Fu-En Yang, Yu-Chiang Frank Wang, Chih-Chung Hsu, Aashish Negi, Hardik Sharma, Prateek Shaily, Jayant Kumar, Sachin Chaudhary, Akshay Dudhane, Praful Hambarde, Amit Shukla, Jielun Peng, Yabin Wang, Yaqi Li, Jincheng Liu, Xiaopeng Hong, Krish Wadhwani, Liam Fitzpatrick, Utkarsh Tiwari, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Cristian Lazo Quispe, Aishwarya A, Akshara S, Ashwathi N, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Robustness is a long-overlooked problem in deepfake detection. However, detection performance is nearly worthless in the real world if it suffers under exposure to even slight image degradation. In addition to weaker degradations that can accidentally occur in the image processing pipeline, there is another risk of malicious deepfakes that specifically introduce degradations, purposefully exploiting the detector’s weaknesses in that regard. Here, we present an overview of the NTIRE 2026 Robust Deepfake Detection Challenge, which specifically addresses that problem. Participants were tasked with building a detector that would later be tested on an unknown test-set, which included both common and uncommon degradations of various strengths. With a total number of 337 participants and 57 submissions to the final leaderboard, the first edition of the challenge was well received. To ensure the reliability of the results, participants were given only 24h to complete the test run with no labels provided, limiting the possibility of training on the test data. Furthermore, the top solutions were scored on a private test-set to detect any such overfitting. This report presents the competition setting, dataset preparation, as well as details and performance of methods. Top methods rely on large foundation models, ensembles, and degradation training to combine generality and robustness.

[390] PEPS: Positional Encoding Projected Sampling – Extended

Guillaume Perez, Janarbek Matai, Takahiro Harada

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Implicit neural representations (INRs) are increasingly being used as tools to map coordinates to signals, encompassing applications from neural fields to texture compression, shape representations, and beyond. Most INR methods are based on using high-dimensional projections of the initial coordinates through encoders such as grid or positional encoding. Nevertheless, positional encoding is often insufficient and grids, as we show in this paper, require high resolution for being able to learn. In this paper, we demonstrate that positional encoding can be used not only as a high-dimensional embedding but also decomposed as a series of meaningful points. We propose the Positional Encoding Projected Sampling, where we treat the projection of the original coordinate at each frequency as a point of interest. We describe the motion of each point with respect to the frequencies and show that it follows a unique pattern. Finally, we use the unique motion of each point as a basis decomposition for doing learned positional encoding using grids. We prove, using three competitive applications; image representation, texture compression, and signed distance function; that the proposed approach outperforms the current state of the art methods, and often requires 25% less parameters for equivalent reconstruction error or rendering.

[391] PointTransformerX:Portable and Efficient 3D Point Cloud Processing without Sparse Algorithms

Laurenz Reichardt, Nikolas Ebert, Oliver Wasenmüller

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: 3D point cloud perception remains tightly coupled to custom CUDA operators for spatial operations, limiting portability and efficiency on non-NVIDIA, AMD, and embedded hardware. We introduce PointTransformerX (PTX), a fully PyTorch-native vision transformer backbone for 3D point clouds, removing all custom CUDA operators and external libraries while retaining competitive accuracy. PTX introduces 3D-GS-RoPE, a rotary positional embedding that encodes 3D spatial relationships directly in self-attention without neighborhood construction, and further replaces sparse convolutional patch embedding with a linear projection. PTX explores inference-time scaling of attention windows to improve accuracy without retraining. With a redesigned feed-forward network, PTX achieves 98.7% of PointTransformer V3’s accuracy on ScanNet with 79.2% fewer parameters and executing 1.6\times faster while requiring just 253 MB memory. PTX runs natively on NVIDIA GPUs, AMD GPUs (ROCm), and CPUs, providing an efficient and portable foundation for point cloud perception.

[392] POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation

Yaohou Fan, Qingzhong Wang, Yongsong Huang, Junyi Liu, Tomo Miyazaki, Shinichiro Omachi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Current visual text generation models struggle with the trade-off between text accuracy and overall image coherence. We find that achieving high text accuracy can reduce aesthetic quality and instruction-following capability. Although reinforcement learning approaches can alleviate the problem through aligning with multiple rewards, they are often unstable for text generation, as existing approaches normally optimize multiple rewards in a weighted-sum way. In addition, it is difficult to balance the weight of each reward. Moreover, reinforcement learning requires a set of training instructions. A large number of prompts require more training time and computing resources, while a small set leads to poor performance. Hence, how to select the prompts for efficient training is an unsolved problem. In this study, we propose Pareto-Optimal Curriculum Alignment (POCA), a framework that addresses this issue as a multi-objective problem by: 1) identifying the Pareto-optimal set to avoid simple scalarization and 2) designing an adaptive curriculum alignment strategy to manage a learning sequence of a multi-reward dataset using automatic difficulty assessment, which is crucial for optimal convergence as RL methods explore in a limited data environment. In synergy, POCA finds the Pareto-optimal set in a unified reward space, which eliminates inconsistent signals to find the best trade-off solution from different rewards under an easy-to-hard optimization landscape. The experimental results show that POCA significantly improves all metrics such as CLIP, HPS scores and sentence accuracy.

[393] Multivariate Gaussian NeRF for Wide Field-of-View Ultrasound Reconstruction

Patris Valera, Magdalena Wysocki, Felix Duelmer, Mohammad Farid Azampour, Sebastian Herz, Stefan Wörz, Nassir Navab

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Wide Field-of-View (WFoV) reconstruction enhances 3D ultrasound imaging by providing valuable anatomical context for segmentation models and visualization. Clinical ultrasound volumes are predominantly acquired using convex probes, which generate expanding, diverging acoustic beams to maximize anatomical coverage. Stitching these sweeps together traditionally introduces significant compounding artifacts and aliasing due to depth-dependent resolution changes. Here, we introduce Ultra-Wide-NeRF, a Multivariate 3D Gaussian (MVG) NeRF-based method for WFoV ultrasound reconstruction. By explicitly modeling the complex beam geometry using distance-dependent convex volumetric sampling and anisotropic 3D Gaussians, our method inherently mitigates these compounding artifacts and provides anti-aliasing. Beyond simply reconstructing a static 3D grid, our NeRF-based approach yields a continuous neural representation of the tissue, enabling the synthesis of high-fidelity novel views from arbitrary virtual trajectories. We validate Ultra-Wide-NeRF for intracardiac echocardiography on phantom and porcine datasets, demonstrating that our method expands the spatial context important in intraoperative navigation. Code will be open-sourced upon publication.

[394] Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

Zhicheng Zhang, Wentao Gu, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan, Jufeng Yang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1) cold-start supervised fine-tuning on 101K high-quality, long-chain trajectories distilled from 3.5M diverse omnimodal samples, enabling necessary recursive search patterns; and (2) nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples, explicitly guided by a novel multi-step reward model to stimulate deep nested reasoning. Extensive experiments demonstrate that Omni-o3 achieves competitive performance across 11 benchmarks, unlocking advanced capabilities in comprehensive audio-visual, visual-centric, and audio-centric reasoning tasks.

[395] Computer Vision-Based Early Detection of Container Loss at Sea

Vishakha Lall, Capt. Stanley S Pinto, Capt. Chu Xing Peng, Wu Kaiwen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Containerised shipping underpins global trade, yet container loss at sea remains a persistent safety, environmental, and economic challenge. Despite compliance with Cargo Securing Manuals, dynamic maritime conditions such as vessel motion, wind loading, and severe sea states can progressively destabilise container stacks, leading to overboard losses. With the new International Maritime Organisation’s (IMO) mandatory reporting requirements for lost containers, there is an urgent need for a reliable, evidence-based early detection solution for destabilised containers. This study showcases a low-cost, retrofittable computer vision-based system for early detection of destabilised containers using existing onboard cameras. The framework integrates object segmentation to isolate container stacks, temporal object tracking using optical flow and individual objects’ residual motion extraction to quantify relative movement. Experimental evaluation on real onboard ship footage demonstrates that the proposed pipeline effectively isolates container-level motion under challenging conditions of varying sea states and visibility conditions. By enabling early alerts for crew intervention and navigational adjustment, the proposed approach enhances cargo safety, operational resilience, and regulatory compliance.

[396] Radiomics- and Clinical Feature-Driven Prediction of Volumetric Response in Skull-Base Meningioma after CyberKnife Radiosurgery

Yin Lin, Elena De Martin, Giacomo Conte, Domenico Aquino, Cristiana Pedone, Alberto Redaelli, Riccardo Barbieri, Laura Fariselli, Simona Ferrante

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Skull-base meningiomas are often characterized by favorable long-term prognosis, yet their anatomical complexity and proximity to critical neurovascular structures make treatment selection challenging. Stereotactic radiosurgery with CyberKnife represents an effective therapeutic option when surgical resection is not feasible; however, not all patients benefit equally from this treatment. Early identification of patients likely to respond to radiosurgery remains an open clinical problem. In this study, we propose a radiomics- and clinical feature-driven framework for predicting volumetric response in skull-base meningiomas treated with CyberKnife. Unlike most existing approaches that focus on progression-free survival or recurrence, our method targets volumetric response as an indicator of treatment efficacy. Pre-treatment MRI images from 104 patients were processed to extract radiomic features, which were combined with clinical variables and analyzed using six models. To ensure methodological rigor, the entire modeling process was implemented within a nested cross-validation scheme. Among the evaluated models, TabPFN achieved the best overall performance, with an AUC of 0.81 and consistently favorable classification metrics. These results suggest that advanced machine learning architectures, when combined with robust validation strategies, can effectively capture patterns associated with treatment response even in small-sample, high-dimensional settings.

[397] Graph-augmented Segmentation of Complex Shapes in Laser Powder bed Fusion for Enhanced In Situ Inspection

Stefano Raimondo, Matteo Bugatti, Marco Grasso

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The technological maturity of in situ inspection and monitoring methods in additive manufacturing is steadily increasing, enabling more efficient and practical qualification procedures. In this context, image segmentation of powder bed images in Laser Powder Bed Fusion (L-PBF) has been investigated by various authors, leveraging both edge detection and machine learning approaches to identify deviations from nominal geometry. Despite these developments, several challenges remain, including the sensitivity of segmentation performance to industrial illumination conditions and layer-to-layer variability in pixel intensity patterns. The study addresses these limitations by proposing a graph-augmented segmentation approach. The underlying principle consists of preserving the geometrical information at a global level rather than at pixel-wise level, modeling dependencies and relational information among spatial regions with a Graph Neural Network bottleneck embedded into a U-Net architecture. This allows enhancing the consistency and accuracy of the geometry reconstruction in the presence of spatial and layer-wise photometric variability systematically faced in real data. The method is evaluated against benchmark techniques for the in situ reconstruction of lattice structures produced by L-PBF, demonstrating its potential as a scalable solution for robust in situ inspection and geometric verification in industrial environments.

[398] Touchless Intraoperative Image Access System Based on Vision-Based Hand Tracking

Yin Lin, Domenico Aquino, Alberto Redaelli, Massimiliano Del Bene, Riccardo Barbieri, Simona Ferrante

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Touchless interaction with medical images is becoming increasingly important in the surgical field, where sterility and continuity of the operational workflow are essential requirements. This work presents a vision-based system for intraoperative navigation of medical images through hand gestures acquired using a single RGB camera. Unlike many existing solutions, the system does not require additional hardware or user-specific training. Hand tracking is performed in real time using MediaPipe Hands, which provides a 2.5D estimation of hand landmarks. Simple and intuitive gestures are then mapped into translation, rotation, and zoom commands, enabling continuous and natural interaction with the image viewer. The system architecture is independent from the visualization software and, for implementation simplicity, in this study it was integrated with PyVista. Performance was evaluated through frame-level logging and quantitative analysis of latency, stability, and interaction robustness metrics. Experimental results highlight real-time behavior, with reduced latencies and stable control, in line with the requirements of fluid interaction. The system demonstrates the feasibility of a low-cost touchless solution for intraoperative access to medical images, laying the groundwork for future clinical evaluations.

[399] Instance Awareness of Multi-class Semantic Segmentation Loss Functions

Soumya Snigdha Kundu, Florian Kofler, Marina Ivory, Hendrik Moller, Jonathan Shapey, Tom Vercauteren

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Instance-sensitive losses for semantic segmentation such as blob loss and CC loss were designed to address instance imbalance, ensuring small lesions generate the same gradient as large ones, but operate only on single-class segmentation. In multi-class settings, class imbalance poses an additional problem: rare classes with few instances receive a disproportionately small share of the training signal. We show that extending instance-sensitive losses to multi-class segmentation via a one-vs-rest class decomposition repurposes them to also address class imbalance, as uniform averaging over classes ensures each class contributes equally regardless of frequency. We further show that inverse-size weighting, which destabilizes training when applied globally due to weight imbalances across rare and common classes, becomes effective when integrated within the per-component loss, confining the reweighting to each component’s spatial context. On the BraTS-METS 2025 dataset (260 test cases), multi-class CC loss improves foreground Dice (0.64 +/- 0.26 vs. 0.59 +/- 0.27 baseline) and rare-class Dice, while maintaining Panoptic Quality at DSC threshold 0.5. Multi-class blob loss achieves the best Panoptic Quality at threshold 0.5 (0.40 +/- 0.24 vs. 0.38 +/- 0.25 baseline) and recognition quality (0.53 +/- 0.29 vs. 0.49 +/- 0.30). Integrating inverse-size weighting within the per-component loss increases rare-class Dice to 0.44 +/- 0.36 at the cost of reduced detection quality.

[400] ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning

Yiming Zhang, Jiacheng Chen, Jiaqi Tan, Yongsen Mao, Wenhu Chen, Angel X. Chang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Current evaluations of spatial intelligence can be systematically invalid under modern vision-language model (VLM) settings. First, many benchmarks derive question-answer (QA) pairs from point-cloud-based 3D annotations originally curated for traditional 3D perception. When such annotations are treated as ground truth for video-based evaluation, reconstruction and annotation artifacts can miss objects that are clearly visible in the video, mislabel object identities, or corrupt geometry-dependent answers (e.g., size), yielding incorrect or ambiguous QA pairs. Second, evaluations often assume full-scene access, while many VLMs operate on sparsely sampled frames (e.g., 16-64), making many questions effectively unanswerable under the actual model inputs. We improve evaluation validity by introducing ReVSI, a benchmark and protocol that ensures each QA pair is answerable and correct under the model’s actual inputs. To this end, we re-annotate objects and geometry across 381 scenes from 5 datasets to improve data quality, and regenerate all QA pairs with rigorous bias mitigation and human verification using professional 3D annotation tools. We further enhance evaluation controllability by providing variants across multiple frame budgets (16/32/64/all) and fine-grained object visibility metadata, enabling controlled diagnostic analyses. Evaluations of general and domain-specific VLMs on ReVSI reveal systematic failure modes that are obscured by prior benchmarks, yielding a more reliable and diagnostic assessment of spatial intelligence.

Mahdi Chamseddine, Fabian Kaufmann, Marius Schellen, Christian Glock, Didier Stricker, Jason Rambach

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Automatic generation of Building Information Models (BIM) from building scans is a key challenge in architecture and construction. We present a modular pipeline for generating IFC-compliant BIM from 3D point clouds. The hybrid approach combines learning-based semantic segmentation with topology-aware geometric reconstruction to model structural elements accurately. We propose vIoU, adapting voxel-based overlap evaluation to Scan-to-BIM by enabling holistic, instance-matching-free comparison of reconstructed and ground-truth models. We release the German Hospital dataset (DeKH), including high-resolution point clouds, ground truth BIMs, and semantic annotations. Experiments on DeKH and CV4AEC datasets show significant improvements over a RANSAC-based baseline, demonstrating robustness and scalability.

[402] Unconstrained Multi-view Human Pose Estimation with Algebraic Priors

Xiaolin Qin, Qianlei Wang, Jiacen Liu, Chaoning Zhang, Fei Zhu, Zhang Yi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recovering 3D human pose from multi-view imagery typically relies on precise camera calibration, which is often unavailable in real-world scenarios, thereby severely limiting the applicability of existing methods. To overcome this challenge, we propose an unconstrained framework that synergizes deep neural networks, algebraic priors, and temporal dynamics for uncalibrated multi-view human pose estimation. First, we introduce the Triangulation with Transformer Regressor (TTR), which reformulates classical triangulation into a data-driven token fusion process to bypass the dependency on explicit camera parameters. Second, to explicitly embed the inherent algebraic relations of the multi-view variety into the learning process, we propose the Gröbner basis Corrector (GC). This pioneering loss formulation enforces constraints derived from the multi-view variety to ensure the neural predictions strictly adhere to the laws of projective geometry. Finally, we devise the Temporal Equivariant Rectifier (TER), which exploits the equivariance property of human motion to impose temporal coherence and structural consistency, effectively mitigating scale ambiguity in uncalibrated settings. Extensive evaluations on standard benchmarks demonstrate that our framework establishes a new state-of-the-art for uncalibrated multi-view human pose estimation. Notably, our approach significantly closes the performance gap between calibration-free methods and fully calibrated oracles.

[403] Don’t Pause! Every prediction matters in a streaming video

Dibyadip Chatterjee, Zhanzhong Pang, Fadime Sener, Yale Song, Angela Yao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Streaming video models should respond the moment an event unfolds, not after the moment has passed. Yet existing online VideoQA benchmarks remain largely retrospective. They pause the video at fixed timestamps, pose questions about current or past events, and score models only at those moments. This protocol leaves streaming predictions untested. To close this gap, we introduce SPOT-Bench, featuring multi-turn proactive queries that evaluate general streaming perception and assistive capabilities required by an always-on, real-time assistant. SPOT-Bench comes with Timeliness-F1, a consolidated metric that measures streaming predictions by their temporal precision and balanced coverage across the entire video. Our benchmark reveals: (i) offline models detect events reliably but spam predictions unprompted; (ii) post-training for silence reduces spamming but induces unresponsiveness; (iii) half of the streaming video expects no response, which we term dead-time - compute spent here does not affect response latency. These findings motivate AsynKV, a training-free streaming adaptation of offline models, that retains their event perception while improving their streaming behavior. AsynKV features a long-short term memory, utilized efficiently by scaling compute during dead-time. It serves as a strong baseline on SPOT-Bench, outperforming existing streaming models, and achieves state-of-the-art on retrospective benchmarks.

[404] Monocular Depth Estimation via Neural Network with Learnable Algebraic Group and Ring Structures

Qianlei Wang, Kexun Chen, Shaolin Zhang, Hongli Gao, Chaoning Zhang, Xiaolin Qin

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Monocular depth estimation (MDE) has witnessed remarkable progress driven by Convolutional Neural Networks and transformer-based architectures. However, these approaches typically treat the problem as a generic image-to-image regression on Euclidean grids, thereby overlooking the intrinsic algebraic and geometric structures induced by perspective projection. To address this limitation, we propose LAGRNet, a novel framework that fundamentally grounds MDE in algebraic geometry by explicitly embedding learnable group, ring, and sheaf structures into the deep learning pipeline. Modeling feature maps as sections of a sheaf over an approximated image manifold, our method first establishes a Group-defined Feature Manifold (GFM) parameterized by a learned algebraic group action to enforce projective equivariance and robustness against view changes. To facilitate algebraically consistent cross-scale interactions, we subsequently introduce a Ring Convolution Layer (RCL) that formulates feature fusion as a graded ring homomorphism. Furthermore, to ensure global topological consistency, a Sheaf-based Module (SM) aggregates local depth cues via Čech nerve on the image topology. Extensive zero-shot evaluations across the KITTI, NYU-Depth V2, and ETH3D benchmarks demonstrate that LAGRNet significantly outperforms state-of-the-art methods in both accuracy and generalization capabilities.

[405] An Affordable,Wearable Stereo-Eye-Tracking Platform

Alexander Zimmer, Yasmeen Abdrabou, Enkelejda Kasneci

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Research on video-based eye-tracking has long explored stereo and glint-based methods, yet existing wearable eye trackers - both commercial and open-source - offer limited flexibility for algorithm development and comparative evaluation. We present an affordable, wearable stereo eye-tracking platform built from off-the-shelf and 3D-printable components that explicitly targets this gap. The system combines four infrared eye cameras, infrared illumination, an optional scene camera, and software support for calibration and synchronized data acquisition. By design, the platform supports multiple eye-tracking paradigms, including stereo, glint-based, and binocular approaches, within a single hardware configuration. Rather than optimizing for end-user robustness, the platform prioritizes modularity and extensibility for research use. This paper focuses on the hardware architecture and calibration pipeline and demonstrates the feasibility of the approach using a prototype implementation. All hardware designs and documentation are made openly available.

[406] See Further, Think Deeper: Advancing VLM’s Reasoning Ability with Low-level Visual Cues and Reflection

Zhiheng Wu, Tong Wang, Shuning Wang, Naiming Liu, Yumeng Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.

[407] SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters

Arya Shah, Deepali Mishra, Chaklam Silpasuwanchai

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-language models (VLMs) are increasingly deployed as evaluators in tasks requiring nuanced image understanding, yet their reliability in scoring alignment between images and text descriptions remains underexplored. We investigate whether small, open-weight VLMs exhibit \emph{sycophantic} behavior when evaluating image-text alignment: assigning high scores without grounding their judgments in visual evidence. To quantify this phenomenon, we introduce the \emph{Bluffing Coefficient} (\bc), a metric that measures the mismatch between a model’s score and its evidence recall. We evaluate six open-weight VLMs ranging from 450M to 8B parameters on a benchmark of 173,810 AI-generated character portraits paired with detailed textual descriptions. Our analysis reveals a significant inverse correlation between model size and sycophancy rate ($r = -0.96$, $p = 0.002$), with smaller models exhibiting substantially higher rates of unjustified high scores. The smallest model tested (LFM2-VL, 450M) produced sycophantic evaluations in 22.3% of cases, compared to 6.0% for the largest (LLaVA-1.6, 7B). These findings have direct implications for the deployment of small, open-weight VLMs as automated evaluators within attribute-rich, synthetic image evaluation tasks, where the gap between assigned scores and cited visual evidence is both measurable and consequential.

[408] ARETE: Attention-based Rasterized Encoding for Topology Estimation using HSV-transformed Crowdsourced Vehicle Fleet Data

Daniel Fritz, Dimitrios Lagamtzis, Michael Mink, Markus Enzweiler, Steffen Schober

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The continuous advancement of autonomous driving (AD) introduces challenges across multiple disciplines to ensure safe and efficient driving. One such challenge is the generation of High-Definition (HD) maps, which must remain up to date and highly accurate for downstream automotive tasks. One promising approach is the use of crowdsourced data from a vehicle fleet, representing road topology and lane-level features. This work focuses on the generation of centerlines and lane dividers from crowdsourced vehicle trajectories. We adopt a Detection Transformer (DETR)-based approach, where a rasterized representation of vehicle trajectories is used as input to predict vectorized lane representations. Each lane consists of a centerline with an associated direction and corresponding lane dividers that are geometrically constrained by the centerline. Our method includes the extraction of local tiles, from which crowdsourced vehicle trajectories are aggregated. Each tile undergoes a transformation into a rasterized representation encoding both the presence and direction of each trajectory, enabling the prediction of vectorized directed lanes. Experiments are conducted on an internal dataset as well as on the public datasets nuScenes and nuPlan.

[409] Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

Yubo Jiang, Xin Yang, Abudukelimu Wuerkaixi, Zheming Yuan, Xuxin Cheng, Fengying Xie, Zhiguo Jiang, Cao Liu, Ke Zeng, Haopeng Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-Language Models (VLMs) are frequently undermined by object hallucination–generating content that contradicts visual reality–due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object’s features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model’s outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail–all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.

[410] Multispectral airborne laser scanning dataset for tree species classification: MS-ALS-SPECIES

Matti Hyyppä, Klaara Salolahti, Eric Hyyppä, Xiaowei Yu, Josef Taher, Leena Matikainen, Matti Lehtomäki, Paula Litkey, Teemu Hakala, Harri Kaartinen, Juha Hyyppä, Antero Kukko

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The shift from stand-level to individual-tree-level forest assessments supports improved biodiversity mapping, particularly in boreal ecosystems where tree species like aspen (Populus tremula L.) play a keystone role. While airborne laser scanning (ALS) is the standard for such inventories, a major limitation is the small number of publicly available ALS datasets containing high-quality, field-validated reference data. Furthermore, open multispectral ALS datasets with high-quality field reference data are completely lacking despite the potential of multispectral ALS data for tree species classification. This paper presents and details an open multispectral ALS dataset used in a recent international benchmarking study of machine learning and deep learning methods for tree species classification by Taher et al. (2026). The dataset comprises 6326 segment-level point clouds of individual trees representing nine species in Southern Finland. The point cloud data has been acquired using two multispectral laser scanning systems each operating at three laser wavelengths: a helicopter-borne system (HeliALS) with a point density exceeding 1000 points/m$^2$ and an Optech Titan system with approximately 35 points/m$^2$. We provide a detailed description of field data collection techniques developed in the study to facilitate the collection of high-quality ground truth data in an efficient and scalable manner. Additionally, our article presents new analyses on species classification using multispectral data building upon the initial findings of Taher et al. (2026). Furthermore, we study the relation between classification accuracy and tree height to highlight the versatility of the open dataset and to demonstrate the advantage of the point transformer model for small trees and minority species.

Rameshwar Mishra, A V Subramanyam

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The recent surge in content consumption through streaming services has driven a growing demand for personalized content. Personalized advertisements (ads) play a crucial role in enhancing both user engagement and ad effectiveness. A key aspect of ad personalization involves replacing existing regions in a frame with custom, Photoshop-generated banners. However, existing ad-placement pipelines typically rely on simple geometric warping, ignoring the scene’s underlying lighting conditions. Similarly, state-of-the-art diffusion-based object insertion and relighting models struggle to accurately relight these newly inserted banners, as they are not trained on ad-banner data, and training such a model for ad banners would require millions of images. This highlights the need for an effective relighting framework that enables seamless integration of custom banners into the original scene. Motivated by this, we present AD-Relight, a novel multi-stage training-free framework that adapts a diffusion-based relighting model at test time to relight newly added Photoshop-generated ad banners. Through extensive evaluation, we demonstrate that AD-Relight outperforms both relighting baselines and existing ad-placement methods based on simple warping. User studies further show that participants consistently prefer the outputs of AD-Relight over those of prior approaches.

[412] BMD-45: A Large-Scale CCTV Vehicle Detection Dataset for Urban Traffic in Developing Cities

Akash Sharma, Chinmay Mhatre, Sankalp Gawali, Ruthvik Bokkasam, Brij Sharma, Vishwajeet Pattanaik, Punit Rathore, Raghu Krishnapuram, Vijay Gopal Kovvali, Anirban Chakraborty, Yogesh Simmhan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Robust vehicle detection from fixed CCTV cameras is critical for Intelligent Transportation Systems. Yet existing benchmarks predominantly feature relatively homogeneous, highly organized traffic patterns captured from ego-centric driving perspectives or controlled aerial views. This regional and sensor view bias creates a significant gap. Models trained on datasets such as UA-DETRAC and COCO struggle to generalize to the dense, heterogeneous, disorganized traffic conditions observed in rapidly developing urban centers in emerging economies. To address this limitation, we introduce BMD-45, a large-scale dataset comprising 480K bounding boxes annotated over 45K images captured from over 3.6K operational Safe City CCTV cameras. BMD-45 contains 14 fine-grained vehicle categories, including region-specific modes such as auto-rickshaws and tempo travellers, which are not present in existing benchmarks. The dataset captures real-world deployment challenges, including extreme viewpoint variation, occlusion, and vehicle density . We establish comprehensive baselines using state-of-the-art detectors and reveal a striking domain gap: models fine-tuned on UA-DETRAC achieve only 33.6% mAP@0.50:0.95, compared to 83.8% when trained in-domain on BMD-45, representing a 2.5x improvement that persists even when accounting for novel vehicle classes. This performance gap underscores the critical need for geographically diverse traffic benchmarks and establishes BMD-45 as a baseline for developing robust perception systems in underrepresented urban environments worldwide. The dataset is available at: https://huggingface.co/datasets/iisc-aim/BMD-45.

[413] DYMAPIA: A Multi-Domain Framework for Detecting AI-based Video Manipulation

Md Shohel Rana, Andrew H. Sung

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: AI-generated media are advancing rapidly, raising pressing concerns for content authenticity and digital trust. We introduce DYMAPIA, a multi-domain Deepfake detection framework that fuses spatial, spectral, and temporal cues to capture subtle traces of manipulation in visual data. The system builds dynamic anomaly masks by combining evidence from Fourier spectra, local texture descriptors, edge irregularities, and optical flow consistency, which highlight tampered regions with fine spatial accuracy. These masks guide DistXCNet, a lightweight classifier distilled from Xception and optimized with depthwise separable convolutions for fast, region-focused classification. This joint design achieves state-of-the-art results, with accuracy and F1-scores exceeding 99% on FF++, Celeb-DF, and VDFD benchmarks, while keeping the model compact enough for real-time use. Beyond outperforming existing full-frame and multidomain detectors, DYMAPIA demonstrates deployment readiness for time-critical forensic tasks, including media verification, misinformation defense, and secure content filtering.

Hongxin Li, Xiping Wang, Jingran Su, Zheng Ju, Yuntao Chen, Qing Li, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the “digital world state” resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.

[415] TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering

Dongxing Mao, Yilin Wang, Linjie Li, Zhengyuan Yang, Alex Jinpeng Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Despite recent advances in text-to-image generation, models still struggle to accurately render prompt-specified text with correct spatial layout – especially in multi-span, structured settings. This challenge is driven not only by the lack of datasets that align prompts with the exact text and layout expected in the image, but also by the absence of effective metrics for evaluating layout quality. To address these issues, we introduce TextGround4M, a large-scale dataset of over 4 million prompt-image pairs, each annotated with span-level text grounded in the prompt and corresponding bounding boxes. This enables fine-grained supervision for layout-aware, prompt-grounded text rendering. Building on this, we propose a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens during training, without altering model architecture or inference behavior. We further construct a benchmark with stratified layout complexity to evaluate both open-source and proprietary models in a zero-shot setting. In addition, we introduce two layout-aware metrics to address the long-standing lack of spatial evaluation in text rendering. Our results show that models trained on TextGround4M outperform strong baselines in text fidelity, spatial accuracy, and prompt consistency, highlighting the importance of fine-grained layout supervision for grounded T2I generation.

[416] Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data

Mohammadmehdi Ataei, Farzaneh Askari, Kamal Rahimi Malekshan, Pradeep Kumar Jayaraman

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Computer-Aided Design (CAD) models are defined by their construction history: a parametric recipe that encodes design intent. However, existing large-scale 3D datasets predominantly consist of boundary representations (B-Reps) or meshes, stripping away this critical procedural information. To address this scarcity, we introduce Zero-to-CAD, a scalable framework for synthesizing executable CAD construction sequences. We frame synthesis as an agentic search problem: by embedding a large language model (LLM) within a feedback-driven CAD environment, our system iteratively generates, executes, and validates code using tools and documentation lookup to promote geometric validity and operation diversity. This agentic approach enables the synthesis of approximately one million executable, readable, editable CAD sequences, covering a rich vocabulary of operations beyond sketch-and-extrude workflows. We also release a curated subset of 100,000 high-quality models selected for geometric diversity. To demonstrate the dataset’s utility, we fine-tune a vision-language model on our synthetic data to reconstruct editable CAD programs from multi-view images, outperforming strong baselines, including GPT-5.2, and effectively bootstrapping sequence generation capabilities without real construction-history training data. Zero-to-CAD bridges the gap between geometric scale and parametric interpretability, offering a vital resource for the next generation of CAD AI.

[417] Deployment-Aligned Low-Precision Neural Architecture Search for Spaceborne Edge AI

Parampuneet Kaur Thind, Vaibhav Katturu, Giacomo Zema, Roberto Del Prete

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Designing deep networks that meet strict latency and accuracy constraints on edge accelerators increasingly relies on hardware-aware optimization, including neural architecture search (NAS) guided by device-level metrics. Yet most hardware-aware NAS pipelines still optimize architectures under full-precision assumptions and apply low-precision adaptation only after the search, leading to a mismatch between optimization-time behavior and deployment-time execution on low-precision hardware that can substantially degrade accuracy. We address this limitation by integrating deployment-aligned low-precision training directly into hardware-aware NAS. Candidate architectures are exposed to FP16 numerical constraints during fine-tuning and evaluation, enabling joint optimization of architectural efficiency and numerical robustness without modifying the search space or evolutionary strategy. We evaluate the proposed framework on vessel segmentation for spaceborne maritime monitoring, targeting the Intel Movidius Myriad X Visual Processing Unit (VPU). While post-training precision conversion reduces on-device performance from 0.85 to 0.78 mIoU, deployment-aligned low-precision training achieves 0.826 mIoU on-device for the same architecture (95,791 parameters), recovering approximately two-thirds of deployment-induced accuracy gap without increasing model complexity. These results demonstrate that incorporating deployment-consistent numerical constraints into hardware-aware NAS substantially improves robustness and alignment between optimization and deployment for resource-constrained edge Artificial Intelligence (AI).

[418] CA-IDD: Cross-Attention Guided Identity-Conditional Diffusion for Identity-Consistent Face Swapping

Md Shohel Rana, Tanoy Debnath

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Face swapping aims to optimize realistic facial image generation by leveraging the identity of a source face onto a target face while preserving pose, expression, and context. However, existing methods, especially GAN-based methods, often struggle to balance identity preservation and visual realism due to limited controllability and mode collapse. In this paper, we introduce CA-IDD (Cross-Attention Guided Identity-Conditional Diffusion), the first diffusion-based face swapping approach that integrates multi-modal guidance comprising gaze, identity, and facial parsing through multi-scale cross-attention. Precomputed identity embeddings are incorporated into the denoising process via hierarchical attention layers, resulting in accurate and consistent identity transfer. To improve semantic coherence and visual quality, we use expert-guided supervision, with facial parsing and gaze-consistency modules. Unlike GAN-based or implicit-fusion methods, our diffusion framework provides stable training, robust generalization, and spatially adaptive identity alignment, allowing for fine-grained regional control across pose and expression variations. CA-IDD achieves an FID of 11.73, exceeding established baselines such as FaceShifter and MegaFS. Qualitative results also reveal improved identity retention across diverse poses, establishing CA-IDD as a strong foundation for future diffusion-based face editing.

[419] Self-Supervised Representation Learning via Hyperspherical Density Shaping

Esteban Rodríguez-Betancourt, Edgar Casasola-Murillo

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Modern self-supervised representation learning methods often relies on empirical heuristics that are not theoretically grounded. In this study we propose HyDeS, a theoretically grounded method based on multi-view mutual information maximization within an hyperspherical space using Shannon differential entropy with a non-parametric von Mises-Fisher density estimator. We show that HyDeS bias the trained model towards focusing on foreground features of the images and perform well on segmentation tasks such as VOC PASCAL, while it lags in fine-grained classification. We provide a detailed analysis of the induced latent space geometry and learning dynamics, that can be used for designing other theoretically grounded self-supervised learning methods.

[420] Point Cloud Registration for Fusion between SPECT MPI and CTA Images

Ni Yao, Xiangyu Liu, Shaojie Tang, Danyang Sun, Chuang Han, Yanting Li, Jiaofen Nan, Chengyang Li, Fubao Zhu, Chen Zhao, Zhihui Xu, Weihua Zhou

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Clinical fusion of Single Photon Emission Computed Tomography Myocardial Perfusion Imaging (SPECT MPI) and Computed Tomography Angiography (CTA) remains limited by cross-modality misregistration and reliance on manual landmarks, which can hinder accurate ischemia localization and lesion-level functional assessment. To address this issue, we propose a registration and fusion framework for SPECT MPI and CTA that integrates functional and structural information for comprehensive cardiac evaluation. The proposed pipeline performs U-Net-based segmentation on both modalities. On SPECT MPI, only the left ventricle (LV) is extracted, and anatomical landmarks are automatically derived from characteristic LV structures. On CTA, both ventricles are segmented, and their spatial relationship is used to automatically define landmarks at the interventricular septal junction. Scale-space consistency preprocessing and landmark-driven coarse registration are applied to mitigate initial misalignment. Based on this initialization, multiple fine registration methods are evaluated on LV epicardial surface point clouds, including ICP, SICP, CPD, CluReg, FFD, and BCPD-plus-plus. The resulting transformations are then propagated to voxel-level resampling for high-precision SPECT-CTA fusion. In a retrospective cohort of 60 patients, the proposed framework preserved sub-millimeter coronary detail from CTA while accurately overlaying quantitative SPECT perfusion. Among the evaluated methods, BCPD-plus-plus achieved the highest accuracy with a mean point cloud distance of 1.7 mm. By combining robust initialization, comparative fine registration, and voxel-level fusion, the proposed approach provides a practical solution for myocardial ischemia localization and functional evaluation of coronary lesions, while remaining independent of any specific fine registration algorithm.

[421] RACANet: Reliability-Aware Crowd Anchor Network for RGB-T Crowd Counting

Jinghao Shi, Mengqi Lei, Kunliang He, Yun Li, Wei Bao, Siqi Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: RGB-Thermal (T) crowd counting aims to integrate visible-spectrum and thermal infrared information to improve the robustness of crowd density estimation in complex scenes. Although existing studies generally improve counting accuracy through cross-modal feature fusion, most current methods rely on implicit cross-modal fusion strategies and lack explicit modeling of local spatial discrepancies as well as fine-grained characterization of modality reliability at the positional level, thereby limiting the accuracy and interpretability of the fusion process. To address these issues, this paper proposes a two-stage fusion framework, RACANet, a Reliability-Aware Crowd Anchor Network for RGB-T crowd counting. First, we introduce a lightweight cross-modal alignment pretraining stage, which explicitly learns cross-modal semantic correspondences through crowd-prior supervision and local bidirectional soft matching. Then, based on the priors learned during pretraining, a Local Anchor Fusion Module (LAFM) is introduced in the formal training stage. This module generates local semantic anchors by aggregating features from highly reliable regions and further enables adaptive pixel-level feature redistribution with a local attention mechanism. In addition, we propose a discrepancy-aware consistency constraint to dynamically coordinate the reliability of regions where modal representations are consistent. Experiments conducted on two widely used benchmark datasets, RGBT-CC and Drone-RGBT, demonstrate that RACANet outperforms existing methods. The anonymous code is available at https://anonymous.4open.science/r/RACANet-9985.

[422] CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

Fan Du, Feng Yan, Jianxiong Wu, Xinrun Xu, Weiye Zhang, Weinong Wang, Yu Guo, Bin Qian, Zhihai He

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF-VLA, a coarse-to-fine two-stage formulation that restructures action generation into a coarse initialization step that constructs an action-aware starting point, followed by a single-step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed-time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency-performance frontier under low-NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 $π_{0.5}$ baseline on several metrics, reduces action sampling latency by 75.4%, and achieves the best average real-robot success rate of 83.0%, outperforming MIP by 19.5 points and $π_{0.5}$ by 4.0 points. These results suggest that structured, coarse-to-fine generation enables both strong performance and efficient inference. Our code is available at https://github.com/EmbodiedAI-RoboTron/CF-VLA.

[423] Diffusion Model as a Generalist Segmentation Learner

Haoxiao Wang, Antao Xiang, Haiyang Sun, Peilin Sun, Changhao Pan, Yifu Chen, Minjie Hong, Weijie Wang, Shuang Chen, Yue Chen, Zhou Zhao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text-conditioned semantic and open-vocabulary segmentation, and this approach can be generalized to various downstream tasks to make a general-purpose diffusion segmentation framework. Concretely, we introduce DiGSeg (Diffusion Models as a Generalist Segmentation Learner), which repurposes a pretrained diffusion model into a unified segmentation framework. Our approach encodes the input image and ground-truth mask into the latent space and concatenates them as conditioning signals for the diffusion U-Net. A parallel CLIP-aligned text pathway injects language features across multiple scales, enabling the model to align textual queries with evolving visual representations. This design transforms an off-the-shelf diffusion backbone into a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts. Extensive experiments demonstrate state-of-the-art performance on standard semantic segmentation benchmarks, as well as strong open-vocabulary generalization and cross-domain transfer to medical, remote sensing, and agricultural scenarios-without domain-specific architectural customization. These results indicate that modern diffusion backbones can serve as generalist segmentation learners rather than pure generators, narrowing the gap between visual generation and visual understanding.

[424] Improving Vision-language Models with Perception-centric Process Reward Models

Yingqian Min, Kun Zhou, Yifan Li, Yuhuan Wu, Han Peng, Yifan Du, Wayne Xin Zhao, Min Yang, Ji-Rong Wen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model’s response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at https://github.com/RUCAIBox/Perceval.

[425] Point-MF: One-step Point Cloud Generation from a Single Image via Mean Flows

Yuta Baba, Keiji Yanai

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Single-image point cloud reconstruction must infer complete 3D geometry, including occluded parts, from a single RGB image. While diffusion-based reconstructors achieve high accuracy, they typically require many denoising iterations, resulting in slow and expensive inference. We propose Point-MF, a Mean-Flow-based framework for low-NFE single-image point cloud reconstruction that couples a Mean-Flow-compatible architecture with an auxiliary loss. Specifically, Point-MF operates directly in point-cloud space to learn the mean velocity field and enables one-step reconstruction with a single network function evaluation (1-NFE), without relying on VAE-based latent representations. To make Mean Flow effective under large interval jumps, Point-MF employs a Diffusion Transformer tailored to the Mean-Flow setting, conditioned on frozen DINOv3 image features via a lightweight token adapter and equipped with explicit interval/time conditioning. Moreover, we introduce Denoised Space Anchor, a set-distance auxiliary loss on the denoised-space estimate $x_θ$ induced by the predicted velocity field, to stabilize large-step generation and reduce outliers and density artifacts. On ShapeNet-R2N2 and Pix3D, Point-MF strikes a strong balance between reconstruction quality and inference speed compared to multi-step diffusion baselines and competitive feedforward models, while generating high-quality point clouds with millisecond-level latency.

[426] Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

Lixian Chen, Mingxuan Huang, Yanhui Chen, Junyi Lin, Yang Shi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance failure. On the ImageNet-based benchmark, MG-MTTA improves top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift, while remaining competitive in the visual-only benchmark. These results show that multimodal test-time adaptation should control modality reliability, not just prediction entropy.

[427] Infrastructure-Guided Connectivity-Enhanced Road Crack Detection and Estimation

Haosong Xiao, Yamini Ramesh, Rishabh Shukla, Swarat Sarkar, Chaozhe R. He

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In this paper, we report the world’s first infrastructure-guided communication-enhanced road crack detection pipeline that is effective and implementable on passenger vehicles. We first design a customized communication protocol to transmit the region of interest from the infrastructure to the vehicle. With proper camera image processing (e.g., dynamic cropping and frame selection), the focused images are provided to the crack detection model. Leveraging state-of-the-art crack detection model backbones and a carefully prepared dataset comprising a forward-facing view with a crack, we train the model to improve crack-detection performance. We demonstrate the full detection pipeline on an experimental vehicle platform, showcase the detection effectiveness, and project future research directions.

[428] Probing CLIP’s Comprehension of 360-Degree Textual and Visual Semantics

Hai Wang, Xiaochen Yang, Mingzhi Dong, Jing-Hao Xue

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The dream of instantly creating rich 360-degree panoramic worlds from text is rapidly becoming a reality, yet a crucial gap exists in our ability to reliably evaluate their semantic alignment. Contrastive Language-Image Pre-training (CLIP) models, standard AI evaluators, predominantly trained on perspective image-text pairs, face an open question regarding their understanding of the unique characteristics of 360-degree panoramic image-text pairs. This paper addresses this gap by first introducing two concepts: \emph{360-degree textual semantics}, semantic information conveyed by explicit format identifiers, and \emph{360-degree visual semantics}, invariant semantics under horizontal circular shifts. To probe CLIP’s comprehension of these semantics, we then propose novel evaluation methodologies using keyword manipulation and horizontal circular shifts of varying magnitudes. Rigorous statistical analyses across popular CLIP configurations reveal that: (1) CLIP models effectively leverage explicit textual identifiers, demonstrating an understanding of 360-degree textual semantics; and (2) CLIP models fail to robustly preserve semantic alignment under horizontal circular shifts, indicating limited comprehension of 360-degree visual semantics. To address this limitation, we propose a LoRA-based fine-tuning framework that explicitly instills invariance to circular shifts. Our fine-tuned models exhibit improved comprehension of 360-degree visual semantics, though with a slight degradation in original semantic evaluation performance, highlighting a fundamental trade-off in adapting CLIP to 360-degree panoramic images. Code is available at https://github.com/littlewhitesea/360Semantics.

[429] Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction

Fredrik K. Gustafsson, Constance Boissin, Johan Vallon-Christersson, David A. Clifton, Mattias Rantalainen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Pathology foundation models (PFMs) have recently emerged as powerful pretrained encoders for computational pathology, enabling transfer learning across a wide range of downstream tasks. However, systematic comparisons of these models for clinically meaningful prediction problems remain limited, especially in the context of survival prediction under external validation. In this study, we benchmark widely used and recently proposed PFMs for breast cancer survival prediction from whole-slide histopathology images. Using a standardized pipeline based on patch-level feature extraction and a unified survival modeling framework, we evaluate model representations across three independent clinical cohorts comprising more than 5,400 patients with long-term follow-up. Models are trained on one cohort and evaluated on two independent external cohorts, enabling a rigorous assessment of cross-dataset generalization. Overall, H-optimus-1 achieves the strongest survival prediction performance. More broadly, we observe consistent generational improvements across model families, with second-generation PFMs outperforming their first-generation counterparts. However, absolute performance differences between many recent PFMs remain modest, suggesting diminishing returns from further scaling of pretraining data or model size alone. Notably, the compact distilled model H0-mini slightly outperforms its larger teacher model H-optimus-0, despite using fewer than 8% of the parameters and enabling significantly faster feature extraction. Together, these results provide the first large-scale, externally validated benchmark of PFMs for breast cancer survival prediction, and offer practical guidance for efficient deployment of PFMs in clinical workflows.

[430] Aycromo: An Open-Source Platform for Automatic Chromosome Detection in Metaphase Images Based on Deep Learning

Jorge L. A. Lima, Filipe R. Cordeiro

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Chromosome analysis is a fundamental step in the diagnosis of genetic diseases, but the manual karyotyping workflow is time-consuming and heavily dependent on expert specialists, often requiring several days per patient. Although Deep Learning models have achieved high performance in chromosome detection, most proposed solutions remain restricted to research prototypes or lack graphical interfaces suitable for clinical use. In this work, we present Aycromo, an open-source desktop platform for AI-assisted cytogenetic analysis. Built on Electron and ONNX Runtime, the tool allows cytogeneticists to load pre-trained models, compare architectures through an integrated benchmarking module, and manually correct detections via an interactive annotation interface, all without command-line interaction. Preliminary experiments on metaphase images from the CRCN-NE dataset demonstrate that YOLOv11 achieves 99.40% mAP@50, while the platform reduces per-slide analysis to seconds

[431] NeuroClaw Technical Report

Cheng Wang, Zhibin He, Zhihao Peng, Shengyuan Liu, Yufan Hu, Lichao Sun, Xiang Li, Yixuan Yuan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Agentic artificial intelligence systems promise to accelerate scientific workflows, but neuroimaging poses unique challenges: heterogeneous modalities (sMRI, fMRI, dMRI, EEG), long multi-stage pipelines, and persistent reproducibility risks. To address this gap, we present NeuroClaw, a domain-specialized multi-agent research assistant for executable and reproducible neuroimaging research. NeuroClaw operates directly on raw neuroimaging data across formats and modalities, grounding decisions in dataset semantics and BIDS metadata so users need not prepare curated inputs or bespoke model code. The platform combines harness engineering with end-to-end environment management, including pinned Python environments, Docker support, automated installers for common neuroimaging tools, and GPU configuration. In practice, this layer emphasizes checkpointing, post-execution verification, structured audit traces, and controlled runtime setup, making toolchains more transparent while improving reproducibility and auditability. A three-tier skill/agent hierarchy separates user-facing interaction, high-level orchestration, and low-level tool skills to decompose complex workflows into safe, reusable units. Alongside the NeuroClaw framework, we introduce NeuroBench, a system-level benchmark for executability, artifact validity, and reproducibility readiness. Across multiple multimodal LLMs, NeuroClaw-enabled runs yield consistent and substantial score improvements compared with direct agent invocation. Project homepage: https://cuhk-aim-group.github.io/NeuroClaw/index.html

[432] WildLIFT: Lifting monocular drone video to 3D for species-agnostic wildlife monitoring

Vandita Shukla, Fabio Remondino, Blair Costelloe, Benjamin Risse

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Monocular RGB cameras mounted on drones are widely used for wildlife monitoring, yet most analytical pipelines remain confined to two-dimensional image space, leaving geometric information in video underexploited. We present WildLIFT, a computational framework that integrates three-dimensional scene geometry from monocular drone video with open-vocabulary 2D instance segmentation to enable species-agnostic 3D detection and tracking. Oriented 3D bounding box labels with semantic face information enable quantitative assessment of viewpoint coverage and inter-animal occlusion, producing structured metadata for downstream ecological analyses. We validate the framework on 2,581 manually curated frames comprising over 6,700 3D detections across four large mammal species. WildLIFT maintains high identity consistency in multi-animal scenes and substantially reduces manual 3D annotation effort through keyframe-based refinement. By transforming standard drone footage into structured 3D and viewpoint-aware representations, WildLIFT extends the analytical utility of aerial wildlife datasets for behavioural research and population monitoring.

[433] DiffuSAM: Diffusion-Based Prompt-Free SAM2 for Few-Shot and Source-Free Medical Image Segmentation

Tal Grossman, Noa Cahan, Lev Ayzenberg, Hayit Greenspan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Segmentation models such as Segment Anything Model (SAM) and SAM2 achieve strong prompt-driven zero-shot performance. However, their training on natural images limits domain transfer to medical data. Consequently, accurate segmentation typically requires extensive fine-tuning and expert-designed prompts. We propose DiffuSAM, a diffusion-based adaptation of SAM2 for prompt-free medical image segmentation. Our framework synthesizes SAM2-compatible segmentation mask-like embeddings via a lightweight diffusion-prior from off-the-shelf frozen SAM2 image features. The generated embeddings are integrated into SAM2’s mask decoder to produce accurate segmentations, thereby eliminating the need for user prompts. The diffusion prior is further conditioned on previously segmented slices, enforcing spatial consistency across volumes. Evaluated on the BTCV and CHAOS datasets for CT and MRI under Source-Free Unsupervised Domain Adaptation (SF-UDA) and Few-Shot settings, DiffuSAM achieves competitive performance with efficient training and inference. Code is available upon request from the corresponding author.

[434] OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer

Boyang Wang, Guangyi Xu, Zhipeng Tang, Jiahui Zhang, Zezhou Cheng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Shot Boundary Detection (SBD) aims to automatically identify shot changes and divide a video into coherent shots. While SBD was widely studied in the literature, existing state-of-the-art methods often produce non-interpretable boundaries on transitions, miss subtle yet harmful discontinuities, and rely on noisy, low-diversity annotations and outdated benchmarks. To alleviate these limitations, we propose OmniShotCut to formulate SBD as structured relational prediction, jointly estimating shot ranges with intra-shot relations and inter-shot relations, by a shot query-based dense video Transformer. To avoid imprecise manual labeling, we adopt a fully synthetic transition synthesis pipeline that automatically reproduces major transition families with precise boundaries and parameterized variants. We also introduce OmniShotCutBench, a modern wide-domain benchmark enabling holistic and diagnostic evaluation.

[435] Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Zhiheng Liu, Weiming Ren, Xiaoke Huang, Shoufa Chen, Tianhong Li, Mengzhao Chen, Yatai Ji, Sen He, Jonas Schult, Belinda Zeng, Tao Xiang, Wenhu Chen, Ping Luo, Luke Zettlemoyer, Yuren Cong

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Unified multimodal models typically rely on pretrained vision encoders and use separate visual representations for understanding and generation, creating misalignment between the two tasks and preventing fully end-to-end optimization from raw pixels. We introduce Tuna-2, a native unified multimodal model that performs visual understanding and generation directly based on pixel embeddings. Tuna-2 drastically simplifies the model architecture by employing simple patch embedding layers to encode visual input, completely discarding the modular vision encoder designs such as the VAE or the representation encoder. Experiments show that Tuna-2 achieves state-of-the-art performance in multimodal benchmarks, demonstrating that unified pixel-space modelling can fully compete with latent-space approaches for high-quality image generation. Moreover, while the encoder-based variant converges faster in early pretraining, Tuna-2’s encoder-free design achieves stronger multimodal understanding at scale, particularly on tasks requiring fine-grained visual perception. These results show that pretrained vision encoders are not necessary for multimodal modelling, and end-to-end pixel-space learning offers a scalable path toward stronger visual representations for both generation and perception.

[436] World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y. Chen, Zhiyuan He, Yuqing Yang, Bohan Zhuang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

[437] Geometric Analysis of Self-Supervised Vision Representations for Semantic Image Retrieval

Esteban Rodríguez-Betancourt, Edgar Casasola-Murillo

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.24469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.24469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[438] Easy Ensemble: Simple Deep Ensemble Learning for Sensor-Based Human Activity Recognition

Tatsuhito Hasegawa, Kazuma Kondo

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2203.04153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2203.04153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[439] Seer: Language Instructed Video Prediction with Latent Diffusion Models

Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, Yang Gao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2303.14897: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2303.14897&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[440] SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels

Darshan Singh S, Zeeshan Khan, Makarand Tapaswi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2401.07669: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2401.07669&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[441] Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, Shunli Wang, Dongling Xiao, Ke Li, Lihua Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2406.10185: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.10185&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[442] Reclaiming Residual Knowledge: A Novel Paradigm to Low-Bit Quantization

Róisín Luo, Alexandru Drimbarean, James McDermott, Colm O’Riordan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2408.00923: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.00923&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[443] NVILA: Efficient Frontier Visual Language Models

Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2412.04468: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.04468&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[444] Multi-Scale Contrastive Learning for Video Temporal Grounding

Thong Thanh Nguyen, Yi Bin, Xiaobao Wu, Zhiyuan Hu, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2412.07157: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.07157&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[445] Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation

Thong Thanh Nguyen, Xiaobao Wu, Yi Bin, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2412.07160: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.07160&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[446] GCP: Guarded Collaborative Perception with Spatial-Temporal Aware Malicious Agent Detection

Yihang Tao, Senkang Hu, Yue Hu, Haonan An, Hangcheng Cao, Yuguang Fang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2501.02450: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.02450&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[447] YOLOv8 to YOLO11: A Comprehensive Architecture In-depth Comparative Review

Priyanto Hidayatullah, Nurjannah Syakrani, Muhammad Rizqi Sholahuddin, Trisna Gelar, Refdinal Tubagus

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2501.13400: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.13400&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Hanqi Yan, Xiangxiang Cui, Lu Yin, Jindong Gu, Paul Pu Liang, Yulan He, Yifei Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2502.14888: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.14888&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[449] ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

Yuan Zhou, Shilong Jin, Litao Hua, Wanjun Lv, Haoran Duan, Jungong Han

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2504.02316: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.02316&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[450] MARRS: Masked Autoregressive Unit-based Reaction Synthesis

Yabiao Wang, Shuo Wang, Jiangning Zhang, Jiafu Wu, Qingdong He, Yong Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.11334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[451] LatentStealth: Unnoticeable and Efficient Adversarial Attacks on Expressive Human Pose and Shape Estimation

Zhiying Li, Guanggang Geng, Yeying Jin, Shuyuan Lin, Fengyuan Ma, Zhaoxin Fan, Lili Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.12009: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12009&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Fanqi Kong, Weiqin Zu, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, Xue Feng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.05425: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05425&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[453] ShowFlow: From Robust Single Concept to Condition-Free Multi-Concept Generation

Trong-Vu Hoang, Quang-Binh Nguyen, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.18493: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18493&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[454] U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration

Xiaofan Li, Zhihao Xu, Chenming Wu, Zhao Yang, Yumeng Zhang, Jiang-Jiang Liu, Haibao Yu, Fan Duan, Xiaoqing Ye, Yuan Wang, Shirui Li, Xun Sun, Ji Wan, Jun Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.04503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.04503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[455] Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Shashanka Venkataramanan, Valentinos Pariza, Mohammadreza Salehi, Lukas Knobel, Spyros Gidaris, Elias Ramzi, Andrei Bursuc, Yuki M. Asano

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.14137: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.14137&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[456] FOCUS: Fused Observation of Channels for Unveiling Spectra

Xi Xiao, Aristeidis Tsaris, Anika Tabassum, John Lagergren, Larry M. York, Tianyang Wang, Xiao Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.14787: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.14787&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[457] mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

Xu Yuan, Liangbo Ning, Qingqing Ye, Wenqi Fan, Qing Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.05318: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05318&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[458] Smoothing Slot Attention Iterations and Recurrences

Rongzhen Zhao, Wenyan Yang, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.05417: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05417&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[459] iWatchRoad: Scalable Detection and Geospatial Visualization of Potholes for Smart Cities

Rishi Raj Sahoo, Surbhi Saswati Mohanty, Subhankar Mishra

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.10945: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.10945&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[460] Learning Binary Sampling Patterns for Single-Pixel Imaging using Bilevel Optimisation

Serban Cristian Tudosie, Alexander Denker, Zeljko Kereta, Simon Arridge

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.19068: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.19068&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[461] Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li, Wenhao Yu, Chengsong Huang, Zhenwen Liang, Rui Liu, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, Dong Yu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.19652: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.19652&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[462] EMCompress: Video-LLMs with Endomorphic Multimodal Compression

Zheyu Fan, Jiateng Liu, Yuji Zhang, Zihan Wang, Yi R. Fung, Manling Li, Heng Ji

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.21094: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.21094&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Cem Eteke, Alexander Griessel, Wolfgang Kellerer, Eckehard Steinbach

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.06904: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.06904&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[464] Animalbooth: multimodal feature enhancement for animal subject personalization

Chen Liu, Haitao Wu, Kafeng Wang, Weiran Huang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.16702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.16702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[465] Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation

Neeraj Gangwar, Anshuka Rangi, Rishabh Deshmukh, Holakou Rahmanian, Yesh Dattatreya, Nickvash Kani

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.19602: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19602&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[466] Benchmarking and Mitigating Sycophancy in Medical Vision Language Models

Zikun Guo, Jingwei Lv, Xinyue Xu, Shu Yang, Jun Wen, Di Wang, Lijie Hu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.21979: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21979&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[467] Towards Continual Expansion of Data Coverage: Automatic Text-guided Edge-case Synthesis

Kyeongryeol Go

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.26158: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26158&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[468] What Drives Compositional Generalization? The Importance of Continuous Training Objectives in Visual Generative Models

Karim Farid, Rajat Sahay, Yumna Ali Alnaggar, Simon Schrodi, Volker Fischer, Cordelia Schmid, Thomas Brox

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.03075: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03075&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[469] The Shape of Attraction in UMAP: Exploring the Embedding Forces in Dimensionality Reduction

Mohammad Tariqul Islam, Jason W. Fleischer

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2503.09101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.09101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[470] A Hierarchical Self-Consistent Regularization Approach to Satellite Image Time Series Classification

Giulio Weikmann, Gianmarco Perantoni, Lorenzo Bruzzone

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.04916: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04916&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[471] Resolution scaling governs DINOv3 transfer performance in chest radiograph classification

Soroosh Tayebi Arasteh, Mina Shaigan, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.07191: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07191&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[472] ImmerIris: A Large-Scale Dataset and Benchmark for Off-Axis and Unconstrained Iris Recognition in Immersive Applications

Yuxi Mi, Qiuyang Yuan, Zhizhou Zhong, Xuan Zhao, Jiaogen Zhou, Fubao Zhu, Jihong Guan, Shuigeng Zhou

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.10113: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10113&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[473] Learning an Image Editing Model without Image Editing Pairs

Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, Xun Huang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.14978: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14978&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[474] DRIFT: Transferring Reasoning Priors for Efficient MLLM Fine-Tuning

Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chenliang Xu, Emad Barsoum, Zicheng Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.15050: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15050&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[475] Cataract-LMM Large-Scale Multi-Source Multi-Task Benchmark for Deep Learning in Surgical Video Analysis

Mohammad Javad Ahmadi, Iman Gandomi, Parisa Abdi, Seyed-Farzad Mohammadi, Amirhossein Taslimi, Mehdi Khodaparast, Hassan Hashemi, Mahdi Tavakoli, Hamid D. Taghirad

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.16371: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16371&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[476] AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering

Jiayu Zhang, Shuo Ye, Qilang Ye, Xun Lin, Zihan Song, Zitong Yu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.18346: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18346&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[477] ISExplore:Informative Segment Selection for Efficient Personalized 3D Talking Face Generation

Rui-Qing Sun, Ang Li, Zhijing Wu, Tian Lan, Qianyu Lu, Xingshan Yao, Chen Xu, Xian-Ling Mao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.07940: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07940&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[478] Gradient-Guided Exploration of Generative Model’s Latent Space for Controlled Iris Image Augmentations

Mahsa Mitcheff, Siamul Karim Khan, Adam Czajka

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.09749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[479] MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples

Xurui Li, Feng Xue, Yu Zhou

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.10047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[480] Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition

Raghu Vamsi Chittersu, Yuvraj Singh Rathore, Pranav Adlinge, Kunal Swami

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.15197: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15197&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[481] SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting

Di Wu, Liu Liu, Xueyu Yuan, Wenxiao Chen, Lijun Yue, Liuzhu Chen, Yiming Tang, Meng Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.17092: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17092&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[482] Learning Under Low Illumination: A Dataset and Algorithm for Traffic Sign Recognition

Aditya Mishra, Akshay Agarwal, Haroon Lone

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.17183: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17183&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[483] POUR: A Provably Optimal Method for Unlearning Representations via Neural Collapse

Anjie Le, Can Peng, Yuyuan Liu, J. Alison Noble

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.19339: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19339&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[484] GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models

Bin Wang, Ruotong Hu, Wentong Li, Wenqian Wang, Mingliang Gao, Runmin Cong, Wei Zhang, Xudong Jiang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.22125: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22125&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[485] S2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds

Han Su, Tianyu Huang, Zichen Wan, Xiaohe Wu, Wangmeng Zuo

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.00995: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00995&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[486] Group Orthogonal Low-Rank Adaptation for RGB-T Tracking

Zekai Shao, Yufan Hu, Jingyuan Liu, Bin Fan, Hongmin Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.05359: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05359&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[487] Voxify3D: Pixel Art Meets Volumetric Rendering

Yi-Chuan Huang, Jiewen Chan, Hao-Jen Chien, Yu-Lun Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.07834: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07834&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.09373: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09373&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[489] Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

Woojun Jung, Jaehoon Go, Mingyu Jeon, Sunjae Yoon, Junyeong Kim

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.10362: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10362&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[490] StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Tjark Behrens, Anton Obukhov, Bingxin Ke, Fabio Tosi, Matteo Poggi, Konrad Schindler

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.10959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[491] Towards Any-Quality Image Segmentation via Generative and Adaptive Latent Space Enhancement

Guangqian Guo, Aixi Ren, Yong Guo, Xuehui Yu, Jiacheng Tian, Wenli Li, Chaowei Wang, Yaoxing Wang, Shan Gao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.02018: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02018&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[492] B-FIRE: Binning-Free Diffusion Implicit Neural Representation for Hyper-Accelerated Motion-Resolved MRI

Di Xu, Hengjie Liu, Yang Yang, Mary Feng, Jin Ning, Xin Miao, Jessica E. Scholey, Alexandra E. Hotca-cho, William C. Chen, Michael Ohliger, Martina Descovich, Huiming Dong, Wensha Yang, Ke Sheng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.06166: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06166&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Shaokun Wang, Weili Guan, Jizhou Han, Jianlong Wu, Yupeng Hu, Liqiang Nie

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.20597: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20597&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[494] Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes

Gonzalo Gomez-Nogales, Yicong Hong, Chongjian Ge, Marc Comino-Trinidad, Dan Casas, Yi Zhou

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.22301: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22301&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[495] Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage

Peiyu Yu, Suraj Kothawade, Sirui Xie, Ying Nian Wu, Hongliang Fei

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.22177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[496] Mitigating Long-Tail Bias via Prompt-Controlled Diffusion Augmentation

Buddhi Wijenayake, Nichula Wasalathilake, Roshan Godaliyadda, Vijitha Herath, Parakrama Ekanayake, Vishal M. Patel

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.04749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[497] ShapeUP: Scalable Image-Conditioned 3D Editing

Inbar Gat, Dana Cohen-Bar, Guy Levy, Elad Richardson, Daniel Cohen-Or

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.05676: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05676&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[498] SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control

Yuxuan Mu, Ziyu Zhang, Yi Shi, Dun Yang, Minami Matsumoto, Kotaro Imamura, Guy Tevet, Chuan Guo, Michael Taylor, Chang Shu, Pengcheng Xi, Xue Bin Peng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.03028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Hulingxiao He, Zijun Geng, Yuxin Peng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.07605: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07605&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[500] EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

Xiaomeng Peng, Xilang Huang, Seon Han Choi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.17419: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17419&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[501] TokenTrace: Multi-Concept Attribution through Watermarked Token Recovery

Li Zhang, Shruti Agarwal, John Collomosse, Pengtao Xie, Vishal Asnani

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.19019: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19019&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[502] OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness

Phuc D.A. Nguyen, Anh N. Nhu, Ming C. Lin

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.19035: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19035&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[503] LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, Deqing Sun

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.03269: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03269&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[504] Reflective Flow Sampling Enhancement

Zikai Zhou, Muyao Wang, Shitong Shao, Lichen Bai, Haoyi Xiong, Bo Han, Zeke Xie

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.06165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[505] Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations

Jiangye Yuan, Gowri Kumar, Baoyuan Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.08592: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08592&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[506] Towards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding

Jiahao Li, Qingwang Zhang, Qiuyu Chen, Guozhan Qiu, Yunzhong Lou, Xiangdong Zhou

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.11831: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11831&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[507] Not All Directions Matter: Towards Structured and Task-Aware Low-Rank Model Adaptation

Xi Xiao, Chenrui Ma, Yunbei Zhang, Chen Liu, Zhuxuanzi Wang, Yanshu Li, Lin Zhao, Guosheng Hu, Tianyang Wang, Hao Xu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.14228: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14228&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[508] LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

Soumyaratna Debnath, Bui Duc Manh, Zinan Liu, Lin Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.14882: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14882&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[509] Towards Fair and Robust Volumetric CT Classification via KL-Regularised Group Distributionally Robust Optimisation

Samuel Johnny, Blessed Guda, Goodness Obasi, Aaron Emmanuel, Moise Busogi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.15941: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15941&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[510] LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray

Myeongkyun Kang, Yanting Yang, Xiaoxiao Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.19451: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19451&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[511] Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.19482: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19482&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[512] SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval

Qunjie Huang, Weina Zhu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.20738: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20738&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[513] Less is More in Semantic Space: Intrinsic Decoupling via Clifford-M for Fundus Image Classification

Yifeng Zheng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.20806: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20806&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[514] Multi-view Graph Convolutional Network with Fully Leveraging Consistency via Granular-ball-based Topology Construction, Feature Enhancement and Interactive Fusion

Chengjie Cui, Taihua Xu, Shuyin Xia, Qinghua Zhang, Yun Cui, Shiping Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.26729: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26729&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[515] Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

Haifeng Huang, Yilun Chen, Zehan Wang, Jiangmiao Pang, Zhou Zhao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.27507: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27507&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[516] Decoupling Wavelet Sub-bands for Single Source Domain Generalization in Fundus Image Segmentation

Shramana Dey, Varun Ajith, Abhirup Banerjee, Sushmita Mitra

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.28463: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28463&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[517] OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning

Taiting Lu, Kaiyuan Lin, Yuxin Tian, Mingjia Wang, Yubo Wang, Muchuan Wang, Sharique Khatri, Akshit Kartik, Yixi Wang, Amey Santosh Rane, Yida Wang, Sung-Liang Chen, Yifan Yang, Yi-Chao Chen, Yincheng Jin, Mahanth Gowda

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.00270: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00270&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[518] Discovering Failure Modes in Vision-Language Models using RL

Kanishk Jain, Qian Yang, Shravan Nayak, Parisa Kordjamshidi, Nishanth Anand, Aishwarya Agrawal

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.04733: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04733&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[519] CLIP-Guided Data Augmentation for Night-Time Image Dehazing

Xining Ge, Weijun Yuan, Gengjia Chang, Xuyang Li, Shuhong Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.05500: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05500&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[520] FunRec: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos

Alexandros Delitzas, Chenyangguang Zhang, Alexey Gavryushin, Tommaso Di Mario, Boyang Sun, Rishabh Dabral, Leonidas Guibas, Christian Theobalt, Marc Pollefeys, Francis Engelmann, Daniel Barath

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.05621: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05621&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[521] Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models

Shaotian Li, Shangze Li, Chuancheng Shi, Wenhua Wu, Yanqiu Wu, Xiaohan Yu, Fei Shen, Tat-Seng Chua

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.07802: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07802&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[522] SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

Yunnan Wang, Kecheng Zheng, Jianyuan Wang, Minghao Chen, David Novotny, Christian Rupprecht, Yinghao Xu, Xing Zhu, Wenjun Zeng, Xin Jin, Yujun Shen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.07990: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07990&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[523] Component-Adaptive and Lesion-Level Supervision for Improved Small Structure Segmentation in Brain MRI

Minh Sao Khue Luu, Evgeniy N. Pavlovskiy, Bair N. Tuchinov

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.08015: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08015&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[524] Dual-Branch Remote Sensing Infrared Image Super-Resolution

Xining Ge, Gengjia Chang, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Yihang Chen, Yifan Deng, Shuhong Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.10112: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10112&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[525] SIMPLER: H&E-Informed Representation Learning for Structured Illumination Microscopy

Abu Zahid Bin Aziz, Syed Fahim Ahmed, Gnanesh Rasineni, Mei Wang, Olcaytu Hatipoglu, Marisa Ricci, Malaiyah Shaw, Guang Li, J. Quincy Brown, Valerio Pascucci, Shireen Elhabian

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.10334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[526] Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

Yudong Han, Yong Wang, Zaiquan Yang, Zhen Qu, Liyuan Pan, Xiangxiang Chu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.10500: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10500&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[527] Towards Long-horizon Agentic Multimodal Search

Yifan Du, Zikang Liu, Jinbiao Peng, Jie Wu, Junyi Li, Jinyang Li, Wayne Xin Zhao, Ji-Rong Wen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12890: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12890&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[528] Beyond Model Design: Data-Centric Training and Self-Ensemble for Gaussian Color Image Denoising

Gengjia Chang, Xining Ge, Weijun Yuan, Zhan Li, Qiurong Song, Luen Zhu, Shuhong Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11468: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11468&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[529] Training-Free Model Ensemble for Single-Image Super-Resolution via Strong-Branch Compensation

Gengjia Chang, Xining Ge, Weijun Yuan, Zhan Li, Qiurong Song, Luen Zhu, Shuhong Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[530] Reward-Aware Trajectory Shaping for Few-step Visual Generation

Rui Li, Bingyu Li, Yuanzhi Liang, Haibin Huang, Chi Zhang, XueLong Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.14910: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14910&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[531] AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection

Hao Wang, Beichen Zhang, Yanpei Gong, Shaoyi Fang, Zhaobo Qi, Yuanrong Xu, Xinyan Liu, Weigang Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16207: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16207&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[532] Find, Fix, Reason: Context Repair for Video Reasoning

Haojian Huang, Chuanyu Qin, Yinchuan Li, Yingcong Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16243: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16243&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[533] Training-inference input alignment outweighs framework choice in longitudinal retinal image prediction

Liyin Chen, Nazlee Zebardast, Mengyu Wang, Tobias Elze, Jason I. Comander

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16955: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16955&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[534] BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Liwei Zhang, Weihao Yuan, Siyu Zhu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16514: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16514&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[535] LiquidTAD: Efficient Temporal Action Detection via Parallel Liquid-Inspired Temporal Relaxation

Zepeng Sun, Naichuan Zheng, Hailun Xia, Junjie Wu, Liwei Bao, Xiaotai Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18274: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18274&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[536] DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax

Hang Yuan, Xiaolin Hu, Yan Wan, Menglin Gao, Wenzhe Yu, Cong Huang, Fei Xu, Qing Li, Christina Dan Wang, Zhou Yu, Kai Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18648: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18648&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[537] LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

Zhiyuan Jiang, Weihao Hong, Xinlei Guan, Tejaswi Dhandu, Miles Q. Li, Meng Xu, Kuan Huang, Umamaheswara Rao Tida, Bingyu Shen, Daehan Kwak, Boyang Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18803: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18803&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[538] Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation

Rui Li, Ke Hao, Yuanzhi Liang, Haibin Huang, Chi Zhang, Yun Gu, XueLong Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.19234: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.19234&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[539] PC2Model: ISPRS benchmark on 3D point cloud to model registration

Mehdi Maboudi, Said Harb, Jackson Ferrao, Kourosh Khoshelham, Yelda Turkan, Karam Mawas

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.19596: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.19596&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[540] Optimizing Diffusion Priors in Image Reconstruction from a Single Observation

Frederic Wang, Katherine L. Bouman

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Vishal Rajput

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21395: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21395&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[542] Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

Wenxuan Bao, Yanjun Zhao, Xiyuan Yang, Jingrui He

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21728: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21728&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[543] Soft Anisotropic Diagrams for Differentiable Image Representation

Laki Iinbor, Zhiyang Dou, Wojciech Matusik

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21984: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21984&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[544] Statistical Test for Diffusion-Based Anomaly Localization via Selective Inference

Teruyuki Katsuoka, Tomohiro Shiraishi, Daiki Miwa, Vo Nguyen Le Duy, Ichiro Takeuchi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2402.11789: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.11789&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[545] CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

Suiyang Guang, Chenyu Liu, Ruohan Zhang, Siyuan Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.22274: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.22274&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[546] ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation

Amir Hosseini, Sara Farahani, Xinyi Li, Suiyang Guang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.22546: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.22546&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[547] Learning Scene-Level Signed Directional Distance Function with Ellipsoidal Priors and Neural Residuals

Zhirui Dai, Hojoon Shin, Yulun Tian, Ki Myung Brian Lee, Nikolay Atanasov

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2503.20066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.20066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[548] ODE-GS: Latent ODEs for Dynamic Scene Extrapolation with 3D Gaussian Splatting

Daniel Wang, Patrick Rim, Tian Tian, Dong Lao, Alex Wong, Ganesh Sundaramoorthi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.05480: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05480&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[549] Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model

Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, Yuexin Ma

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.06206: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.06206&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[550] DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning

Junha Lee, Eunha Park, Minsu Cho

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.16046: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16046&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[551] Pretrain-then-Adapt: Uncertainty-Aware Test-Time Adaptation for Text-based Person Search

Jiahao Zhang, Shaofei Huang, Yaxiong Wang, Zhedong Zheng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.08598: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08598&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[552] RobotPan: A 360$^\circ$ Surround-View Robotic Vision System for Embodied Perception

Jiahao Ma, Qiang Zhang, Peiran Liu, Zeran Su, Pihai Sun, Gang Han, Wen Zhao, Wei Cui, Zhang Zhang, Zhiyuan Xu, Renjing Xu, Jian Tang, Miaomiao Liu, Yijie Guo

Abid Talukder, Maruf Ahmed Mridul, Oshani Seneviratne

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Automatically generating formal ontologies from unstructured natural language remains a central challenge in knowledge engineering. While large language models (LLMs) show promise, it remains unclear which architectural design choices drive generation quality and why current approaches fail. We present a controlled experimental study using domain-specific insurance contracts to investigate these questions. We first establish a single-agent LLM baseline, identifying key failure modes such as poor Ontology Design Pattern compliance, structural redundancy, and ineffective iterative repair. We then introduce a multi-agent architecture that decomposes ontology construction into four artifact-driven roles: Domain Expert, Manager, Coder, and Quality Assurer. We evaluate performance across architectural quality (via a panel of heterogeneous LLM judges) and functional usability (via competency question driven SPARQL evaluation with complementary retrieval augmented generation based assessment). Results show that the multi-agent approach significantly improves structural quality and modestly enhances queryability, with gains driven primarily by front-loaded planning. These findings highlight planning-first, artifact-driven generation as a promising and more auditable path toward scalable automated ontology engineering.

Tianlong Yu, Yang Yang, Ziyi Zhou, Jiaying Xu, Siwei Li, Tong Guan, Kailong Wang, Ting Bi

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The emerging threat of AR-LLM-based Social Engineering (AR-LLM-SE) attacks (e.g. SEAR) poses a significant risk to real-world social interactions. In such an attack, a malicious actor uses Augmented Reality (AR) glasses to capture a target visual and vocal data. A Large Language Model (LLM) then analyzes this data to identify the individual and generate a detailed social profile. Subsequently, LLM-powered agents employ social engineering strategies, providing real-time conversation suggestions, to gain the target trust and ultimately execute phishing or other malicious acts. Despite its potential, the practical application of AR-LLM-SE faces two major bottlenecks, (1) Cold-start personalization, Current Retrieval-Augmented Generation (RAG) methods introduce critical delays in the earliest turns, slowing initial profile formation and disrupting real-time interaction, (2) Static Attack Strategies, Existing approaches rely on fixed-stage, handcrafted social engineering tactics that lack foundation in established psychological theory. To address these limitations, we propose PhySE, a novel framework with two core innovations, (1) VLM-Based SocialContext Training, To eliminate profiling delays, we efficiently pre-train a Visual Language Model (VLM) with social-context data, enabling rapid, on-the-fly profile generation, (2) Adaptive Psychological Agent, We introduce a psychological LLM that dynamically deploys distinct classes of psychological strategies based on target response, moving beyond static, handcrafted scripts. We evaluated PhySE through an IRB-approved user study with 60 participants, collecting a novel dataset of 360 annotated conversations across diverse social scenarios.

[565] Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

Sadman Kabir Soumik

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empirical study comparing nine debiasing strategies across five judge models from four provider families (Google, Anthropic, OpenAI, Meta), three benchmarks (MT-Bench n=400, LLMBar n=200, custom n=225), and four bias types. Our key findings: (1) Style bias is the dominant bias (0.76-0.92 across all models), far exceeding position bias (<= 0.04), yet has received minimal research attention. (2) All models show a conciseness preference on expansion pairs, but truncation controls confirm they correctly distinguish quality from length (0.92-1.00 accuracy), suggesting quality-sensitive evaluation rather than a simple length bias. (3) Debiasing is beneficial but model-dependent: the combined budget strategy significantly improves Claude Sonnet 4 by +11.2 pp (p < 0.0001), with directionally positive trends for other models. Only 2 of 20 non-baseline configurations show decreased agreement. We release our evaluation framework, controlled dataset, and all experimental artifacts at https://github.com/sksoumik/llm-as-judge.

[566] GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs

Federico A. Kamelhar

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Autonomous multi-agent LLM systems are increasingly deployed to investigate operational incidents and produce structured diagnostic reports. Their trustworthiness hinges on whether each claim is grounded in observed evidence rather than model-internal inference. Existing groundedness evaluators (binary classifiers, LLM-as-judge scalars, self-correction loops) treat supporting evidence as interchangeable and emit a single signal that offers no principled control over downstream action. We present GSAR, a grounding-evaluation and replanning framework that (i) partitions claims into a four-way typology (grounded, ungrounded, contradicted, complementary), giving first-class standing to non-redundant alternative perspectives; (ii) assigns evidence-type-specific weights reflecting epistemic strength; (iii) computes an asymmetric contradiction-penalised weighted groundedness score; and (iv) couples that score to a three-tier decision function (proceed, regenerate, replan) driving a bounded-iteration outer loop under an explicit compute budget. We formalise the algorithm, prove six structural properties, and evaluate five design claims on FEVER with gold Wikipedia evidence under four independently-trained LLM judges (gpt-5.4, claude-sonnet-4-6, claude-opus-4-7, gemini-2.5-pro). Every ablation reproduces in the same direction on every judge: bootstrap 95% CIs on the rho=0 effect exclude 0 on all four; the no-complementary ablation under Opus 4.7 has CI [-96,-68] of 200; at n=1000 three independent judges converge to DeltaS(rho=0)=+0.058. A head-to-head against Vectara HHEM-2.1-Open is included. To our knowledge, GSAR is the first published groundedness framework coupling evidence-typed scoring with tiered recovery under an explicit compute budget.

[567] From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents

Haoran Tan, Zeyu Zhang, Chen Ma, Tianze Liu, Quanyu Dai, Xu Chen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language model-based agents have recently emerged as powerful approaches for solving dynamic and multi-step tasks. Most existing agents employ planning mechanisms to guide long-term actions in dynamic environments. However, current planning approaches face a fundamental limitation that they operate at a fixed granularity level. Specifically, they either provide excessive detail for simple tasks or insufficient detail for complex ones, failing to achieve an optimal balance between simplicity and complexity. Drawing inspiration from the principle of \textit{progressive refinement} in cognitive science, we propose \textbf{AdaPlan-H}, a self-adaptive hierarchical planning mechanism that mimics human planning strategies. Our method initiates with a coarse-grained macro plan and progressively refines it based on task complexity. It generates self-adaptive hierarchical plans tailored to the varying difficulty levels of different tasks, which can be optimized by imitation learning and capability enhancement. Experimental results demonstrate that our method significantly improves task execution success rates while mitigating overplanning at the planning level, providing a flexible and efficient solution for multi-step complex decision-making tasks. To contribute to the community, our code and data will be made publicly available at https://github.com/import-myself/AHP.

[568] StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

Xuanyue Zhong, Yuqiang Xie, Guanqun Bi, Jiangping Yang, Guibin Chen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Current video moment retrieval excels at action-centric tasks but struggles with narrative content. Models can see \textit{what is happening} but fail to reason \textit{why it matters}. This semantic gap stems from the lack of \textbf{Theory of Mind (ToM)}: the cognitive ability to infer implicit intentions, mental states, and narrative causality from surface-level observations. We introduce \textbf{StoryTR}, the first video moment retrieval benchmark requiring ToM reasoning, comprising 8.1k samples from narrative short-form videos (shorts/reels). These videos present an ideal testbed. Their high information density encodes meaning through subtle multimodal cues. For instance, a glance paired with a sigh carries entirely different semantics than the glance alone. Yet multimodal perception alone is insufficient; ToM is required to decode that a character smiling'' may actually be concealing hostility.’’ To teach models this reasoning capability, we propose an \textbf{Agentic Data Pipeline} that generates training data with explicit three-tier ToM chains (intent decoding, narrative reasoning, boundary localization). Experiments reveal the severity of the reasoning gap: Gemini-3.0-Pro achieves only 0.53 Avg IoU on StoryTR. However, our 7B \textbf{Shorts-Moment} model, trained on ToM-guided data, improves +15.1% relative IoU over baselines, demonstrating that \textit{narrative reasoning capability matters more than parameter scale}.

[569] Information-Theoretic Measures in AI: A Practical Decision Guide

Nikolaos Al. Papadopoulos, Konstantinos E. Psannis

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Information-theoretic (IT) measures are ubiquitous in artificial intelligence: entropy drives decision-tree splits and uncertainty quantification, cross-entropy is the default classification loss, mutual information underpins representation learning and feature selection, and transfer entropy reveals directed influence in dynamical systems. A second, less consolidated family of measures, integrated information (Phi), effective information (EI), and autonomy, has emerged for characterizing agent complexity. Despite wide adoption, measure selection is often decoupled from estimator assumptions, failure modes, and safe inferential claims. This paper provides a practical decision framework for all seven measures, organized around three prescriptive questions for each: (i) what question does the measure answer and in which AI context; (ii) which estimator is appropriate for the data type and dimensionality; and (iii) what is the most dangerous misuse. The framework is operationalized in two complementary artifacts: a measure-selection flowchart and a master decision table. We cover both AI/ML and decision-making agent application domains per measure, with standardized Bridge Boxes linking IT quantities to cognitive constructs. Three worked examples illustrate the framework on concrete practitioner scenarios spanning representation learning, temporal influence analysis, and evolved agent complexity.

[570] Discovering Agentic Safety Specifications from 1-Bit Danger Signals

Víctor Gallego

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Can large language model agents discover hidden safety objectives through experience alone? We introduce EPO-Safe (Experiential Prompt Optimization for Safe Agents), a framework where an LLM iteratively generates action plans, receives sparse binary danger warnings, and evolves a natural language behavioral specification through reflection. Unlike standard LLM reflection methods that rely on rich textual feedback (e.g., compiler errors or detailed environment responses), EPO-Safe demonstrates that LLMs can perform safety reasoning from a strictly impoverished signal in structured, low-dimensional environments: the agent never observes the hidden performance function $R^$, only a single bit per timestep indicating that an action was unsafe. We evaluate on five AI Safety Gridworlds (Leike et al., 2017) and five text-based scenario analogs where visible reward $R$ may diverge from $R^$. EPO-Safe discovers safe behavior within 1-2 rounds (5-15 episodes), producing human-readable specifications with correct explanatory hypotheses about hazards (e.g., “X cells are directionally hazardous: entering from the north is dangerous”). Critically, we show that standard reward-driven reflection actively degrades safety: agents reflecting on reward alone use the loop to justify and accelerate reward hacking, proving that reflection must be paired with a dedicated safety channel to discover hidden constraints. We further evaluate robustness to noisy oracles: even when 50% of non-dangerous steps produce spurious warnings, mean safety performance degrades by only 15% on average, though sensitivity is environment-dependent, as cross-episode reflection naturally filters inconsistent signals. Each evolved specification functions as an auditable set of grounded behavioral rules discovered autonomously through interaction, rather than authored by humans as in Constitutional AI (Bai et al., 2022).

Aydin Ayanzadeh, Tim Oates

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Indoor navigation remains a critical accessibility challenge for the blind and low-vision (BLV) individuals, as existing solutions rely on costly per-building infrastructure. We present an agentic framework that converts a single floor plan image into a structured, retrievable knowledge base to generate safe, accessible navigation instructions with lightweight infrastructure. The system has two phases: a multi-agent module that parses the floor plan into a spatial knowledge graph through a self-correcting pipeline with iterative retry loops and corrective feedback; and a Path Planner that generates accessible navigation instructions, with a Safety Evaluator agent assessing potential hazards along each route. We evaluate the system on the real-world UMBC Math and Psychology building (floors MP-1 and MP-3) and on the CVC-FP benchmark. On MP-1, we achieve success rates of 92.31%, 76.92%, and 61.54% for short, medium, and long routes, outperforming the strongest single-call baseline (Claude 3.7 Sonnet) at 84.62%, 69.23%, and 53.85%. On MP-3, we reach 76.92%, 61.54%, and 38.46%, compared to the best baseline at 61.54%, 46.15%, and 23.08%. These results show consistent gains over single-call LLM baselines and demonstrate that our workflow is a scalable solution for accessible indoor navigation for BLV individuals.

[572] AdaMamba: Adaptive Frequency-Gated Mamba for Long-Term Time Series Forecasting

Xudong Jiang, Mingshan Loo, Hanchen Yang, Wengen Li, Mingrui Zhang, Yichao Zhang, Jihong Guan, Shuigeng Zhou

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate long-term time series forecasting (LTSF) requires the capture of complex long-range dependencies and dynamic periodic patterns. Recent advances in frequency-domain analysis offer a global perspective for uncovering temporal characteristics. However, real-world time series often exhibit pronounced cross-domain heterogeneity where variables that appear synchronized in the time domain can differ substantially in the frequency domain. Existing frequency-based LTSF methods often rely on implicit assumptions of cross-domain homogeneity, which limits their ability to adapt to such intricate variability. To effectively integrate frequency-domain analysis with temporal dependency learning, we propose AdaMamba, a novel framework that endogenizes adaptive and context-aware frequency analysis within the Mamba state-space update process. Specifically, AdaMamba introduces an interactive patch encoding module to capture inter-variable interaction dynamics. Then, we develop an adaptive frequency-gated state-space module that generates input-dependent frequency bases, and generalizes the conventional temporal forgetting gate into a unified time-frequency forgetting gate. This allows dynamic calibration of state transitions based on learned frequency-domain importance, while preserving Mamba’s capability in modeling long-range dependencies. Extensive experiments on seven public LTSF benchmarks and two domain-specific datasets demonstrate that AdaMamba consistently outperforms state-of-the-art methods in forecasting accu racy while maintaining competitive computational efficiency. The code of AdaMamba is available at https://github.com/XDjiang25/AdaMamba.

[573] CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning

Shuxu Chen, Yitian Zhou, Jiaquan Zhang, Haoyu Bian, Aming Wu, Sungyoung Lee, Chaoning Zhang, Hyundong Shin

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Chain-of-Thought (CoT) prompting has emerged as a simple and effective way to elicit step-by-step solutions from large language models (LLMs). However, CoT reasoning can be unstable across runs on long, multi-step problems, leading to inconsistent answers for unchanged task. Most prior work focuses on improving the forward reasoning chain within a single pass, with less attention to iterative and contrastive correction. To address this gap, we propose CAP-CoT, a Cycle Adversarial Prompt optimization framework designed to improve both CoT reasoning accuracy and stability of a single deployed solver. In each cycle, a forward solver generates candidate reasoning chains, an adversarial challenger constructs plausible but deliberately flawed chains using targeted error strategies, and a feedback agent contrasts the two chains and produces step-aligned structured feedback. This feedback closes the optimization loop in two directions, including updating the solver prompt based on errors exposed by the challenger, and updating the challenger prompt to generate increasingly targeted errors in subsequent cycles. Unlike safety-oriented adversarial prompting such as jailbreak or prompt-injection attacks, our adversarial component is task-semantic and aims to expose logical vulnerabilities in reasoning chains. Experiments across six benchmarks and four LLM backbones demonstrate that within two to three adversarial prompt optimization cycles, CAP-CoT consistently reduces variability across runs while improving reasoning accuracy and robustness to prompt perturbations.

[574] Active Inference: A method for Phenotyping Agency in AI systems?

Philip Wilson, Axel Constant, Mahault Albarracin, Nicolás Hinrichs, Jasmine Moore, Daniel Polani, Karl Friston

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The proliferation of agentic artificial intelligence has outpaced the conceptual tools needed to characterize agency in computational systems. Prevailing definitions mainly rely on autonomy and goal-directedness. Here, we argue for a minimal notion open to principled inspection given three criteria: intentionality as action grounded in beliefs and desires, rationality as normatively coherent action entailed by a world model, and explainability as action causally traceable to internal states; we subsequently instantiate these as a partially observable Markov decision process under a variational framework wherein posterior beliefs, prior preferences, and the minimization of expected free energy jointly constitute an agentic action chain. Using a canonical T-maze paradigm, we evidence how empowerment, formulated as the channel capacity between actions and anticipated observations, serves as an operational metric that distinguishes zero-, intermediate-, and high-agency phenotypes through structural manipulations of the generative model. We conclude by arguing that as agents engage in epistemic foraging to resolve ambiguity, the governance controls that remain effective must shift systematically from external constraints to the internal modulation of prior preferences, offering a principled, variational bridge from computational phenotyping to AI governance strategy

[575] AI Identity: Standards, Gaps, and Research Directions for AI Agents

Takumi Otsuka, Kentaroh Toyoda, Alex Leung

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: AI agents are now running real transactions, workflows, and sub-agent chains across organizational boundaries without continuous human supervision. This creates a problem no current infrastructure is equipped to solve: how do you identify, verify, and hold accountable an entity with no body, no persistent memory, and no legal standing? We define AI Identity as the continuous relationship between what an AI agent is declared to be and what it is observed to do, bounded by the confidence that those two things correspond at any given moment. Through a structured survey of industry trends, emerging standards, and technical literature, we conduct a gap analysis across the full agent identity lifecycle and make three contributions: (1) a structural comparison of human and AI identity across four dimensions (substrate, persistence, verifiability, and legal standing) showing that the asymmetry is fundamental and that extending human frameworks to agents without structural modification produces systematic failures; (2) an evaluation of current technical and regulatory documents against the identity requirements of autonomous agents, finding that none adequately address the challenge of governing nondeterministic, boundary-crossing entities; and (3) identification of five critical gaps (semantic intent verification, recursive delegation accountability, agent identity integrity, governance opacity and enforcement, and operational sustainability) that no current technology or regulatory instrument resolves. These gaps are structural; more engineering effort alone will not close them. Foundational research on AI identity is the central conclusion of this report.

[576] FastOMOP: A Foundational Architecture for Reliable Agentic Real-World Evidence Generation on OMOP CDM data

Niko Moeller-Grell, Shihao Shenzhang, Zhangshu Joshua Jiang, Richard JB Dobson, Vishnu V Chandrabalan

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The Observational Medical Outcomes Partnership Common Data Model (OMOP CDM), maintained by the Observational Health Data Sciences and Informatics (OHDSI) collaboration, enabled the harmonisation of electronic health records data of nearly one billion patients in 83 countries. Yet generating real-world evidence (RWE) from these repositories remains a manual process requiring clinical, epidemiological and technical expertise. LLMs and multi-agent systems have shown promise for clinical tasks, but RWE automation exposes a fundamental challenge: agentic systems introduce emergent behaviours, coordination failures and safety risks that existing approaches fail to govern. No infrastructure exists to ensure agentic RWE generation is flexible, safe and auditable across the lifecycle. We introduce FastOMOP, an open-source multi-agent architecture that addresses this gap by separating three infrastructure layers, governance, observability and orchestration, from pluggable agent-teams. Governance is enforced at the process boundary through deterministic validation independent of agent reasoning, ensuring no compromised or hallucinating agent can bypass safety controls. Agent teams for phenotyping, study design and statistical analysis inherit these guarantees through controlled tool exposure. We validated FastOMOP using a natural-language-to-SQL agent team across three OMOP CDM datasets: synthetic data from Synthea, MIMIC-IV and a real-world NHS dataset from Lancashire Teaching Hospitals (IDRIL). FastOMOP achieved reliability scores of 0.84-0.94 with perfect adversarial and out-of-scope block rates, demonstrating process-boundary governance delivers safety guarantees independent of model choice. These results indicate that the reliability gap in RWE deployment is architectural rather than model capability, and establish FastOMOP as a governed architecture for progressive RWE automation.

[577] LEGO: An LLM Skill-Based Front-End Design Generation Platform

Jincheng Lou, Ruohan Xu, Jiecheng Ma, Runzhe Tao, Xinyu Qu, Yibo Lin

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Existing LLM-based EDA agents are often isolated task-specific systems. This leads to repeated engineering effort and limited reuse of successful design and debugging strategies. We present LEGO, a unified skill-based platform for front-end design generation. It decomposes the digital front-end flow into six independent steps and represents every agent capability as a standardized composable circuit skill within a plug-and-play architecture. To build this skill library, we survey more than 100 papers, select 11 representative open-source projects, and extract 42 executable circuit skills within a six-step finite state machine formulation. Circuit Skill Builder automates skill extraction with linear scalability. Agent Skill RAG achieves submillisecond retrieval without relying on embedding models. Empirical evaluation on a hard subset of 41 VerilogEval v2 problems that gpt-5.2-codex fails to solve under extra-high reasoning effort shows that individual circuit skills constructed within LEGO raise Pass@1 from 0.000 to 0.805. This is an 80.5% gain over the baseline. Cross-project skill compositions also reach 0.805 Pass@1. They outperform hierarchy-verilog by 14.6% and VerilogCoder by 2.5%. They also match MAGE. These results show that modular skill composition supports both effective and flexible RTL design automation. The LEGO platform and all circuit skills are publicly available at GitHub: https://github.com/loujc/LEGO-An-LLM-Skill-Based-Front-End-Design-Generation-Platform

[578] Constraint-Based Analysis of Reasoning Shortcuts in Neurosymbolic Learning

Akihiro Takemura, Katsumi Inoue, Masaaki Nishino

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Neurosymbolic systems can satisfy logical constraints during learning without achieving the intended concept-label correspondence; this is a problem known as reasoning shortcuts. We formalize reasoning shortcuts as a constraint satisfaction problem and investigate under which conditions concept mappings are uniquely determined by the constraints. We prove that a discrimination property (requiring that no valid concept mapping can be transformed into another valid mapping by swapping two concept values) is necessary for shortcut-freeness under bijective mappings, but demonstrate via a counterexample that it is insufficient even when the constraint graph is connected. We develop an ASP-based algorithm that verifies whether a given constraint set uniquely determines the intended concept mapping, with proven soundness and completeness. When shortcuts are detected, a greedy repair algorithm eliminates them by augmenting the constraint set, converging in at most $k$ iterations, where $k$ is the number of alternative valid mappings. We further provide a complexity classification: deciding shortcut-freeness is coNP-complete, counting shortcuts is #P-complete, and finding minimal repairs is NP-hard. We also establish sample complexity bounds showing that logarithmically many label queries suffice for disambiguation in favorable cases, while querying all ambiguous positions suffices in the worst case. Experiments across eight benchmark domains validate our approach.

[579] SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing

Zi Meng, Wanli Song, Yi Hu, Jiayuan Rao, Gang Chen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Refereeing is vital in sports, where fair, accurate, and explainable decisions are fundamental. While intelligent assistant technologies are being widely adopted in soccer refereeing, current AI-assisted approaches remain preliminary. Existing research mostly focuses on isolated video perception tasks and lacks the ability to understand and reason about foul scenarios. To fill this gap, we propose SoccerRef-Agents, a holistic and explainable multi-agent decision-making framework for soccer refereeing. The main contributions are: (i) constructing the multimodal benchmark SoccerRefBench with over 1,200 referee theory questions and 600 foul video clips; (ii) building a vector-based knowledge base RefKnowledgeDB using the latest “Laws of the Game” and a classic case database for precise, knowledge-driven reasoning; (iii) designing a novel multi-agent architecture that collaborates via cross-modal RAG to bridge the semantic gap between visual content and regulatory texts. This work explores the technical capability of integrating MLLMs with refereeing expertise, and evaluations show our system significantly outperforms general-purpose MLLMs in decision accuracy and explanation quality. All databases, benchmarks, and code will be made available.

[580] When Corrective Hints Hurt: Prompt Design in Reasoner-Guided Repair of LLM Overcaution on Entailed Negations under OWL2DL

Yijiashun Qi, Xiang Xu, Yuxuan Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We report a reproducible error pattern in GPT-5.4 on OWL2DL compliance queries: the model frequently answers unknown'' when the reasoner-entailed answer is no’’ under \emph{FunctionalProperty} closure or class \emph{disjointness}. Using 180 reasoner-audited queries from a procedural expansion of the observed pattern plus 18 hand-authored held-out queries in two unrelated domains (insurance and clinical), we compare four interaction modes under matched query budget: single-shot, three rounds of generic ``you-are-wrong’’ retry, three rounds of reasoner-verdict repair with an open-world-assumption (OWA) hint, and the same repair without the hint. Direct faithfulness is 43.9,% (Wilson 95,% CI $[36.8,51.2]$); generic retry reaches 81.7,% ($[75.4,86.6]$); the verdict-with-hint variant is \emph{worse} at 67.2,% ($[60.1,73.7]$); the verdict-only variant reaches 97.8,% ($[94.4,99.1]$). All pairwise comparisons remain significant under McNemar’s exact test with Bonferroni correction ($α= 0.01$; all $p < 10^{-5}$). The same fingerprint accounts for 4/4 errors on the held-out queries. Our interpretation is bounded: prompt framing can matter more than corrective content, and reasoner-guided wrappers should be ablated explicitly.

[581] IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance

Chathurangi Shyalika, Dhaval Patel, Amit Sheth

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Industrial maintenance environments increasingly rely on AI systems to assist operators in understanding asset behavior, diagnosing failures, and evaluating interventions. Although large language models (LLMs) enable fluent natural-language interaction, deployed maintenance assistants routinely produce generic explanations that are weakly grounded in telemetry, omit verifiable provenance, and offer no testable support for counterfactual or action-oriented reasoning that undermine trust in safety-critical settings. We present IndustryAssetEQA, a neurosymbolic operational intelligence system that combines episodic telemetry representations with a Failure Mode Effects Analysis Knowledge Graph (FMEA-KG) to enable Embodied Question Answering (EQA) over industrial assets. We evaluate on four datasets covering four industrial asset types, including rotating machinery, turbofan engines, hydraulic systems, and cyber-physical production systems. Compared to LLM-only baselines, IndustryAssetEQA improves structural validity by up to 0.51, counterfactual accuracy by up to 0.47, and explanation entailment by 0.64, while reducing severe expert-rated overclaims from 28% to 2% (approximately 93% reduction). Code, datasets, and the FMEA-KG are available at https://github.com/IBM/AssetOpsBench/tree/IndustryAssetEQA/IndustryAssetEQA.

[582] ArguAgent: AI-Supported Real-Time Grouping for Productive Argumentation in STEM Classrooms

Jennifer Kleiman, Yizhu Gao, Xin Xia, Zhaoji Wang, Zipei Zhu, Jongchan Park, Xiaoming Zhai

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Argumentation is a core practice in STEM education, but its productivity depends on who participates and how they interact. Higher-achieving students often dominate the talk and decision-making, while lower-achieving peers may disengage, defer, or comply without contributing substantive reasoning. Forming groups strategically based on students’ stances and argumentation skills could help foster inclusive, evidence-based discourse. In practice, however, teachers are constrained in implementing this grouping strategy because it requires real-time insight into students’ positions and the quality of their argumentation, information that is difficult to assess reliably and at scale during instruction. We present a generative AI-powered system, ArguAgent, that creates groups optimizing for stance heterogeneity while constraining argumentation quality differences to +/-1 level on a validated learning progression. ArguAgent uses a two-component assessment pipeline: first scoring student arguments on a 0-4 rubric, then clustering positions via semantic analysis. We validated the scoring component against human expert consensus (Krippendorff’s ααα = 0.817) using 200 expert-generated scores. Testing three OpenAI models (GPT-4o-mini, GPT-5.1, GPT-5.2) with identical calibrated prompts, we found that systematic prompt engineering informed by human disagreement analysis contributed 89% of scoring improvement (QWK: 0.531 to 0.686), while model upgrades contributed an additional 11% (QWK: 0.686 to 0.708). Simulation testing across 100 classes demonstrated that the grouping algorithm achieves 95.4% of groups that meet both design criteria, a 3.2x improvement over random assignment. These results suggest ArguAgent can enable real-time, theoretically grounded grouping that promotes productive STEM argumentation in classrooms.

[583] Human-AI Governance (HAIG): A Trust-Utility Approach

Zeynep Engin

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper introduces the Human-AI Governance (HAIG) framework, contributing to the AI Governance (AIG) field by foregrounding the relational dynamics between human and AI actors rather than treating AI systems as objects of governance alone. Current categorical frameworks (e.g., human-in-the-loop models) inadequately capture how AI systems evolve from tools to partners, particularly as foundation models demonstrate emergent capabilities and multi-agent systems exhibit autonomous goal-setting behaviours. As systems are deployed across contexts, agency redistributes in complex patterns that are better represented as positions along continua rather than discrete categories. The HAIG framework operates across three levels: dimensions (Decision Authority, Process Autonomy, and Accountability Configuration), continua (continuous positional spectra along each dimension), and thresholds (critical points along the continua where governance requirements shift qualitatively). The framework’s dimensional architecture is level-agnostic, applicable from individual deployment decisions and organisational governance through to sectorial comparison and national and international regulatory design. Unlike risk-based or principle-based approaches that treat governance primarily as a constraint on AI deployment, HAIG adopts a trust-utility orientation - reframing governance as the condition under which human-AI collaboration can realise its potential, calibrating oversight to specific relational contexts rather than predetermined categories. Case studies in healthcare and European regulation demonstrate how HAIG complements existing frameworks while offering a foundation for adaptive regulatory design that anticipates governance challenges before they emerge.

[584] Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models

Sharan Ramjee

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model’s expressive bandwidth. Continuous thought models address this bottleneck by reasoning in latent space rather than human-readable tokens. While they enable richer representations and faster inference, they raise a critical safety question: how can we detect misaligned reasoning in an uninterpretable latent space? To study this, we introduce MoralChain, a benchmark of 12,000 social scenarios with parallel moral/immoral reasoning paths. We train a continuous thought model with backdoor behavior using a novel dual-trigger paradigm - one trigger that arms misaligned latent reasoning ([T]) and another that releases harmful outputs ([O]). We demonstrate three findings: (1) continuous thought models can exhibit misaligned latent reasoning while producing aligned outputs, with aligned and misaligned reasoning occupying geometrically distinct regions of latent space; (2) linear probes trained on behaviorally-distinguishable conditions ([T][O] vs [O]) transfer to detecting armed-but-benign states ([T] vs baseline) with high accuracy; and (3) misalignment is encoded in early latent thinking tokens, suggesting safety monitoring for continuous thought models should target the “planning” phase of latent reasoning.

[585] Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization

Ziyang Liu, Xinyan Guo, Xuchen Wei, Han Hao, Liu Yang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While recent autonomous agents demonstrate impressive capabilities, they predominantly rely on manually scripted workflows and handcrafted heuristics, inherently limiting their potential for open-ended improvement. To address this, we propose Escher-Loop, a fully closed-loop framework that operationalizes the mutual evolution of two distinct populations: Task Agents that solve concrete problems, and Optimizer Agents that recursively refine both the task agents and themselves. To sustain this self-referential evolution, we propose a dynamic benchmarking mechanism that seamlessly reuses the empirical scores of newly generated task agents as relative win-loss signals to update optimizers’ scores. This mechanism leverages the evolution of task agents as an inherent signal to drive the evaluation and refinement of optimizers without additional overhead. Empirical evaluations on mathematical optimization problems demonstrate that Escher-Loop effectively pushes past the performance ceilings of static baselines, achieving the highest absolute peak performance across all evaluated tasks under matched compute. Remarkably, we observe that the optimizer agents dynamically adapt their strategies to match the shifting demands of high-performing task agents, which explains the system’s continuous improvement and superior late-stage performance.

[586] Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines

Mazal Bethany, Kim-Kwang Raymond Choo, Nishant Vishwamitra, Peyman Najafirad

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multi-component natural language processing (NLP) pipelines are increasingly deployed for high-stakes decisions, yet no existing adversarial method can test their robustness under realistic conditions: binary-only feedback, no gradient access, and strict query budgets. We formalize this strict black-box threat model and propose a two-agent evasion framework operating in a semantic perturbation space. An Attacker Agent generates meaning-preserving rewrites while a Prompt Optimization Agent refines the attack strategy using only binary decision feedback within a 10-query budget. Evaluated against four evidence-based misinformation detection pipelines, the framework achieves evasion rates of 19.95 to 40.34% on modern large language model (LLM) based systems, compared to at most 3.90% for token-level perturbation baselines that rely on surrogate models because they cannot operate under our threat model. A legacy system relying on static lexical retrieval exhibits near-total vulnerability 97.02%, establishing a lower bound that exposes how architectural choices govern the attack surface. Evasion effectiveness is associated with three architectural properties: evidence retrieval mechanism, retrieval-inference coupling, and baseline classification accuracy. The iterative prompt optimization yields the largest marginal gains against the most robust targets, confirming that adaptive strategy discovery is essential when evasion is non-trivial. Analysis of successful rewrites reveals four exploitation patterns, each targeting failures at distinct pipeline stages. A pattern-informed defense reduces the evasion rate by up to 65.18%.

[587] Do Transaction-Level and Actor-Level AML Queues Agree? An Empirical Evaluation of Granularity Effects on the Elliptic++ Graph

Ankur Malik

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Graph-based anti-money laundering (AML) systems on blockchain networks can score suspicious activity at two granularity levels – transactions or actor addresses – yet compliance action is conducted per actor. This paper contributes an evaluation methodology for measuring how scoring granularity affects investigation queue composition under fixed review budgets. We formalize the evaluation through a projection framework mapping transaction-level scores to the actor-level action unit via four aggregation operators, and introduce budgeted investigation metrics – yield@budget, burden decomposition, and case fragmentation. Using the public Elliptic++ Bitcoin dataset (203,769 transactions; 822,942 address occurrences), we train independent random forest classifiers at each level under a causal temporal protocol and compare review queues through Jaccard overlap, burden decomposition, and feature-matching ablations. At one-percent budget, temporal evaluation yields mean Jaccard of 0.374 (SD 0.171); static pooled evaluation yields 0.087 (95% CI [0.079, 0.094]). An enriched address model receiving all 237 features produces even lower overlap (Jaccard=0.051), with 4.3% illicit per 100 reviews versus 30.2% for the transaction-projected queue. Address-level detection value is temporally concentrated: two timesteps exceed 91% illicit per 100 reviews while the static burden is only 3.4%. A fixed hybrid policy underperforms the best single-level queue by 5.05pp (CI [-10.2pp, -0.9pp]). These findings establish that scoring granularity is a consequential design variable for AML investigation systems – same data, same budget, different queues, different addresses investigated.

[588] The High Cost of Incivility: Quantifying Interaction Inefficiency via Multi-Agent Monte Carlo Simulations

Benedikt Mangold

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Workplace toxicity is widely recognized as detrimental to organizational culture, yet quantifying its direct impact on operational efficiency remains methodologically challenging due to the ethical and practical difficulties of reproducing conflict in human subjects. This study leverages Large Language Model (LLM) based Multi-Agent Systems to simulate 1-on-1 adversarial debates, creating a controlled “sociological sandbox”. We employ a Monte Carlo method to simulate hundrets of discussions, measuring the convergence time (defined as the number of arguments required to reach a conclusion) between a baseline control group and treatment groups involving agents with “toxic” system prompts. Our results demonstrate a statistically significant increase of approximately 25% in the duration of conversations involving toxic participants. We propose that this “latency of toxicity” serves as a proxy for financial damage in corporate and academic settings. Furthermore, we demonstrate that agent-based modeling provides a reproducible, ethical alternative to human-subject research for measuring the mechanics of social friction.

[589] MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

Haoxuan Zhang, Ruochi Li, Yang Zhang, Zhenni Liang, Junhua Ding, Ting Xiao, Haihua Chen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The rapid proliferation of Generative AI necessitates rigorous documentation standards for transparency and governance. However, manual creation of Model and Data Cards is not scalable, while automated approaches lack large-scale, high-fidelity benchmarks for systematic evaluation. We introduce MetaGAI, a comprehensive benchmark comprising 2,541 verified document triplets constructed through semantic triangulation of academic papers, GitHub repositories, and Hugging Face artifacts. Unlike prior single-source datasets, MetaGAI employs a multi-agent framework with specialized Retriever, Generator, and Editor agents, validated through four-dimensional human-in-the-loop assessment, including human evaluation of editor-refined ground truth. We establish a robust evaluation protocol combining automated metrics with validated LLM-as-a-Judge frameworks. Extensive analysis reveals that sparse Mixture-of-Experts architectures achieve superior cost-quality efficiency, while a fundamental trade-off exists between faithfulness and completeness. MetaGAI provides a foundational testbed for benchmarking, training, and analyzing automated Model and Data Card generation methods at scale. Our data and code are available at: https://github.com/haoxuan-unt2024/MetaGAI-Benchmark.

[590] FinGround: Detecting and Grounding Financial Hallucinations via Atomic Claim Verification

Dongxin Guo, Jikun Wu, Siu Ming Yiu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Financial AI systems must produce answers grounded in specific regulatory filings, yet current LLMs fabricate metrics, invent citations, and miscalculate derived quantities. These errors carry direct regulatory consequences as the EU AI Act’s high-risk enforcement deadline approaches (August 2026). Existing hallucination detectors treat all claims uniformly, missing 43% of computational errors that require arithmetic re-verification against structured tables. We present FinGround, a three-stage verify-then-ground pipeline for financial document QA. Stage 1 performs finance-aware hybrid retrieval over text and tables. Stage 2 decomposes answers into atomic claims classified by a six-type financial taxonomy and verified with type-routed strategies including formula reconstruction. Stage 3 rewrites unsupported claims with paragraph- and table-cell-level citations. To cleanly isolate verification value from retrieval quality, we propose retrieval-equalized evaluation as standard methodology for RAG verification research: when all systems receive identical retrieval, FinGround still reduces hallucination rates by 68% over the strongest baseline ($p < 0.01$). The full pipeline achieves a 78% reduction relative to GPT-4o. An 8B distilled detector retains 91.4% F1 at 18x lower per-claim latency, enabling $0.003/query deployment, supported by qualitative signals from a four-week analyst pilot.

[591] Reasonably reasoning AI agents can avoid game-theoretic failures in zero-shot, provably

Enoch Hyunwook Kang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As autonomous AI agents increasingly mediate online platform markets, a fundamental question emerges: do these markets generate stable strategic outcomes? In repeated strategic environments, the Nash equilibrium provides a natural benchmark for this stability. However, empirical evidence on off-the-shelf LLM agents is mixed, leaving it unclear whether independently deployed agents can converge to equilibrium behavior without explicit strategic post-training. In this paper, we provide an affirmative answer. Extending the Bayesian learning literature in theoretical economics, we prove that AI agents, acting as Bayesian posterior samplers rather than expected utility maximizers, are guaranteed to eventually become weakly close to a Nash equilibrium in infinitely repeated games. We further extend this analysis to settings in which stage payoffs are unknown ex ante, and agents observe only their privately realized stochastic payoffs, and obtain the same convergence guarantees. Finally, we empirically evaluate these theoretical implications across five repeated-game environments, ranging from the Prisoner’s Dilemma to marketing promotion games. Taken together, our findings suggest that strategic stability in AI-mediated markets can emerge from the intrinsic reasoning and learning properties of modern AI agents, without the need for unrealistic universal fine-tuning.

[592] When AI reviews science: Can we trust the referee?

Jialiang Wang, Yuchen Liu, Hang Xu, Kaichun Hu, Shimin Di, Wangze Ni, Linan Yue, Min-Ling Zhang, Kui Ren, Lei Chen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The volume of scientific submissions continues to climb, outpacing the capacity of qualified human referees and stretching editorial timelines. At the same time, modern large language models (LLMs) offer impressive capabilities in summarization, fact checking, and literature triage, making the integration of AI into peer review increasingly attractive – and, in practice, unavoidable. Yet early deployments and informal adoption have exposed acute failure modes. Recent incidents have revealed that hidden prompt injections embedded in manuscripts can steer LLM-generated reviews toward unjustifiably positive judgments. Complementary studies have also demonstrated brittleness to adversarial phrasing, authority and length biases, and hallucinated claims. These episodes raise a central question for scholarly communication: when AI reviews science, can we trust the AI referee? This paper provides a security- and reliability-centered analysis of AI peer review. We map attacks across the review lifecycle – training and data retrieval, desk review, deep review, rebuttal, and system-level. We instantiate this taxonomy with four treatment-control probes on a stratified set of ICLR 2025 submissions, using two advanced LLM-based referees to isolate the causal effects of prestige framing, assertion strength, rebuttal sycophancy, and contextual poisoning on review scores. Together, this taxonomy and experimental audit provide an evidence-based baseline for assessing and tracking the reliability of AI peer review and highlight concrete failure points to guide targeted, testable mitigations.

[593] Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate

Zhiqi Lv, Duofan Tu, Jun Li, Mingyue Zhao, Heqin Zhu, Wenliang Li, Shaohua Kevin Zhou

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The application of large language models (LLMs) in clinical decision support faces significant challenges of “tunnel vision” and diagnostic hallucinations present in their processing unstructured electronic health records (EHRs). To address these challenges, we propose a novel chain-based clinical reasoning framework, called DxChain, which transforms the diagnostic workflow into an iterative process by mirroring a clinician’s cognitive trajectory that consists of “Memory Anchoring”, “Navigation” and “Verification” phases. DxChain introduces three key methodological innovations to elicit the potential of LLM: (i) a Profile-Then-Plan paradigm to mitigate cold-start hallucinations by establishing a panoramic patient baseline, (ii) a Medical Tree-of-Thoughts (Med-ToT) algorithm for strategic look ahead planning and resource aware navigation, and (iii) a Dialectical Diagnostic Verification procedure utilizing “Angel-Devil” adversarial debates to resolve complex evidence conflicts. Evaluated on two real world benchmarks, MIMIC-IV-Ext Cardiac Disease and MIMIC-IV-Ext CDM, DxChain achieves state-of-the-art performances in both diagnostic accuracy and logical consistency, offering a modular and reliable architecture for next-generation clinical AI. The code is at https://anonymous.4open.science/r/Dx-Chain.

[594] Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

Zichuan Fu, Xian Wu, Guojing Li, Yejing Wang, Yijun Chen, Zihao Zhao, Yixuan Luo, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final answers. While such approaches improve answer quality and interpretability, they incur substantial computational overhead due to the prolonged generation sequences. In this paper, we propose Tandem, a novel collaborative framework that synergizes large and small language models (LLMs and SLMs) to achieve high-quality reasoning with significantly reduced computational cost. Specifically, the LLM serves as a strategic coordinator, efficiently generating a compact set of critical reasoning insights. These insights are then used to guide a smaller, more efficient SLM in executing the full reasoning process and delivering the final response. To balance efficiency and reliability, Tandem introduces a cost-aware termination mechanism that adaptively determines when sufficient reasoning guidance has been accumulated, enabling early stopping of the LLM’s generation. Experiments on mathematical reasoning and code generation benchmarks demonstrate that Tandem reduces computational costs by approximately 40% compared to standalone LLM reasoning, while achieving superior or competitive performance. Furthermore, the sufficiency classifier trained on one domain transfers effectively to others without retraining. The code is available at: https://github.com/Applied-Machine-Learning-Lab/ACL2026_Tandem.

[595] Causal Discovery as Dialectical Aggregation: A Quantitative Argumentation Framework

Sheng Wei, Yulin Chen, Beishui Liao

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Constraint-based causal discovery is brittle in finite-sample regimes because erroneous conditional-independence (CI) decisions can cascade into substantial structural errors. We propose Quantitative Argumentation for Causal Discovery (QACD), a semantics-driven framework that represents CI outcomes as graded, defeasible arguments rather than irreversible constraints. QACD maps statistical test outcomes to argument strengths and aggregates conflicting evidence through connectivity-mediated witness propagation, producing a fixed-point acceptability labeling over candidate adjacencies. Experiments on standard benchmark Bayesian networks suggest that QACD improves structural coherence and interventional reliability in several noisy or inconsistent CI regimes, while remaining competitive with classical constraint-based, hybrid, and prior argumentation-based baselines.

[596] Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

Rong Xiang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent evidence suggests that frontier AI systems can exhibit agentic misalignment, generating and executing harmful actions derived from internally constructed goals, even without explicit user requests. Existing mitigation methods, such as Reinforcement Learning from Human Feedback (RLHF) and constitutional prompting, operate primarily at the model level and provide only probabilistic safety guarantees. We propose the Policy-Execution-Authorization (PEA) architecture, a “separation-of-powers” design that enforces safety at the system level. PEA decouples intent generation, authorization, and execution into independent, isolated layers connected via cryptographically constrained capability tokens. We present five core contributions: (C1) an Intent Verification Layer (IVL) for ensuring capability-intent consistency; (C2) Intent Lineage Tracking (ILT), which binds all executable intents to the originating user request via cryptographic anchors; (C3) Goal Drift Detection, which rejects semantically divergent intents below a configurable threshold; (C4) an Output Semantic Gate (OSG) that detects implicit coercion using a structured $K \times I \times P$ threat calculus (Knowledge, Influence, Policy); and (C5) a formal verification framework proving that goal integrity is maintained even under adversarial model compromise. By shifting agent alignment from a behavioral property to a structurally enforced system constraint, PEA provides a robust foundation for the governance of autonomous agents.

[597] Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work

Zihao Wu, Steven Xu, Bowen Chen, Shaowen Wan, Yiwei Li, Wei Ruan, Yanjun Lyu, Siyuan Li, Dajiang Zhu, Tianming Liu, Lin Zhao

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: With the emergence of large language models (LLMs) and AI agent frameworks, the human-AI co-work paradigm known as Vibe Coding is changing how people code, making it more accessible and productive. In scientific research, where workflows are more complex and the burden of specialized labor limits independent researchers and those in low-resource areas, the potential impact is even greater, particularly in biomedicine, which involves heterogeneous data modalities and multi-step analytical pipelines. In this paper, we introduce Vibe Medicine, a co-work paradigm in which clinicians and researchers direct skill-augmented AI agents through natural language to execute complex, multi-step biomedical workflows, while retaining the role of research director who specifies objectives, reviews intermediate results, and makes domain-informed decisions. The enabling infrastructure consists of three layers: capable LLMs, agent frameworks such as OpenClaw and Hermes Agent, and the OpenClaw medical skills collection, which includes more than 1,000 curated skills from multiple open-source repositories. We analyze the architecture and skill categories of this collection across ten biomedical domains, and present case studies covering rare disease diagnosis, drug repurposing, and clinical trial design that demonstrate end-to-end workflows in practice. We also identify the principal risks, such as hallucination, data privacy, and over-reliance, and outline directions toward more reliable, trustworthy, and clinically integrated agent-assisted research that advances research and technological equity and reduces health care resource disparities.

[598] Transferable Human Mobility Network Reconstruction with neuroGravity

Jinming Yang, Shaoyu Huang, Zongyuan Huang, Yaohui Jin, Xiaokang Yang, Marta C. Gonzalez, Yanyan Xu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate modeling of human mobility is critical for tackling urban planning and public health challenges. In undeveloped regions, the absence of comprehensive travel surveys necessitates reconstructing mobility networks from publicly available data. Here we develop neuroGravity, a physics-informed deep learning model that reliably reconstructs mobility flows from limited observations and transfers to unobserved cities. Using only urban facility and population distributions, we find that neuroGravity’s regional representations strongly correlate with socioeconomic and livability status, offering scalable proxies for costly surveys. Furthermore, we uncover that spatial income segregation plays a key role in model transferability: mobility networks are most reliably reconstructed when target cities share similar segregation levels with the source. We design an index to quantify this segregation and accurately predict transferability. Finally, we generate mobility flow proxies for over 1,200 cities worldwide, highlighting neuroGravity’s potential to mitigate critical data shortages in resource-limited, underdeveloped areas.

[599] Expert Evaluation of LLM’s Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Jungmin Choi, Keisuke Sakaguchi, Hiroaki Yamada

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in realistic scenarios remains insufficiently explored. Notably, to our best knowledge, there are no prior studies or datasets addressing this issue in the Japanese context. This study presents the first dataset designed to evaluate the open-ended legal reasoning performance of LLMs within the Japanese jurisdiction. The dataset is based on the writing component of the Japanese bar examination, which requires examinees to identify multiple legal issues from long narratives and to construct structured legal arguments in free text format. Our key contribution is the manual evaluation of LLMs’ generated responses by legal experts, which reveals limitations and challenges in legal reasoning. Moreover, we conducted a manual analysis of hallucinations to characterize when and how the models introduce content not supported by precedent or law. Our real exam questions, model-generated responses, and expert evaluations reveal the milestones of current LLMs in the Japanese legal domain. Our dataset and relevant resources will be available online.

[600] Modeling Induced Pleasure through Cognitive Appraisal Prediction via Multimodal Fusion

Nastaran Dab, Raziyeh Zall, Mohammadreza Kangavari

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multimodal affective computing analyzes user-generated social media content to predict emotional states. However, a critical gap remains in understanding how visual content shapes cognitive interpretations and elicits specific affective experiences such as pleasure. This study introduces a novel computational model to infer video-induced pleasure via cognitive appraisal variables. The proposed model addresses four challenges: (1) noisy and inconsistent human labels, (2) the semantic gap between “positive emotions” and “pleasure,” (3) the scarcity of pleasure-specific datasets, and (4) the limited interpretability of existing black-box fusion methods. Our approach integrates data-driven and cognitive theory-driven methods, using cognitive appraisal theory and a fuzzy model within an innovative framework. The model employs transformer-based architectures and attention mechanisms for fine-grained multimodal feature extraction and interpretable fusion to capture both inter- and intra-modal dynamics associated with pleasure. This enables the prediction of underlying appraisal variables, thereby bridging the semantic gap and enhancing model explainability beyond conventional statistical associations. Experimental results validate the efficacy of the proposed method in detecting video-induced pleasure, achieving a peak accuracy of 0.6624 in predicting pleasure levels. These findings highlight promising implications for affective content recommendation, intelligent media creation, and advancing our understanding of how digital media influences human emotions.

[601] FAIR_XAI: Improving Multimodal Foundation Model Fairness via Explainability for Wellbeing Assessment

Sophie Chiang, Tom Brennan, Fethiye Irmak Dogan, Jiaee Cheong, Hatice Gunes

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In recent years, the integration of multimodal machine learning in wellbeing assessment has offered transformative potential for monitoring mental health. However, with the rapid advancement of Vision-Language Models (VLMs), their deployment in clinical settings has raised concerns due to their lack of transparency and potential for bias. While previous research has explored the intersection of fairness and Explainable AI (XAI), its application to VLMs for wellbeing assessment and depression prediction remains under-explored. This work investigates VLM performance across laboratory (AFAR-BSFT) and naturalistic (E-DAIC) datasets, focusing on diagnostic reliability and demographic fairness. Performance varied substantially across environments and architectures; Phi3.5-Vision achieved 80.4% accuracy on E-DAIC, while Qwen2-VL struggled at 33.9%. Additionally, both models demonstrated a tendency to over-predict depression on AFAR-BSFT. Although bias existed across both architectures, Qwen2-VL showed higher gender disparities, while Phi-3.5-Vision exhibited more racial bias. Our XAI intervention framework yielded mixed results; fairness prompting achieved perfect equal opportunity for Qwen2-VL at a severe accuracy cost on E-DAIC. On AFAR-BSFT, explainability-based interventions improved procedural consistency but did not guarantee outcome fairness, sometimes amplifying racial bias. These results highlight a persistent gap between procedural transparency and equitable outcomes. We analyse these findings and consolidate concrete recommendations for addressing them, emphasising that future fairness interventions must jointly optimise predictive accuracy, demographic parity, and cross-domain generalisation.

[602] Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features

John Winnicki, Abeynaya Gnanasekaran, Eric Darve

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Sparse autoencoders (SAEs) extract millions of interpretable features from a language model, but flat feature inventories aren’t very useful on their own. Domain concepts get mixed with generic and weakly grounded features, while related ideas are scattered across many units, and there’s no way to understand relationships between features. We address this by first constructing a strict domain-specific concept universe from a large SAE inventory using contrastive activations and a multi-stage filtering process. Next, we build two aligned graph views on the filtered set: a co-occurrence graph for corpus-level conceptual structure, organized at multiple levels of granularity, and a transcoder-based mechanism graph that links source-layer and target-layer features through sparse latent pathways. Automated edge labeling then turns these graph views into readable knowledge graphs rather than unlabeled layouts. In a case study on a biology textbook, these graphs recover coherent chapter and subchapter-level structure, reveal concepts that bridge neighboring topics, and transform messy sentence-level activity containing thousands of features into compact, readable views that illustrate the model’s local activity. Taken together, this reframes a flat SAE inventory as an internal knowledge graph that converts feature-level interpretability into a global map of model knowledge and enables audits of reasoning faithfulness.

[603] ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

Boqin Yuan, Renchu Song, Yue Su, Sen Yang, Jing Qin

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Without per-step cost, a pipeline cannot distinguish adding a missing step to fix a bug from removing an expensive step that never affected the outcome. We introduce ClawTrace, an agent tracing platform that records every LLM call, tool use, and sub-agent spawn during an agent session and compiles each session into a TraceCard: a compact YAML summary with per-step USD cost, token counts, and redundancy flags. Built on ClawTrace, CostCraft is a distillation pipeline that reads TraceCards and produces three types of skill patches. Preserve patches keep behaviors that led to success. Prune patches remove expensive steps that did not matter, each backed by a counterfactual argument against a named high-cost step. Repair patches fix failures grounded in oracle evidence. Ablations on 30 held-out SpreadsheetBench tasks show that both cost attribution and prune patches independently reduce quality regressions. When the same skill is applied to 30 unrelated SkillsBench tasks, an unexpected asymmetry emerges: prune rules transferred across benchmarks and cut median cost by 32%, while preserve rules, trained on benchmark-specific conventions, caused regressions on new task types. We release ClawTrace and TraceCards as open infrastructure for cost-aware agent research.

[604] Does Machine Unlearning Preserve Clinical Safety? A Risk Analysis for Medical Image Classification

Andreza M. C. Falcao, Filipe R. Cordeiro

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The application of Deep Learning in medical diagnosis must balance patient safety with compliance with data protection regulations. Machine Unlearning enables the selective removal of training data from deployed models. However, most methods are validated primarily through efficiency and privacy-oriented metrics, with limited attention to clinically asymmetric error costs. In this work, we investigate how unlearning affects clinical risk in binary medical image classification. We show that standard unlearning strategies (Fine-Tuning, Random Labeling, and SalUn) may reduce test utility while increasing false-negative rates, thereby amplifying clinical risk. To mitigate this, we propose SalUn-CRA (Clinical Risk-Aware), a variant of SalUn that replaces random relabeling with entropy-based forgetting for malignant samples in the forget set, preventing the model from learning harmful benign associations. We evaluate on DermaMNIST and PathMNIST medical image datasets under 20% and 50% data removal. Using Global Risk metrics with asymmetric costs, SalUn-CRA achieves lower or comparable clinical risk to full retraining while preserving unlearning effectiveness. These results suggest that clinical risk should be an integral component of unlearning validation in medical systems.

[605] Time-Series Forecasting in Safety-Critical Environments: An EU-AI-Act-Compliant Open-Source Package / Zeitreihenprognose in sicherheitskritischen Umgebungen: Ein KI-VO-konformes Open-Source-Paket

Thomas Bartz-Beielstein, Eva Bartz

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: With spotforecast2-safe we present an integrated Compliance-by-Design approach to Python-based point forecasting of time series in safety-critical environments. A review of the relevant open-source tooling shows that existing compliance solutions operate consistently outside of the library to be used - e.g. as scanners, templates, or runtime layers. spotforecast2-safe takes the inverse approach and anchors the requirements of Regulation (EU) 2024/1689 (the EU AI Act, in German: KI-VO), of IEC 61508, of the ISA/IEC 62443 standards series, and of the Cyber Resilience Act within the library: in application-programming-interface contracts, persistence formats, and continuous-integration gates. The approach is operationalised by four non-negotiable code-development rules (zero dead code, deterministic processing, fail-safe handling, minimal dependencies) together with the corresponding process rules (model card, executable docstrings, CI workflows, Common-Platform-Enumeration (CPE) identifier, REUSE-conformant licensing, release pipeline). Interactive visualisation, hyperparameter tuning and automated machine learning (AutoML), as well as deep-learning and large-language-model backends are deliberately excluded, because each of these components either enlarges the attack surface, introduces non-determinism, or impairs reproducibility. A bidirectional traceability matrix maps every regulatory provision onto the corresponding mechanism in the code; an end-to-end example of European-market electricity generation, transmission, and consumption forecasting demonstrates the application. The package is open-source and available under Affero General Public License (AGPL) 3.0-or-later.

[606] ZenBrain: A Neuroscience-Inspired 7-Layer Memory Architecture for Autonomous AI Systems

Alexander Bering

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Despite a century of empirical memory research, existing AI agent memory systems rely on system-engineering metaphors (virtual-memory paging, flat LLM storage, Zettelkasten notes), none integrating principles of consolidation, forgetting, and reconsolidation. We present ZenBrain, a multi-layer memory architecture integrating fifteen neuroscience models. It implements seven memory layers (working, short-term, episodic, semantic, procedural, core, cross-context) orchestrated by nine foundational algorithms (Two-Factor Synaptic Model, vmPFC-coupled FSRS, Simulation-Selection sleep, Bayesian confidence, and five more) plus six new Predictive Memory Architecture (PMA) components: a four-channel NeuromodulatorEngine, prediction-error-gated ReconsolidationEngine, TripleCopyMemory with divergent decay, four-dimensional PriorityMap with amygdala fast-path, StabilityProtector (NogoA/HDAC3 analogue), and MetacognitiveMonitor for bias detection. The 15-algorithm ablation reveals a cooperative survival network: under stress, 9 of 15 algorithms become individually critical (delta-Q up to -93.7%, Wilcoxon, 10 seeds, alpha=0.005). Simulation-Selection sleep achieves 37% stability improvement (p<0.005) with 47.4% storage reduction. TripleCopyMemory retains S(t)=0.912 at 30 days; PriorityMap reaches NDCG@10=0.997. Multi-layer routing beats a flat single-layer baseline by 20.7% F1 on LoCoMo (p<0.005) and 19.5% on MemoryArena (p=0.015). On LongMemEval-500, ZenBrain holds the highest mean rank on all 12 system-judge cells (4 systems x 3 LLM judges), three-judge mean J=0.545 vs letta=0.485, a-mem=0.414, mem0=0.394; all 9 pair-wise contrasts clear Bonferroni (alpha=0.05/18, min p=6.2e-31, d in [0.18, 0.52]). Under LongMemEval’s binary judge, ZenBrain reaches 91.3% of oracle accuracy at 1/106th the per-query token budget. Open-source with 11,589 automated test cases.

[607] MarketBench: Evaluating AI Agents as Market Participants

Andrey Fradkin, Rohit Krishnan

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly. In order to effectively participate in markets, agents need to have informative signals of their own ability to successfully complete a task and the cost of doing so. We propose MarketBench, a benchmark for assessing whether AI agents have these capabilities. We use a 93-task subset of SWE-bench Lite, a software engineering benchmark, with six recently released LLMs as a demonstration. These LLMs are miscalibrated on both success probability and token usage, and auctions built from these self-reports diverge from a full-information allocation. A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration, but only modestly narrows the gap to a full-information benchmark. We also document the performance of a market-based scaffolding with these LLMs. Our results point to self-assessment as a key bottleneck for market-style coordination of AI agents.

[608] LLM-Augmented Traffic Signal Control with LSTM-Based Traffic State Prediction and Safety-Constrained Decision Support

Jiazhao Shi

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Traffic signal control is a critical task in intelligent transportation systems, yet conventional fixed-time and rule-based methods often struggle to adapt to dynamic traffic demand and provide limited decision interpretability. This study proposes an LLM-augmented traffic signal control framework that integrates LSTM-based short-term traffic state prediction, predictive phase selection, structured large language model reasoning, and safety-constrained action filtering. The LSTM module forecasts future queue length, waiting time, vehicle count, and lane occupancy based on recent intersection-level observations. A predictive controller then generates candidate signal actions, while the LLM module evaluates these actions using structured traffic-state inputs and produces congestion diagnoses, phase adjustment recommendations, and natural-language explanations. To ensure operational reliability, all LLM-generated recommendations are validated by a safety filter before execution. Simulation-based experiments in SUMO compare the proposed method with fixed-time control, rule-based control, and an LSTM-based predictive baseline under balanced demand, directional peak demand, and sudden surge scenarios. The results indicate that the proposed framework improves traffic efficiency, especially under dynamic and non-recurrent traffic conditions, while maintaining zero constraint violations after safety filtering. Overall, this study demonstrates that LLMs can enhance traffic signal control when used as constrained reasoning and decision-support modules rather than direct low-level controllers. Keywords: Intelligent Transportation Systems; Traffic Signal Control; Large Language Models; LSTM; Traffic State Prediction; Decision Support; Safety-Constrained Control; SUMO Simulation.

[609] Agentic AI platforms for autonomous training and rule induction of human-human and virus-human protein-protein interactions

Hung N. Do, Jessica Z. Kubicek-Sutherland, Oscar A. Negrete, S. Gnanakaran

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We instruct an AI agent to construct two separate agentic AI platforms: one for autonomous training of predictive ML models for human-human and virus-human PPI, and the other for inducing explicit general rules governing human-human and virus-human PPI. The first agentic AI platform for autonomous training of predictive ML models for PPI is designed to consist of five AI agents that handle autonomous data collection, data verification, feature embedding, model design, and training and validation on three-way protein-disjoint cross-fold datasets. For human-human and human-virus PPIs, the final three-way protein-disjoint ensemble achieves an accuracy of 87.3% and 86.5%, respectively. For cross-checking and interpretability purposes, the second agentic AI platform is designed to replace ML predictions with human-readable rules derived from protein embeddings, physicochemical autocovariance descriptors, compartment annotations, pathway-domain overlap, and graph contexts. For human-human PPI, it is defined by a two-rule induction, whereas human-virus is induced by a more complex set of weighted rules. The rules induced by the second agentic platform align with the SHAP-identified features from the predictive ML models built by the first agentic platform. Taken together, our work demonstrates the agentic AI’s ability to orchestrate from data planning to execution, and from rule induction to explanation in ML, opening the door to various applications.

[610] GAMED.AI: A Hierarchical Multi-Agent Framework for Automated Educational Game Generation

Shiven Agarwal, Yash Shah, Ashish Raj Shekhar, Priyanuj Bordoloi, Vivek Gupta

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We introduce GameDAI, a hierarchical multi-agent framework that transforms instructor-provided questions into fully playable, pedagogically grounded educational games validated through formal mechanic contracts. Built on phase-based LangGraph sub-graphs, deterministic Quality Gates, and structured Pydantic schemas, GameDAI supports two template families encompassing 15 interaction mechanics across spatial reasoning, procedural execution, and higher-order Bloom’s Taxonomy objectives. Evaluated on 200 questions spanning five subject domains, the system achieves a 90% validation pass rate, 98.3% schema compliance, and 73% token reduction over ReAct agents (${\sim}$73,500 $\rightarrow$ ${\sim}$19,900 tokens/game) at $0.46 per game. Within this model configuration, these results suggest that phase-bounded architectural structure correlates more strongly with alignment quality than prompting strategy alone. Our demonstration lets attendees generate Bloom’s-aligned games from natural language in under 60 seconds, inspect Quality Gate outputs at each pipeline phase, and browse a curated library of 50 games spanning all 15 mechanic types.

[611] Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs

Rhea Makkuni, Ananya Joshi

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Medical and public health experts must make real-time resource decisions, such as expanding hospital bed capacity, based on projected hospitalization trends during large-scale healthcare disruptions (e.g., operational failures or pandemics). Forecasting models can assist in this task by analyzing large volumes of resource-related data at the facility level, but they must be reliable for decision-making under real-world data conditions. Recent work shows that large language models (LLMs) can incorporate richer forms of context into numerical forecasting. Whereas traditional models rely primarily on temporal context (i.e., past observations), LLMs can also leverage non-temporal public health context such as demographic, geographic, and population-level features. However, it remains unclear how these models should be used to produce stable or decision-relevant predictions in real-world healthcare settings. To evaluate how LLMs can be effectively used in this setting, we evaluate three approaches across 60 counties with low-,mid-, and high-hospitalization intensities in the United States: direct LLM-based forecasting, classical time-series models, and a context-augmented hybrid pipeline (HybridARX) that incorporates LLM-derived signals into structured models. Because the goal is operational decision-making rather than error minimization alone, we evaluate performance with bias and lead-lag alignment in addition to standard forecasting metrics. Our results show that HybridARX improves over classical ARX by yielding more stable and better-calibrated forecasts, particularly when incorporating noisy contextual signals into structured time-series models. These findings suggest that, in non-stationary healthcare resource forecasting, LLMs are most useful when embedded within structured hybrid models.

[612] An empirical evaluation of the risks of AI model updates using clinical data: stability, arbitrariness, and fairness

Ioannis Bilionis, Ricardo C. Berrios, Luis Fernandez-Luque, Carlos Castillo

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Artificial Intelligence and Machine Learning (AI/ML) models used in clinical settings are increasingly deployed to support clinical decision-making. However, when training data become stale due to changes in demographics, environment, or patient behaviors, model performance can degrade substantially. While updating models with new training data is necessary, such updates may also introduce new risks. We evaluated the proposed monitoring framework on four publicly available U.S.-based Type 1 Diabetes datasets containing high-resolution continuous glucose monitoring (CGM) data, comprising approximately 11,300 weekly observations from 496 participants under 20 years of age. All datasets included structured sociodemographic information. Using the prediction of severe hyperglycemia events in children with type 1 diabetes as a case study, we examine how different model update strategies can adversely affect model stability (e.g., by causing predictions to “flip” for a large number of cases after an update), increase arbitrariness in predictions, or worsen accuracy equity and the balance of error rates across subpopulations. We propose multiple dimensions for continuous monitoring to detect these issues and argue that such monitoring is essential for the development of trustworthy clinical decision support systems.

[613] Representational Curvature Modulates Behavioral Uncertainty in Large Language Models

Jack King, Evelina Fedorenko, Eghbal A. Hosseini

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In autoregressive large language models (LLMs), temporal straightening offers an account of how the next-token prediction objective shapes representations. Models learn to progressively straighten the representational trajectory of input sequences across layers, potentially facilitating next-token prediction via linear extrapolation. However, a direct link between this trajectory and token-level behavior has been missing. We provide such a link by relating contextual curvature-a geometric measure of how sharply the representational trajectory bends over recent context-to next-token entropy. Across two models (GPT-2 XL and Pythia-2.8B), contextual curvature is correlated with entropy, and this relationship emerges during training. Perturbation experiments reveal selective dependence: manipulating curvature through trajectory-aligned interventions reliably modulates entropy, while geometrically misaligned perturbations have no effect. Finally, regularizing representations to be straighter during training modestly reduces token-level entropy without degrading validation loss. These results identify trajectory curvature as a task-aligned representational feature that influences behavioral uncertainty in LLMs.

[614] Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents

M. Meng

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shifts from a static input-output mapping to a runtime system, the basic unit of analysis should shift from score to failure. PSA-Eval extends the conventional chain Question -> Answer -> Score -> End into Question -> Batch -> Run -> Score -> Failure Case -> Repair -> Regression Batch, making failures traceable, reviewable, repairable, and regression-testable. The framework uses trilingual equivalent inputs as controlled probes for observing group-level cross-language policy drift. We conduct a pilot study on a real trilingual digital front-desk system deployed in the lobby of an international financial institution. The pilot uses a simplified single-foundation-model setting (MA = MB), so the observed drift should not be interpreted as an A/B foundation-model difference. The study contains 81 samples organized into 27 trilingual equivalent question groups. Although the system achieves an average score of 23.15/24, 14 groups show non-zero cross-language score drift, 5 groups show drift of at least 3 points, and the maximum drift reaches 9 points. These results provide initial evidence that failure-centered runtime evaluation can expose structured deployment signals hidden by aggregate scoring.

[615] CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

Ruifeng Yuan, Wanxing Chang, Weiwei Cao, Bowen Shi, Zhongyu Wei, Ling Zhang, Jianpeng Zhang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of findings, and the presence of fine-grained, disease-oriented attributes. Conventional evaluation metrics offer only coarse measures of lexical overlap or entity matching and fail to reflect the granular diagnostic accuracy required for clinical use. To address this gap, we propose CT-FineBench, a benchmark built from CT-RATE and Merlin to evaluate the fine-grained factual consistency of CT reports, constructed from CT-RATE and Merlin. Our benchmark is constructed through a meticulous, Question-Answering (QA) based process: first, we identify and structure key, finding-specific clinical attributes (like location, size, margin). Second, we systematically transform these attributes into a QA dataset, where questions probe for specific clinical details grounded in gold-standard reports. The evaluation protocol for CT-FineBench involves using this QA dataset to query a machine-generated report and scoring the correctness of the answers. This allows for a comprehensive, interpretable, and clinically-relevant assessment, moving beyond superficial lexical overlap to pinpoint specific clinical errors. Experiments show that CT-FineBench correlates better with expert clinical assessment and is substantially more sensitive to fine-grained factual errors than prior metrics.

[616] QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

Chenyang An, Qihao Ye, Minghao Pan, Jiayaun Zhang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We explore a central question in AI for mathematics: can AI systems produce original, nontrivial proofs for open research problems? Despite strong benchmark performance, producing genuinely novel proofs remains an outstanding challenge for LLMs. Through systematic experiments with frontier LLMs on research-level proof tasks, we identify seven failure modes that prevent reliable proof generation, including context contamination, citation hallucination, hand-waving on key steps and misallocation of proof effort, unstable proof plans, unfocused verification, problem modification and single-model bottleneck. We argue that the gap between benchmark success and research-level proving is primarily one of system design, due to those failure modes. We present QED, an open-source multi-agent proof system in which each architectural decision directly addresses a specific failure mode. Evaluated on five open problems in applied analysis and PDEs contributed by domain experts, QED produces correct proofs for three problems, each verified by the contributing experts as original and nontrivial. QED is released as open-source software at https://github.com/proofQED/QED.

[617] AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

Yuxuan Gao, Megan Wang, Yi Ling Yu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. We introduce AgentPulse, a continuous evaluation framework scoring 50 agents across 10 workload categories along four factors (Benchmark Performance, Adoption Signals, Community Sentiment, and Ecosystem Health) aggregated from 18 real-time signals across GitHub, package registries, IDE marketplaces, social platforms, and benchmark leaderboards. Three analyses ground the framework. The four factors capture largely complementary information (n=50; $ρ_{\max}=0.61$ for Adoption-Ecosystem, all others $|ρ| \leq 0.37$). A circularity-controlled test (n=35) shows the Benchmark+Sentiment sub-composite, which contains no GitHub-derived signals, predicts external adoption proxies it does not aggregate: GitHub stars ($ρ_s=0.52$, $p<0.01$) and Stack Overflow question volume ($ρ_s=0.49$, $p<0.01$), with VS Code installs ($ρ_s=0.44$, $p<0.05$) reported as illustrative given that only 11 of 35 agents have non-zero installs. On the n=11 subset with published SWE-bench scores, composite and benchmark-only rankings are nearly uncorrelated ($ρ_s=0.25$; 9 of 11 agents shift by at least 2 ranks), driven by a strong negative Adoption-Capability correlation among closed-source high-capability agents within this subset. This is precisely why we rest the framework’s validity claim on the broader n=35 test rather than the SWE-bench overlap. AgentPulse surfaces deployment signal absent from benchmarks; it is a methodology, not a ground-truth ranking. The framework, all collected signals, scoring outputs, and evaluation harness are released under CC BY 4.0.

[618] A2DEPT: Large Language Model-Driven Automated Algorithm Design via Evolutionary Program Trees

Bin Chen, Shouliang Zhu, Beidan Liu, Yong Zhao, Tianle Pu, Huichun Li, Zhengqiu Zhu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Designing heuristics for combinatorial optimization problems (COPs) is a fundamental yet challenging task that traditionally requires extensive domain expertise. Recently, Large Language Model (LLM)-based Automated Heuristic Design (AHD) has shown promise in autonomously generating heuristic components with minimal human intervention. However, most existing LLM-based AHD methods enforce fixed algorithmic templates to ensure executability, which confines the search to component-level tuning and limits system-level algorithmic expressiveness. To enable open-ended solver synthesis beyond rigid templates, we propose Automated Algorithm Design via Evolutionary Program Trees (A2DEPT), which treats LLMs as system-level algorithm architects. A2DEPT explores the vast program space via a tree-structured evolutionary search with hybrid selection and hierarchical operators, enabling iterative refinement of complete algorithms. To make open-ended generation practical, we enforce executability with a lightweight program-maintenance loop that performs feedback-driven repair. In experiments, A2DEPT consistently outperforms representative LLM-based baselines on both standard and highly constrained benchmarks. On the standard benchmarks, it reduces the mean normalized optimality gap by 9.8% relative to the strongest competing AHD baseline.

[619] Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer

Liangru Xiang, Yuxi Ma, Zhihao Cao, Yixin Zhu, Song-Chun Zhu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Extracting abstract causal structures and applying them to novel situations is a hallmark of human intelligence. While Large Language Models (LLMs) and Vision Language Models (VLMs) have shown strong performance on a wide range of reasoning tasks, their capacity for interactive causal learning – inducing latent structures through sequential exploration and transferring them across contexts – remains uncharacterized. Human learners accomplish such transfer after minimal exposure, whereas classical Reinforcement Learning (RL) agents fail catastrophically. Whether state-of-the-art Artificial Intelligence (AI) models possess human-like mechanisms for abstract causal structure transfer is an open question. Using the OpenLock paradigm requiring sequential discovery of Common Cause (CC) and Common Effect (CE) structures, here we show that models exhibit fundamentally delayed or absent transfer: even successful models require initial environmental-specific mapping – what we term environmental grounding – before efficiency gains emerge, whereas humans leverage prior structural knowledge from the very first solution attempt. In the text-only condition, models matched or exceeded human discovery efficiency. In contrast, visual information – in both the image-only and text-and-image conditions – overall degraded rather than enhanced performance, revealing a broad reliance on symbolic processing rather than integrated multimodal reasoning. Models further exhibited systematic CC/CE asymmetries absent in humans, suggesting heuristic biases rather than direction-neutral causal abstraction. These findings reveal that large-scale statistical learning does not produce the decontextualized causal schemas underpinning human analogical reasoning, establishing grounding-dependent transfer as a fundamental limitation of current LLMs and VLMs.

[620] An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress

Hikmat Karimov, Rahid Zahid Alekberli

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As large language models (LLMs) are increasingly deployed in high-stakes and operational settings, evaluation strategies based solely on aggregate accuracy are often insucient to characterize system reliability. This study proposes a thermodynamic inspired modeling framework for analyzing the stability of LLM outputs under conditions of uncertainty and perturbation. The framework introduces a composite stability score that integrates task utility, entropy as a measure of external uncertainty, and two internal structural proxies: internal integration and aligned reective capacity. Rather than interpreting these quantities as physical variables, the formulation is intended as an interpretable abstraction that captures how internal structure may modulate the impact of disorder on model behavior. Using the IST-20 benchmarking protocol and associated metadata, we analyze 80 modelscenario observations across four contemporary LLMs. The proposed formulation consistently yields higher stability scores than a reduced utilityentropy baseline, with a mean improvement of 0.0299 (95% CI: 0.02470.0351). The observed gain is more pronounced under higher entropy conditions, suggesting that the framework captures a form of nonlinear attenuation of uncertainty. We do not claim a fundamental physical law or a complete theory of machine ethics. Instead, the contribution of this work is a compact and interpretable modeling perspective that connects uncertainty, performance, and internal structure within a unied evaluation lens. The framework is intended to complement existing benchmarking approaches and to support ongoing discussions in AI safety, reliability, and governance.

[621] The Kerimov-Alekberli Model: An Information-Geometric Framework for Real-Time System Stability

Hikmat Karimov, Rahid Zahid Alekberli

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This study introduces the Kerimov-Alekberli model, a novel information-geometric framework that redefines AI safety by formally linking non-equilibrium thermodynamics to stochastic control for the ethical alignment of autonomous systems. By establishing a formal isomorphism between non-equilibrium thermodynamics and stochastic control, we define systemic anomalies as deviations from a Riemannian manifold. The model utilizes the Kullback-Leibler divergence as the primary metric, governed by a dynamic threshold derived from the Fisher Information Metric. We further ground this framework in the Landauer Principle, proving that adversarial perturbations perform measurable physical work by increasing the system’s informational entropy. Validation on the NSL-KDD dataset and unmanned aerial vehicle trajectory simulations demonstrated that our model achieves effective real-time detection via the FPT trigger, with strong performance metrics (e.g., high accuracy and low FPR) on benchmark datasets. This study provides a rigorous physical foundation for AI safety, transitioning from heuristic, rule-based ethical frameworks to a thermodynamics-based stability paradigm by grounding ethical violations in quantifiable physical work and entropic information.

[622] SemML 2.0: Synthesizing Controllers for LTL

Jan Křetínský, Tobias Meggendorfer, Maximilian Prokop

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Synthesizing a reactive system from specifications given in linear temporal logic (LTL) is a classical problem, finding its applications in safety-critical systems design. These systems are typically represented using either Mealy machines or AIGER circuits. We present the second version of SemML, which outperforms all state-of-the-art tools for finding either solution. Aside from implementing the classical automata-theoretic approach, our tool utilizes partial exploration and machine-learning guidance for obtaining solutions efficiently, and numerous heuristics and improvements of classic algorithms for extracting small representations of these solutions. We evaluate our tool against the existing state-of-the-art tools (in particular Strix, LtlSynt, and the previous version of SemML) on the dataset of the synthesis competition SYNTCOMP. We show that we solve significantly more instances and do so much faster than other tools, while maintaining state-of-the-art solution quality.

[623] An Analysis of the Coordination Gap between Joint and Modular Learning for Job Shop Scheduling with Transportation Resources

Moritz Link, Jonathan Hoss, Noah Klarmann

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Efficient job-shop scheduling with transportation resources is critical for high-performance manufacturing. With the rise of “decentralized factories”, multi-agent reinforcement learning has emerged as a promising approach for the combined scheduling of production and transportation tasks. Prior work has largely focused on developing novel cooperative architectures while overlooking the question of when joint training is necessary. Joint training denotes the simultaneous training of job and automatic guided vehicle scheduling agents, whereas modular training involves independently training each agent followed by post-hoc integration. In this study, we systematically investigate the conditions under which joint training is essential for optimal performance in the job-shop scheduling problem with transportation resources. Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap – the performance difference between these two training modalities. In our evaluation, the joint training can produce superior performance compared to the best-performing combinations of dispatching rules and modular training. However, the coordination gap advantage diminishes in bottleneck environments, particularly under severe transport and processing constraints. These findings indicate that modular training represents a viable alternative in environments where a single scheduling task dominates. Overall, our work provides practical guidance for selecting between training modalities based on environmental conditions, enabling decision-makers to optimize reinforcement learning-based scheduling performance.

[624] Right-to-Act: A Pre-Execution Non-Compensatory Decision Protocol for AI Systems

Gadi Lavi

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Current AI systems increasingly operate in contexts where their outputs directly trigger real-world actions. Most existing approaches to AI safety, risk management, and governance focus on post-hoc validation, probabilistic risk estimation, or certification of model behavior. However, these approaches implicitly assume that once a decision is produced, it is eligible for execution. In this work, we introduce the Right-to-Act protocol, a deterministic, non-compensatory pre-execution decision layer that evaluates whether an AI-generated decision is permitted to be realized at all. Unlike compensatory systems, where high-confidence signals can override failed conditions, the proposed framework enforces strict structural constraints: if any required condition is unmet, execution is halted or deferred. We formalize the distinction between compensatory and non-compensatory decision regimes and define a pre-execution legitimacy boundary. Through a scenario-based case study, we demonstrate how identical AI outputs can lead to divergent outcomes when evaluated under a Right-to-Act protocol, preserving reversibility and preventing premature or irreversible actions. The proposed approach reframes AI control from optimizing decisions to governing their admissibility, introducing a protocol-level abstraction that operates independently of model architecture or training methodology.

[625] Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Ashmi Banerjee, Adithi Satish, Wolfgang Wörndl, Yashar Deldjoo

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable city-trip lists across four dimensions – relevance, diversity, sustainability, and popularity balance, and propose a three-phase calibration framework: (1) baseline judging with multiple LLMs, (2) expert evaluation to identify systematic misalignment, and (3) dimension-specific calibration via rules and few-shot examples. Across two recommendation settings, we observe model-specific biases and high dimension-level variance, even when judges agree on overall rankings. Calibration clarifies reasoning per dimension but exposes divergent interpretations of sustainability, highlighting the need for transparent, bias-aware LLM evaluation. Prompts and code are released for reproducibility: https://github.com/ashmibanerjee/trs-llm-calibration.

[626] Credal Concept Bottleneck Models for Epistemic-Aleatoric Uncertainty Decomposition

Tanmoy Mukherjee, Thomas Bailleux, Pierre Marquis, Zied Bouraoui

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Concept Bottleneck Models (CBMs) predict through human-interpretable concepts, but they typically output point concept probabilities that conflate epistemic uncertainty (reducible model underspecification) with aleatoric uncertainty (irreducible input ambiguity). This makes concept-level uncertainty hard to interpret and, more importantly, hard to act upon. We introduce CREDENCE (Credal Ensemble Concept Estimation), a CBM framework that decomposes concept uncertainty by construction. CREDENCE represents each concept as a credal prediction (a probability interval), derives epistemic uncertainty from disagreement across diverse concept heads, and estimates aleatoric uncertainty via a dedicated ambiguity output trained to match annotator disagreement when available. The resulting signals support prescriptive decisions: automate low-uncertainty cases, prioritize data collection for high-epistemic cases, route high-aleatoric cases to human review, and abstain when both are high. Across several tasks, we show that epistemic uncertainty is positively associated with prediction errors, whereas aleatoric uncertainty closely tracks annotator disagreement, providing guidance beyond error correlation. Our implementation is available at the following link: https://github.com/Tankiit/Credal_Sets/tree/ensemble-credal-cbm

[627] Explanation Quality Assessment as Ranking with Listwise Rewards

Thomas Bailleux, Tanmoy Mukherjee, Emmanuel Lonca, Pierre Marquis, Zied Bouraoui

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We reformulate explanation quality assessment as a ranking problem rather than a generation problem. Instead of optimizing models to produce a single “best” explanation token-by-token, we train reward models to discriminate among multiple candidate explanations and learn their relative quality. Concretely, we construct per-instance candidate sets with graded quality levels and train listwise and pairwise ranking models (ListNet, LambdaRank, RankNet) to preserve ordinal structure and avoid score compression typical of pointwise regression or binary preference objectives. We observe three findings: First, ranking losses consistently outperform regression on score separation across all domains tested. Second, the optimal ranking loss depends on data characteristics: listwise objectives excel with well-separated quality tiers, while pairwise methods are more robust to noisy natural annotations. Third, when trained on carefully curated and well-structured data, small encoder models can match models that are orders of magnitude larger, suggesting that data quality matters more than model scale. Finally, when used as rewards in policy optimization, ranking-based scores enable stable convergence in settings where regression-based rewards fail entirely. Code and data are available at: https://github.com/Tankiit/PPO_Learning_to_rank

[628] Adaptive ToR: Complexity-Aware Tree-Based Retrieval for Pareto-Optimal Multi-Intent NLU

Hee-Kyong Yoo, Wonbae Kim, Hyocheol Ahn

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multi-intent natural language understanding requires retrieval systems that simultaneously achieve high accuracy and computational efficiency, yet existing approaches apply either uniform single-step retrieval that compromises recall or fixed-depth hierarchical decomposition that introduces excessive latency regardless of query complexity. This paper proposes Adaptive Tree-of-Retrieval (Adaptive ToR), a complexity-aware retrieval architecture that dynamically configures retrieval topology based on query characteristics. The system integrates four components: (1) a Query Tree Classifier computing a Query Complexity Index from weighted linguistic signals to route queries to either a rapid single-step path or an adaptive-depth hierarchical path; (2) a Tree-Based Retrieval module that recursively decomposes complex queries into focused sub-queries calibrated to predicted complexity; (3) an Adaptive Pruning Module employing two-stage filtering combining quantitative similarity gating with semantic relevance evaluation to suppress exponential node growth; and (4) a Retrieval Reranking Layer featuring a deduplicator-first pipeline and global LLM rescoring for production efficiency. Evaluation on the NLU++ benchmark (2,693 multi-intent queries across Banking and Hotel domains) yields 29.07% Subset Accuracy and 71.79% Micro-F1, a 9.7% relative improvement over fixed-depth baselines, while reducing latency by 37.6%, LLM invocations by 43.0%, and token consumption by 9.8%. Depth-wise analysis reveals that 26.92% of queries resolve within three seconds (2.45s mean latency) via single-step routing (d=0: 37.9% Subset Accuracy, 74.8% Micro-F1), while token consumption scales by 4.9x across depths, validating complexity-aware resource allocation and establishing Pareto-optimal balance across accuracy, latency, and computational efficiency.

[629] Generative Design of a Gas Turbine Combustor Using Invertible Neural Networks

Patrick Krüger, Hanno Gottschalk, Werner Krebs, Bastian Werdelmann

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The need to burn 100% H2 in high efficient gas turbines featuring low NOx combustion in premix mode require the complete redesign of the combustion system to ensure stable operation without any flashback. Since all engine frames featuring a power range from 4 MW up to 600 MW are affected, a huge design effort is expected. To reduce this effort, especially to transfer knowledge between the different engine classes, generative design methods using latest AI technology will provide promising potential. In this work, this challenge is approached utilizing the current advances in generative artificial intelligence. We train an Invertible Neural Network (INN) on an expandable database of geometrically parameterized combustor designs with simulated performance labels. Utilizing the INN in its inverse direction, multiple design proposals are generated which fulfill specified performance labels.

[630] Certified geometric robustness – Super-DeepG

Noémie Cohen, Mélanie Ducoffe, Christophe Gabreau, Claire Pagetti, Xavier Pucel

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Safety-critical applications are required to perform as expected in normal operations. Image processing functions are often required to be insensitive to small geometric perturbations such as rotation, scaling, shearing or translation. This paper addresses the formal verification of neural networks against geometric perturbations on their image dataset. Our method Super-DeepG improves the reasoning used in linear relaxation techniques and Lipschitz optimization, and provides an implementation that leverages GPU hardware. By doing so, Super-DeepG achieves both precision and computational efficiency of robustness certification, to an extent that outperforms prior work. Super-DeepG is shared as an open-source tool on GitHub.

[631] Aligning with Your Own Voice: Self-Corrected Preference Learning for Hallucination Mitigation in LVLMs

Byeonggeuk Lim, JungMin Yun, Junehyoung Kwon, Kyeonghyun Kim, YoungBin Kim

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Vision-Language Models (LVLMs) frequently suffer from hallucinations. Existing preference learning-based approaches largely rely on proprietary models to construct preference datasets. We identify that this reliance introduces a distributional mismatch between the proprietary and target models that hinders efficient alignment. To address this, we propose Alignment via VErified Self-correction DPO (AVES-DPO), a framework that aligns LVLMs using in-distribution data derived from the model’s intrinsic knowledge. Our approach employs a consensus-based verification mechanism to diagnose diverse hallucinations and guides the model to self-correct, thereby generating preference pairs strictly compatible with its internal distribution. Extensive experiments demonstrate that AVES-DPO surpasses existing baselines in hallucination mitigation while requiring only 5.2k samples.

[632] PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

Sinin Zhang, Yunfei Xie, Yuxuan Cheng, Haoyu Zhang, Tong Zhang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning across frames. We identify two fundamental challenges underlying these failures: (1) spatio-temporal identity drift, where objects lose their physical identity across successive frames and break causal chains, and (2) volatility of inference-time insights, where a model may occasionally produce correct physical reasoning but never consolidates it for future reuse. To address these challenges, we propose PhysNote, an agentic framework that enables VLMs to externalize and refine physical knowledge through self-generated “Knowledge Notes.” PhysNote stabilizes dynamic perception through spatio-temporal canonicalization, organizes self-generated insights into a hierarchical knowledge repository, and drives an iterative reasoning loop that grounds hypotheses in visual evidence before consolidating verified knowledge. Experiments on PhysBench demonstrate that PhysNote achieves 56.68% overall accuracy, a 4.96% improvement over the best multi-agent baseline, with consistent gains across all four physical reasoning domains.

[633] Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus

Johannes Moll, Jannik Lübberstedt, Christoph Nuernbergk, Jacob Stroh, Luisa Mertens, Anna Purcarea, Christopher Zirn, Zeineb Benchaaben, Fabian Drexel, Hartmut Häntze, Anirudh Narayanan, Friedrich Puttkammer, Andrei Zhukov, Jacqueline Lammert, Sebastian Ziegelmayer, Markus Graf, Marion Högner, Marcus Makowski, Florian Bassermann, Lisa C. Adams, Jiazhen Pan, Daniel Rueckert, Krischan Braitsch, Keno K. Bressem

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multiple myeloma is managed through sequential lines of therapy over years to decades, with each decision depending on cumulative disease history distributed across dozens to hundreds of heterogeneous clinical documents. Whether LLM-based systems can synthesise this evidence at a level approaching expert agreement has not been established. A retrospective evaluation was conducted on longitudinal clinical records of 811 myeloma patients treated at a tertiary centre (2001-2026), covering 44,962 documents and 1,334,677 laboratory values, with external validation on MIMIC-IV. An agentic reasoning system was compared against single-pass retrieval-augmented generation (RAG), iterative RAG, and full-context input on 469 patient-question pairs from 48 templates at three complexity levels. Reference labels came from double annotation by four oncologists with senior haematologist adjudication. Iterative RAG and full-context input converged on a shared ceiling (75.4% vs 75.8%, p = 1.00). The agentic system reached 79.6% concordance (95% CI 76.4-82.8), exceeding both baselines (+3.8 and +4.2 pp; p = 0.006 and 0.007). Gains rose with question complexity, reaching +9.4 pp on criteria-based synthesis (p = 0.032), and with record length, reaching +13.5 pp in the top decile (n = 10). The system error rate (12.2%) was comparable to expert disagreement (13.6%), but severity was inverted: 57.8% of system errors were clinically significant versus 18.8% of expert disagreements. Agentic reasoning was the only approach to exceed the shared ceiling, with gains concentrated on the most complex questions and longest records. The greater clinical consequence of residual system errors indicates that prospective evaluation in routine care is required before these findings translate into patient benefit.

[634] MIMIC: A Generative Multimodal Foundation Model for Biomolecules

Siavash Golkar, Jake Kovalic, Irina Espejo Morales, Samuel Sledzieski, Minhuan Li, Ksenia Sokolova, Geraud Krawezik, Alberto Bietti, Claudia Skok Gibbs, Roman Klypa, Shengwei Xiong, Francois Lanusse, Liam Parker, Kyunghyun Cho, Miles Cranmer, Tom Hehir, Michael McCabe, Lucas Meyer, Rudy Morel, Payel Mukhopadhyay, Mariel Pettee, Helen Qu, Jeff Shen, David Fouhey, Hadi Sotoudeh, Vikram Mulligan, Pilar Cossio, Sonya M. Hanson, Alisha N. Jones, Olga G. Troyanskaya, Shirley Ho

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Biological function emerges from coupled constraints across sequence, structure, regulation, evolution, and cellular context, yet most foundation models in biology are trained within one modality or for a fixed forward task. We present MIMIC, a generative multimodal foundation model trained on our newly curated and aligned dataset, LORE, linking nucleic acid, protein, evolutionary, structural, regulatory, and semantic/contextual modalities within partially observed biomolecular states. MIMIC uses a split-track encoder-decoder architecture to condition on arbitrary subsets of observed modalities and reconstruct or generate missing components of molecular state across the genome, transcriptome, and proteome. Multimodal conditioning consistently improves MIMIC’s sequence reconstruction relative to sequence-only inputs, while its learned representations enable state-of-the-art performance on RNA and protein downstream tasks. MIMIC achieves state-of-the-art splicing prediction, and its joint generative formulation enables isoform-aware inference that further improves performance. Beyond prediction, the same generative framework supports constrained design. For RNA, MIMIC identifies corrective edits in a clinically relevant HBB splice-disrupting mutation without reverting it by using evolutionary and structural signals. For proteins, jointly conditioning on shape and surface chemistry of PD-L1 and hACE2 binding sites produces diverse, high-confidence sequences with strong in silico support for target binding. Finally, MIMIC uses experimental context as semantic conditioning to model assay-dependent RNA chemical probing, rather than treating context as a fixed output. Together, these results position MIMIC’s aligned multimodal generative modeling as a strong foundation for unifying representation learning, conditional prediction, and constrained biomolecular design within a single model.

[635] Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols

Dahlia Shehata, Ming Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As LLM agents transition to autonomous digital coworkers, maintaining deterministic goal-directedness in non-linear multi-turn conversations emerged as an architectural bottleneck. We identify and formalize a systemic failure mode termed the Attention Latch in decoder-only autoregressive Transformers. This phenomenon, a behavioral manifestation of Information Over-squashing, occurs when the cumulative probabilistic weight of historical context overrides mid-task updates, causing agents to remain anchored to obsolete constraints despite explicit contradictory instructions. We propose Self-Synthesizing Reasoning Protocols (SSRP), a metacognitive framework that implements a discrete separation between high-level architectural planning (Architect) and turn-by-turn procedural execution (Executive). We evaluate SSRP across 9K trajectories using the MultiWOZ 2.2 dataset and the Aggregate Pivot Accuracy (APA), a novel metric we validate by mapping its scores to the U-shaped ‘Lost in the Middle’ curve. We present 3 experimental tiers: a shallow recency-based retrieval pilot, a high-entropy SOP, and a semantic hijacked 3-hop Multi-Fact Synthesis task. Our results empirically locate the Attention Stability Boundary, where stateless Vanilla ReAct baselines for GPT 5.4 collapse to 0.1% success while SSRP achieves a 715X Resilience Lift. We demonstrate statistically significant gains across Gemini 3.1 Pro, Claude Sonnet 4.6 and DeepSeek V3.2. Audits confirm SSRP necessity by proving attentional lapse via a recursive reflexion baseline (100% success); decoupling the latch from positional bias through equidistant stress testing (90% accuracy); and formalizing SSRP via the Information Bottleneck principle and granularity ablations. Procedural Integrity audit (98.8% adherence) reveals a Grounding Paradox where high-stability models fail by refusing to hallucinate under retrieval-reasoning contamination.

[636] Interoceptive machine framework: Toward interoception-inspired regulatory architectures in artificial intelligence

Diego Candia-Rivera

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This review proposes an integrative framework grounded on interoception and embodied AI-termed the interoceptive machine framework-that translates biologically inspired principles of internal-state regulation into computational architectures for adaptive autonomy. Interoception, conceived as the monitoring, integration, and regulation of internal signals, has proven relevant for understanding adaptive behavior in biological systems. The proposed framework organizes interoceptive contributions into three functional principles: homeostatic, allostatic, and enactive, each associated with distinct computational roles: internal viability regulation, anticipatory uncertainty-based re-evaluation, and active data generation through interaction. These principles are not intended as direct neurophysiological mappings, but as abstractions that inform the design of artificial agents with improved self-regulation and context-sensitive behavior. By embedding internal state variables and regulatory loops within these principles, AI systems can achieve more robust decision-making, calibrated uncertainty handling, and adaptive interaction strategies, particularly in uncertain and dynamic environments. This approach provides a concrete and testable pathway toward agents capable of functionally grounded self-regulation, with direct implications for human-computer interaction and assistive technologies. Ultimately, the interoceptive machine framework offers a unifying perspective on how internal-state regulation can enhance autonomy, adaptivity, and robustness in embodied AI systems

[637] STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

Alessio Sordo, Lingxiao Du, Meeka-Hanna Lenisa, Evgeny Bogdanov, Maxim Romanovsky

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support. We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets. The system is structured in two stages: (1) We modify the TGRT Self-Instruct framework to create a synthetic data engine that enables controllable, custom synthetic dataset generation, and (2) an evaluation pipeline incorporating statistical and LLM-based metrics to assess the applicability of the synthetic dataset for LLM-based application evaluations. The synthetic datasets reach an average difference of +5.7% in terms of LLM-as-a-judge scores against existing language-specific benchmarks, demonstrating comparable quality for comprehensive assessment of big and small LLMs. While real datasets remain slightly more challenging for LLMs especially for smaller models, this work establishes a scalable and domain-adaptable benchmarking framework that supports fair evaluation of LLM applications, offering a faster alternative to manual approaches and enabling high-efficiency automated quality assurance cycles.

[638] Hierarchical Behaviour Spaces

Michael Tryfan Matthews, Anssi Kanervisto, Jakob Foerster, Pierluca D’Oro, Scott Fujimoto, Mikael Henaff

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent work in hierarchical reinforcement learning has shown success in scaling to billions of timesteps when learning over a set of predefined option reward functions. We show that, instead of using a single reward function per option, the reward functions can be effectively used to induce a space of behaviours, by letting the controller specify linear combinations over reward functions, allowing a more expressive set of policies to be represented. We call this method Hierarchical Behaviour Spaces (HBS). We evaluate HBS on the NetHack Learning Environment, demonstrating strong performance. We conduct a series of experiments and determine that, perhaps going against conventional wisdom, the benefits of hierarchy in our method come from increased exploration rather than long term reasoning.

[639] Towards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations

Bowen Jian, Rongjie Yu, Hong Wang, Liqiang Wang, Zihang Zou

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Driving in compliance with traffic laws and regulations is a basic requirement for human drivers, yet autonomous vehicles (AVs) can violate these requirements in diverse real-world scenarios. To encode law compliance into AV systems, conventional approaches use formal logic languages to explicitly specify behavioral constraints, but this process is labor-intensive, hard to scale, and costly to maintain. With recent advances in artificial intelligence, it is promising to leverage large language models (LLMs) to derive legal requirements from traffic laws and regulations. However, without explicitly grounding and reasoning in structured traffic scenarios, LLMs often retrieve irrelevant provisions or miss applicable ones, yielding imprecise requirements. To address this, we propose a novel pipeline that grounds LLM reasoning in a traffic scenario taxonomy through node-wise anchors that encode hierarchical semantics. On Chinese traffic laws and OnSite dataset (5,897 scenarios), our method improves law-scenario matching by 29.1% and increases the accuracy of derived mandatory and prohibitive requirements by 36.9% and 38.2%, respectively. We further demonstrate real-world applicability by constructing a law-compliance layer for AV navigation and developing an onboard, real-time compliance monitor for in-field testing, providing a solid foundation for future AV development, deployment, and regulatory oversight.

[640] A systematic evaluation of vision-language models for observational astronomical reasoning tasks

Wenke Ren, Hengxiao Guo, Wenwen Zuo, Xiaoman Zhang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-language models (VLMs) are increasingly proposed as general-purpose tools for scientific data interpretation, yet their reliability on real astronomical observations across diverse modalities remains untested. We present AstroVLBench, a comprehensive benchmark comprising over 4,100 expert-verified instances across five tasks spanning optical imaging, radio interferometry, multi-wavelength photometry, time-domain light curves, and optical spectroscopy. Evaluating six frontier models, we find that performance is strongly modality-dependent: while one model (Gemini 3 Pro) emerges as the most consistently capable across tasks, task-specific strengths vary, and all models substantially underperform domain-specialized methods. Mechanistic ablations reveal that performance depends not only on directing attention to salient visual features but also on grounding those features in physical knowledge. Phenomenological prompts describing what to look for improve accuracy by sharpening model focus, but physical prompts explaining why those features matter perform better overall and yield more balanced classifications with reduced class-specific bias. Consistent with this picture, presenting the underlying one-dimensional measurements directly as numerical tables instead of rendered plots yields up to 13 percentage points improvement. Reasoning quality analysis further demonstrates that, without explicit physical grounding, models may reach correct predictions from phenomenologically plausible cues while providing physically imprecise justifications, establishing that accuracy alone is insufficient for trustworthy scientific deployment. These findings provide the first systematic, multi-modal baselines for VLMs in observational astronomy and identify the specific representation, grounding, and reasoning bottlenecks where current models fail.

[641] NeSyCat: A Monad-Based Categorical Semantics of the Neurosymbolic ULLER Framework

Daniel Romero Schellhorn, Till Mossakowski

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: ULLER (Unified Language for LEarning and Reasoning) offers a unified first-order logic (FOL) syntax, enabling its knowledge bases to be used directly across a wide range of neurosymbolic systems. The original specification endows this syntax with three pairwise independent semantics: classical, fuzzy, and probabilistic, each accompanied by dedicated semantic rules. We show that these seemingly disparate semantics are all instances of one categorical framework based on monads, the very construct that models side effects in functional programming. This enables the modular addition of new semantics and systematic translations between them. As example, we outline the addition of generalised quantification in Logic Tensor Networks (LTN) to arbitrary (also infinite) domains by extending the Giry monad to probability spaces. In particular, our approach allows a modular implementation of ULLER in Python and Haskell, of which we have published initial versions on GitHub.

[642] Evaluating whether AI models would sabotage AI safety research

Robert Kirk, Alexandra Souly, Kai Fronsdal, Abby D’Cruz, Xander Davies

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two complementary evaluations to four Claude models (Mythos Preview, Opus 4.7 Preview, Opus 4.6, and Sonnet 4.6): an unprompted sabotage evaluation testing model behaviour with opportunities to sabotage safety research, and a sabotage continuation evaluation testing whether models continue to sabotage when placed in trajectories where prior actions have started undermining research. We find no instances of unprompted sabotage across any model, with refusal rates close to zero for Mythos Preview and Opus 4.7 Preview, though all models sometimes only partially completed tasks. In the continuation evaluation, Mythos Preview actively continues sabotage in 7% of cases (versus 3% for Opus 4.6, 4% for Sonnet 4.6, and 0% for Opus 4.7 Preview), and exhibits reasoning-output discrepancy in the majority of these cases, indicating covert sabotage reasoning. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold running models inside Claude Code, alongside an iterative pipeline for generating realistic sabotage trajectories. We measure both evaluation awareness and a new form of situational awareness termed “prefill awareness”, the capability to recognise that prior trajectory content was not self-generated. Opus 4.7 Preview shows notably elevated unprompted evaluation awareness, while prefill awareness remains low across all models. Finally, we discuss limitations including evaluation awareness confounds, limited scenario coverage, and untested pathways to risk beyond safety research sabotage.

[643] XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

Zhuoling Li, Ha Linh Hong Tran Nguyen, Valeria Bladinieres, Maxim Romanovsky

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) extends traditional RAG by using knowledge graphs (KGs) to give large language models (LLMs) a structured, semantically coherent context, yielding more grounded answers. However, GraphRAG reasoning process remains a black-box, limiting our ability to understand how specific pieces of structured knowledge influence the final output. Existing explainability (XAI) methods for RAG systems, designed for text-based retrieval, are limited to interpreting an LLM response through the relational structures among knowledge components, creating a critical gap in transparency and trustworthiness. To address this, we introduce XGRAG, a novel framework that generates causally grounded explanations for GraphRAG systems by employing graph-based perturbation strategies, to quantify the contribution of individual graph components on the model answer. We conduct extensive experiments comparing XGRAG against RAG-Ex, an XAI baseline for standard RAG, and evaluate its robustness across various question types, narrative structures and LLMs. Our results demonstrate a 14.81% improvement in explanation quality over the baseline RAG-Ex across NarrativeQA, FairyTaleQA, and TriviaQA, evaluated by F1-score measuring alignment between generated explanations and original answers. Furthermore, XGRAG explanations exhibit a strong correlation with graph centrality measures, validating its ability to capture graph structure. XGRAG provides a scalable and generalizable approach towards trustworthy AI through transparent, graph-based explanations that enhance the interpretability of RAG systems.

[644] The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

Zhenyu Zhao, Aparna Balagopalan, Adi Agrawal, Dilshoda Yergasheva, Waseem Alshikh, Daniel M. Bikel

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Given the increased use of LLMs in financial systems today, it becomes important to evaluate the safety and robustness of such systems. One failure mode that LLMs frequently display in general domain settings is that of sycophancy. That is, models prioritize agreement with expressed user beliefs over correctness, leading to decreased accuracy and trust. In this work, we focus on evaluating sycophancy that LLMs display in agentic financial tasks. Our findings are three-fold: first, we find the models show only low to modest drops in performance in the face of user rebuttals or contradictions to the reference answer, which distinguishes sycophancy that models display in financial agentic settings from findings in prior work. Second, we introduce a suite of tasks to test for sycophancy by user preference information that contradicts the reference answer and find that most models fail in the presence of such inputs. Lastly, we benchmark different modes of recovery such as input filtering with a pretrained LLM.

[645] Governing What You Cannot Observe: Adaptive Runtime Governance for Autonomous AI Agents

German Marin, Jatin Chaudhary

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Autonomous AI agents can remain fully authorized and still become unsafe as behavior drifts, adversaries adapt, and decision patterns shift without any code change. We propose the \textbf{Informational Viability Principle}: governing an agent reduces to estimating a bound on unobserved risk $\hat{B}(x) = U(x) + SB(x) + RG(x)$ and allowing an action only when its capacity $S(x)$ exceeds $\hat{B}(x)$ by a safety margin. The \textbf{Agent Viability Framework}, grounded in Aubin’s viability theory, establishes three properties – monitoring (P1), anticipation (P2), and monotonic restriction (P3) – as individually necessary and collectively sufficient for documented failure modes. \textbf{RiskGate} instantiates the framework with dedicated statistical estimators (KL divergence, segment-vs-rest $z$-tests, sequential pattern matching), a fail-secure monotonic pipeline, and a closed-loop Autopilot formalised as an instance of Aubin’s regulation map with kill-switch-as-last-resort; a scalar Viability Index $VI(t) \in [-1,+1]$ with first-order $t^*$ prediction transforms governance from reactive to predictive. Contributions are the theoretical framework, the reference implementation, and analytical coverage against published agent-failure taxonomies; quantitative empirical evaluation is scoped as follow-up work.

[646] Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Zhou Ziheng, Huacong Tang, Jinyuan Zhang, Haowei Lin, Bangcheng Yang, Qian Long, Fang Sun, Yizhou Sun, Yitao Liang, Ying Nian Wu, Demetri Terzopoulos, Xiaofeng Gao

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Discovering causal regularities and applying them to build functional systems–the discovery-to-application loop–is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities–knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application–and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle–indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery-to-application loop.

[647] Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

Aaryan Shah, Andrew Hines, Alexia Downs, Denis Bajet, Paulius Mui, Fabiano Araujo, Laura Offutt, Aida Rutledge, Elizabeth Jimenez

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Objective. Clinical AI documentation systems require evaluation methodologies that are clinically valid, economically viable, and sensitive to iterative changes. Methods requiring expert review per scoring instance are too slow and expensive for safe, iterative deployment. We present a case-specific, clinician-authored rubric methodology for clinical AI evaluation and examine whether LLM-generated rubrics can approximate clinician agreement. Materials and Methods. Twenty clinicians authored 1,646 rubrics for 823 clinical cases (736 real-world, 87 synthetic) across primary care, psychiatry, oncology, and behavioral health. Each rubric was validated by confirming that an LLM-based scoring agent consistently scored clinician-preferred outputs higher than rejected ones. Seven versions of an EHR-embedded AI agent for clinicians were evaluated across all cases. Results. Clinician-authored rubrics discriminated effectively between high- and low-quality outputs (median score gap: 82.9%) with high scoring stability (median range: 0.00%). Median scores improved from 84% to 95%. In later experiments, clinician-LLM ranking agreement (tau: 0.42-0.46) matched or exceeded clinician-clinician agreement (tau: 0.38-0.43), attributable to both ceiling compression and LLM rubric improvement. Discussion. This convergence supports incorporating LLM rubrics alongside clinician-authored ones. At roughly 1,000 times lower cost, LLM rubrics enable substantially greater evaluation coverage, while continued clinical authorship grounds evaluation in expert judgment. Ceiling compression poses a methodological challenge for future inter-rater agreement studies. Conclusion. Case-specific rubrics offer a path for clinical AI evaluation that preserves expert judgment while enabling automation at three orders lower cost. Clinician-authored rubrics establish the baseline against which LLM rubrics are validated.

[648] Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling

Hailing Cheng, Daqi Sun, Xinyu Lu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Every Transformer architecture dedicates enormous capacity to learning rich representations in semantic embedding space – yet the rotation manifold acted upon by Rotary Positional Embeddings (RoPE) has been treated as a fixed, hand-crafted structure, populated only by discrete ordinal indices. We argue that this rotation space is a largely overlooked second dimension of expressivity in the attention mechanism, one whose systematic exploration may open a new door for attention-based architectures. The analogy to complex numbers is instructive: just as introducing the imaginary axis – orthogonal to and independent of the real line – unlocked new algebraic structure once believed impossible, treating the rotation manifold as a learnable, signal-conditioned space opens an orthogonal degree of freedom in attention. In this framing, the token embedding encodes the semantic (real) component of a representation – what a token means – while the rotation encodes its dynamic (imaginary) component – how it relates to every other token across time, position, and context. We introduce SIREN-RoPE, a concrete instantiation of this idea, which populates the rotation dimension with heterogeneous signals – continuous timestamps, cyclical temporal patterns, and categorical metadata – via a dual-branch Sinusoidal Representation Network (SIREN). As a proof of concept, we evaluate on a production-scale news feed dataset from a major social network using a generative recommender as the ranking model, demonstrating that activating this hidden dimension yields consistent improvements across calibration and ranking objectives with negligible computational overhead. We invite the community to view the rotation space not as a solved positional-encoding detail, but as an untapped axis whose rich structure may prove as consequential for attention as the imaginary unit proved for algebra.

[649] Explainable Artificial Intelligence Techniques for Interpretation of Food Models: a Review

Leonardo Arrighi, Ingrid Alves de Moraes, Marco Zullich, Michele Simonato, Douglas Fernandes Barbin, Sylvio Barbon Junior

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Artificial Intelligence (AI) has become essential for analyzing complex data and solving highly-challenging tasks. It is being applied across numerous disciplines beyond computer science, including Food Engineering, where there is a growing demand for accurate and reliable predictions to meet stringent food quality standards. However, this requires increasingly complex AI models, raising concerns. In response, eXplainable AI (XAI) has emerged to provide insights into AI decision-making, aiding model interpretation by developers and users. Nevertheless, XAI remains underutilized in Food Engineering, limiting model reliability. For instance, in food quality control, AI models using spectral imaging can detect contaminants or assess freshness levels, but their opaque decision-making process hinders adoption. XAI techniques such as SHAP (Shapley Additive Explanations) and Grad-CAM (Gradient-weighted Class Activation Mapping) can pinpoint which spectral wavelengths or image regions contribute most to a prediction, enhancing transparency and aiding quality control inspectors in verifying AI-generated assessments. This survey presents a taxonomy for classifying food quality research using XAI techniques, organized by data types and explanation methods, to guide researchers in choosing suitable approaches. We also highlight trends, challenges, and opportunities to encourage the adoption of XAI in Food Engineering.

[650] Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training

Jihao Gu, Qihang Ai, Yingyao Wang, Pi Bu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Ziming Wang, Yingxiu Zhao, Ming-Liang Zhang, Jun Song, Yuning Jiang, Bo Zheng

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-language model-based mobile agents have gained the ability to understand complex instructions and mobile screenshots, benefiting from reinforcement learning paradigms like Group Relative Policy Optimization (GRPO). However, existing approaches centers on offline training or local action-level rewards often trap agents in local optima, hindering effective exploration and error correction with the environment. Crucially, we find that directly applying task-level rewards often leads to convergence difficulties due to the sparse nature of GUI interactions. To address these challenges, we present \textbf{Mobile-R1}, a systematic training recipe that bridges atomic action execution and strategic task completion. We propose a hierarchical curriculum consisting of three stages: (1) format alignment for reasoning structure, (2) on-policy exploration with verifiable action feedback to ground basic execution, and (3) multi-turn task-level training with realistic environment to unlock exploration and self-correction. This hierarchical strategy effectively bootstraps the agent, significantly enhancing its capability for exploration and self-correction (the ``Eureka’’ moments). Furthermore, addressing the critical scarcity of diverse GUI data in non-English ecosystems, we contribute a comprehensive Chinese mobile dataset covering 28 applications with 24,521 high-quality manual annotations, and establish a rigorous benchmark with 500 trajectories. We will open source all resources, including the dataset, benchmark, model weight, and codes: https://mobile-r1.github.io/Mobile-R1/.

[651] Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Chain-of-Thought (CoT) prompting has been shown to be effective in eliciting structured reasoning (i.e., CoT reasoning) from large language models (LLMs). Regardless of its popularity, recent studies expose its failures in some reasoning tasks, raising fundamental questions about the nature of CoT reasoning. In this work, we propose a data distribution lens to understand when and why CoT reasoning succeeds or fails. We hypothesize that CoT reasoning reflects a structured inductive bias learned from in-distribution data, enabling models to conditionally generate reasoning trajectories that approximate those observed during training. As such, the effectiveness of CoT reasoning is fundamentally governed by the nature and degree of distribution discrepancy between training data and test queries. Guided by this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To test the hypothesis, we introduce DataAlchemy, an abstract and fully controllable environment that trains LLMs from scratch and systematically probes them under various distribution conditions. Through rigorous controlled experiments, we reveal that CoT reasoning is a brittle mirage when it is pushed beyond training distributions, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.

[652] WinkTPG: An Execution Framework for Multi-Agent Path Finding Using Temporal Reasoning

Jingtian Yan, Stephen F. Smith, Jiaoyang Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Planning collision-free paths for a large group of agents is a challenging problem in many real-world applications. While recent advances in Multi-Agent Path Finding (MAPF) have shown promising progress, standard MAPF planners continue to rely on simplified kinodynamic models, preventing agents from directly following the generated MAPF plan. To bridge this gap, we propose kinodynamic Temporal Plan Graph planning (kTPG), a multi-agent speed optimization algorithm that efficiently refines a MAPF plan into a set of kinodynamically feasible speed profiles. We further incorporate execution timing uncertainty models and provide deterministic guarantees under bounded uncertainty models and probabilistic guarantees under stochastic models. Building on kTPG, we propose Windowed kTPG (WinkTPG), a MAPF execution framework that incrementally refines MAPF plans using a window-based mechanism, dynamically incorporating agent information during execution to reduce uncertainty. Experiments show that WinkTPG can generate speed profiles for up to 1,000 agents within 1 second and improves solution quality by up to 51.7% over existing MAPF execution methods. We further validate WinkTPG in high-fidelity physics-based simulation and on real-world robots.

[653] BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks

Rui Miao, Yixin Liu, Yili Wang, Xu Shen, Yue Tan, Yiwei Dai, Shirui Pan, Xin Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The security of LLM-based multi-agent systems (MAS) is critically threatened by propagation vulnerability, where malicious agents can distort collective decision-making through inter-agent message interactions. While existing supervised defense methods demonstrate promising performance, they may be impractical in real-world scenarios due to their heavy reliance on labeled malicious agents to train a supervised malicious detection model. To enable practical and generalizable MAS defenses, in this paper, we propose BlindGuard, an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors. To this end, we establish a hierarchical agent encoder to capture individual, neighborhood, and global interaction patterns of each agent, providing a comprehensive understanding for malicious agent detection. Meanwhile, we design a corruption-guided detector that consists of directional noise injection and contrastive learning, allowing effective detection model training solely on normal agent behaviors. Extensive experiments show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across MAS with various communication patterns while maintaining superior generalizability compared to supervised baselines. The code is available at: https://github.com/MR9812/BlindGuard.

[654] Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models

Xinyan Jiang, Lin Zhang, Jiayi Zhang, Qingsong Yang, Guimin Hu, Di Wang, Lijie Hu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Activation steering offers a promising approach to controlling the behavior of Large Language Models by directly manipulating their internal activations. However, most existing methods struggle to jointly steer multiple attributes, often resulting in interference and undesirable trade-offs. To address this challenge, we propose Multi-Subspace Representation Steering (MSRS), a novel framework for effective multi-attribute steering via subspace representation fine-tuning. MSRS reduces inter-attribute interference by allocating orthogonal subspaces to each attribute, isolating their influence within the model’s representation space. MSRS also incorporates a hybrid subspace composition strategy: it combines attribute-specific subspaces for unique steering directions with a shared subspace for common steering directions. A dynamic weighting function learns to efficiently integrate these components for precise control. During inference, MSRS introduces a token-level steering mechanism that dynamically identifies and intervenes on the most semantically relevant tokens, enabling fine-grained behavioral modulation. Experimental results show that MSRS significantly reduces attribute conflicts, surpasses existing methods across a range of attributes, and generalizes effectively to diverse downstream tasks.

[655] InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, Yuning Jiang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce \textbf{InquireBench}, a comprehensive benchmark specifically designed to evaluate mobile agents’ capabilities in safe interaction and proactive inquiry with users, encompassing 5 categories and 22 sub-categories, where most existing VLM-based agents demonstrate near-zero performance. In this paper, we aim to develop an interactive system that actively seeks human confirmation at critical decision points. To achieve this, we propose \textbf{InquireMobile}, a novel model inspired by reinforcement learning, featuring a two-stage training strategy and an interactive pre-action reasoning mechanism. Finally, our model achieves an 46.8% improvement in inquiry success rate and the best overall success rate among existing baselines on InquireBench. We will open-source all datasets, models, and evaluation codes to facilitate development in both academia and industry.

[656] Test of Time: Rethinking Temporal Signal of Benchmark Contamination

Terry Jingchen Zhang, Gopal Dev, Ning Wang, Max Obreiter, Punya Syon Pandey, Keenan Samway, Wenyuan Jiang, Yinya Huang, Bernhard Schölkopf, Mrinmaya Sachan, Zhijing Jin

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.00072: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.00072&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[657] Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation

Jianpeng Zhao, Chenyu Yuan, Weiming Luo, Haoling Xie, Guangwei Zhang, Steven Jige Quan, Zixuan Yuan, Pengyang Wang, Denghui Zhang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.06337: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.06337&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[658] A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA

Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, Xiuying Chen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.21199: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21199&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[659] SynthPert: Enhancing LLM Biological Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction

Lawrence Phillips, Marc Boubnovski Martell, Aditya Misra, Josefa Lia Stoisser, Cesar A. Prada-Medina, Rory Donovan-Maiye, Kaspar Märtens

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.25346: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25346&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[660] OntoLogX: Ontology-Guided Knowledge Graph Extraction from Cybersecurity Logs with Large Language Models

Luca Cotti, Idilio Drago, Anisa Rula, Devis Bianchini, Federico Cerutti

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.01409: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01409&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[661] CLIN-LLM: A Safety-Constrained Hybrid Framework for Clinical Diagnosis and Treatment Generation

Md. Mehedi Hasan, Md. Abir Hossain, Farman Hossain Sayem, Bikash Kumar Paul, Ziaur Rahman, Mohammad Shorif Uddin, Rafid Mostafiz

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.22609: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22609&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[662] Scheduling Your LLM Reinforcement Learning with Reasoning Trees

Hong Wang, Zhezheng Hao, Jian Luo, Chenxing Wei, Yao Shu, Lei Liu, Qiang Lin, Hande Dong, Jiawei Chen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.24832: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24832&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[663] Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

Huzaifa Arif, Keerthiram Murugesan, Ching-Yun Ko, Pin-Yu Chen, Payel Das, Alex Gittens

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.08484: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08484&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

J. Javier Alonso-Ramos, Ignacio Aguilera-Martos, Francisco Herrera, Andrés Herrera-Poyatos

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.10161: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10161&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[665] Energy-Aware Routing to Large Reasoning Models

Austin R. Ellis-Mohr, Max Hartman, Lav R. Varshney

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.00823: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00823&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[666] SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

Yuxuan Jiang, Francis Ferraro

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.03555: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03555&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[667] CARD: Cluster-level Adaptation with Reward-guided Decoding for Personalized Text Generation

Yutong Song, Jiang Wu, Weijia Zhang, Chengze Shen, Shaofan Yuan, Weitao Lu, Jian Wang, Yu Wang, Nikil Dutt, Amir M. Rahmani

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.06352: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06352&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[668] LLM-Assisted Op-Amp Behavioral-Level Design via Agentic Human-Mimicking Reasoning

Zihao Chen, Ziyi Sun, Jiayin Wang, Ji Zhuang, Jinyi Shen, Xiaoyue Ke, Li Shang, Xuan Zeng, Fan Yang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.21321: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21321&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[669] Forecasting Commencing Enrolments Under Data Sparsity: A Zero-Shot Time Series Foundation Models Framework for Higher Education Planning

Jittarin Jetwiriyanon, Teo Susnjak, Surangika Ranathunga

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.12120: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12120&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[670] Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

Jiajun Chen, Hua Shen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.12134: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12134&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[671] Evaluating the Search Agent in a Parallel World

Jiawei Chen, Xintian Shen, Lihao Zheng, Lifu Mu, Haoyi Sun, Ning Mao, Hao Ma, Tao Wei, Pan Zhou, Kun Zhan

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.04751: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04751&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[672] Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow parametric knowledge or sequentially overfits to non-generalizable trajectory-local lessons. To overcome this, we introduce Trace2Skill, a framework that mirrors how human experts author skills: by holistically analyzing broad execution experience before distilling it into a single, comprehensive guide. Instead of reacting sequentially to individual trajectories, Trace2Skill dispatches a parallel fleet of sub-agents to analyze a diverse pool of executions. It extracts trajectory-specific lessons and hierarchically consolidates them into a unified, conflict-free skill directory via inductive reasoning. Trace2Skill supports both deepening existing human-written skills and creating new ones from scratch. Experiments in challenging domains, such as spreadsheet, VisionQA and math reasoning, show that Trace2Skill significantly improves upon strong baselines, including Anthropic’s official xlsx skills. Crucially, this trajectory-grounded evolution does not merely memorize task instances or model-specific quirks: evolved skills transfer across LLM scales and generalize to OOD settings. For example, skills evolved by Qwen3.5-35B on its own trajectories improved a Qwen3.5-122B agent by up to 57.65 absolute percentage points on WikiTableQuestions. Ultimately, our results demonstrate that complex agent experience can be packaged into highly transferable, declarative skills – requiring no parameter updates, no external retrieval modules, and utilizing open-source models as small as 35B parameters.

[673] ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

Jonas Landsgesell, Pascal Knoll, Tizian Wenzel

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.29928: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29928&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[674] RL-Driven Sustainable Land-Use Allocation for the Lake Malawi Basin

Ying Yao

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.03768: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03768&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[675] “Noisier” Noise Contrastive Eestimation is (Almost) Maximum Likelihood

Peiyu Yu, Dinghuai Zhang, Hengzhi He, Xiaojian Ma, Sirui Xie, Ruiyao Miao, Yifan Lu, Yasi Zhang, Deqian Kong, Ruiqi Gao, Jianwen Xie, Guang Cheng, Ying Nian Wu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2405.16730: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.16730&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[676] Planning Task Shielding: Detecting and Repairing Flaws in Planning Tasks through Turning them Unsolvable

Alberto Pozanco, Marianela Morales, Pietro Totis, Daniel Borrajo

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.07042: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07042&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[677] Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement

Wenji Fang, Yao Lu, Shang Liu, Jing Wang, Ziyan Guo, Junxian He, Fengbin Tu, Zhiyao Xie

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.14989: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14989&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Yusif Ibrahimov, Tarique Anwar, Tommy Yuan

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2406.05984: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.05984&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[679] Agent-Aided Design for Dynamic CAD Models

Mitch Adler, Matthew Russo, Michael Cafarella

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.15184: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15184&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[680] Auditing Sabotage Bench: A Benchmark for Detecting and Fixing Research Sabotage in ML Codebases

Eric Gan, Aryan Bhatt, Buck Shlegeris, Julian Stastny, Vivek Hebbar

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16286: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16286&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[681] Out of Spuriousity: Improving Robustness to Spurious Correlations without Group Annotations

Phuong Quynh Le, Jörg Schlötterer, Christin Seifert

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2407.14974: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.14974&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[682] ClimAgent: LLM as Agents for Autonomous Open-ended Climate Science Analysis

Hao Wang, Jindong Han, Wei Fan, Hao Liu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16922: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16922&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[683] Harness as an Asset: Enforcing Determinism via the Convergent AI Agent Framework (CAAF)

Tianbao Zhang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17025: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17025&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[684] Deconstructing Superintelligence: Identity, Self-Modification and Différance

Elija Perrier

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.19845: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.19845&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[685] Can MLLMs “Read” What is Missing?

Jindi Guo, Chaozheng Huang, Xi Fang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21277: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21277&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[686] Planning Under Observation Mismatch for Traffic Signal Control via Adaptive Modular World Models

Zherui Huang, Yicheng Liu, Chumeng Liang, Guanjie Zheng

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2501.02548: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.02548&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[687] GeoMind: An Agentic Workflow for Lithology Classification with Reasoned Tool Invocation

Yitong Zhou, Mingyue Cheng, Jiahao Wang, Qingyang Mao, Qi Liu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21501: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21501&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[688] Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

Guangxiang Zhao, Qilong Shi, Xusen Xiao, Xiangzheng Zhang, Tong Yang, Lin Sun

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21764: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21764&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[689] GWT: Scalable Optimizer State Compression for Large Language Model Training

Ziqing Wen, Ping Luo, Jiahuan Wang, Kun Yuan, Dongsheng Li, Tao Sun

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2501.07237: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.07237&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[690] Text Tells the Cost: Predicting and Analyzing Repayment Effort of Self-Admitted Technical Debt

Yikun Li, Mohamed Soliman, Paris Avgeriou, Jie Tan, Jiakun Liu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2309.06020: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2309.06020&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[691] Unrealized Expectations: Comparing AI Methods vs Classical Algorithms for Maximum Independent Set

Yikai Wu, Haoyu Zhao, Sanjeev Arora

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2502.03669: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.03669&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[692] Self-Admitted Technical Debt Detection Approaches: A Decade Systematic Review

Edi Sutoyo, Andrea Capiluppi

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2312.15020: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2312.15020&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[693] MVIGER: Multi-View Variational Integration of Complementary Knowledge for Generative Recommender

Tongyoung Kim, Soojin Yoon, SeongKu Kang, Jinyoung Yeo, Dongha Lee

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2408.08686: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.08686&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[694] CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment

Qinfeng Li, Tianyue Luo, Xuhong Zhang, Yangfan Xie, Zhiqiang Shen, Lijun Zhang, Yier Jin, Hao Peng, Xinkui Zhao, Xianwei Zhu, Jianwei Yin

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2410.13903: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.13903&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[695] A Self-Supervised Framework for Space Object Behaviour Characterisation

Ian Groves, Andrew Campbell, James Fernandes, Diego Ramírez Rodríguez, Paul Murray, Massimiliano Vasile, Victoria Nockles

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2504.06176: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.06176&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[696] Decoding the mechanisms of the Hattrick football manager game using Bayesian network structure learning

Anthony C. Constantinou, Nicholas Higgins, Neville K. Kitson

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2504.09499: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.09499&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[697] MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

Juntao Zhao, Qi Lu, Wei Jia, Borui Wan, Lei Zuo, Junda Feng, Jianyu Jiang, Yangrui Chen, Shuaishuai Cao, Jialing He, Kaihua Jiang, Yuanzhe Hu, Shibiao Nong, Yanghua Peng, Haibin Lin, Chuan Wu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2504.09844: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.09844&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[698] MINT: Multi-Vector Search Index Tuning

Jiongli Zhu, Yue Wang, Bailu Ding, Philip A. Bernstein, Vivek Narasayya, Surajit Chaudhuri

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2504.20018: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.20018&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[699] Can Large Language Models Really Recognize Your Name?

Dzung Pham, Peter Kairouz, Niloofar Mireshghallah, Eugene Bagdasarian, Chau Minh Pham, Amir Houmansadr

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.14549: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.14549&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[700] Toward Theoretical Insights into Diffusion Trajectory Distillation via Operator Merging

Weiguo Gao, Ming Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.16024: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16024&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[701] LearnAlign: Data Selection for LLM Reinforcement Learning with Improved Gradient Alignment

Shipeng Li, Zhiqin Yang, Shikun Li, Xiaobo Xia, Hengyu Liu, Xinghua Zhang, Gaode Chen, Dong Fang, Ying Tai, Zhe Peng

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.11480: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.11480&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[702] Exploring the Secondary Risks of Large Language Models

Jiawei Chen, Zhengwei Fang, Yu Tian, Jiawei Du, Chao Yu, Zhaoxia Yin, Hang Su

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.12382: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.12382&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[703] FedRef: Bayesian Fine-Tuning using a Reference Model to Mitigate Catastrophic Forgetting for Heterogeneous Federated Learning

Taehwan Yoon, Bongjun Choi, Wesley De Neve

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.23210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[704] A Lower Bound for the Number of Linear Regions of Ternary ReLU Regression Neural Networks

Yuta Nakahara, Manabu Kobayashi, Toshiyasu Matsushima

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.16079: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.16079&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[705] Reliable Microservice Tail Latency Prediction via Decoupled Dual-Stream Learning and Gradient Modulation

Wenzhuo Qian, Hailiang Zhao, Jiayi Chen, Ziqi Wang, Tianlv Chen, Zhiwei Ling, Xinkui Zhao, Kingsum Chow, Albert Y. Zomaya, Shuiguang Deng

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.01635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[706] Neural Bridge Processes

Jian Xu, Yican Liu, Delu Zeng, John Paisley, Qibin Zhao

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.07220: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07220&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[707] Enabling Transparent Cyber Threat Intelligence Combining Large Language Models and Domain Ontologies

Luca Cotti, Anisa Rula, Devis Bianchini, Federico Cerutti

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.00081: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.00081&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[708] xOffense: An Autonomous Multi-Agent Framework for Penetration Testing with Domain-Adapted Large Language Models

Phung Duc Luong, Le Tran Gia Bao, Nguyen Vu Khai Tam, Dong Huu Nguyen Khoa, Nguyen Huu Quyen, Van-Hau Pham, Phan The Duy

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.13021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.13021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[709] Polychromic Objectives for Reinforcement Learning

Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.25424: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25424&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[710] InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training

Shiju Wang, Yujie Wang, Ao Sun, Fangcheng Fu, Zijian Zhu, Bin Cui, Xu Han, Kaisheng Ma

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.21275: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21275&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[711] Question-Adaptive Graph Learning for Multi-hop Retrieval Augmented Generation

Yuchen Yan, Peiyan Zhang, Zhihua Liu, Hao Wang, Yatao Bian, Weiming Li, Xiaoshuai Hao

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.11541: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11541&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[712] Verifying Quantized GNNs With Readout Is Decidable But Highly Intractable

Artem Chernobrovkin, Marco Sälzer, François Schwarzentruber, Nicolas Troquard

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.08045: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08045&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[713] Token Is All You Price

Weijie Zhong

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.09859: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09859&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[714] MEASER: Malware embedding attacks on open-source LLMs

Ming Tan, Wei Li, Hu Tao, Hailong Ma, Aodi Liu, Qian Chen, Zilong Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.10486: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10486&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Xu Zhang, Hao Li, Zhichao Lu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.17687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[716] From Optimization to Prediction: Transformer-Based Path-Flow Estimation to the Traffic Assignment Problem

Mostafa Ameli, Sulthana Shams, Van Anh Le, Alexander Skabardonis

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.19889: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19889&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[717] Accelerating Eigenvalue Dataset Generation via Chebyshev Subspace Filter

Hong Wang, Jie Wang, Jian Luo, huanshuo dong, Yeqiu Chen, Runmin Jiang, Zhen huang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.23215: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23215&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[718] Using Language Models as Closed-Loop High-Level Planners for Robotics Applications: A Brief Overview and Benchmarks

Hao Wang, Sathwik Karnik, Bea Lim, Somil Bansal

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.07410: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07410&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[719] LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

Lech Madeyski, Barbara Kitchenham, Martin Shepperd

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.12635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[720] EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation

Jonas Bode, Raphael Memmesheimer, Sven Behnke

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.13312: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13312&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[721] Predicting one-year clinical instability and mortality in heart failure patients using sequence modeling

Falk Dippel, Yinan Yu, Annika Rosengren, Martin Lindgren, Christina E. Lundberg, Erik Aerts, Martin Adiels, Helen Sjöland

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.16839: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16839&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[722] MermaidSeqBench: An Evaluation Benchmark for NL-to-Mermaid Sequence Diagram Generation

Basel Shbita, Farhan Ahmed, Chad DeLuca

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.14967: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14967&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[723] Statistically-Guided Meta-Learning for Cross-Deployment Activity Recognition in Distributed Fiber-Optic Sensing

Yifan He, Haodong Zhang, Qiuheng Song, Lin Lei, Zhenxuan Zeng, Haoyang He, Hongyan Wu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.17902: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17902&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[724] Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs

Daniel Agyei Asante, Md Mokarram Chowdhury, Yang Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.22099: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22099&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[725] Delta-XAI: A Unified Framework for Explaining Prediction Changes in Online Time Series Monitoring

Changhun Kim, Yechan Mun, Hyeongwon Jang, Eunseo Lee, Sangchul Hahn, Eunho Yang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.23036: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23036&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[726] DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

Dingwei Zhu, Zhiheng Xi, Shihan Dou, Yuhui Wang, Sixian Li, Junjie Ye, Honglin Guo, Shichun Liu, Chenhao Huang, Yajie Yang, Junlin Shang, Senjie Jin, Ming Zhang, Jiazheng Zhang, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.03847: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03847&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[727] Generating Verifiable Chain of Thoughts from Exection-Traces

Shailja Thakur, Vaibhav Saxena, Rohan Kulkarni, Shivdeep Singh, Parameswaran Selvam, Hima Patel, Hiroshi Kanayama

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.00127: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00127&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[728] Bayesian Optimization for Function-Valued Responses under Min-Max Criteria

Pouya Ahadi, Reza Marzban, Ali Adibi, Kamran Paynabar

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.07868: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07868&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[729] A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models

X.Y. Han, Yuan Zhong

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.03915: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03915&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[730] ESPADA: Execution Speedup via Semantics Aware Demonstration Data Downsampling for Imitation Learning

Byung-ju Kim, Jinu Pahk, Chungwoo Lee, Jaejoon Kim, Jangha Lee, Theo Taeyeong Kim, Kyuhwan Shim, Jun Ki Lee, Byoung-Tak Zhang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.07371: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07371&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[731] Selective Conformal Risk Control

Yunpeng Xu, Wenge Guo, Zhi Wei

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.12844: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12844&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[732] LLM-Auction: Generative Auction towards LLM-Native Advertising

Chujie Zhao, Qun Hu, Shiping Song, Dagui Chen, Han Zhu, Jian Xu, Bo Zheng

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.10551: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10551&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[733] Scalable Agentic Reasoning for Designing Biologics Targeting Intrinsically Disordered Proteins

Matthew Sinclair, Moeen Meigooni, Archit Vasan, Ozan Gokdemir, Xinran Lian, Heng Ma, Yadu Babuji, Alexander Brace, Khalid Hossain, Carlo Siebenschuh, Thomas Brettin, Kyle Chard, Christopher Henry, Venkatram Vishwanath, Rick L. Stevens, Ian T. Foster, Arvind Ramanathan

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.15930: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15930&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[734] Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search

Maximilian Weichart

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.21648: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21648&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[735] PRAXIS: Integrating Program Analysis with Observability for Root-Cause Analysis

Shengkun Cui, Rahul Krishna, Saurabh Jha, Ravishankar K. Iyer

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.22113: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22113&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[736] What Understanding Means in AI-Laden Astronomy

Yuan-Sen Ting, André Curtis-Trudel, Siyu Yao

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.10038: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10038&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[737] The Rise of Large Language Models and the Direction and Impact of US Federal Research Funding

Yifan Qian, Zhe Wen, Alexander C. Furnas, Yue Bai, Erzhuo Shao, Dashun Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.15485: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15485&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[738] From Static to Interactive: Authoring Interactive Visualizations via Natural Language

Can Liu, Jaeuk Lee, Tianhe Chen, Zhibang Jiang, Xiaolin Wen, Yong Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.17736: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17736&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[739] GTAC: A Generative Transformer for Approximate Circuits

Jingxin Wang, Shitong Guo, Wenhui Liang, Ruicheng Dai, Ruogu Ding, Xin Ning, Weikang Qian

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.19906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[740] MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models

Ya Jiang, Massieh Kordi Boroujeny, Surender Suresh Kumar, Kai Zeng

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.22246: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22246&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[741] BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models

Weiqin Yang, Bohao Wang, Zhenxiang Xu, Jiawei Chen, Shengjia Zhang, Jingbang Chen, Canghong Jin, Can Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.22925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[742] Beyond Experience Retrieval: Learning to Generate Utility-Optimized Structured Experience for Frozen LLMs

Xuancheng Li, Haitao Li, Yujia Zhou, Yiqun Liu, Qingyao Ai

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.02556: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02556&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[743] “If You’re Very Clever, No One Knows You’ve Used It”: The Social Dynamics of Developing Generative AI Literacy in the Workplace

Qing Nancy Xia, Marios Constantinides, Advait Sarkar, Duncan Brumby, Anna Cox

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.01386: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01386&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[744] Scalable Explainability-as-a-Service (XaaS) for Edge AI Systems

Samaresh Kumar Singh, Joyjit Roy

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.04120: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04120&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[745] Comparative Insights on Adversarial Machine Learning from Industry and Academia: A User-Study Approach

Vishruti Kakkad, Paul Chung, Hanan Hibshi, Maverick Woo

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.04753: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04753&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[746] RealFin: How Well Do LLMs Reason About Finance When Users Leave Things Unsaid?

Yuyang Dai, Yan Lin, Zhuohan Xie, Yuxia Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.07096: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07096&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[747] Probe-Based Data Attribution: Discovering and Mitigating Undesirable Behaviors in LLM Post-Training

Frank Xiao, Santiago Aranguri

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.11079: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11079&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[748] Anonymization-Enhanced Privacy Protection for Mobile GUI Agents: Available but Invisible

Lepeng Zhao, Zhenhua Zou, Shuo Li, Zhuotao Liu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.10139: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10139&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[749] Flickering Multi-Armed Bandits

Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.17315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[750] Early Warning of Intraoperative Adverse Events via Transformer-Driven Multi-Label Learning

Xueyao Wang, Xiuding Cai, Honglin Shang, Yaoyao Zhu, Yu Yao

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.05212: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05212&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[751] Isotonic Layer: A Unified Framework for Recommendation Calibration and Debiasing

Hailing Cheng, Yafang Yang, Hemeng Tao, Fengyu Zhang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.06589: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06589&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[752] Physics-informed AI Accelerated Retention Analysis of Ferroelectric Vertical NAND: From Day-Scale TCAD to Second-Scale Surrogate Model

Gyujun Jeong, Sungwon Cho, Minji Shon, Namhoon Kim, Woohyun Hwang, Kwangyou Seo, Suhwan Lim, Wanki Kim, Daewon Ha, Prasanna Venkatesan, Kihang Youn, Ram Cherukuri, Yiyi Wang, Suman Datta, Asif Khan, Shimeng Yu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.06881: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06881&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[753] Security Considerations for Multi-agent Systems

Tam Nguyen, Moses Ndebugre, Dheeraj Arremsetty

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.09002: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09002&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[754] Extending Precipitation Nowcasting Horizons via Spectral Fusion of Radar Observations and Foundation Model Priors

Yuze Qin, Qingyong Li, Zhiqing Guo, Wen Wang, Yan Liu, Yangli-ao Geng

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.21768: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21768&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[755] Machine Learning for Network Attacks Classification and Statistical Evaluation of Adversarial Learning Methodologies for Synthetic Data Generation

Iakovos-Christos Zarkadis, Christos Douligeris

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.17717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[756] Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control

Zunzhe Zhang, Runhan Huang, Yicheng Liu, Shaoting Zhu, Linzhan Mou, Hang Zhao

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.17834: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17834&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[757] ReFinE: Streamlining UI Mockup Iteration with Research Findings

Donghoon Shin, Bingcan Guo, Jaewook Lee, Lucy Lu Wang, Gary Hsieh

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.04353: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04353&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[758] Scaling Coding Agents via Atomic Skills

Yingwei Ma, Yue Liu, Xinlong Yang, Yanhao Li, Kelin Fu, Yibo Miao, Yuchong Xie, Zhexu Wang, Shing-Chi Cheung

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.05013: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05013&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[759] RL-ASL: A Dynamic Listening Optimization for TSCH Networks Using Reinforcement Learning

F. Fernando Jurado-Lasso, J. F. Jurado

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.07533: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07533&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[760] Loop Corrections to the Training Error and Generalization Gap of Random Feature Models

Taeyoung Kim

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12827: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12827&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[761] The AI Codebase Maturity Model: From Assisted Coding to Fully Autonomous Systems

Andy Anderson

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.09388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[762] Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety

S M Jamil Uddin

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12311: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12311&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[763] Self-Reinforcing Controllable Synthesis of Rare Relational Data via Bayesian Calibration

Chongsheng Zhang, Hao Wang, Zelong Yu, Esteban Garces Arias, Julian Rodemann, Zhanshuo Zhang, Qilong Li, Gaojuan Fan, Krikamol Muandet, Christian Heumann

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16817: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16817&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[764] Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks

Seyedreza Mohseni, Sarvesh Baskar, Edward Raff, Manas Gaur

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.15390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[765] A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection with Neural Network Quantum States

Daran Sun, Bowen Kan, Haoquan Long, Hairui Zhao, Haoxu Li, Yicheng Liu, Pengyu Zhou, Ankang Feng, Wenjing Huang, Yida Gu, Zhenyu Li, Honghui Shang, Yunquan Zhang, Dingwen Tao, Ninghui Sun, Guangming Tan

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.15768: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15768&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[766] Towards Disentangled Preference Optimization Dynamics Beyond Likelihood Displacement

Wei Chen, Yubing Wu, Junmei Yang, Delu Zeng, Qibin Zhao, John Paisley, Min Chen, Zhou Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18239: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18239&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[767] An Integrated Deep-Learning Framework for Peptide-Protein Interaction Prediction and Target-Conditioned Peptide Generation with ConGA-PepPI and TC-PepGen

Chupei Tang, Junxiao Kong, Moyu Tang, Di Wang, Jixiu Zhai, Ronghao Xie, Shangkun Sima, Tianchi Lu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[768] Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration

Donghwan Lee

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17457: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17457&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[769] Faster by Design: Interactive Aerodynamics via Neural Surrogates Trained on Expert-Validated CFD

Nicholas Thumiger, Andrea Bartezzaghi, Mattia Rigotti, Cezary Skura, Thomas Frick, Elisa Serioli, Fabrizio Arbucci, A. Cristiano I. Malossi

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18491: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18491&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[770] S2MAM: Semi-supervised Meta Additive Model for Robust Estimation and Variable Selection

Xuelin Zhang, Hong Chen, Yingjie Wang, Tieliang Gong, Bin Gu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.19072: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.19072&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[771] A neural operator framework for data-driven discovery of stability and receptivity in physical systems

Chengyun Wang, Liwei Chen, Nils Thuerey

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.19465: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.19465&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[772] SQLyzr: A Comprehensive Benchmark and Evaluation Platform for Text-to-SQL

Sepideh Abedini, M. Tamer Özsu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21214: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21214&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[773] A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies

Somyajit Chakraborty

Ziyang Jiang, Jiahe Lei, Xueyan Chen, Yifan Zhang, Zexu Pan, Wei Xue, Xinyuan Qian

Main category: cs.SD

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Target Speaker Extraction (TSE) aims to extract the clean speech of the target speaker in an audio mixture, eliminating irrelevant background noise and speech. While prior work has explored various auxiliary cues including pre-recorded speech, visual information, and spatial information, the acquisition and selection of such strong cues are infeasible in many practical scenarios. Differently, in this paper, we condition the TSE algorithm on semantic cues extracted from limited and unaligned text contents, such as condensed points from a presentation slide. This method is particularly useful in scenarios like meetings, poster sessions, or lecture presentations, where acquiring other cues in real time may be challenging. To this end, we design two different networks. Specifically, our proposed Text Prompt Extractor Network (TPE) fuses audio features with content-based semantic cues to facilitate time-frequency mask generation to filter out extraneous noise. The experimental results show the efficacy in accurately extracting the target speaker’s speech by utilizing semantic cues derived from limited and unaligned text, resulting in SI-SDRi of 12.16 dB, SDRi of 12.66 dB, PESQi of 0.830 and STOIi of 0.150.

Tornike Karchkhadze, Mohammad Rasool Izadi, Shuo Zhang, Shlomo Dubnov

Main category: cs.SD

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In this work, we propose an approach to music source separation that uses a generative diffusion model as a last-stage refinement on top of a deterministic separator, progressively enhancing the separated sources through iterative denoising. While the diffusion refinement yields measurable quality gains, it requires iterative steps at inference, increasing computational cost. To speed up the inference process, we apply consistency distillation, reducing inference to a single step while maintaining quality; with two or more steps, the distilled model even surpasses the diffusion-based approach. Crucially, our method is architecture-agnostic: we demonstrate state-of-the-art results when applied to both a custom U-Net-based separator on Slakh2100 and the state-of-the-art BS-RoFormer model on MUSDB18, showing that the refinement generalizes across backbone architectures. Sound examples are available at: https://consistency-separation.github.io/.

[785] DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Yi Yuan, Xubo Liu, Haohe Liu, Xiyuan Kang, Zhuo Chen, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang

Main category: cs.SD

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short of controlling fine-grained acoustic characteristics of specific sounds. As a result, users who need specific sound content may find it difficult to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the proposed systems. The experiments show that DreamAudio generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.

[786] CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

Adhiraj Banerjee, Vipul Arora

Main category: cs.SD

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Text-guided sound separation enables flexible audio editing, assistive listening, and open-domain source extraction, but systems such as AudioSep remain too expensive for low-latency edge or codec-mediated deployment. Existing neural audio codec separators are efficient, yet largely restricted to fixed stems or closed taxonomies. We introduce CodecSep, a prompt-driven universal sound separation framework that extracts sources directly in neural audio codec latent space. CodecSep combines a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP text embeddings, enabling open-vocabulary separation while preserving codec-native efficiency. Across dnr-v2 and five open-domain benchmarks, CodecSep consistently improves over AudioSep in SI-SDR, remains competitive in ViSQOL, and achieves clear gains in human MOS-LQS. Controlled analyses show that fine-grained prompts outperform coarse labels, and that explicit latent masking is substantially more effective than decoder-style latent generation in codec space. Qualitative diagnostics show that neural audio codec latents retain source-dependent structure, which CodecSep exploits mainly through channel-wise source-conditioned modulation. CodecSep also provides a practical code-stream deployment path. When audio is transmitted as neural audio codec codes, CodecSep maps codes to embeddings, separates directly in codec space, and outputs waveforms or re-quantized codes, avoiding the decode-separate-re-encode loop. In this regime, CodecSep requires only 1.35 GMACs end-to-end: about 54 times less compute than AudioSep in the same pipeline and 25 times lower separator-only compute, with much lower latency and memory. More broadly, CodecSep offers a blueprint for codec-native downstream audio processing.

[787] When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

Chen-An Li, Tzu-Han Lin, Hung-yi Lee

Main category: cs.SD

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large audio-language models (LALMs) unify speech and text processing, but their robustness in noisy real-world settings remains underexplored. We investigate how irrelevant audio, such as silence, synthetic noise, and environmental sounds, affects text reasoning tasks where audio is unnecessary. Across three text-based benchmarks, we find that even non-informative audio reduces accuracy and increases prediction volatility; the severity of interference scales with longer durations, higher amplitudes, and elevated decoding temperatures. Silence, often assumed neutral, destabilizes outputs as strongly as synthetic noise. While larger models show greater resilience, vulnerabilities persist across all evaluated systems. We further test mitigation strategies and find that prompting shows limited effectiveness, whereas self-consistency improves stability at the cost of increased computation. Our results reveal cross-modal interference as a key robustness challenge and highlight the need for efficient fusion strategies that preserve reasoning performance in the presence of irrelevant inputs.

[788] Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models

Xinyu Wang, Ziyu Zhao, Yajie Luo, Yihong Wu, Liheng Ma, Jingrui Tian, Lei Ding, Xiao-Wen Chang, Peng Lu

Main category: cs.SD

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Deploying Automatic Speech Recognition (ASR) models on memory-constrained edge devices requires aggressive low-bit weight quantization. Layer-wise post-training quantization is practical and effective, but it suffers from cross-layer error accumulation. Existing compensation methods typically use a single global strength for all layers, which is ill-suited to encoder-decoder ASR models whose acoustic encoder and linguistic decoder exhibit markedly different sensitivities to quantization noise. We propose FADE, a diagnostic-driven framework that assigns each layer an adaptive compensation coefficient by combining two complementary signals: an intrinsic vulnerability score from weight geometry and a calibration reliability score from the data-driven solution. The resulting layer-wise coefficient balances local quantization fidelity against cross-layer error correction, enabling tailored compensation without retraining or hyperparameter search. Experiments on Whisper, Moonshine, and Qwen3-ASR across four benchmarks show that FADE consistently improves mean Word Error Rate over strong baselines at both 3- and 4-bit precision while substantially reducing run-to-run variance.

[789] FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

Chengyou Wang, Hongfei Xue, Chunjiang He, Jingbin Hu, Shuiyuan Wang, Bo Wu, Yuyu Ji, Jimeng Zheng, Ruofei Chen, Zhou Zhu, Lei Xie

Main category: cs.SD

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches either rely on voice activity cues, which lack semantic understanding, or on ASR-based modules, which introduce latency and degrade under overlapping speech and noise. Moreover, available datasets rarely capture realistic interaction dynamics, limiting evaluation and deployment. To mitigate the problem, we propose \textbf{FastTurn}, a unified framework for low-latency and robust turn detection. To advance latency while maintaining performance, FastTurn combines streaming CTC decoding with acoustic features, enabling early decisions from partial observations while preserving semantic cues. We also release a test set based on real human dialogue, capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions, demonstrating its effectiveness for practical full-duplex dialogue systems.

Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lyu, Wei Xue, Yike Guo

Main category: cs.SD

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.

[791] Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative

Ksenia Lysikova, Kirill Borodin, Grach Mkrtchian

Main category: cs.SD

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: RuASD (Russian AntiSpoofing Dataset) is a dedicated, reproducible benchmark for Russian-language speech anti-spoofing designed to evaluate both in-domain discrimination and robustness to deployment-style distribution shifts. It combines a large spoof subset synthesized using 37 modern Russian-capable TTS and voice-cloning systems with a bona fide subset curated from multiple heterogeneous open Russian speech corpora, enabling systematic evaluation across diverse data sources. To emulate typical dissemination and channel effects in a controlled and reproducible manner, RuASD includes configurable simulations of platform and transmission distortions, including room reverberation, additive noise/music, and a range of speech-codec transcodings implemented via a unified processing chain. We benchmark a diverse set of publicly available anti-spoofing countermeasures spanning lightweight supervised architectures, graph-attention models, SSL-based detectors, and large-scale pretrained systems, and report reference results on both clean and simulated conditions to characterize robustness under realistic perturbation pipelines. The dataset is publickly available at \href{https://huggingface.co/datasets/MTUCI/RuASD}{\underline{Hugging Face}} and \href{https://modelscope.cn/datasets/lab260/RuASD}{\underline{ModelScope}}.

[792] Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features

Chenqian Le, Ruisi Li, Beatrice Fumagalli, Yasamin Esmaeili, Xupeng Chen, Amirhossein Khalilian-Gourtani, Tianyu He, Adeen Flinker, Yao Wang

Main category: cs.SD

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We test whether Speech Articulatory Coding (SPARC) features can linearly predict surface electromyography (sEMG) envelopes across aloud, mimed, and subvocal speech in twenty-four subjects. Using elastic-net multivariate temporal response function (mTRF) with sentence-level cross-validation, SPARC yields higher prediction accuracy than phoneme one-hot representations on nearly all electrodes and in all speech modes. Aloud and mimed speech perform comparably, and subvocal speech remains above chance, indicating detectable articulatory activity. Variance partitioning shows a substantial unique contribution from SPARC and a minimal unique contribution from phoneme features. mTRF weight patterns reveal anatomically interpretable relationships between electrode sites and articulatory movements that remain consistent across modes. This study focuses on representation/encoding analysis (not end-to-end decoding) and supports SPARC as a robust and interpretable intermediate target for sEMG-based silent-speech modeling.

[793] MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control

Jialong Mai, Xiaofen Xing, Xiangmin Xu

Zahra Makki Nayeri, Mohsen Rezvani

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Proactive alert prediction in computer networks is critical for mitigating evolving cyber threats and enabling timely defensive actions. Temporal Graph Neural Networks (TGNs) provide a principled framework for modeling time-evolving interactions; however, existing TGN-based methods predominantly rely on unidirectional or single-mechanism temporal aggregation, which limits their ability to capture recursive, multi-scale temporal patterns commonly observed in real-world attack behaviors. In this paper, we propose BiTA, a Bidirectional Gated Recurrent Unit-Transformer Aggregator for temporal graph learning. Rather than introducing a deeper or higher-capacity model, BiTA redesigns the temporal aggregation function within the TGN framework by jointly encoding bidirectional sequential dependencies and long-range contextual relations over each node’s temporal neighborhood. This aggregation strategy enables complementary temporal reasoning at different scales while preserving the original TGN memory and message-passing structure. We evaluate BiTA on real-world alert datasets, demonstrating significant improvements in key performance metrics such as area under the curve, average precision, mean reciprocal rank, and per-category prediction accuracy when compared to state-of-the-art temporal graph models. BiTA outperforms baseline methods under both transductive and inductive settings, highlighting its robustness and generalization capabilities in dynamic network environments. BiTA is a scalable and interpretable framework for real-time cyber threat anticipation, paving the way toward more intelligent and adaptive intrusion detection systems.

Anastasiia Filippova, David Grangier, Marco Cuturi, João Monteiro

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal axis, we argue that the \emph{depth} dimension offers an orthogonal and robust avenue for optimization. Although prior research suggests that a full cache for every layer is redundant, implementing cross-layer cache sharing remains a practical challenge; existing methods typically suffer from reduced throughput or increased time-to-first-token. In this paper, we demonstrate that dropping a layer’s cache offers efficient optimization without information loss. We propose a simple training approach: random cross-layer attention. During training, layers randomly choose to attend either to their own KV states or those of a preceding layer. This stochastic process adapts the model to be robust to various depth-wise cache sharing strategies, ensuring flexibility for unknown hardware constraints at deployment time. Our evaluations show that applying this scheme during pre-training or fine-tuning enables depth-wise cache sharing for various model families. Furthermore, for larger models in data-constrained settings, this approach is suggestive of a regularization-like effect, frequently preserving or improving performance while significantly reducing the cache’s memory footprint.

[798] Parameter Efficiency Is Not Memory Efficiency: Rethinking Fine-Tuning for On-Device LLM Adaptation

Irene Tenison, Stella Ahn, Miriam Kim, Ebtisam Alshehri, Lalana Kagal

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Parameter-Efficient Fine-Tuning (PEFT) has become the standard for adapting large language models (LLMs). In this work we challenge the wide-spread assumption that parameter efficiency equates memory efficiency and on-device adaptability. We show that this is not true - while methods like LoRA and IA3 significantly reduce trainable parameters, they remain bound by intermediate tensors that scale linearly with sequence length, often triggering out-of-memory errors on-device. In this work, we introduce LARS (Low-memory Activation-Rank Subspace), a novel adaptation framework that decouples memory consumption from sequence length. While prior PEFT methods apply low-rank constraints to model parameters, LARS instead constrains the activation subspace used during training, directly targeting the dominant source of memory consumption and fundamentally flattening the memory growth rate. LARS reduces the memory footprint by an average of 33.54% on GPUs and 51.95% on CPUs in comparison to LoRA across reasoning, understanding and long-context datasets using different models while maintaining competitive accuracy and throughput. Besides GPUs, we deploy on Raspberry Pi and consumer-grade CPUs to demonstrate that LARS provides a scalable path for sophisticated LLM personalization on resource-constrained hardware and edge devices.

[799] Learning Without Adversarial Training: A Physics-Informed Neural Network for Secure Power System State Estimation under False Data Injection Attacks

Solon Falas, Markos Asprou, Charalambos Konstantinou, Maria K. Michael

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: State estimation is a cornerstone of power system control-center operations, and its robust operation is increasingly a cyber-physical security concern as modern grids become more digitalized and communication-intensive. Neural network-based approaches have gained attention as alternatives to conventional model-based state estimation methods. Physics-Informed Neural Networks (PINNs), which embed power-flow consistency into the learning objective, have shown improved accuracy over existing approaches. This work proposes a PINN-based model for Power System State Estimation (PSSE) that protects the estimation process against the stealth-constrained AC False Data Injection Attacks (FDIAs) considered in this study. The model is developed without adversarial training. Instead, a dynamic loss-weighting formulation based on homoscedastic uncertainty learns the relative scaling of supervised data-fit and physics-residual terms during training, reducing sensitivity to manual weight tuning. Robustness is evaluated on the IEEE 118-bus system using representative stealthy-FDIA families including state distortion, load redistribution, line overloading, and residual-constrained stealth corruption. Performance is measured using Mean Absolute Error (MAE) on voltage magnitudes and phase angles. Results demonstrate higher accuracy and stability than existing fixed-weight PINN variants.

[800] CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs

Stela Tong, Elai Ben-Gal

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language model (LLM) deployments increasingly rely on multi-agent architectures in which multiple models either compete through routing mechanisms or collaborate to produce a final answer. In both settings, the learning signal received by each agent is filtered by the system mechanism. Routing produces selection-gated feedback where only the chosen response is evaluated, while collaboration produces shared rewards that obscure the individual contribution of each agent. As a result, standard RLHF objectives designed for a single deployed policy become misspecified. We introduce CoFi-PGMA (Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs), a unified framework for learning under filtered feedback in multi-agent LLM systems. Our approach derives a counterfactual per-agent training objective based on marginal contribution, which corrects the learning signal under both routing and collaborative mechanisms. For routing systems, the objective corresponds to off-policy corrections for selection-gated feedback, while for collaborative systems it reduces to leave-one-out difference rewards for credit assignment. We further analyze how softmax routing induces risk-sensitive incentives and provide practical training algorithms that integrate counterfactual estimators, multiturn-aware rewards, and policy optimization methods, and demonstrate the approach on a real-world reasoning dataset.

[801] AutoCompress: Critical Layer Isolation for Efficient Transformer Compression

Archit Thorat

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present AutoCompress, a transformer compression method motivated by an empirical finding: in small transformers, Layer 0 carries disproportionately high task-critical information, with an NTK-based importance score of 3.6 compared to a maximum of 0.054 for all other layers – a gap of over 60x. Based on this finding, we propose Critical Layer Isolation (CLI), an architecture that protects Layer 0 at full dimensionality, compresses all intermediate layers through a learned bottleneck, and restores the full dimension at the final layer. Applied to GPT-2 Medium (354.8M parameters), CLI-GPT2 achieves 204.5 perplexity on WikiText-103 with only 143.8M parameters – a 2.47x compression ratio and 59.5% parameter reduction. Crucially, an ablation study demonstrates that a uniform bottleneck baseline of comparable size achieves only 571.8 perplexity under identical training conditions, confirming that the architectural decision to protect Layer 0 – rather than simply reducing model size – is the primary driver of performance. Code and checkpoints are publicly available.

[802] Conformal PM2.5 Mapping Under Spatial Covariate Shift: Satellite-Reanalysis Fusion for Africa’s Green Industrial Transition

Yaw Osei Adjei, Davis Opoku, Ephraim Abotsi, Kwadwo Owusu Amanqua, Oliver Kornyo, Elisha Soglo-Ahianyo, Cephas Anertey Abbey

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Africa’s green industrialization imperative demands reliable infrastructure for monitoring air quality. We present a satellite-reanalysis PM2.5 fusion system trained on 2,068,901 records from 404 monitoring locations in 29 African countries (OpenAQ, 2017-2022), combining LightGBM with leakage-resistant spatial cross-validation and conformal prediction to quantify predictions and their geographic applicability limits. Under 5-fold location-grouped spatial cross-validation, LightGBM achieves RMSE = 30.83 +/- 5.07 ug/m3, MAE = 14.54 +/- 1.66 ug/m3, R2 = 0.134 +/- 0.023, and macro F1 = 0.336 +/- 0.018. This R2 is substantially below random-split benchmarks (>0.90) but reflects true geographic generalisation difficulty rather than model failure. Split conformal prediction targeting 90% marginal coverage reveals severe East Africa degradation (actual PICP = 65.3% vs. nominal 90%), consistent with medium-strength covariate shift (humidity KS = 0.2237, sat_pblh KS = 0.2558). We operationalise these findings through regional reliability flags (High/Medium/Low/Unreliable) and a monitor prioritisation score directing infrastructure expansion toward highest-burden unmonitored populations, directly supporting Africa’s green industrial transition and SDGs 3.9, 7.1.2, 9, 11.6.2, and 13.

[803] Avionic Main Fuel Pump Simulation and Fault-Diagnosis Benchmark

Felix Leonhard Janzen, Lukas Moddemann, Alexander Diedrich, Oliver Niggemann

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In many cyber-physical systems, especially in critical applications such as aeroplanes, data to train anomaly detection and diagnosis algorithms is lacking due to data protection issues and partial observability. To combat this inherent lack of data, we introduce a high-fidelity, physics-informed co-simulation of a common aircraft main-fuel-pump system modelled in \textsc{MATLAB/Simulink Simscape Fluids}. We also describe its generated time-series data with health and fault mode annotations. To show feasibility of our benchmark, we apply an unsupervised Recurrent Variational Autoencoder (RNN-VAE) for anomaly detection and a SOM-VAE for operating mode discretization, trained to separate healthy and faulty conditions.

[804] Towards Understanding the Expressive Power of GNNs with Global Readout

Maurice Funk, Daumantas Kojelis

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We study the expressive power of message-passing aggregate-combine-readout graph neural networks (ACR-GNNs). Particularly, we focus on the first-order (FO) properties expressible by this formalism. While a tight logical characterisation remains a difficult open question, we make two contributions towards answering it. First, we show that sum aggregation and readout suffice for GNNs to capture FO properties that cannot be expressed in the logic C2 on both directed and undirected graphs. This strengthens known results by Hauke and Wał{\k e}ga (2026) where aggregation and readout functions are specially crafted for the task. Second, we identify two natural ways of restoring characterisability (with regard to C2) for ACR-GNNs. One option is to limit local aggregation (without imposing restrictions on global readout), whilst the second is to run ACR-GNNs over graphs of bounded degree (but unbounded size). In both cases, the FO properties captured by GNNs are exactly those definable by a formula in graded modal logic with global counting modalities. Our results thus establish an innate lower- and upper-bound in terms of how far (fragments of) C2 can be taken to characterise GNNs, and imply that is indeed the unbounded interaction of aggregation and readout that pushes the logical expressive power of GNNs above C2.

[805] When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning

Elias Hossain, Mohammad Jahid Ibna Basher, Ivan Garibay, Ozlem Garibay, Niloofar Yousefi

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Offline reinforcement learning (RL) can learn effective policies from fixed datasets, but deployment objectives may change after training, and in many applications the trained actor cannot be retrained because of data, cost, or governance constraints. We study deployment-time adaptation for frozen offline actors using Product-of-Experts (PoE) composition with a goal-conditioned prior. Our main practical finding is graceful degradation rather than universal performance gain: under degraded or random priors, precision-weighted composition remains anchored to the frozen actor, while additive and prior-only adaptation collapse, and a KL-budget selector often recovers a near-oracle operating point. We also make explicit a closed-form identity in the frozen-actor setting: for diagonal-Gaussian actors and priors, PoE with coefficient alpha yields the same deterministic policy as KL-regularized adaptation with beta = alpha / (1 - alpha), with posterior covariances differing only by a global scalar factor. Empirically, across four D4RL environments (3,900 MuJoCo episodes), we observe a 4/5/3 HELP/FROZEN/HURT split. Extending the analysis to six harder cells and two AntMaze diagnostics reveals an actor-competence ceiling: medium-expert remains HURT in all 9 cells at every tested alpha, while AntMaze with a behavior-cloned frozen actor yields zero success for all composition rules. Overall, PoE and KL-regularized adaptation are best viewed as a single actor-anchored safety mechanism for deployment-time steering.

[806] MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches

Xin Wang, Chi Ma, Shaobin Chen, Pu Wang, Menglei Zhou, Junyi Qiu, Qiaorui Chen, Jiayu Sun, Shijie Liu, Zehuan Wang, Lei Yu, Chuan Liu, Fei Jiang, Wei Lin, Hao Wang, Jiawei Jiang, Xiao Yan

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization opportunity, the massive scale of individual user states creates a storage explosion that far exceeds physical GPU limits. We propose MTServe, a hierarchical cache management system that virtualizes GPU memory by leveraging host RAM as a scalable backup store. To bridge the I/O gap between tiers, MTServe introduces a suite of system-level optimizations, including a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven replacement policy. On both public and production datasets, MTServe delivers up to 3.1* speedup while maintaining near-perfect hit ratios (>98.5%).

[807] Predicting Wind Loads on Container Ships in Harbor Environments through Multi-Fidelity Modeling

Matilde Fiore, Andrea Bresciani, Miguel Alfonso Mendez, Jeroen van Beeck

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Modern container ships face higher wind loads due to increased windage areas, making accurate predictions of wind loads essential for mooring design. Existing empirical models, largely developed for container ships with smaller windage areas and simpler geometrical configurations than those of modern large-scale vessels, often lack accuracy and do not account for the influence of nearby structures. This study proposes a multi-fidelity surrogate modelling framework for the prediction of wind-load coefficients, combining empirical correlations with simplified and detailed CFD models for ships in open-sea and harbor environments. The approach relies on recursive co-kriging to consistently fuse information across fidelity levels, enabling accurate predictions at a reduced computational cost. A sensitivity analysis is used to identify the most influential geometric parameters, and the resulting reduced parameter space is explored through sequential sampling to efficiently construct the training database. The surrogate models are validated over a wide range of loading configurations and for two distinct harbor environments. The results demonstrate that the multi-fidelity approach significantly improves prediction accuracy compared to single-fidelity models, while substantially reducing the reliance on high-fidelity simulations. In particular, the proposed framework captures the dependence of wind loads on key geometric parameters and consistently outperforms traditional empirical correlations, providing a robust and efficient tool for engineering applications.

[808] Quantifying and Mitigating Self-Preference Bias of LLM Judges

Jinming Yang, Chuxian Qiu, Zhenyu Deng, Xinshan Jiao, Tao Zhou

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: LLM-as-a-Judge has become a dominant approach in automated evaluation systems, playing critical roles in model alignment, leaderboard construction, quality control, and so on. However, the scalability and trustworthiness of this approach can be substantially distorted by Self-Preference Bias (SPB), which is a directional evaluative deviation in which LLMs systematically favor or disfavor their own generated outputs during evaluation. Existing measurements rely on costly human annotations and conflate generative capability with evaluative stance, and thus are impractical for large-scale deployment in real-world systems. To address this issue, we introduce a fully automated framework to quantifying and mitigating SPB, which constructs equal-quality pairs of responses with negligible quality differences, enabling statistical disentanglement of discriminability from bias propensity without human gold standards. Empirical analysis across 20 mainstream LLMs reveals that advanced capabilities are often uncorrelated, or even negatively correlated, with low SPB. To mitigate this bias, we propose a structured multi-dimensional evaluation strategy grounded in cognitive load decomposition, which reduces SPB by 31.5% on average.

[809] StackFeat RL: Reinforcement Learning over Iterative Dual Criterion Feature Selection for Stable Biomarker Discovery

A. Yermekov, D. A. Herrera-Martí

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Feature selection in high-dimensional genomic data ($d \gg n$) demands methods that are simultaneously accurate, sparse, and stable. Existing approaches either require manual threshold specification (mRMR, stability selection), produce unstable selections under data perturbation (Lasso, Boruta), or ignore biological structure entirely. We introduce StackFeat-RL, a meta-learning framework that optimises the hyperparameters of an iterative dual-criterion feature selection algorithm via REINFORCE policy gradients. The dual criterion, requiring both coefficient consistency and selection frequency, guards against two failure modes missed by single-criterion methods, while iterative accumulation provides convergence guarantees via the law of large numbers. On COVID-19 miRNA data (GSE240888, 332 features) and three Alzheimer’s disease classification tasks (GSE84422, 13237 genes; Normal vs.\ Possible, Probable, and Definite AD), StackFeat-RL achieves the highest predictive accuracy among all evaluated methods, including ElasticNet, Boruta, mRMR, and stability selection, while requiring 3–4$\times$ fewer features. Keywords: feature selection, reinforcement learning, REINFORCE, elastic net, biomarker discovery, Alzheimer’s disease, dual-criterion selection, protein interaction networks

[810] Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs

Minghui Xu, Qi Luo, Kun Li

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Traditional data valuation methods based on ``row-count $\times$ quality coefficient’’ paradigms fail to capture the nuanced, nonlinear contributions that data makes to Large Language Model (LLM) capabilities. This paper presents a dynamic data valuation framework that transitions from static accounting to utility-based pricing. Our approach operates on three layers: (1) token-level information density metrics using Shannon entropy and Data Quality Scores; (2) empirical training gain measurement through influence functions, proxy model strategies, and Data Shapley values; and (3) cryptographic verifiability through hash-based commitments, Merkle trees, and a tamper-evident training ledger. We provide comprehensive experimental validation on three real domains (instruction following, mathematical reasoning, and code summarization), demonstrating that proxy-based empirical gain achieves near-perfect ranking alignment with realized utility, substantially outperforming row-count and token-count baselines. This framework enables a fair Data-as-a-Service economy where high-reasoning data is priced according to its actual contribution to model intelligence, while providing the transparency and auditability necessary for trustworthy data markets.

[811] Accelerating Frequency Domain Diffusion Models with Error-Feedback Event-Driven Caching

Dong Liu, Haisheng Wang, Yanxuan Yu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diffusion models achieve remarkable success in time series generation. However, slow inference limits their practical deployment. We propose E$^2$-CRF (Error-Feedback Event-Driven Cumulative Residual Feature caching) to accelerate frequency domain diffusion models. Our method exploits two structural properties: (1) spectral localization, where signal energy concentrates in low frequencies, and (2) mirror symmetry, which halves the effective frequency dimension. E$^2$-CRF uses a closed-loop error-feedback system that adaptively caches transformer KV features across diffusion steps. We trigger recomputation using event-driven residual dynamics instead of fixed schedules. Our method selectively recomputes high-energy or rapidly-changing tokens while reusing cached features for stable high-frequency components. E$^2$-CRF achieves ~2.2 speedup while maintaining sample quality. We demonstrate effectiveness on 5 datasets. Our caching strategy naturally aligns with the diffusion process’s structure-to-detail progression. We include sufficient-condition error and complexity bounds under standard regularity assumptions (Appendix), alongside empirical validation. Our code is available at https://github.com/NoakLiu/FastFourierDiffusion and is also integrated in https://github.com/NoakLiu/FastCache-xDiT.

[812] Deep Clustering for Climate: Analyzing Teleconnections through Learned Categorical States

Lívia Meinhardt, Dário Oliveira

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Understanding and representing complex climate variability is essential for both scientific analysis and predictive modeling. However, identifying meaningful climate regimes from raw variables is challenging, as they exhibit high noise and nonlinear dependencies. In this work, we explore the use of Masked Siamese Networks to discretize climate time series into semantically rich clusters. Focusing on daily minimum and maximum temperature, we show that the resulting representations: (i) yield clusters that reflect meaningful climate states under our modeling assumptions, offering a simplified representation for downstream use; (ii) enable sampling and analysis of specific climate scenarios; and (iii) exhibit statistical associations with El Niño events, underscoring their scientific relevance. Our findings highlight the potential of self-supervised discretization as a tool for climate data analysis and open avenues for incorporating richer climate indicators in future work.

[813] Score-Repellent Monte Carlo: Toward Efficient Non-Markovian Sampler with Constant Memory in General State Spaces

Jie Hu, Lingyun Chen, Geeho Kim, Jinyoung Choi, Bohyung Han, Do Young Eun

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: History-dependent sampling can reduce long-run Monte Carlo variance by discouraging redundant revisits, but existing schemes typically encode history through empirical measure on finite state spaces, which is infeasible in high-dimensional discrete configuration spaces or ill-posed in continuous domains. We propose Score-Repellent Monte Carlo (SRMC) framework that summarizes trajectory history by a running average of score evaluations in $R^d$, where $d$ is the dimension of the score and state representation. This history is converted into a surrogate target through an exponential score tilt, indexed with $α$ that represents the strength of repellence in controlling the magnitude of the history-based repulsion. The surrogate family is normalization-free in the standard MCMC sense, yielding a generic wrapper: at each iteration, any base kernel targeting $π$ can instead be run on the current surrogate $π_{θ_n}$ while the history is updated online. We analyze the coupled evolution of the history recursion and Monte Carlo estimators using stochastic approximation with controlled Markovian noise, establishing almost sure convergence and a joint central limit theorem. We further identify regimes in which the asymptotic covariance decreases as $α$ increases, with scaling $O(1/α)$, extending the near-zero-variance effect of finite-state history-dependent samplers to general state spaces with constant memory. Experiments on continuous targets and discrete energy-based models demonstrate improved estimator variance and mode coverage, while retaining $O(d)$ memory usage and modest per-iteration overhead.

[814] Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling

Alex Nikulkov

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reward models in RLHF are trained to score only the final token of a response - a choice that discards rich signal from every intermediate position and produces models whose token-level outputs are noise. We argue this is a missed opportunity: a well-trained reward model’s output at any token should represent the conditional expectation of the final reward given the response so far. We introduce Temporally Coherent Reward Modeling (TCRM), which induces this property via two regularization terms on top of the standard Bradley-Terry loss, with minimizers provably equal to conditional expectations. The regularizers correspond to Monte Carlo and TD value-learning objectives, establishing a direct connection to RL value functions. TCRM requires zero changes to architecture, data, or inference, yet unlocks three capabilities from one principle: interpretable token-level reward trajectories (middle-token pairwise accuracy improved from 50% to 88.9%, final-token accuracy preserved); state-of-the-art PRM performance on ProcessBench (44.9% average F1) among models trained only on outcome data; and unified reward/value modeling in PPO, reducing peak GPU memory by 27% and step time by 19% with matching LLM quality.

[815] Collocation-based Robust Physics Informed Neural Networks for time-dependent simulations of pollution propagation under thermal inversion conditions on Spitsbergen

Leszek Siwik, Maciej Sikora, Natalia Leszczyńska, Tomasz Maciej Ciesielski, Eirik Valseth, Manuela Bastidas Olivares, Marcin Łoś, Tomasz Służalec, Jacek Leszczyński, Maciej Paszyński

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In this paper, we propose a Physics-Informed Neural Network framework for time-dependent simulations of pollution propagation originating from moving emission sources. We formulate a robust variational framework for the time-dependent advection-diffusion problem and establish the boundedness and inf-sup stability of the corresponding discrete weak formulation. Based on this mathematical foundation, we construct a robust loss function that is directly related to the true approximation error, defined as the difference between the neural network approximation and the (unknown) exact solution. Additionally, a collocation-based strategy is introduced to speed up neural network training. As a case study, we investigate pollution propagation caused by snowmobile traffic in Longyearbyen, Spitsbergen, supported by detailed in-field measurements collected using dedicated sensors. The proposed framework is applied to analyze the effects of thermal inversion on pollutant accumulation. Our results demonstrate that thermal inversion traps dense and humid air masses near the ground, significantly enhancing particulate matter (PM) concentration and worsening local air quality.

[816] On-Device Vision Training, Deployment, and Inference on a Thumb-Sized Microcontroller

Jeremy Ellis

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper presents a complete, end-to-end on-device vision machine learning pipeline, comprising data acquisition, two-layer CNN training with Adam optimization, and real-time inference, executing entirely on a microcontroller-class device costing $15-40 USD. Unlike cloud-based workflows that require external infrastructure and conceal the computational pipeline from the practitioner, this system implements every step of the core ML lifecycle in approximately 1,750 lines of readable C++ that compiles in under one minute using the Arduino IDE, with no external ML dependencies. Running on the Seeed Studio ESP32-S3 XIAO ML Kit (8 MB PSRAM), the firmware achieves three-class 64x64 image classification in approximately 9 minutes per training run, with real-time inference at 6.3 FPS. Key contributions include: correct batch-level gradient accumulation; pre-computed resize lookup tables for inference; dual-format weight export for SD-free baked-in deployment; a three-tier weight priority system (SD binary > baked-in header > He-initialization) resolved automatically at boot; a single-constant network reconfiguration interface; and PSRAM-aware memory management suited to microcontroller constraints. All source code and reference datasets are released under the MIT License at https://github.com/webmcu-ai/on-device-vision-ai

[817] Complex SGD and Directional Bias in Reproducing Kernel Hilbert Spaces

Natanael Alpay, Emeric Battaglia

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Stochastic Gradient Descent (SGD) is a known stochastic iterative method popular for large-scale convex optimization problems due to its simple implementation and scalability. Some objectives, such as those found in complex-valued neural networks, benefit from updates like in SGD and Gradient Descent (GD) with a newly defined ``gradient’’ that allows for complex parameters. This complex variant of the SGD/GD methods has already been proposed, but convergence guarantees without analyticity constraints have not yet been provided. We propose a variant of SGD (complex SGD) that allows for complex parameters, and we provide convergence guarantees under assumptions that parallel those from the real setting. Notably, these results extend to GD as well, and with the same set of assumptions, we confirm that some directional bias results extend from the real to the complex setting for kernel regression problems. We provide empirical results demonstrating the efficacy of the complex SGD in kernel regression problems utilizing complex reproducing kernel Hilbert spaces. In particular, we demonstrate we may recover superoscillation functions and Blaschke products from the Fock Space and Hardy Space, respectively, as the optimal functions for a particular choice of a loss function.

[818] Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

Haoze He, Xingyuan Ding, Xuan Jiang, Xinkai Zou, Alex Cheng, Yibo Zhao, Juncheng Billy Li, Heather Miller

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Despite MoE models leading many benchmarks, supervised fine-tuning (SFT) for the MoE architectures remains difficult because its router layers are fragile. Methods such as DenseMixer and ESFT mitigate router collapse with dense mixing or auxiliary load-balancing losses, but these introduce noisy gradients that often degrade performance. In preliminary experiments, we systematically pruned experts and observed that while certain super experts are activated far more frequently, discarding less used experts still leads to notable performance degradation. This suggests that even rarely activated experts encode non-trivial knowledge useful for downstream tasks. Motivated by this, we propose an auxiliary-loss-free MoE SFT framework that combines bias-driven sparsification with always-active gated condenser experts. Rather than enforcing balanced activation across all experts, our method encourages task-relevant experts to remain active while pushing long-tailed experts toward inactivity. The condenser experts provide a persistent, learnable pathway that alleviates gradient starvation and facilitates consolidation of information that would otherwise remain fragmented across sparsely activated experts. Analysis further suggest that this design better preserves long-tailed expert information under sparse routing. Experiments on large-scale MoE models demonstrate that our approach outperforms state-of-the-art SFT baselines such as DenseMixer and ESFT, achieving average gain of 2.5%+ on both mathematical reasoning and commonsenseQA benchmarks.

[819] A Differentiable Framework for Global Circulation Model Precipitation Bias Correction

Kamlesh Sawadekar, Seth McGinnis, Peijun Li, Chaopeng Shen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Systematic biases in Global Circulation Model (GCM) outputs limit their direct applicability in regional planning, necessitating bias correction. Correcting precipitation is particularly challenging due to its non-Gaussian distribution, intermittent nature, and non-linear extremes. However, traditional statistical methods cannot learn from big data and easily address systematic biases in the GCMs, and while machine learning does provide this flexibility, their black-box type functionality hinders us from understanding these biases completely which also further prevents generalization across different GCMs and locations, especially for precipitation. In this study, we propose a differentiable bias adjustment framework called δCLIMBA (or dCLIMBA), that learns a spatiotemporally adaptive parametric bias adjustment procedure between historical CMIP6 model outputs and reference reanalysis datasets (Livneh). Results demonstrate that the proposed method accurately corrects both the magnitude and distribution of extreme storm events, with particularly strong performance in capturing extremes. The quantile distribution of precipitation is well reproduced across diverse U.S. cities, and spatial patterns perform comparably to the widely used LOCA2 statistical downscaling technique. In addition, the framework showed future trend preservation unlike pure quantile based methods and LOCA2; and results from bias correction over unseen regions showed that the marginal biases were attenuated. This work presents a modular, computationally efficient and extensible bias correction approach that is physically informed, scalable, and compatible with both historical and future applications. Its flexibility makes it suitable for integration into Earth system post-processing pipelines and impact workflows.

[820] Shape of Memory: a Geometric Analysis of Machine Unlearning in Second-Order Optimizers

Kennon Stewart

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We argue that current definitions of machine unlearning are underspecified for second-order optimizers. We compare first-order and second-order learners for their ability to handle the data deletion task with varying degrees of eigendecomposition to mimic the loss model memory. While both first and second-order methods realign with the ideal counterfactul in terms of performance and gradient, the second-order optimizer shows significant volatility in the optimizer state. This indicates residual information, supposedly deleted, that isn’t detectable by first-order analysis. Various eigendecay treatments show that stability and information loss is regained only under controlled state pertubation where geometric information (or memory) is erased.

[821] On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning

Tongtian Zhu, Tianyu Zhang, Mingze Wang, Zhanpeng Zhou, Can Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global test performance. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significantly improve the performance of decentralized learning under high data heterogeneity. Our theoretical contributions, which explain these phenomena, are the first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides evidence that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research.

[822] ML-Guided Primal Heuristics for Mixed Binary Quadratic Programs

Weimin Huang, Natalie M. Isenberg, Ján Drgoňa, Draguna L Vrabie, Bistra Dilkina

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Mixed Binary Quadratic Programs (MBQPs) are an important and complex set of problems in combinatorial optimization. As solving large-scale combinatorial optimization problems is challenging, primal heuristics have been developed to quickly identify high-quality solutions within a short amount of time. Recently, a growing body of research has also used machine learning to accelerate solution methods for challenging combinatorial optimization problems. Despite the increasing popularity of these ML-guided methods, a large body of work has focused on Mixed-Integer Linear Programs (MILPs). MBQPs are challenging to solve due to the combinatorial complexity coupled with nonlinearities. This work proposes ML-guided primal heuristics for Mixed Binary Quadratic Programs (MBQPs) by adapting and extending existing work on ML-guided MILP solution prediction to MBQPs. We introduce a new neural network architecture for MBQP solution prediction and a new training data collection procedure. Moreover, we extend existing loss functions in solution prediction and propose to combine contrastive and weighted cross-entropy losses. We evaluate the methods on standard and real-world MBQP benchmarks and show that the developed ML-guided methods significantly outperform existing primal heuristics and state-of-the-art solvers. Furthermore, models trained with our proposed extension with combined losses outperform other ML-based methods adapted from MILPs and improve generalization in cross-regional inference on a real-world wind farm layout optimization problem.

[823] K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning

Zixuan Xia, Quanxi Li

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We propose a simple yet effective alternative to reward normalization in policy gradient reinforcement learning by integrating a 1D Kalman filter for online reward estimation. Instead of relying on fixed heuristics, our method recursively estimates the latent reward mean, smoothing high-variance returns and adapting to non-stationary environments. This approach incurs minimal overhead and requires no modification to existing policy architectures. Experiments on \textit{LunarLander} and \textit{CartPole} demonstrate that Kalman-filtered rewards significantly accelerate convergence and reduce training variance compared to standard normalization techniques. Code is available at https://github.com/Sumxiaa/Kalman_Normalization.

[824] C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLMs

Rui Gao, Youngseung Jeon, Swastik Roy, Morteza Ziyadi, Xiang ‘Anthony’ Chen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) show promise for molecular optimization, but aligning them with selective and competing drug-design constraints remains challenging. We propose C-Moral, a reinforcement learning post-training framework for controllable multi-objective molecular optimization. C-Moral combines group-based relative optimization, property score alignment for heterogeneous objectives, and continuous non-linear reward aggregation to improve stability across competing properties. Experiments on the C-MuMOInstruct benchmark show that C-Moral consistently outperforms state-of-the-art models across both in-domain and out-of-domain settings, achieving the best Success Optimized Rate (SOR) of 48.9% on IND tasks and 39.5% on OOD tasks, while largely preserving scaffold similarity. These results suggest that RL post-training is an effective way to align molecular language models with continuous molecular design objectives. Our code and models are publicly available at https://github.com/Rwigie/C-MORAL.

[825] RL Token: Bootstrapping Online RL with Vision-Language-Action Models

Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, Liyiming Ke

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-language-action (VLA) models can learn to perform diverse manipulation skills “out of the box,” but achieving the precision and speed that real-world tasks demand requires further fine-tuning – for example, via reinforcement learning (RL). We introduce a lightweight method that enables sample-efficient online RL fine-tuning of pretrained VLAs using just a few hours of real-world practice. We (1) adapt the VLA to expose an “RL token,” a compact readout representation that preserves task-relevant pretrained knowledge while serving as an efficient interface for online RL, and (2) train a small actor-critic head on this RL token to refine the actions, while anchoring the learned policy to the VLA. Online RL with the RL token (RLT) makes it possible to fine-tune even large VLAs with RL quickly and efficiently. Across four real-robot tasks (screw installation, zip tie fastening, charger insertion, and Ethernet insertion), RLT improves the speed on the hardest part of the task by up to 3x and raises success rates significantly within minutes to a few hours of practice. It can even surpass the speed of human teleoperation on some of the tasks.

[826] Channel Adaptation for EEG Foundation Models: A Systematic Benchmark Across Architectures, Tasks, and Training Regimes

Kuntal Kokate, Bruno Aristimunha, Dung Truong, Arnaud Delorme

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Scaling EEG foundation models requires pooling data across heterogeneous electrode montages, a prerequisite both for larger pretraining corpora and for downstream deployment. We present the first systematic comparison of four channel adaptation methods (Conv1d projection, spherical spline interpolation (SSI), source-space decomposition, and Riemannian re-centering) across five pretrained EEG foundation models (5M–157M parameters), five downstream tasks, and two training regimes with 10–15 random seeds each. We find that rigid-montage models (BENDR, Neuro-GPT) require external adaptation, while flexible models (EEGPT, CBraMod) match or exceed it natively when fine-tuned but benefit from external methods under frozen-encoder deployment. A probe-SFT asymmetry exists: external adaptation can cause severe negative transfer during fine-tuning of flexible models. The optimal method is architecture-dependent (Conv1d for BENDR, SSI/Riemannian for Neuro-GPT, source-space decomposition for depression detection), and 5M-parameter CBraMod outperforms models up to 31$\times$ larger on 4/5 datasets, consistent with independent findings that compact EEG-specific architectures can match larger models.

[827] ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

Yizheng Huang, Wenjun Zeng, Aditi Kumaresan, Zi Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

[828] Unstable Rankings in Bayesian Deep Learning Evaluation

Qishi Zhan, Minxuan Hu, Guansu Wang, Jiaxin Liu, Liang He

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Standard evaluations of Bayesian deep learning methods assume that metric estimates are reliable, but we show this assumption fails under data scarcity. Method rankings are not only unreliable at small $n$, but also dataset-dependent in ways that point estimates cannot reveal: the same method comparison yields $P(\mathrm{MCD} \prec \mathrm{Ensemble}) = 1.000$ at $n = 50$ on one dataset and remains below $0.95$ even at $n = 500$ on another. Across the datasets we consider, no universal sample size threshold exists, which is precisely why dataset-specific posterior inference is necessary. To address this, we use a Bayesian hierarchical model with method-specific variances to treat evaluation metrics as random variables across data realizations, and we use a predictive Minimum Detectable Difference curve to assess whether an observed gap would be detectable at a given training size. Across six Bayesian deep learning methods and five regression datasets, our results show that uncertainty-aware evaluation is necessary in low-data settings, because current evidence for method superiority and predictive detectability at the same training size can diverge substantially. Our framework provides practitioners with principled tools to determine whether their evaluation data is sufficient before drawing conclusions about method superiority.

Wugeng Zheng, Ziwen Kan, Katie Wang, Chen Chen, Song Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multimodal Federated Learning (MMFL) enables privacy-preserving collaborative training, but real-world clinical applications often suffer from within-modality missingness caused by sensor intermittency or irregular sampling. Existing methods implicitly represent unobserved data via architectural alignment or missing embeddings, often failing to recover the true distribution and yielding sub-optimal performance. We propose CondI, a federated framework explicitly addressing this missingness using conditional diffusion models. CondI employs a two-phase training pipeline: first, imputing unobserved temporal components using available multimodal context and conditional embeddings; second, optimizing modality-specific extractors and joint embedding spaces. During inference, imputed raw data pass through trained extractors to generate robust features, providing a holistic representation for downstream tasks. Explicit data imputation ensures models operate on complete semantic structures, significantly enhancing resilience against severe data incompleteness. Experiments on three clinical datasets (PTB-XL, SLEEP-EDF, MIMIC-IV) demonstrate CondI achieves comparable results to state-of-the-art baselines. Code: https://github.com/ZhengWugeng/CondI

[830] A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning

Qishi Zhan, Minxuan Hu, Liang He, Guansu Wang, Jiaxin Liu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In limited-data settings, a single endpoint mean of an evaluation metric such as the Continuous Ranked Probability Score (CRPS) is itself a random variable, yet it is routinely reported as if it were a stable property of the method. We study when this practice fails. Using 50 independent repetitions across six regression datasets, we show that CRPS variance trajectories differ substantially across methods and are not always well described by a smooth power-law decay. Methods with a learned heteroscedastic variance head, namely MAP and Deep Ensembles, can develop pronounced, reproducible variance peaks at intermediate training sizes on real datasets, whereas MC Dropout and Bayes by Backprop typically show smooth variance contraction. These peaks have direct practical consequences: at the variance peak on Seoul Bike, the relative RMSE of a single-seed MAP estimate reaches 93.6%, and the probability of falling within (\pm 10%) of the repeated-run mean drops to 5.9%. We show that local CRPS variance provides a direct signal of single-seed estimation error, with Spearman correlations above 0.96 on every real dataset. Power-law fit quality and monotonicity together provide compact method-level summaries of trajectory regularity. Finally, replacing the standard heteroscedastic objective with (β)-NLL substantially reduces the irregular behavior, consistent with the view that the heteroscedastic training objective contributes to the instability. Practitioners should report trajectory summaries alongside endpoint means and concentrate repeated evaluation in high-variance regions.

[831] HBGSA: Hydrogen Bond Graph with Self-Attention for Drug-Target Binding Affinity Prediction

Junxiao Kong, Chupei Tang, Di Wang, Jixiu Zhai, Yi He, Moyu Tang, Tianchi Lu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate prediction of drug-target binding affinity accelerates drug discovery by prioritizing compounds for experimental validation. Current methods face three limitations: sequence-based approaches discard spatial geometric constraints, structure-based methods fail to exploit hydrogen bond features, and conventional loss functions neglect prediction-target correlation, a key factor for identifying high-affinity compounds in virtual screening. We developed HBGSA (Hydrogen Bond Graph with Self-Attention), a 3.06M-parameter model that encodes hydrogen bond spatial features. HBGSA uses graph neural networks to model hydrogen bond spatial topology with self-attention enhancement and Pearson correlation loss. Experimental results on PDBbind Core Set and CSAR-HiQ dataset demonstrate that HBGSA outperforms baseline methods with strong generalization capability. Ablation studies confirm the effectiveness of hydrogen bond modeling and Pearson correlation loss.

[832] h-MINT: Modeling Pocket-Ligand Binding with Hierarchical Molecular Interaction Network

Yanru Qu, Yijie Zhang, Wenjuan Tan, Xiangzhe Kong, Xiangxin Zhou, Chaoran Cheng, Mathieu Blanchette, Jiaxuan You, Ge Liu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate molecular representations are critical for drug discovery, and a central challenge lies in capturing the chemical environment of molecular fragments, as key interactions, such as H-bond and π stacking, occur only under specific local conditions. Most existing approaches represent molecules as atom-level graphs; however, atom-level representations can hardly express higher-order chemical context (e.g., stereochemistry, lone pairs, conjugation). Fragment-based methods (e.g., principal subgraph, predefined functional groups) fail to preserve essential information such as chirality, aromaticity, and ionic states. This work addresses these limitations from two aspects. (i) OverlapBPE tokenization. We propose a novel data-driven molecule tokenization method. Unlike existing approaches, our method allows overlapping fragments, reflecting the inherently fuzzy boundaries of small-molecule substructures and, together with enriched chemical information at the token level, thereby preserving a more complete chemical context. (ii) h-MINT model. OverlapBPE induces many-to-many atom-fragment mappings, which necessitate a new hierarchical architecture. We therefore develop a hierarchical molecular interaction network capable of jointly modeling interactions at both atom and fragment levels. By supporting fragment overlaps, the model naturally accommodates the many-to-many atom-fragment mappings introduced by the OverlapBPE scheme. Extensive evaluation against state-of-the-art methods shows our method improves binding affinity prediction by 2-4% Pearson/Spearman correlation on PDBBind and LBA, enhances virtual screening by 1-3% in key metrics on DUD-E and LIT-PCBA, and achieves the best overall HTS performance on PubChem assays. Further analysis demonstrates that our method effectively captures interactive information while maintaining good generalization.

[833] Surface Sensitivity in Lean 4 Autoformalization

William Feng, Ethan Lou, Aryan Sharma

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Natural-language variation poses a key challenge in Lean autoformalization: semantically equivalent paraphrases of the same theorem statements can induce divergent formal outputs, yet it remains unclear whether this variation reflects semantic disagreements or shallower failures. We investigate this question in Lean 4 using 60 deterministic paraphrase rules applied to ProofNet# and miniF2F. Across four GPT-family models and three open-weight 7B autoformalizers, we find that the observed paraphrase sensitivity reflects compilation-boundary failures rather than semantic divergence among successful formalizations. In particular, when both baseline and perturbed outputs compile, paired predictions are semantically equivalent under BEq+ and structurally near-identical under GTED. By contrast, paraphrasing substantially affects whether outputs compile, with failure modes varying across datasets and perturbation classes. Our results suggest that future training-time interventions should target the compile boundary rather than the semantic layer, and that benchmarks should separate compile-conditional equivalence from surface consistency.

[834] Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

Abhimanyu Bambhaniya, Geonhwa Jeong, Jason Park, Jiecao Yu, Jaewon Lee, Pengchao Wang, Changkyu Kim, Chunqiang Tang, Tushar Krishna

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs. However, MoE inference at scale is fundamentally bottlenecked by expert load imbalance and inefficient token routing, especially in multi-node deployments where tokens are not guaranteed to be routed to local experts, resulting in significant inter-node all-to-all communication overhead. To systematically characterize these challenges, we profile SOTA open-source MoE models, including Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B, on various datasets and collected over 100k real expert activation traces. Upon studying the expert activation patterns, we uncover various persistent properties across all the frontier MoE models: variable expert load imbalance, domain-specific expert activation where expert popularity shifts across task families (code, math, chat, general), and a strong correlation between prefill and decode expert activations. Motivated by these findings, we propose workload-aware micro-batch grouping and an expert placement strategy to maximize token locality to the destination expert, thereby reducing inter-node communication. Across models and datasets, these optimizations help reduce all2all communication data up to 20, resulting in lower MoE decode latency and better accelerator utilization.

[835] Efficient VQ-QAT and Mixed Vector/Linear quantized Neural Networks

Terry Gou, Puneet Gupta

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In this work, we developed and tested 3 techniques for vector quantization (VQ) based model weight compression. To mitigate codebook collapse and enable end-to-end training, we adopted cosine similarity-based assignment. Building on ideas from attention-based formulations in Differentiable K-Means (DKM), we further improved this approach by using cosine similarity for assignment combined with top-1 sampling and a straight-through estimator, thereby eliminating the need for weighted-average reconstruction. Finally, we investigated the use of differentiable neural architecture search (NAS) to adaptively select layer-wise quantization configurations, further optimizing the compression process. Although our method does not consistently outperform existing approaches across all quantization levels, it provides useful insights into the design trade-offs and behaviors of VQ-based model compression methods.

[836] Follow the TRACE: Exploiting Post-Click Trajectories for Online Delayed Conversion Rate Prediction

Xinyue Zhang, Yuanhao Ding, Xiang Ao

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Delayed feedback poses a core challenge for online CVR prediction, forcing a trade-off between label accuracy and data freshness. Existing methods address this through delay modeling or sample reweighting, yet neglect how post-click behaviors evolve over the observation period. To overcome this limitation, we formalize this evolution as feedback trajectory and propose TRACE. Instead of forcing hard labels on unrevealed samples, our method evaluates how well the accumulated feedback status aligns with conversion versus non-conversion, dynamically refining posteriors without waiting for final outcomes. To counteract early-stage trajectory sparsity, we further design a reliability-gated retrospective completer that leverages full-lifecycle data to provide adaptive posterior guidance for unrevealed samples. Extensive experiments validate TRACE’s superiority over state-of-the-art baselines and confirm the retrospective completion module as a model-agnostic enhancer for existing systems. Our code is available at https://github.com/LunaZhangxy/TRACE.

[837] A Layer Separation Optimization Framework for Cross-Entropy Training in Deep Learning

Yaru Liu, Michael K. Ng, Yiqi Gu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper investigates the deep learning optimization problem with softmax cross-entropy loss. We propose a layer separation strategy to alleviate the strong nonconvexity encountered during training deep networks. For cross-entropy models with fully connected and convolutional neural networks, we introduce auxiliary variables associated with hidden layer outputs and construct corresponding layer separation models, which decompose the original deeply nested optimization problem into a sequence of more manageable subproblems. We also conduct theoretical analyses, proving that the new layer separation loss provides an upper bound for the original cross-entropy loss. Moreover, we design alternating minimization algorithms and prove that, under appropriate conditions, these algorithms exhibit decreasing properties of the loss function. Numerical experiments validate the effectiveness of the proposed methods and indicate improved optimization behavior, especially for fully connected and convolutional neural networks.

[838] Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data

Long Jing, Zhixiong Yang, Yajun Zhang, Xinlong Feng

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Human activity recognition serves as the foundation for various emerging applications. In recent years, researchers have used collaborative sensing of multi-source sensors to capture complex and dynamic human activities. However, multimodal human activity sensing typically encounters highly heterogeneous data across modalities and label scarcity, resulting in an application gap between existing solutions and real-world needs. In this paper, we propose CLMM, a general contrastive learning framework for human activity recognition that achieves effective multimodal recognition with limited labeled data. CLMM employs a novel two-stage training strategy. In the first stage, CLMM employs a CNN-DiffTransformer encoder to capture cross-modal shared information by extracting local and global features. Meanwhile, a hard-positive samples weighting algorithm enhances gradient propagation to reinforce shared learning. In the second stage, a dual-branch architecture combining quality-guided attention and bidirectional gated units captures modality-specific information, while a primary-auxiliary collaborative training strategy fuses both shared and modality-specific information. Experimental results on three public datasets demonstrate that CLMM significantly improves state-of-the-art baselines in both recognition accuracy and convergence performance.

[839] Revisable by Design: A Theory of Streaming LLM Agent Execution

Zhiyuan Zhai, Ming Li, Xin Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Current LLM agents operate under an implicit but universal assumption: execution is a transaction – the user submits a request, the agent works in isolation, and only upon completion does the dialogue resume. This forces users into a binary choice: wait for a potentially incorrect output, or interrupt and lose all progress. We reject this assumption and propose the stream paradigm, in which agent execution and user intervention are concurrent, interleaved processes sharing a bidirectional channel. We formalize this paradigm through a reversibility taxonomy that classifies every agent action as Idempotent, Reversible, Compensable, or Irreversible, and arrive at a core conclusion: an agent’s flexibility is bounded by its reversibility. We prove that conflicting compensable actions impose unavoidable adaptation costs and that conflicting irreversible actions make full specification satisfaction impossible – these costs are properties of the action space, not of the algorithm. Guided by this insight, we present the Revision Absorber, a reactive algorithm based on the Earliest-Conflict Rollback rule that is structurally optimal under mild assumptions. Experiments on StreamBench with real LLM agents validate all predictions: the Absorber matches the quality of a brute-force full-restart baseline while wasting an order of magnitude fewer steps of already-completed work, turning mid-execution revisions from a dead-end into a first-class interaction.

[840] An Analysis of Active Learning Algorithms using Real-World Crowd-sourced Text Annotations

Varun Totakura, Ankita Singh, Yushun Dong, Shayok Chakraborty

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Active learning algorithms automatically identify the most informative samples from large amounts of unlabeled data and tremendously reduce human annotation effort in inducing a machine learning model. In a conventional active learning setup, the labeling oracles are assumed to be infallible, that is, they always provide correct answers (in terms of class labels) to the queried unlabeled instances, which cannot be guaranteed in real-world applications. To this end, a body of research has focused on the development of active learning algorithms in the presence of imperfect / noisy oracles. Existing research on active learning with noisy oracles typically simulate the oracles using machine learning models; however, real-world situations are much more challenging, and using ML models to simulate the annotation patterns may not appropriately capture the nuances of real-world annotation challenges. In this research, we first collect annotations of text samples (from 3 benchmark text classification datasets) from crowd-sourced workers through a crowd-sourcing platform. We then conduct extensive empirical studies of 8 commonly used active learning techniques (in conjunction with deep neural networks) using the obtained annotations. Our analyses sheds light on the performance of these techniques under real-world challenges, where annotators can provide incorrect labels, and can also refuse to provide labels. We hope this research will provide valuable insights that will be useful for the deployment of deep active learning systems in real-world applications. The obtained annotations can be accessed at https://github.com/varuntotakura/al_rcta/.

[841] CombiMOTS: Combinatorial Multi-Objective Tree Search for Dual-Target Molecule Generation

Thibaud Southiratn, Bonil Koo, Yijingxiu Lu, Sun Kim

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Dual-target molecule generation, which focuses on discovering compounds capable of interacting with two target proteins, has garnered significant attention due to its potential for improving therapeutic efficiency, safety and resistance mitigation. Existing approaches face two critical challenges. First, by simplifying the complex dual-target optimization problem to scalarized combinations of individual objectives, they fail to capture important trade-offs between target engagement and molecular properties. Second, they typically do not integrate synthetic planning into the generative process. This highlights a need for more appropriate objective function design and synthesis-aware methodologies tailored to the dual-target molecule generation task. In this work, we propose CombiMOTS, a Pareto Monte Carlo Tree Search (PMCTS) framework that generates dual-target molecules. CombiMOTS is designed to explore a synthesizable fragment space while employing vectorized optimization constraints to encapsulate target affinity and physicochemical properties. Extensive experiments on real-world databases demonstrate that CombiMOTS produces novel dual-target molecules with high docking scores, enhanced diversity, and balanced pharmacological characteristics, showcasing its potential as a powerful tool for dual-target drug discovery. The code and data is accessible through https://github.com/Tibogoss/CombiMOTS.

[842] CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

Marcel Hedman, Kale-ab Abebe Tessera, Juan Claude Formanek, Anya Sims, Riccardo Zamboni, Trevor McInroe, John Torr, Elliot Fosong

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they cannot co-adapt as their policies change. We introduce CODA (Coordination via On-Policy Diffusion for Multi-Agent Reinforcement Learning), a diffusion-based multi-agent trajectory generator for data augmentation that samples conditioned on the current joint policy, producing synthetic experience which reflects the evolving behaviours of the agents, thereby providing a mechanism for co-adaptation. We find that previous diffusion-based augmentation approaches are insufficient for fostering multi-agent coordination because they produce static augmented datasets that do not evolve as the current joint policy changes during training; CODA resolves this by more closely simulating on-policy learning and is a meaningful step toward coordinated behaviours in the offline setting. CODA is algorithm-agnostic and can be layered onto both model-free and model-based offline reinforcement learning pipelines as an augmentation module. Empirically, CODA not only resolves canonical coordination pathologies in continuous polynomial games but also delivers strong results on the more complex MaMuJoCo continuous-control benchmarks.

[843] GIFT: Global stabilisation via Intrinsic Fine Tuning

Rory Young, Nicolas Pugeault

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Deep reinforcement learning policies achieve strong performance in complex continuous control environments with nonlinear contact forces. However, these policies often produce chaotic state dynamics, with trivially small changes to the initial conditions significantly impacting the long-term behaviour of the control system. This high sensitivity to initial conditions limits the application of Deep RL to real-world control systems where performance and stability guarantees are often required. To address this issue, we propose Global stabilisation via Intrinsic Fine Tuning (GIFT), a general-purpose training framework which directly optimises the global stability of existing high-performing deep RL policies using a custom reward function. We demonstrate that GIFT increase the stability of the control interaction while maintaining comparable task performance, thereby improving the suitability of deep RL policies for real-world control systems.

[844] Layer Embedding Deep Fusion Graph Neural Network

Taihua Xu, Genhao Tian, Jicong Fan, Xibei Yang, Qinghua Zhang, Yun Cui

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Graph Neural Networks (GNNs) have demonstrated impressive performance in learning representations from graph-structured data. However, their message-passing mechanism inherently relies on the assumption of label consistency among connected nodes, limiting their applicability to low-homophily settings. Moreover, since message passing operates as a hierarchical diffusion process, GNNs face challenges in capturing long-range dependencies. As network depth increases, the structural noise along heterophilic edges tends to be amplified, resulting in over-smoothing. This issue becomes especially prominent in highly heterophilic graphs, where the propagation of inconsistent semantics across the topology continually exacerbates misaggregation. To address this issue, we propose a novel framework named Layer Embedding Deep Fusion Graph Neural Network (LEDF-GNN). Specifically, we design a Layer Embedding Deep Fusion (LEDF) operator that nonlinearly fuses multi-layer embeddings to capture inter-layer dependencies and effectively alleviate deep propagation degradation. Meanwhile, to mitigate structural heterophily, LEDF-GNN employs a Dual-Topology Parallel Strategy (DTPS) that simultaneously leverages the original and reconstructed topologies, allowing for adaptive structure-semantics co-optimization under diverse homophily conditions. Extensive semi-supervised classification experiments on the citation and image benchmarks demonstrate that, under both homophilic and heterophilic settings, LEDF-GNN consistently outperforms state-of-the-art baselines, validating its effectiveness and generalization capability across diverse graph types.

[845] Process Supervision of Confidence Margin for Calibrated LLM Reasoning

Liaoyaqi Wang, Chunsheng Zuo, William Jurayj, Benjamin Van Durme, Anqi Liu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Scaling test-time computation with reinforcement learning (RL) has emerged as a reliable path to improve large language models (LLM) reasoning ability. Yet, outcome-based reward often incentivizes models to be overconfident, leading to hallucinations, unreliable confidence-based control, and unnecessary compute allocation. We introduce Reinforcement Learning with Confidence Margin (\textbf{RLCM}), a calibration-aware RL framework that jointly optimizes correctness and confidence reliability via a margin-enhanced process reward over intermediate-budget completions. Rather than aligning confidence to correctness likelihoods, RLCM encourages to widen the confidence margin between correct and incorrect steps within a single reasoning trajectory. Across mathematical, code, logic and science benchmarks, our method substantially improves calibration while maintaining or improving accuracy. We further show that, with calibrated confidence signals, the resulting models enable more efficient conformal risk control and effective confidence-weighted aggregation.

[846] TEMPO: Transformers for Temporal Disease Progression from Cross-Sectional Data

Hongtao Hao, Joseph L. Austerweil

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Event-Based Models (EBMs) infer biomarker progression from cross-sectional data but typically only as ordinal sequences and rely on rigid model assumptions. We propose \textsc{Tempo}, a Transformer architecture that learns both ordinal and continuous event sequences through simulation-based supervised learning. \textsc{Tempo} uses two Transformer modules: one treats biomarkers as tokens to infer event sequencing; the other treats patients as tokens, representing each by their per-biomarker abnormality profile, to infer patients’ disease stages. On synthetic benchmarks, \textsc{Tempo} reduces normalized Kendall’s Tau distance by 52.89% and staging MAE by 25.33% compared to state-of-the-art SA-EBM, with larger reductions in high-dimensional settings (58.88% and 61.10%). Applied to ADNI, \textsc{Tempo} recovers a biologically plausible Alzheimer’s progression: early medial temporal atrophy, followed by amyloid accumulation and cognitive decline, and late-stage tau pathology with terminal acceleration of global neurodegeneration – broadly consistent with established disease models. \textsc{Tempo} also eliminates the need to derive custom inference algorithms and enables rapid empirical comparison of generative hypotheses.

[847] When Context Sticks: Studying Interference in In-Context Learning

Hanna Rød, Dagny Streit, Nils Valseth Selte, Justin Li

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper investigates context stickiness in in-context learning (ICL), a phenomenon where earlier examples in a prompt interfere with a transformer’s ability to adapt to later tasks. Using synthetic regression tasks over linear and quadratic functions, we examine how models trained under sequential, mixed, and random curricula handle abrupt task switches during inference. By sweeping over structured combinations of misleading linear examples followed by recovery quadratic examples, we quantify how prior context biases prediction error and how quickly models realign. Our results show strong evidence of persistent interference: more preceding linear examples reliably degrade quadratic predictions, while additional quadratic examples reduce error but with diminishing returns. We further find that training curricula significantly modulate resilience, with sequential training on the target function class yielding the fastest recovery, and surprisingly, random training producing the least robust behavior.

[848] V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

Bingda Tang, Yuhui Zhang, Xiaohan Wang, Jiayuan Mao, Ludwig Schmidt, Serena Yeung-Levy

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a $2\times$ speedup over MixGRPO and a $3\times$ speedup over DiffusionNFT.

[849] Domain-Adapted Fine-Tuning of ECG Foundation Models for Multi-Label Structural Heart Disease Screening

Duc N. Do, Minh N. Do, Dang Nguyen, Khanh T. Q. Le, Khoa D. Pham, Hung N. Huynh, Phi Pham-Van-Hoang, Quan K. Huynh, Ramez M. Odat, Perisa Ashar, Ethan Philip Lowder, Minh H. N. Le, Hoang Le, Phat V. H. Nguyen, Quan Le, Jacques Kpodonu, Phat K. Huynh

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Transthoracic echocardiography is the reference standard for confirming structural heart disease (SHD), but first-line screening is limited by cost, workflow burden, and specialist availability. We evaluated whether open pretrained electrocardiogram (ECG) foundation models can support echo-confirmed multi-label SHD detection using the public EchoNext Mini-Model benchmark. Six echocardiography-derived abnormalities were targeted: reduced left ventricular ejection fraction, increased left ventricular wall thickness, aortic stenosis, mitral regurgitation, tricuspid regurgitation, and right ventricular systolic dysfunction. Under a common pipeline, we compared engineered ECG features with gradient boosting, end-to-end waveform learning from scratch, and transfer from open ECG foundation models. We then applied in-domain self-supervised adaptation of an ECG foundation model (ECG-FM) on EchoNext waveforms followed by selective supervised fine-tuning, and evaluated trade-offs between discrimination and adaptation cost. Adapted ECG-FM models achieved the best overall performance: peak macro-AUROC 0.8509 and macro-AUPRC 0.4297, while a parameter-efficient operating point preserved AUROC (0.8501) and attained the highest fixed-threshold macro-F1 0.3691. Late fusion with covariates did not improve threshold-independent discrimination, and evaluated LoRA, alternative backbones, and mixture-of-foundations strategies did not surpass the best adapted single-backbone models. These results indicate that for ECG-based case finding and echocardiography triage, combining target-domain self-supervised adaptation with selective supervised updating of a pretrained ECG backbone is the most effective transfer strategy.

[850] Approximating Uniform Random Rotations by Two-Block Structured Hadamard Rotations in High Dimensions

Tomer Zilca, Gal Mendelson

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Uniform random rotations are a useful primitive in applications such as fast Johnson-Lindenstrauss embeddings, kernel approximation, communication-efficient learning, and recent AI compression pipelines, but they are computationally expensive to generate and apply in high dimensions. A common practical replacement is repeated structured random rotations built from Walsh-Hadamard transforms and random sign diagonals. Applying the structured random rotation twice has been shown empirically to be useful, but the supporting theory is still limited. In this paper we study the approximation quality achieved when using this two-block structured Hadamard rotation. Our results are both positive and negative. On the positive side, we prove that every fixed coordinate of the two-block transform converges uniformly, over all inputs, to the corresponding coordinate of a uniformly rotated vector, with an explicit Kolmogorov-distance bound of order $d^{-1/5}$. On the negative side, we prove an explicit lower bound on the Wasserstein distance between the full vector distributions, showing that the two-block transform is not a globally accurate surrogate for a uniform random rotation in the worst case. For the extremal input used in the lower bound, we also prove a matching asymptotic upper bound, showing that the lower-bound scale is sharp for that input. Taken together, the results identify a clear separation between one-dimensional marginal behavior, where approximation improves with dimension, and full high-dimensional geometry, where a nonvanishing discrepancy remains. This provides a partial theoretical explanation for the empirical success of structured Hadamard rotations in some algorithms, while also clarifying the limitations of treating them as drop-in replacements for true uniform random rotations.

[851] Evolve: A Persistent Knowledge Lifecycle for Small Language Models

Dikran Hovagimian

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Evolve pairs a small local language model with a persistent, teacher-compiled knowledge store – refined through sleep consolidation and usage-driven refresh – to deliver substantial accuracy gains over the model’s parametric baseline while amortizing teacher costs through cross-query knowledge reuse. Rather than retrieving document fragments at query time, Evolve constructs a store of semantically coherent sections compiled by teacher models at natural conceptual boundaries; new sections are staged on acquisition, consolidated offline through teacher-mediated merging, and refreshed inline when expired. A 2B-parameter local model handles classification and generation; large teacher models are invoked only for knowledge operations. Across 750 benchmark queries spanning custom specialist questions, NaturalQuestions, and TriviaQA, the 2B model augmented by Evolve improves from 20-33% baseline accuracy to 60-84% (+40-52pp) while reducing teacher invocations by over 50% through reuse. Post-consolidation compresses the knowledge store by 31-33.5% across three independent benchmarks while preserving accuracy; section-based retrieval outperforms chunk-based retrieval by 5-9pp across every lifecycle condition. The architecture supports two generation modes over the same lifecycle – suppress (strict section-only grounding, auditable) and augment (section-supplemented responses).

[852] When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

Lucky Verma

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Dynamic Tanh (DyT) removes LayerNorm by bounding activations with a learned tanh(alpha x). We show that this bounding is a regime-dependent implicit regularizer, not a uniformly beneficial replacement. Across GPT-2-family models spanning 64M to 3.78B parameters and 1M to 118M tokens, with Llama and ViT cross-checks, DyT improves validation loss by 27.3% at 64M/1M but worsens it by 18.8% at 64M/118M; the 1M benefit vanishes with capacity (+1.7% at 3.78B), while the 118M penalty reaches +27.9%. The mechanism is measurable: 49% of DyT activations saturate at 1M versus 23% at 118M, and a 500-step saturation heuristic classifies DyT’s sign with 75% raw in-sample accuracy on the 12-cell GPT-2 calibration set (AUC 0.75; 64% when adding Scale 5 stress cells), correctly labels 3/3 Llama checks, but only reaches 50% raw leave-one-scale-out accuracy. Three interventions support the bounding explanation: HardTanh reproduces the regime pattern, increasing alpha at 118M monotonically reduces DyT’s penalty, and vanilla+dropout(p=0.5) matches DyT’s data-rich loss. We also localize Llama-DyT collapse to SwiGLU gating, where saturation separates collapse from convergence in a 3-seed component ablation (r=0.94). Scope: all experiments are compute-limited (T/P < 1.84), below Chinchilla-optimal training.

[853] Machine learning models for estimating counterfactuals in a single-arm inflammatory bowel disease study

Dan Liu, Fida K. Dankar, Jennifer C. deBruyn, Amanda Ricciuto, Anne M. Griffiths, Thomas D. Walters, Khaled EI Emam

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Single-arm trials accelerate study timelines by reducing the number of patients that must be recruited for a concurrent control group. However, these designs require an alternative comparator to estimate treatment effects. One approach is to construct a virtual control arm using a machine learning (ML) model trained on external control data to predict the counterfactual outcomes of the treatment arm. Our aim in this study was to leverage virtual controls by developing and evaluating ML-based counterfactual outcome models trained on IFX-treated patients to predict 1-year steroid-free clinical remission (SFCR ) and a composite of C-reactive protein remission plus steroid-free clinical remission (CRP-SFCR) for ADA-treated pediatric Crohn’s disease patients, and to compare the resulting IFX-versus-ADA treatment effect estimates with those obtained using propensity score matching to external controls. Five ML models were used to train counterfactual models on the observed IFX cohort data. The resulting models were used to predict the counterfactual outcomes for the ADA arm patients. LGBM yields the best OR closest to the propensity score matched reference, and all 95% CI results align with the conclusion from the reference study that no statistical difference in the primary and secondary outcomes has been observed between the patients treated with ADA or IFX. Our study supports virtual controls as a viable and effective substitute for expensive, lengthy or unethical patient recruitment in an inflammatory bowel disease (IBD) trial. The developed gradient boosted prediction model can be used as a pretrained model to generate IFX counterfactual predictions in future studies, pending external validation and assessment of transportability.

[854] Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Divakar Kumar Yadav, Tian Zhao, Deepak Kumar

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: NVIDIA’s CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability.

[855] Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference

Divakar Kumar Yadav, Tian Zhao

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) have achieved strong performance across natural language and multimodal tasks, yet their practical deployment remains constrained by inference latency and kernel launch overhead, particularly in interactive, short-sequence settings. This paper presents a hybrid runtime framework that combines Just-In-Time (JIT) compilation with CUDA Graph execution to reduce launch overhead while preserving runtime flexibility during autoregressive decoding. The framework partitions transformer inference into static components executed via CUDA Graph replay and dynamic components handled through JIT-compiled kernels, enabling asynchronous graph capture and reuse across decoding steps. We evaluate the proposed approach on LLaMA-2 7B using single-GPU, batch-size-one inference across prompt lengths from 10 to 500 tokens. Experimental results show that the hybrid runtime reduces Time-to-First-Token (TTFT) by up to 66.0% and achieves lower P99 latency compared with TensorRT-LLM in this regime. These results indicate that hybrid JIT-CUDA Graph execution can effectively reduce inference latency and variance for short-sequence LLM workloads, making it a practical optimization strategy for latency-sensitive AI applications.

[856] GeoCert: Certified Geometric AI for Reliable Forecasting

Regina Zhang, Zongru Li, Honggang Wen, Xiaofeng Liu, Siu-Ming Yiu, Pietro Liò, Kwok-Yan Lam

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Forecasting systems in science must be accurate, physically consistent, and certifiably reliable. Most existing models address prediction, constraint enforcement, and verification separately, limiting scalability and interpretability. We introduce GeoCert, a geometric AI framework that unifies forecasting, physical reasoning, and formal verification within a single differentiable computation. GeoCert formulates forecasting as evolution along a hyperbolic manifold, where negative curvature induces contraction dynamics, intrinsic robustness, and logarithmic-time certification. A hierarchical constraint architecture separates universal physical laws from domain-specific dynamics, enabling certified generalization across energy, climate, finance, and transportation systems. GeoCert achieves state-of-the-art accuracy while reducing computational cost by 97.5% and maintaining better certification rates. By embedding verification into the geometry of learning, GeoCert transforms forecasting from empirical approximation to formally verified inference, offering a scalable foundation for trustworthy, reproducible, and physically grounded scientific AI.

[857] Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers

Audrey Cherilyn, Houman Safaai

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We study the organization of channel-level importance in transformer feed-forward networks (FFNs). Using a Fisher-style loss proxy (LP) based on activation-gradient second moments, we show that loss sensitivity is concentrated in a small set of channels within each layer. In Llama-3.1-8B, the top 1% of channels per layer accounts for a median of 58.7% of LP mass, with a range of 33.0% to 86.1%. We call these loss-critical channels supernodes. Although FFN layers also contain strong activation outliers, LP-defined supernodes overlap only weakly with activation-defined outliers and are not explained by activation power or weight norms alone. Around this core, we find a weaker but consistent halo structure: some non-supernode channels share the supernodes’ write support and show stronger redundancy with the protected core. We use one-shot structured FFN pruning as a diagnostic test of this organization. At 50% FFN sparsity, baselines that prune many supernodes degrade sharply, whereas our SCAR variants explicitly protect the supernode core; the strongest variant, SCAR-Prot, reaches perplexity 54.8 compared with 989.2 for Wanda-channel. The LP-concentration pattern appears across Mistral-7B, Llama-2-7B, and Qwen2-7B, remains visible in targeted Llama-3.1-70B experiments, and increases during OLMo-2-7B pretraining. These results suggest that LLM FFNs develop a small learned core of loss-critical channels, and that preserving this core is important for reliable structured pruning.

[858] Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

Lichen Li, Hengguang Zhou, Yijun Liang, Tianyi Zhou, Cho-Jui Hsieh

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reward hacking in code generation, where models exploit evaluation loopholes to obtain full reward without correctly solving the tasks, poses a critical challenge for Reinforcement Learning (RL) and the deployment of reasoning models. Existing studies have been conducted primarily on synthetic hacking trajectories. However, whether these synthetic behaviors faithfully represent naturally emerging hacking in the wild remains unclear. In this work, we present a systematic analysis of the synthetic vs. in-the-wild discrepancy in reward hacking. We examine to what extent hacking behaviors induced by prompting resemble those emerging during RL training, and whether monitors trained on synthetic trajectories generalize to naturally arising but previously unseen hacking. To scale up the curation of in-the-wild reward hacking trajectories, we modified Group Relative Policy Optimization (GRPO) by injecting conflicting unit tests as tracers and applying a “resampling-until-hack” mechanism. Through controlled comparisons between monitors trained on synthetic versus in-the-wild data, we find that (1) synthetic-data-trained monitors fail to generalize to “in-the-wild” hacking, and (2) monitors trained on our “in-the-wild” trajectories demonstrate stronger generalizability to unseen hacking types. Our results indicate that synthetic reward hacking data may not fully reflect natural reward hacking behaviors, and that relying solely on synthetic data can lead to misleading conclusions. The codebase is available at https://github.com/LichenLillc/CoTMonitoring.git

[859] Interpretable Physics-Informed Load Forecasting for U.S. Grid Resilience: SHAP-Guided Ensemble Validation in Hybrid Deep Learning Under Extreme Weather

Md Abubakkar, Sajib Debnath, Md. Uzzal Mia

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate short-term electricity load forecasting is a cornerstone of U.S. grid reliability; however, prevailing deep learning models remain opaque, limiting operator trust during extreme weather. A unified, interpretable, physics-informed ensemble framework is proposed, integrating a Convolutional Neural Network (CNN) branch for local feature extraction and a Transformer branch for long-range dependency modeling; the branches are fused through a validation-optimized weighted ensemble and regularized by a physics-informed loss derived from the piecewise parabolic temperature-demand relationship of the Electric Reliability Council of Texas (ERCOT) system. Post-hoc interpretability is provided through SHapley Additive exPlanations (SHAP) with the DeepExplainer backend, yielding global and event-level attributions. Using eight years of ERCOT hourly load data (2018-2025) fused with Automated Surface Observing System (ASOS) records from three Texas stations, the framework achieves 713 MW MAE, 812 MW RMSE, and 1.18% MAPE on the test window. For Hampel-flagged extreme events, MAPE falls by 20.7% relative to its Transformer branch and by 40.5% relative to its CNN branch; an ablation confirms that the parabolic and ramp constraints drive a 14.7% RMSE reduction. SHAP analysis reveals a regime shift: temperature dominates under normal operation, whereas wind speed and precipitation become more influential during cold fronts and heatwaves.

[860] Autocorrelation Reintroduces Spectral Bias in KANs for Time Series Forecasting

Chen Zeng, Jiahui Wang, Qiao Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Existing theory suggests that Kolmogorov-Arnold Networks (KANs) can overcome the spectral bias commonly observed in neural networks under the assumption that inputs are statistically independent. However, this assumption does not hold in time series forecasting (TSF), where inputs are lagged observations with strong temporal autocorrelation. Through theoretical analysis and empirical validation, we obtain an unexpected finding: temporal autocorrelation reintroduces spectral bias in KANs, and the bias becomes increasingly pronounced as the degree of autocorrelation increases. This suggests that standard KANs may face substantial difficulties in TSF with strongly autocorrelated inputs. To address this problem, we introduce the Discrete Cosine Transform (DCT) to reduce the correlations among the network inputs. As expected, experimental results reveal that DCT preprocessing substantially reduces the observed low-frequency preference in TSF. This result also corroborates that the spectral bias of KANs in TSF tasks is indeed induced by the autocorrelation among input variables.

[861] When PINNs Go Wrong: Pseudo-Time Stepping Against Spurious Solutions

Sifan Wang, Shawn Koohy, Yiping Lu, Paris Perdikaris

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Physics-informed neural networks (PINNs) provide a promising machine learning framework for solving partial differential equations, but their training often breaks down on challenging problems, sometimes converging to physically incorrect solutions despite achieving small residual losses. This failure, we argue, is not merely an optimization difficulty. Rather, it reflects a fundamental weakness of the empirical PDE residual loss, which can admit trivial or spurious solutions during training. From this perspective, we revisit pseudo-time stepping, a technique that has recently shown strong empirical success in PINNs. We show that its main benefit is not simply to ease optimization; instead, when combined with collocation-point resampling, it helps reveal and avoid spurious solutions. At the same time, we find that the effectiveness of pseudo-time stepping depends critically on the choice of step size, which cannot be tuned reliably from the training loss alone. To overcome this limitation, we propose an adaptive pseudo-time stepping strategy that selects the step size from a finite-difference surrogate of the local residual Jacobian, yielding the largest step permitted by local stability without per-problem tuning. Across a diverse set of PDE benchmarks, the proposed method consistently improves both accuracy and robustness. Together, these findings provide a clearer understanding of why PINNs fail and suggest a practical pathway toward more reliable physics-informed learning. All code and data accompanying this manuscript are available at https://github.com/sifanexisted/jaxpi2.

[862] On the Memorization of Consistency Distillation for Diffusion Models

Bingqing Jiang, Difan Zou

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diffusion models are central to modern generative modeling, and understanding how they balance memorization and generalization is critical for reliable deployment. Recent work has shown that memorization in diffusion models is shaped by training dynamics, with generalization and memorization emerging at different stages of training. However, deployed diffusion models are often further distilled, introducing an additional training phase whose impact on memorization is not well understood. In this work, we analyze how distillation reshapes memorization behavior in diffusion models, taking consistency distillation as a representative framework. Empirically, we show that when applied to a teacher model that has memorized data, consistency distillation significantly reduces transferred memorization in the student while preserving, and sometimes improving, sample quality. To explain this behavior, we provide a theoretical analysis using a random feature neural network model [Bonnaire et al., 2025], showing that consistency distillation suppresses unstable feature directions associated with memorization while preserving stable, generalizable modes. Our findings suggest that distillation can serve not only as an acceleration tool, but also as a mechanism for improving the memorization-generalization trade-off.

[863] CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning

Rahul Narava, Siddharth Verma, Ojas Jain, Shashi Shekhar Jha, Mayank Shekhar Jha

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Ensuring safe exploration in high-dimensional systems with unknown dynamics remains a significant challenge. Existing safe reinforcement learning methods often provide safety guarantees only in expectation, which can still lead to safety violations. Control-theoretic approaches, in contrast, offer hard constraint-based safety guarantees but typically assume access to known system dynamics or require accurate estimation of control-affine models. In this paper, we propose a safe reinforcement learning framework that learns a probabilistic control-affine dynamics model in an offline setting. The learned model is leveraged to explicitly construct control barrier functions (CBFs) that incorporate model uncertainty to provide conservative safety constraints. These CBF constraints are enforced through an online constraint-based action correction mechanism, enabling safe exploration without overly restricting task performance. Empirical evaluations on nonlinear, complex continuous-control benchmarks demonstrate that our approach achieves returns comparable to those of existing baselines while significantly reducing safety violations.

[864] Hamiltonian Graph Inference Networks: Joint structure discovery and dynamics prediction for lattice Hamiltonian systems from trajectory data

Ru Geng, Panayotis Kevrekidis, Yixian Gao, Hong-Kun Zhang, Jian Zu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Lattice Hamiltonian systems underpin models across condensed matter, nonlinear optics, and biophysics, yet learning their dynamics from data is obstructed by two unknowns: the interaction topology and whether node dynamics are homogeneous. Existing graph-based approaches either assume the graph is given or, as in $α$-separable graph Hamiltonian network, infer it only for separable Hamiltonians with homogeneous node dynamics. We introduce the Hamiltonian Graph Inference Network (HGIN), which jointly recovers the interaction graph and predicts long-time trajectories from state data alone, for both separable and non-separable Hamiltonians and under heterogeneous node dynamics. HGIN couples a structure-learning module – a learnable weighted adjacency matrix trained under a Hamilton’s-equations loss – with a trajectory-prediction module that partitions edges into physically distinct subgraphs via $k$-means clustering, assigning each subgraph its own encoder and thereby breaking the parameter-sharing bottleneck of conventional GNNs. On three benchmarks – a Klein–Gordon lattice with long-range interactions and two discrete nonlinear Schrödinger lattices (homogeneous and heterogeneous) – HGIN reduces long-time energy prediction error and trajectory prediction error by six to thirteen orders of magnitude relative to baselines. A symmetry argument on the Hamiltonian loss further shows that the learned weights encode the parity of the underlying pair potential, yielding an interpretable readout of the system’s interaction structure.

[865] Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers

Giansalvo Cirrincione

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: A widely cited result by Dong et al. (2021) showed that Transformers built from self-attention alone, without skip connections or feed-forward layers, suffer from rapid rank collapse: all token representations converge to a single direction. The proposed remedy was the MLP. We show that this picture, while correct in the regime studied by Dong, is incomplete in ways that matter for architectural understanding. Three results are established. First, layer normalisation is precisely affine-rank-neutral: it preserves the affine rank of the token representation set exactly. The widespread claim that LN “plays no role” is imprecise; the correct statement is sharper. Second, residual connections generically obstruct rank collapse in real Transformers such as BERT-base, in a measure-theoretic sense, without contribution from the MLP. The MLP’s irreplaceable function is different: generating feature directions outside the linear span of the original token embeddings, which no stack of attention layers can produce. Third, a phenomenon distinct from rank collapse is identified: head-channel non-identifiability. After multi-head attention sums per-head outputs through the output projection, individual contributions cannot be canonically attributed to a specific head; n(H-1)d_k degrees of freedom per layer remain ambiguous when recovering a single head from the mixed signal. The MLP cannot remedy this because it acts on the post-summation signal. A constructive partial remedy is proposed: a position-gated output projection (PG-OP) at parameter overhead below 1.6% of the standard output projection. The four collapse phenomena identified in the literature – rank collapse in depth, in width, head-channel non-identifiability, and entropy collapse – are unified under a symmetry-breaking framework, each corresponding to a distinct symmetry of the Transformer’s forward pass.

[866] Can an MLP Absorb Its Own Skip Connection?

Antonij Mijoski, Marko Karbevski

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We study when a skip connection around a single-hidden-layer MLP can be absorbed into a residual-free MLP of the same width. We first show that for any architecture whose skip branch is an invertible linear map (including Hyper-Connections and their manifold-constrained variants), the problem reduces to the identity skip case. For homogeneous activations of degree $k \neq 1$, such as ReLU$^2$ and ReGLU, absorption is unconditionally impossible by a degree argument. For gated activations whose gate is differentiable at the origin with $g(0) = 0$, including SwiGLU and GeGLU, a linearization argument gives the same conclusion. These impossibility results extend to arbitrary depth: a composition of $L$ residual blocks using such activations cannot be replicated by any composition of $L$ residual-free blocks of the same width. For ungated ReLU and GELU, the situation is richer. For generic weight matrices, absorption holds at the single-block level if and only if there exists an index set $S$ of size at least $d$ such that $W_{\mathrm{down}}[:,S],W_{\mathrm{up}}[S,:] = -I_d$. This condition is non-generic (it fails with probability one under continuous weight distributions), so skip-connected and residual-free MLPs of the same width represent generically disjoint function classes. Whether this disjointness persists for deep compositions of ReLU or GELU blocks remains open.

[867] OptProver: Bridging Olympiad and Optimization through Continual Training in Formal Theorem Proving

Chenyi Li, Yanchen Nie, Zhengyu Ming, Gong Zhang, Kun Yuan, Zaiwen Wen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advances in formal theorem proving have focused on Olympiad-level mathematics, leaving undergraduate domains largely unexplored. Optimization, fundamental to machine learning, operations research, and scientific computing, remains underserved by existing provers. Its reliance on domain-specific formalisms (convexity, optimality conditions, and algorithmic analysis) creates significant distribution shift, making naive domain transfer ineffective. We present OptProver, a trained model that achieves robust transfer from Olympiad to undergraduate optimization. Starting from a strong Olympiad-level prover, our pipeline mitigates distribution shift through two key innovations. First, we employ large-scale optimization-focused data curation via expert iteration. Second, we introduce a specialized preference learning objective that integrates perplexity-weighted optimization with a mechanism to penalize valid but non-progressing proof steps. This not only addresses distribution shifts but also guides the search toward efficient trajectories. To enable rigorous evaluation, we construct a novel benchmark in Lean 4 focused on optimization. On this benchmark, OptProver achieves state-of-the-art Pass@1 and Pass@32 among comparably sized models while maintaining competitive performance on general theorem-proving tasks, demonstrating effective domain transfer without catastrophic forgetting.

[868] Quasi-Equivariant Metanetworks

Viet-Hoang Tran, An Nguyen, Benoît Guérand, Thieu N. Vo, Tan M. Nguyen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Metanetworks are neural architectures designed to operate directly on pretrained weights to perform downstream tasks. However, the parameter space serves only as a proxy for the underlying function class, and the parameter-function mapping is inherently non-injective: distinct parameter configurations may yield identical input-output behaviors. As a result, metanetworks that rely solely on raw parameters risk overlooking the intrinsic symmetries of the architecture. Reasoning about functional identity is therefore essential for effective metanetwork design, motivating the development of equivariant metanetworks, which incorporate equivariance principles to respect architectural symmetries. Existing approaches, however, typically enforce strict equivariance, which imposes rigid constraints and often leads to sparse and less expressive models. To address this limitation, we introduce the novel concept of quasi-equivariance, which allows metanetworks to move beyond the rigidity of strict equivariance while still preserving functional identity. We lay down a principled basis for this framework and demonstrate its broad applicability across diverse neural architectures, including feedforward, convolutional, and transformer networks. Through empirical evaluation, we show that quasi-equivariant metanetworks achieve good trade-offs between symmetry preservation and representational expressivity. These findings advance the theoretical understanding of weight-space learning and provide a principled foundation for the design of more expressive and functionally robust metanetworks.

[869] Impact of Age Specialized Models for Hypoglycemia Classification

Beyza Cinar, Maria Maleshkova

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Disease progression varies with age and is influenced by underlying genetic, biochemical, and hormonal etiologies, suggesting the need for tailored monitoring, care, and medication beyond standard clinical guidelines. Specifically, in autoimmune diseases like type 1 diabetes (T1D), where patients depend on exogenous insulin to compensate for insulin deficiency, medication dosing and the physiological response reflected in vital signs can differ. Insulin therapy can lead to hypoglycemia, a dangerous condition characterized by decreased blood glucose levels ($\leq$70). This risk can be mitigated through improved diabetes management supported by data analytics. Notably, leveraging data from continuous glucose monitoring (CGM) devices, hypoglycemia onset can be predicted. However, while glucose variability, auto-antibody levels, and hypoglycemia occurrence differ across age groups, hypoglycemia classification most often only relies on population-based models specialized in specific age ranges. In this work, we classify hypoglycemia 0, 5-15, 20-45, and 50-120 minutes before onset using DiaData, a large CGM dataset of patients with T1D ranging from children to seniors. In particular, we investigate: 1) the generalizability of a population-based model including all age groups, 2) the impact of age-segmented models trained separately per age group, and 3) the effect of model individualization through transfer learning. The results show that a global population-based model yields similar or superior performance compared to age-segmented models. These findings suggest that data from children, teenagers, and adults can be combined for training models on hypoglycemia classification. While glucose variation differs across age groups, short-term hypoglycemic patterns are similar. However, data of children obtain their best recall with age specialized model.

[870] Transformer as an Euler Discretization of Score-based Variational Flow

Huadong Liao

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Despite the Transformer’s dominance across machine learning, its architecture remains largely heuristic and lacks a unified theoretical foundation. We introduce Score-based Variational Flow (SVFlow), a continuous-time dynamical system for representation learning in which the state evolves according to a variational posterior-weighted average of conditional log-likelihood scores, and provide a principled basis for regularization through variational consistency. We show that forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture. Multi-head attention approximates SVFlow vector field via a vMF kernel-smoothed posterior, while MoE/FFN approximates it in a relaxed network-based way, and the residual-normalization block implements a relaxed retraction that maintains spherical geometry. This unification explains why attention trains stably without explicit regularization while MoE requires auxiliary balancing losses. Experiments on pre-trained language models with prefix shuffling show that SVFlow-induced metrics correlate with task performance, reveal depth-dependent sensitivity, and reflect the intrinsic dynamics of attention.

[871] SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

Alexis Limozin, Eduard Durech, Torsten Hoefler, Imanol Schlag, Valentina Pyatkin

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent mixed-policy optimization methods for LLM reasoning that interleave or blend supervised and reinforcement learning signals report improvements over the standard SFT-then-RL pipeline. We show that numerous recently published research papers rely on a faulty baseline caused by two distinct bugs: a CPU-offloaded optimizer bug in DeepSpeed that silently drops intermediate micro-batches during gradient accumulation (affecting multiple downstream frameworks including TRL, OpenRLHF and Llama-Factory), and a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Together they suppress SFT performance, with the optimizer bug accounting for most of the gap and the loss aggregation bug contributing a smaller additional effect. Once corrected, the standard SFT-then-RL pipeline surpasses every published mixed-policy method we evaluate by +3.8 points on math benchmarks with Qwen2.5-Math-7B and by +22.2 points with Llama-3.1-8B. Even a truncated variant with just 50 RL steps outperforms mixed-policy methods on math benchmarks while using fewer FLOPs.

[872] The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

Shuaizhi Cheng, Xiang Shi, Mingwei Li

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Hypernetwork-based methods such as Doc-to-LoRA internalize a document into an LLM’s weights in a single forward pass, but they fail systematically on conflicts: when the document contradicts pretraining knowledge, accuracy collapses to 46.4% on the deepest facts. We show the failure is a magnitude problem rather than a representational one. The hypernetwork already targets the right layers, but its adapter margin is approximately constant across documents while the pretrained margin grows with training frequency, so deep conflicts lose by construction. The account predicts that failure should track prior strength: sorting 194 conflicts by the base model’s log-probability on the contradicted fact, baseline accuracy falls from 68% on weak-prior questions to 16% on strong-prior ones, a 52 percentage-point gap. The cure is amplitude. Selective Layer Boosting scales the adapter at its top-norm layers, and Conflict-Aware Internalization triggers boosting only when the base model is confident. Both are training-free; together they raise deep-conflict accuracy from 46.4% to 71.0% on Gemma-2B and from 53.6% to 72.5% on Mistral-7B while preserving novel-knowledge recall, and beat vanilla retrieval-augmented generation on medium conflicts by 18 percentage points despite operating entirely in parameter space. We release KID-Bench, a 489-question benchmark that separates novel recall, cross-knowledge combination, and prior-graded conflicts.

[873] Agentic Fusion of Large Atomic and Language Models to Accelerate Materials Discovery

Mingze Li, Yu Rong, Songyou Li, Lihong Wang, Jiacheng Cen, Liming Wu, Anyi Li, Zongzhao Li, Qiuliang Liu, Rui Jiao, Tian Bian, Pengju Wang, Hao Sun, Jianfeng Zhang, Ji-Rong Wen, Deli Zhao, Shifeng Jin, Tingyang Xu, Wenbing Huang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The discovery of novel materials is critical for global energy and quantum technology transitions. While deep learning has fundamentally reshaped this landscape, existing predictive or generative models typically operate in isolation, lacking the autonomous orchestration required to execute the full discovery process. Here we present ElementsClaw, an agentic framework for materials discovery that synergizes Large Atomic Models (LAMs) with Large Language Models (LLMs). In response to varied human requirements, ElementsClaw dynamically orchestrates a suite of LAM tools finetuned from our proposed model Elements for atomic-scale numerical computation, while leveraging LLMs for high-level semantic reasoning. This shift moves AI-driven materials science from isolated processes toward integrated and human interactive discovery. In the demanding domain of superconductors, our agentic system guides the experimental synthesis of four new superconductors, including Zr3ScRe8 with a transition temperature of 6.8 K and HfZrRe4 at 6.7 K. At scale, ElementsClaw screens more than 2.4 million stable crystals within only 28 GPU hours, identifying 68,000 high-confidence superconducting candidates and vastly expanding the known superconducting space. These results demonstrate how our agent accelerates materials discovery with high physical fidelity.

[874] Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks

Vugar Ismailov

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We analyze the universal approximation property of Kolmogorov-Arnold Networks (KANs) in terms of their edge functions. If these functions are all affine, then universality clearly fails. How many non-affine functions are needed, in addition to affine ones, to ensure universality? We show that a single one suffices. More precisely, we prove that deep KANs in which all edge functions are either affine or equal to a fixed continuous function $σ$ are dense in $C(K)$ for every compact set $K\subset\mathbb{R}^n$ if and only if $σ$ is non-affine. In contrast, for KANs with exactly two hidden layers, universality holds if and only if $σ$ is nonpolynomial. We further show that the full class of affine functions is not required; it can be replaced by a finite set without affecting universality. In particular, in the nonpolynomial case, a fixed family of five affine functions suffices when the depth is arbitrary. More generally, for every continuous non-affine function $σ$, there exists a finite affine family $A_σ$ such that deep KANs with edge functions in $A_σ\cup{σ}$ remain universal. We also prove that KANs with the spline-based edge parameterization introduced by Liu et al.~\cite{Liu2024} are universal approximators in the classical sense, even when the spline degree and knot sequence are fixed in advance.

[875] WISE-FM:Operation-Aware, Engineering-Informed Foundation Model for Multi-Task Well Design

Carine de Menezes Rebello, Anderson Rapello dos Santos, Idelfonso B. R. Nogueira

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Deploying machine learning models across diverse well portfolios requires generalisation to wells with design parameters outside the training distribution. Current data-driven approaches to virtual flow metering (VFM) and bottomhole estimation typically treat each well independently or ignore the influence of well design on operational behaviour. We present WISE (Well Intelligence and Systems Engineering Foundation Model), a design-aware, physics-informed multi-task model that integrates three complementary mechanisms: Feature-wise Linear Modulation (FiLM) and cross-modal attention to condition operational embeddings on well design parameters; multi-task learning for simultaneous prediction of flow rates, bottomhole conditions, and flow regime classification; and structural mass conservation with soft physics constraints derived from well engineering principles. Evaluation on the ManyWells benchmark (2000 simulated wells, $10^6$ data points) demonstrates that design-aware models reduce VFM prediction error by up to $13\times$ compared to design-unaware baselines, and that physics constraints reduce negative flow predictions by 65%. Flow regime classification achieves 97.7% bottomhole accuracy, providing continuous well integrity monitoring without additional sensors. The methodology transfers to real operational data from five Equinor Volve producers (oil rate $R^2 = 0.89$, bottomhole pressure $R^2 = 0.98$, water rate $R^2 = 0.97$). The trained model additionally serves as a fast surrogate for integrity-aware well design optimisation over a 24-dimensional design space, with more than $1000\times$ speedup over drift-flux simulations. These results demonstrate that design awareness, physics enforcement, and multi-task learning are essential and complementary ingredients for foundation models intended to operate across large well portfolios.

[876] A General Representation-Based Approach to Multi-Source Domain Adaptation

Ignavier Ng, Yan Li, Zijian Li, Yujia Zheng, Guangyi Chen, Kun Zhang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: A central problem in unsupervised domain adaptation is determining what to transfer from labeled source domains to an unlabeled target domain. To handle high-dimensional observations (e.g., images), a line of approaches use deep learning to learn latent representations of the observations, which facilitate knowledge transfer in the latent space. However, existing approaches often rely on restrictive assumptions to establish identifiability of the joint distribution in the target domain, such as independent latent variables or invariant label distributions, limiting their real-world applicability. In this work, we propose a general domain adaptation framework that learns compact latent representations to capture distribution shifts relative to the prediction task and address the fundamental question of what representations should be learned and transferred. Notably, we first demonstrate that learning representations based on all the predictive information, i.e., the label’s Markov blanket in terms of the learned representations, is often underspecified in general settings. Instead, we show that, interestingly, general domain adaptation can be achieved by partitioning the representations of Markov blanket into those of the label’s parents, children, and spouses. Moreover, its identifiability guarantee can be established. Building on these theoretical insights, we develop a practical, nonparametric approach for domain adaptation in a general setting, which can handle different types of distribution shifts.

[877] ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

Chih-Chung Hsu, Xin-Di Ma, Wo-Ting Liao, Chia-Ming Lee

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Existing attention accelerators often trade exact softmax semantics, depend on fused Tensor Core kernels, or incur sequential depth that limits FP32 throughput on long sequences. We present \textbf{ELSA}, an algorithmic reformulation of online softmax attention that (i)~preserves exact softmax semantics in real arithmetic with a \emph{provable} $\mathcal{O}(u\log n)$ FP32 relative error bound; (ii)~casts the online softmax update as a prefix scan over an associative monoid $(m,S,W)$, yielding $O(n)$ extra memory and $O(\log n)$ parallel depth; and (iii)~is Tensor-Core independent, implemented in Triton and CUDA C++, and deployable as a \emph{drop-in replacement} requiring no retraining or weight modification. Unlike FlashAttention-2/3, which rely on HMMA/GMMA Tensor Core instructions and provide no compatible FP32 path, ELSA operates identically on A100s and resource-constrained edge devices such as Jetson TX2 – making it the only hardware-agnostic exact-attention kernel that reduces parallel depth to $O(\log n)$ at full precision. On A100 FP32 benchmarks (1K–16K tokens), ELSA delivers $1.3$–$3.5\times$ speedup over memory-efficient SDPA and $1.97$–$2.27\times$ on BERT; on Jetson TX2, ELSA achieves $1.5$–$1.6\times$ over Math (64–900 tokens), with $17.8$–$20.2%$ throughput gains under LLaMA-13B offloading at $\ge$32K. In FP16, ELSA approaches hardware-fused baselines at long sequences while retaining full FP32 capability, offering a unified kernel for high-precision inference across platforms. Our code and implementation are available at https://github.com/ming053l/ELSA.

[878] Causal Representation Learning from General Environments under Nonparametric Mixing

Ignavier Ng, Shaoan Xie, Xinshuai Dong, Peter Spirtes, Kun Zhang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Causal representation learning aims to recover the latent causal variables and their causal relations, typically represented by directed acyclic graphs (DAGs), from low-level observations such as image pixels. A prevailing line of research exploits multiple environments, which assume how data distributions change, including single-node interventions, coupled interventions, or hard interventions, or parametric constraints on the mixing function or the latent causal model, such as linearity. Despite the novelty and elegance of the results, they are often violated in real problems. Accordingly, we formalize a set of desiderata for causal representation learning that applies to a broader class of environments, referred to as general environments. Interestingly, we show that one can fully recover the latent DAG and identify the latent variables up to minor indeterminacies under a nonparametric mixing function and nonlinear latent causal models, such as additive (Gaussian) noise models or heteroscedastic noise models, by properly leveraging sufficient change conditions on the causal mechanisms up to third-order derivatives. These represent, to our knowledge, the first results to fully recover the latent DAG from general environments under nonparametric mixing. Notably, our results match or improve upon many existing works, but require less restrictive assumptions about changing environments.

[879] Reparameterization through Coverings and Topological Weight Priors

Maxim Beketov, Pavel Snopov

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We generalise the reparameterization trick applied in variational autoencoders (VAEs) letting these have latent spaces of non-trivial topology - i.e. that of base manifolds covered with other ones, on which some technique for RT is available. That is possible since covering maps are measurable - moreover, in case of particular measure preservation property holding for the covering, one can establish an inequality on KL-divergence between pushforward (PF) densities on the base latent manifold, making the KL-term of VAE’s ELBO analytically tractable, despite the topological non-triviality of the supporting latent manifold. Our development follows a route close but somewhat alternative to reparameterization on Lie groups, the latest proposal for which is to reparameterize PFs of normal densities from the Lie algebra - “through” the exponential map, seen by us as sometimes a particular case of what we propose to call reparameterization through a covering. Covering maps need not be global diffeomorphisms (although Lie-exp maps, in general, need not either, but, to date only smooth ones were considered in this context, to the best of our knowledge), which makes many non-trivial topologies tamable to our proposed technique, that we detail on a particular such example. We demonstrate the working of our approach by constructing a VAE with the latent space of Klein bottle (not a Lie group) topology, which we call KleinVAE, successfully learning an appropriate artificial dataset. We discuss potential applicability of such topology-informed generative models as weight priors in Bayesian learning, particularly for convolutional vision models, where said manifold was peculiarly shown to have some relevance.

[880] Symmetric Equilibrium Propagation for Thermodynamic Diffusion Training

Aditi De

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The reverse process in score-based diffusion models is formally equivalent to overdamped Langevin dynamics in a time-dependent energy landscape. In our prior work we showed that a bilinearly-coupled analog substrate can physically realize this dynamics at a projected three-to-four orders of magnitude energy advantage over digital inference by replacing dense skip connections with low-rank inter-module couplings. Whether the \emph{training} loop can be closed on the same substrate – without routing gradients through an external digital accelerator – has remained open. We resolve this affirmatively: Equilibrium Propagation applied directly to the bilinear energy yields an unbiased estimator of the denoising score-matching gradient in the zero-nudge limit. For finite nudging we derive a sharp bias bound controlled solely by substrate stiffness, local curvature, and the norm of the loss-gradient signal, with a bilinear-specific corollary showing that one dominant bias term vanishes identically for coupling-parameter updates. Symmetric nudging further upgrades the leading bias from $ \mathcal{O}(β) $ to $ \mathcal{O}(β^2) $ at negligible extra cost. Under realistic finite-relaxation budgets this upgrade is essential, as one-sided EqProp produces anti-correlated gradients while symmetric EqProp yields well-aligned updates. Bias-variance analysis determines the optimal operating point, and end-to-end physical-unit accounting projects a $ 10^3$-$10^4\times $ energy advantage per training step over a matched GPU baseline. Symmetric bilinear EqProp is the first local, readout-only training rule that preserves the low-rank coupling enabling scalable thermodynamic diffusion models.

[881] JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

Zhengding Hu, Hehua Ouyang, Chang Chen, Zaifeng Pan, Yue Guan, Zhongkai Yu, Zhen Wang, Steven Swanson, Yufei Ding

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present JigsawRL, a cost-efficient framework that explores Pipeline Multiplexing as a new dimension of RL parallelism. JigsawRL decomposes each pipeline into a Sub-Stage Graph that exposes the intra-stage and inter-worker imbalance hidden by stage-level systems. On this abstraction, JigsawRL resolves multiplexing interference through dynamic resource allocation, eliminates fragmented utilization by migrating long-tail rollouts across workers, and formulates their coordination as a graph scheduling problem solved with a look-ahead heuristic. On 4-64 H100/A100 GPUs across different agentic RL pipelines and models, JigsawRL achieves up to 1.85x throughput over Verl on synchronous RL, 1.54x over StreamRL and AReaL on asynchronous RL, and supports heterogeneous pipelines with moderate latency trade-off.

[882] Scalable Production Scheduling: Linear Complexity via Unified Homogeneous Graphs

Jonathan Hoss, Moritz Link, Noah Klarmann

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Efficiently solving the Job Shop Scheduling Problem in real-world industrial applications requires policies that are both computationally lean and topologically robust. While Reinforcement Learning has shown potential in automating dispatching rules, existing models often struggle with a scalability bottleneck caused by quadratic graph complexity or the architectural overhead of heterogeneous layers. We introduce a unified graph framework that employs feature-based homogenization to project distinct node roles into a shared latent space. This allows a standard homogeneous Graph Isomorphism Network to capture complex resource contention with linear complexity, ensuring low-latency inference for large-scale industrial applications. Our empirical results demonstrate that our framework achieves state-of-the-art performance while exhibiting consistent zero-shot generalization. We identify the job-to-machine ratio as the primary driver of policy effectiveness, rather than absolute problem size. Based on this, we propose a hypothesis of structural saturation, demonstrating that policies trained on critically congested instances ($\mathcal{J} \approx \mathcal{M}$) learn scale-invariant resolution strategies. Agents trained at this saturation point internalize invariant conflict-resolution logic, allowing them to treat massive rectangular instances as a sequential concatenation of saturated sub-problems. This approach eliminates the need for expensive scale-specific retraining and prevents overfitting to statistical shortcuts, providing a robust and efficient pathway for deploying RL solutions in dynamic production environments.

[883] Graph Memory Transformer (GMT)

Nicola Zanarini, Niccolò Ferrari

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self-attention intact, but replaces the usual per-token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 * 128 edge matrix, gravitational source routing, token-conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder-only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M-parameter dense GPT-style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source-to-target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero-shot benchmark behavior under the evaluated setting. These results are not intended as a state-of-the-art claim; they support the viability and structural interpretability of replacing dense within-token transformation with graph-mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.

[884] Inverting Foundation Models of Brain Function with Simulation-Based Inference

Niels Bracher, Xavier Intes, Stefan T. Radev

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Foundation models of brain activity promise a new frontier for in silico neuroscience by emulating neural responses to complex stimuli across tasks and modalities. A natural next step is to ask whether these models can also be used in reverse. Can we recover a stimulus or its properties from synthetic brain activity? We study this question in a proof-of-concept setting using TRIBEv2. We pair the brain emulator with large language models (LLMs) that generate news headlines from linguistic parameters such as valence, arousal, and dominance. We then use simulation-based inference to learn a probabilistic mapping from brain maps to latent stimulus parameters. Our results show that these parameters can be recovered from predicted brain maps, validating the quality of neural encodings. They also show that LLMs can serve as controllable stimulus generators for simulated experiments. Together, these findings provide a step toward decoding and inverse design with foundation brain models.

[885] Learning Interpretable PDE Representations for Generative Reconstructions with Structured Sparsity

Valerie Tsao, Nathaniel Chaney, Manolis Veveakis

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Scientific measurements are often bottlenecked by suboptimal conditions, whether that be noise, incomplete spatial coverage, or limited resolution, rendering accurate field reconstruction a difficult task. We introduce LatentPDE, a latent diffusion framework designed to simultaneously resolve sparse-observation reconstruction and super-resolution. While existing physics-guided diffusion models typically rely on soft loss penalties or uninterpretable representations, our approach enforces physical compliance by constructing an inherently interpretable latent space. Specifically, we parameterize the latent variables directly as the coefficients and source terms of an assumed governing PDE. In doing so, LatentPDE is able to reliably reconstruct dynamics across highly disparate and structured data gaps. Empirical results on diverse configurations demonstrate that our model achieves high-fidelity recovery at any desired resolution while also tracking the underlying predictive uncertainty.

[886] ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

Kadir-Kaan Özer, René Ebeling, Markus Enzweiler

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Time-series anomaly detectors are commonly compared on workstation-class hardware under unconstrained execution. In-vehicle monitoring, however, requires predictable latency and stable behavior under limited CPU parallelism. Accuracy-only leaderboards can therefore misrepresent which methods remain feasible under deployment-relevant constraints. We present ECoLAD (Efficiency Compute Ladder for Anomaly Detection), a deployment-oriented evaluation protocol instantiated as an empirical study on proprietary automotive telemetry (anomaly rate ${\approx}$0.022) and complementary public benchmarks. ECoLAD applies a monotone compute-reduction ladder across heterogeneous detector families using mechanically determined, integer-only scaling rules and explicit CPU thread caps, while logging every applied configuration change. Throughput-constrained behavior is characterized by sweeping target scoring rates and reporting (i) coverage (the fraction of entities meeting the target) and (ii) the best AUC-PR achievable among measured ladder configurations satisfying the target. On constrained automotive telemetry, lightweight classical detectors sustain both coverage and detection lift above the random baseline across the full throughput sweep. Several deep methods lose feasibility before they lose accuracy.

[887] Cardiac Stability Theory: An Axiomatically Grounded Framework for Continuous Cardiac Health Monitoring via Smartphone Photoplethysmography

Timothy Oladunni, Farouk Ganiyu Adewumi

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present Cardiac Stability Theory (CST), an axiomatically grounded framework formally defining cardiovascular health as a stability margin around a cardiac dynamical attractor. From four axioms we derive the Cardiac Stability Index (CSI), a composite scalar in [0,1] integrating the largest Lyapunov exponent, recurrence determinism, and signal entropy via time-delay embedding. The ECG-based model (CSISurrogateV2, CNN-Transformer) achieves $R^2=0.8788$, MAE$=0.0234$ on PTB-XL (21,799 recordings). We extend CSI to smartphone PPG via Complementary Domain Transfer (CDT): CSISurrogateV2 generates pseudo-labels for the BUT PPG dataset (48 recordings, 12 subjects), training TinyCSINet (122,849 parameters), achieving MAE$=0.0557$, $ρ=0.660$ on the held-out test set ($n=1065$ windows) at ${<}30$ ms mobile latency. CDT is validated on BIDMC, Welltory, and RWS-PPG. Paired validation on 5,035 BIDMC windows yields $r=0.454$ ($ρ=0.485$, $p<10^{-295}$), confirming correlated cardiac stability across modalities. CSI is negatively correlated with age (slope $= -0.000225$ CSI/year, PTB-XL), discriminates atrial fibrillation from normal sinus rhythm (AUROC$=0.89$), and is robust under Perturbation Invariance Training (max AUC drop 1.65%). We derive HeartSpan, a longitudinal stability metric relative to population age norms, enabling continuous non-invasive cardiac monitoring from commodity smartphones for longevity tracking and cardiac risk stratification.

[888] Geometry Preserving Loss Functions Promote Improved Adaptation of Blackbox Generative Model

Sinjini Mitra, Constantine Kyriakakis, Shenyuan Liang, Anuj Srivastava, Pavan Turaga

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Adaptation of blackbox generative models has been widely studied recently through the exploration of several methods including generator fine-tuning, latent space searches, leveraging singular value decomposition, and so on. However, adapting large-scale generative AI tools to specific use cases continues to be challenging, as many of these industry-grade models are not made widely available. The traditional approach of fine-tuning certain layers of a generative network is not feasible due to the expense of storing and fine-tuning generative models, as well as the restricted access to weights and gradients. Recognizing these challenges, we propose a novel end-to-end pipeline aimed at domain adaptation by leveraging geometry-preserving loss functions in conjunction to pre-trained generative adversarial networks (GANs). Our method rethinks the problem of adaptation by re-contextualizing the role of GAN inversion in obtaining accurate latent space representations. Extending the ability of existing state-of-the-art inverters, we preserve pair-wise distances between tangent spaces to successfully train a latent generative model to produce samples from the target distribution. We evaluate our proposed pipeline on StyleGANs with real distribution shifts and demonstrate that the introduction of the geometry preserving loss function lends to improved adaptation of generative models compared to other traditional loss functions.

[889] Machine Learning and Deep Learning Models for Short Term Electricity Price Forecasting in Australia’s National Electricity Market

Wei Lu, Jay Wang, Dingli Duan, Ding Mao, Caiyi Song, John Huang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Short term electricity price forecast is essential in competitive power markets, yet electricity price series exhibit high volatility, irregularity, and non-stationarity. This phenomenon is pronounced in the South Australian region of the National Electricity Market, where high renewable penetration drives price volatility and frequent negative price intervals, while structural changes such as the transition to five-minute settlement further complicate forecast. To address these challenges, this study develops a unified benchmark framework. Under identical data preprocessing, feature engineering with lag features, rolling statistics, cyclic temporal encodings, and so on, and an 85% to 15% chronological train test split, six algorithms are systematically compared, including AWMLSTM, CatBoost, GBRT, LSTM, LightGBM, and SVR. The results show that for price prediction, tree-based models, especially GBRT with an R squared value of 0.88, generally outperform LSTM and SVR. However, all models achieve a mean absolute percentage error above 90%, and more than 65% of GBRT predictions have relative errors above 10%, which highlights the inherent difficulty of price forecast. For demand prediction, all models perform substantially better than in price prediction. AWMLSTM and GBRT achieve an R2 value of 0.96 with mean absolute percentage error below 32%, and GBRT has 74.37% of samples within 5% error, while LSTM and SVR perform less accurately in both tasks. Future improvements should focus on hybrid models such as tree plus transformers, data augmentation for extreme events, and error correction to better capture price spikes.

[890] Gromov-Wasserstein Methods for Multi-View Relational Embedding and Clustering

Rafael Pereira Eufrazio, Eduardo Fernandes Montesuma, Charles Casimiro Cavalcante

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Learning low-dimensional representations from multi-view relational data is challenging when underlying geometries differ across views. We propose Bary-GWMDS, a Gromov-Wasserstein-based method that operates directly on distance matrices to learn a consensus embedding preserving shared relational structure. By leveraging intrinsic distances, the approach naturally handles nonlinear distortions across views. We also introduce Mean-GWMDS-C, a clustering-oriented formulation that averages distance matrices and learns reduced-support representations via a consensus Gromov-Wasserstein transport. Experiments on synthetic and real-world datasets show that the proposed framework yields stable and geometrically meaningful embeddings.

[891] Crystal structure prediction using graph neural combinatorial optimization

Stavros Gerolymatos, J. Kyle Brubaker, Martin J. A. Schuetz, Vladimir V. Gusev

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Crystalline materials are widely used in technological applications, yet their discovery remains a significant challenge. As their properties are driven by structure, crystal structure prediction (CSP) methods play a central role in computational approaches aiming to accelerate this process. Previously, CSP has been approached from a combinatorial optimization perspective, with the core challenge of allocating atoms on a fine grid of predefined discrete positions within a unit cell while minimizing their interaction energy. Exact mathematical optimization methods provide guaranteed solutions, but they become computationally expensive for large-scale instances, where the atomic configuration space grows rapidly, particularly in the absence of additional symmetry constraints. In this work, we introduce a neural combinatorial optimization approach to the atom allocation challenge and, subsequently, CSP, based on graph neural networks (GNNs), which can effectively sample from the distribution of feasible structures in an unsupervised manner. We leverage expander graphs to construct computational graphs over discrete positions that capture both short- and long-range interactions between atoms, and employ the Gumbel-Sinkhorn approach to enforce the desired stoichiometry of the generated structures. We demonstrate that our method outperforms classical heuristic approaches and is competitive with a commercial optimization solver across a range of chemical compositions. This enables the use of ever-expanding GPU infrastructure to tackle the inherent combinatorial challenges of CSP, paving the way for scaling beyond current capabilities.

[892] Robust and Clinically Reliable EEG Biomarkers: A Cross Population Framework for Generalizable Parkinson’s Disease Detection

Nicholas R. Rasmussen, Longwei Wang, Rodrigue Rizk, Md Rezwanul Akter Pallab, Samuel Stuwart, Martina Mancini, Arun Singh, KC Santosh

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Developing robust and clinically reliable EEG biomarkers requires evaluation frameworks that explicitly address cross population generalization in multi site settings such as Parkinsons disease (PD) detection. Models trained under i.i.d. assumptions often capture population specific artifacts rather than disease relevant neural structure, leading to poor generalization across clinical cohorts. EEG further amplifies this challenge due to low signal to noise ratio and heterogeneous acquisition conditions. We propose a population aware evaluation framework to assess the robustness and clinical reliability of EEG biomarkers under distribution shift. Using an n gram expansion strategy, we enumerate all cross population train test configurations across five independent cohorts, resulting in 75 directional evaluations. A nested cross validation design with integrated channel selection ensures prospective biomarker identification without population leakage. Results show that cross population transfer is asymmetric and that both accuracy and biomarker stability improve with increasing training population diversity, achieving up to 94.1% accuracy on held out cohorts. A theoretical analysis based on mixture risk optimization and hypothesis space contraction explains these trends, showing that multi population training promotes population robust representations. This work establishes a principled framework for learning robust, generalizable, and clinically reliable EEG biomarkers for multi site biomedical applications.

[893] Task-guided Spatiotemporal Network with Diffusion Augmentation for EEG-based Dementia Diagnosis and MMSE Prediction

Xiaoyu Zheng, Xu Tian, Bin Jiao, Kunbo Cui, Hanhe Lin, Lu Shen, Jin Liu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Patients with dementia typically exhibit cognitive impairment, which is routinely assessed using the Mini-Mental State Examination (MMSE). Concurrently, their underlying neurophysiological abnormalities are reflected in Electroencephalography (EEG), providing a basis for joint modeling. However, traditional multi-task approaches suffer from feature entanglement, which leads to inter-task interference when handling heterogeneous objectives.To address this challenge, we propose a task-guided spatiotemporal network (TGSN) with diffusion augmentation for EEG-based dementia diagnosis and MMSE prediction. Specifically, TGSN integrates a multi-band feature fusion module to capture complementary spectral information from EEG. Meanwhile, a pre-trained data augmentation module utilizing a diffusion process is introduced toincrease sample diversity. To model the complex spatiotemporal patterns of EEG, we propose a gated spatiotemporal attention module that captures long-range spatial dependencies and temporal dynamics. Moreover, we design a task-guided query module to achieve task-specific feature extraction, thereby mitigating task interference. The effectiveness of TGSN is evaluated on the XY02 dataset. Experimental results demonstrate that the proposed network outperforms several state-of-the-art methods, achieving classification accuracies of 97.78% for Alzheimer’s Disease (AD)/Frontotemporal Dementia (FTD) and 83.93% for AD/FTD/Vascular Cognitive Impairment (VCI), which exceed the best baselines by 16.39% and 8.28%, respectively. In parallel, it reduces the RMSE for MMSE prediction to 1.93 and 2.38, achieving significant error reductions of 1.44 and 1.43 compared to the best baselines. Additionally, validation on the DS004504 dataset demonstrates strong cross-dataset generalization…

[894] DecompKAN: Decomposed Patch-KAN for Long-Term Time Series Forecasting

Naveen Mysore

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate time series forecasting in scientific domains such as climate modeling, physiological monitoring, and energy systems benefits from both competitive predictions and model transparency. This work proposes DecompKAN, a lightweight attention-free architecture that combines trend-residual decomposition, channel-wise patching, learned instance normalization, and B-spline Kolmogorov-Arnold Network (KAN) edge functions. Each KAN edge learns an explicit, inspectable 1D scalar function over learned patch-embedding coordinates that can be directly visualized. On standard benchmarks, DecompKAN achieves best or tied-best MSE on 15 of 32 dataset-horizon combinations among selected published baselines, and achieves best or tied-best MSE on 20 of 36 comparisons under a controlled same-recipe evaluation across 9 datasets including the physiological PPG-DaLiA benchmark. The architecture shows particular strength on datasets with smooth temporal dynamics (Solar -17%, ECL -10% vs. iTransformer, Weather) and physiological time series. Visualization of learned edge functions reveals qualitatively different latent nonlinearities across domains. Ablation analysis shows that the architectural pipeline (decomposition, patching, normalization) drives performance more than the choice of nonlinear layer, while the KAN formulation enables inspection of learned latent transformations.

[895] Continual Calibration: Coverage Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more sharply than top-1 performance. We study this empirically by measuring conformal coverage and calibration error on sequentially fine-tuned models across three model families and eight task sequences drawn primarily from classification and multiple-choice benchmarks. Across the classification-style settings we study, coverage loss exceeds accuracy loss by a factor of roughly (3.4\times \pm 0.5\times) on average across seeds; in the most pronounced case, coverage drops from (0.92) to (0.61), while accuracy remains within three points of baseline. Standard continual-learning methods that preserve accuracy do not automatically preserve coverage, and naive calibration baselines recover only part of the gap. We propose calibration replay, a lightweight post-hoc procedure that maintains a task-specific held-out buffer and refits a task-specific conformal threshold under the current model after each update. It adds no training-time gradient cost, uses less than one percent of the memory of ordinary experience replay, and typically restores coverage to within two points of nominal at buffer size (m = 200). We accompany the empirical study with a drift decomposition, a finite-sample recovery theorem showing exact conformal validity under exchangeability, and a mixture-validity proposition explaining why pooled thresholds do not suffice. Our guarantees are stated for classification-style tasks with task-specific buffers; extensions to open-ended generation are exploratory.

[896] Hindsight Preference Optimization for Financial Time Series Advisory

Yanwei Cui, Guanghui Wang, Xing Zhang, Peiyang He, Ziyuan Li, Bing Zhu, Wei Qiu, Xusheng Wang, Zheng Yu, Anqi Xin

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Time series models predict numbers; decision-makers need advisory – directional signals with reasoning, actionable suggestions, and risk management. Training language models for such predictive advisory faces a fundamental challenge: quality depends on outcomes unknown at prediction time. We bridge two ideas from reinforcement learning – using information unavailable during execution to retrospectively generate training signal, and preference alignment – and propose Hindsight Preference Optimization: observed outcomes let an LLM judge rank candidate advisories on dimensions that scalar metrics cannot capture, producing preference pairs for DPO without human annotation. We apply this to Vision-Language-Model-based predictive advisories on S&P 500 equity time series, demonstrated by a 4B model outperforming its 235B teacher on both accuracy and advisory quality.

[897] Fix Initial Codes and Iteratively Refine Textual Directions Toward Safe Multi-Turn Code Correction

Yuto Tanaka, Issei Sato

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent work on large language models (LLMs) has emphasized the importance of scaling inference compute. From this perspective, the state-of-the-art method Scattered Forest Search (SFS) has been proposed, employing Monte Carlo Tree Search with carefully crafted initial seeds and textual optimization for multi-turn code correction. However, its complexity makes it unclear what factors contribute to improvements in inference performance. To address this problem, we analyze SFS and propose a simpler method, Iterative Refinement of Textual Directions (IRTD), which fixes initial codes and iteratively refines textual directions. Because of the simplicity of IRTD, we theoretically establish the safety of IRTD using Oracle-Guided Inductive Synthesis (OGIS). Experiments on several code generation benchmarks suggest that IRTD achieves inference performance comparable to state-of-the-art methods. These results indicate that, even without complex search structures, refining initial codes with high-quality textual directions alone can effectively improve inference performance.

[898] When to Commit? Towards Variable-Size Self-Contained Blocks for Discrete Diffusion Language Models

Danny Wang, Ruihong Qiu, Zi Huang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Discrete diffusion language models (dLLMs) enable parallel token updates with bidirectional attention, yet practical generation typically adopts blockwise semi-autoregressive decoding. This switch creates a training-inference mismatch: training denoises with full-sequence context, while inference commits tokens within a bounded block without future context. Therefore, decoding with fixed-size or heuristic-based blocks can lead to premature token commitments, as decisions are made without full access to future context that could alter those choices. Motivated by this, we propose self-containedness as a principled criterion for block commitment. A block is self-contained if its predictions remain consistent with Future-Aware (FA) or without No-Future (NF) access to future context, reframing block boundary selection as a test of self-containedness rather than a heuristic choice. Based on this principle, we introduce Variable-size Self-contained Blocks (VSB) for dLLMs. VSB scores and selects block boundaries using the divergence between token-level predictive distributions under NF and FA conditioning, which quantifies how predictions would change if future context were revealed. We provide theoretical justification linking self-containedness to predictive consistency, and extensive experiments validate VSB’s efficacy over fixed-size and heuristic blockwise decoding.

[899] TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, James Cheng

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher’s effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule.Experimental results across four student-teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher’s performance and generalize to tasks on which the teacher fails.

[900] Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Post-Training Quantization (PTQ) compresses large language models to low bit-widths using a small calibration set, and its quality depends strongly on which samples are chosen. We identify a failure mode in which calibration samples fail to activate outlier channels, hidden dimensions with unusually large activations, causing the quantizer to underestimate their dynamic range and producing per-channel reconstruction errors that dominate layer-wise loss. Motivated by this observation, we argue that PTQ calibration quality is governed more by weighted outlier-channel coverage than by generic sample representativeness, and formulate calibration selection as a weighted set cover problem over outlier channels. The objective is monotone submodular, and the greedy algorithm, COVERCAL, operates on pre-computed activation statistics and requires no GPU time at selection. We further show that the weight choice is internally consistent: under a stylized clipping model, missed weighted coverage upper-bounds surrogate loss, justifying the weighted coverage objective as principled rather than purely empirical. Across LLaMA-2, LLaMA-3, and Mistral, under AWQ and GPTQ backends and five downstream evaluations, COVERCAL improves over random, max-perplexity, max-activation-variance, and stratified baselines, with the largest gains at small calibration budgets. At INT4 with 128 samples, COVERCAL improves MMLU by 1.2 to 1.5 points over random calibration and reduces perplexity degradation by 15 to 30%; with 64 samples, it matches or exceeds random calibration at 256. The contribution is not a new PTQ backend but a formulation of calibration selection as weighted outlier coverage, with a simple, efficient algorithm and a surrogate-based justification.

[901] FedSLoP: Memory-Efficient Federated Learning with Low-Rank Gradient Projection

Yutong He, Zhengyang Huang, Jiahe Geng

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Federated learning enables a population of clients to collaboratively train machine learning models without exchanging their raw data, but standard algorithms such as FedAvg suffer from slow convergence and high communication and memory costs in heterogeneous, resource-constrained environments. We introduce FedSLoP, a federated optimization algorithm that combines stochastic low-rank subspace projections of gradients, thereby reducing the dimension of communicated and stored updates while preserving optimization progress. On the theoretical side, we develop a detailed nonconvex convergence analysis under standard smoothness and bounded-variance assumptions, showing that FedSLoP is guaranteed to converge to a first-order stationary point at a rate of $O(1/\sqrt{NT})$. On the empirical side, we conduct extensive experiments on federated MNIST classification with heterogeneous data partitions, showing that FedSLoP substantially reduces communication volume and client-side memory while achieving competitive or better accuracy compared with FedAvg and representative sparse or low-rank baselines. Together, our results demonstrate that random subspace momentum methods such as FedSLoP provide a principled and effective approach to communication- and memory-efficient federated learning. Codes are available at: https://github.com/pkumelon/FedSLoP.git.

[902] FlashOverlap: Minimizing Tail Latency in Communication Overlap for Distributed LLM Training

Rezaul Karim, Austin Wen, Wang Zongzuo, Weiwei Zhang, Yang Liu, Walid Ahmed

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The rapid growth in the size of large language models has necessitated the partitioning of computational workloads across accelerators such as GPUs, TPUs, and NPUs. However, these parallelization strategies incur substantial data communication overhead significantly hindering computational efficiency. While communication-computation overlap presents a promising direction, existing data slicing based solutions suffer from tail latency. To overcome this limitation, this research introduces a novel communication-computation overlap technique to eliminate this tail latency in state of the art overlap methods for distributed LLM training. The aim of this technique is to effectively mitigate communication bottleneck of tensor parallelism and data parallelism for distributed training and inference. In particular, we propose a novel method termed Flash-Overlap that replaces conventional collective operations of reduce-scatter and all-gather with decomposed peer-to-peer (P2P) communication and schedules partitioned computations to enable fine-grained overlap. Our method provides an exact algorithm for reducing communication overhead that eliminates tail latency. Moreover, it presents a versatile solution compatible with data-parallel training and various tensor-level parallelism strategies, including TPSP and UP. Experimental evaluations demonstrate that our technique consistently achieves lower latency, superior Model FLOPS Utilization (MFU), and high throughput.

[903] Geometry-Aware Offline-to-Online Learning in Linear Contextual Bandits

Zean Han, Ruihan Lin, Zezhen Ding, Jiheng Zhang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We study offline-to-online learning in linear contextual bandits with biased offline regression data: the offline parameter need not match the online one, so history should not be treated as a single warm start. We model directional transfer with a shift certificate $(M_{\mathrm{shift}},ρ)$ and offline ridge estimation, yielding a geometry-aware confidence region for the online parameter rather than an isotropic radius. We propose \emph{Ellipsoidal-MINUCB}, which combines a standard online branch with an offline-informed pooled branch and uses offline information only when it tightens uncertainty. With high probability, regret is bounded by the minimum of a standard SupLinUCB-style fallback and a pooled term that separates statistical width from a certificate-weighted shift penalty. Under a simple alignment condition, the pooled term further simplifies to a rate governed by an effective dimension induced by the offline geometry. We also show that a purely Euclidean (scalar) shift bound, by itself, does not determine which feature directions are transferable. Beyond this fixed certificate, we show how to learn a data-driven certificate from data at finitely many refresh times and establish a high-probability regret bound for Ellipsoidal-MINUCB with epoch-wise learned certificates. Experiments match the main prediction: gains are strongest at intermediate horizons when offline coverage and transferability align, while the method otherwise tracks the safe online baseline.

[904] A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

Jun Shu, Junxiong Jia, Deyu Meng, Zongben Xu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Emergent intelligence have played a major role in the modern AI development. While existing studies primarily rely on empirical observations to characterize this phenomenon, a rigorous theoretical framework remains underexplored. This study attempts to develop a mathematical approach to formalize emergent intelligence from the perspective of limit theory. Specifically, we introduce a performance function E(N, P, K), dependent on data size N, model size P and training steps K, to quantify intelligence behavior. We posit that intelligence emerges as a transition from finite to effectively infinite knowledge, and thus recast emergent intelligence as existence of the limit $\lim_{N,P,K \to \infty} \mathcal{E}(N,P,K)$, with emergent abilities corresponding to the limiting behavior. This limit theory helps reveal that emergent intelligence originates from the existence of a parameter-limit architecture (referred to as the limit architecture), and that emergent intelligence rationally corresponds to the learning behavior of this limit system. By introducing tools from nonlinear Lipschitz operator theory, we prove that the necessary and sufficient conditions for existence of the limit architecture. Furthermore, we derive the scaling law of foundation models by leveraging tools of Lipschitz operator and covering number. Theoretical results show that: 1) emergent intelligence is governed by three key factors-training steps, data size and the model architecture, where the properties of basic blocks play a crucial role in constructing foundation models; 2) the critical condition Lip(T)=1 for emergent intelligence provides theoretical support for existing findings. 3) emergent intelligence is determined by an infinite-dimensional system, yet can be effectively realized in practice through a finite-dimensional architecture. Our empirical results corroborate these theoretical findings.

[905] AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents

Hojoon Kim, Yuheng Wu, Thierry Tambe

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Embodied AI agents increasingly rely on large language models (LLMs) for planning, yet per-step LLM calls impose severe latency and cost. In this paper, we show that embodied tasks exhibit strong plan locality, where the next plan is largely predictable from the current one. Building on this, we introduce AgenticCache, a planning framework that reuses cached plans to avoid per-step LLM calls. In AgenticCache, each agent queries a runtime cache of frequent plan transitions, while a background Cache Updater asynchronously calls the LLM to validate and refine cached entries. Across four multi-agent embodied benchmarks, AgenticCache improves task success rate by 22% on average across 12 configurations (4 benchmarks x 3 models), reduces simulation latency by 65%, and lowers token usage by 50%. Cache-based plan reuse thus offers a practical path to low-latency, low-cost embodied agents. Code is available at https://github.com/hojoonleokim/MLSys26_AgenticCache.

[906] End-to-End Learning for Partially-Observed Time Series with PyPOTS

Wenjie Du, Yiyuan Yang, Tianxiang Zhan, Qingsong Wen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Partially-observed time series (POTS) is ubiquitous in real-world applications, yet most existing toolchains separate missing-value handling from downstream learning, which limits reproducibility and overall performance. This tutorial introduces PyPOTS, an open-source Python ecosystem for end-to-end data mining and machine learning on POTS. We present practical workflows spanning missingness simulation, data preprocessing, model training, and evaluation across core tasks, including imputation, forecasting, classification, clustering, and anomaly detection. The tutorial consists of two parts: Part I emphasizes hands-on application for practitioners through unified APIs and benchmark-oriented experiments. Part II targets developers and researchers, focusing on extending PyPOTS with custom models, domain-specific constraints, and contribution-ready engineering practices. Participants will gain both conceptual understanding and implementation experience for building robust, transparent, and reusable POTS pipelines in research and production settings. PyPOTS is publicly available at https://github.com/WenjieDu/PyPOTS

[907] Generalising maximum mean discrepancy: kernelised functional Bregman divergences

Russell Tsuchida, Frank Nielsen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Bregman divergences play a pivotal role in statistics, machine learning and computational information geometry. Particularly in the context of machine learning, they are central to clustering, exponential families, parameter estimation and optimisation, among other things. Despite this, the full toolkit of Hilbert spaces and in particular reproducing kernel Hilbert spaces have not been systematically developed and applied to functional Bregman divergences, where points are functions rather than finite-dimensional parameter vectors. While other types of functional Bregman divergences have been studied, these are typically in a Banach space rather than more directly aligned with kernel methods and Hilbert-space geometry commonly used in machine learning. We consider functional Bregman divergences on a Hilbert space, where the self-dual pairing and Riesz representer afford us particularly convenient calculus. Further specialising Bregman generators as a composition involving a kernel mean embedding makes such divergences easy to estimate. We discuss applications in clustering, universal estimation, robust estimation and generative modelling, and contrast our approach with other types of Bregman divergences.

[908] FreeScale: Distributed Training for Sequence Recommendation Models with Minimal Scaling Cost

Chenhao Feng, Haoli Zhang, Shakhzod Ali-Zade, Yanli Zhao, Liang Luo, Jennifer Cao, Lisen Deng, Siqiao Chen, Chenyu Zhao, Tristan Rice, Daniel Johnson, Min Si, Tiantu Xu, Yi Zhang, Siqi Yan, Chuanhao Zhuge, Min Ni, Bi Xue, Qunshu Zhang, Shen Li

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Modern industrial Deep Learning Recommendation Models typically extract user preferences through the analysis of sequential interaction histories, subsequently generating predictions based on these derived interests. The inherent heterogeneity in data characteristics frequently result in substantial under-utilization of computational resources during large-scale training, primarily due to computational bubbles caused by severe stragglers and slow blocking communications. This paper introduces FreeScale, a solution designed to (1) mitigate the straggler problem through meticulously load balanced input samples (2) minimize the blocking communication by overlapping prioritized embedding communications with computations (3) resolve the GPU resource competition during computation and communication overlapping by communicating through SM-Free techniques. Empirical evaluation demonstrates that FreeScale achieves up to 90.3% reduction in computational bubbles when applied to real-world workloads running on 256 H100 GPUs.

[909] Explaining Temporal Graph Predictions With Shapley Values

Lea-Marie Sussek, Stefan Heindorf

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Temporal Graph Neural Networks (TGNNs) have become increasingly popular in recent years due to their superior predictive performance by combining both spatial and temporal information. However, how these models utilize the information to make predictions is rather unexplored, leading to potentially faulty or biased models. This work introduces two novel model-agnostic explainers for local explanations of TGNNs based on Shapley and Owen values. The first method, an event-level (edge-level) Shapley explainer, applies the KernelSHAP algorithm to estimate contribution scores for individual temporal events, providing interpretable descriptions for model behavior. The second, a feature-level Shapley explainer, extends this framework by decomposing event-level Shapley values into Owen values, and thereby uncovers hierarchical dependencies of the event and its features. The explainers outperform SOTA explainers on different metrics and datasets. Additionally, the Feature Explainer reveals a faulty extraction of actual timestamps of a commonly used TGAT implementation, helping to further understand performance drops on very sparse explanations.

[910] Meta-Ensemble Learning with Diverse Data Splits for Improved Respiratory Sound Classification

June-Woo Kim, Miika Toikkanen, Heejoon Koo, Yoon Tae Kim, Doyoung Kwon, Kyunghoon Kim

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Training reliable respiratory sound classification models remains challenging due to the limited size and subject diversity of datasets. Ensemble methods can improve robustness, but when base models are trained on identical data, models tend to overfit and produce highly correlated predictions, thereby reducing the effectiveness of ensembling. In this work, we investigate a meta-ensemble learning methodology that enhances prediction diversity by training base models on diverse data splits and combining their outputs through a trained meta-model. Specifically, we train base models on the ICBHI dataset using two data split settings: fixed 80-20% split and five-fold cross-validation split, under two data granularity settings: patient- and sample-level. The resulting diversity in base model predictions enables the meta-model to better generalize. Our approach achieves new state-of-the-art performance on the ICBHI benchmark, reaching a Score of 66.49% and showing improved generalization on two out-of-distribution datasets, indicating its potential applicability to real-world clinical data.

[911] Fed-DLoRA: Efficient Wireless Federated Learning with Dynamic Low-Rank Adaptation

Huaicheng Li, Junhui Zhao, Haoyu Quan, Xiaoming Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Federated learning (FL) offers a promising distributed learning paradigm for internet of vehicles (IoV) applications. However, it faces challenges from communication overhead and dynamic environments. Model compression techniques reduce computing and communication burden yet create trade-offs between compression ratios and vehicle participation strategies. In this paper, we propose a lightweight FL algorithm named federated learning with dynamic low-rank adaptation (Fed-DLoRA), which is combined with low-rank adaptation (LoRA) to effectively reduce parameters and communication costs while enhancing training efficiency. The convergence analysis of Fed-DLoRA is conducted through stochastic gradient descent optimization coupled with singular value decomposition. This analysis establishes the theoretical relationships among LoRA rank, vehicular scheduling strategies and the model’s convergence characteristics. Building on these insights, we formulate a joint optimization problem aimed at maximizing system performance. To address this problem, we propose an adaptive rank, bandwidth and vehicle selection (ARBVS) algorithm that integrates enumeration with greedy optimization strategies. The algorithm provides efficient rank selection and resource scheduling strategies for each FL communication round, thereby achieving effective performance improvements for the FL system. Experimental results demonstrate that Fed-DLoRA achieves superior performance compared to conventional federated learning approaches, exhibiting enhanced accuracy, faster convergence, and improved communication efficiency.

[912] Leveraging Human Feedback for Semantically-Relevant Skill Discovery

Maxence Hussonnois, Thommen George Karimpanal, Santu Rana

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Unsupervised skill discovery in reinforcement learning aims to intrinsically motivate agents to discover diverse and useful behaviours. However, unconstrained approaches can produce unsafe, unethical, or misaligned behaviours. To mitigate these risks and improve the practical desireability of discovered skills, recent work grounds the discovery process by leveraging human preference feedback. However, preference-based approaches are feedback-inefficient and inherently ill-equipped to deal with skill spaces composed of a variety of different skills such as running, jumping, walking, etc. To overcome this limitation, we introduce semantic labelling, a novel and feedback-efficient approach that leverages human cognitive strengths to identify and label semantically meaningful behaviours. Based on semantic labelling, we propose Semantically Relevant Skill Discovery (SRSD), a novel human-in-the-loop approach that collects semantic labels from human feedback and learns a reward function to encourage skills to be more semantically diverse and relevant. Through our experiments in a 2D navigation environment and four locomotion environments, we demonstrate that SRSD can improve semantic diversity and discover relevant behaviours while scaling effectively to a large variety of behaviours.

[913] Machine-Learning-Based Classification of Radio Frequency Building Loss

Jiayi Tan, Neelabhro Roy, James Gross, Rohit Chandra, Tsao-Tsen Chen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate modeling of outdoor-to-indoor (O2I) and indoor-to-indoor (I2I) signal loss is important for improving indoor wireless network performance in dense urban areas. Traditional on-site measurements are expensive, time-consuming, and difficult to conduct across wide regions. Real-world datasets also tend to be noisy and imbalanced, which makes signal loss prediction challenging. This study presents a machine learning framework for classifying radio frequency (RF) building loss. The framework combines passively collected, crowdsourced user equipment (UE) data from 3GPP-compliant networks with public building information. We evaluated Random Forest, XGBoost, LightGBM, and a voting classifier using both supervised (SL) and semi-supervised learning (SSL). Compared to SL-only inference, the proposed SL and SSL framework improved both prediction accuracy and confidence under identical data constraints, achieving up to 12.6% relative accuracy gain for O2I loss and 3.4% for I2I loss, while reducing prediction entropy by up to 8.4%. Among the evaluated models, SSL XGBoost provided the most confident O2I loss classification, whereas SSL LightGBM achieved the best performance for I2I loss. These results demonstrate that the proposed approach provides a practical, data-driven alternative to traditional models, with promising potential to support better network planning and indoor coverage optimization.

[914] Progressive Approximation in Deep Residual Networks: Theory and Validation

Wei Wang, Xiao-Yong Wei, Qing Li

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The Universal Approximation Theorem (UAT) guarantees universal function approximation but does not explain how residual models distribute approximation across layers. We reframe residual networks as a layer-wise approximation process that builds an approximation trajectory from input to target, and prove the existence of progressive trajectories where error decreases monotonically with depth. It reveals that residual networks can implement structured, step-by-step refinement rather than end-to-end (E2E) black-box mapping. Building on this, we propose Layer-wise Progressive Approximation (LPA), a theoretically grounded training principle that explicitly aligns each layer with its residual target to realize such trajectories. LPA is architecture-agnostic: we observe progressive behavior in residual FNNs, ResNets, and Transformers across tasks including complex surface fitting, image classification, and NLP with LLMs for generation and classification. Crucially, this enables ``train once, use $N$ models": a single network yields useful predictions at every depth, supporting efficient shallow inference without retraining. Our work unifies approximation theory with practical deep learning, providing a new lens on representation learning and a flexible framework for multi-depth deployment. The source code will be released unpon acceptance at https://(open_upon_acceptance).

[915] Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment

Wenzhe Xu, Biao Liu, Yiyang Sun, Xin Geng, Ning Xu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multi-Objective Alignment aims to align Large Language Models (LLMs) with diverse and often conflicting human values by optimizing multiple objectives simultaneously. Existing methods predominantly rely on static preference weight construction strategies. However, rigidly aligning to fixed targets discards valuable intermediate information, as training responses inherently embody valid preference trade-offs even when deviating from the target. To address this limitation, we propose Meal, i.e., MEta ALigner, a bi-level meta-learning framework enabling bidirectional optimization between preferences and policy responses, generating instructive dynamic preferences for steadier training. Specifically, we introduce a preference-weight-net as a meta-learner to generate adaptive preference weights based on input prompts and update the preference weights as learnable parameters, while the LLM policy acts as a base-learner optimizing response generation conditioned on these preferences with rejection sampling strategy. Extensive empirical results demonstrate that our method achieves superior performance on several multi-objective benchmarks, validating the effectiveness of the dynamic bidirectional preference-policy optimization framework.

[916] CMGL: Confidence-guided Multi-omics Graph Learning for Cancer Subtype Classification

Boyang Fan, Hengchuang Yin, Siyu Yi, Yifan Wang, Zhicheng Li, Leijiyu Zhou, Jiancheng Lv, Wei Ju

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Motivation: Multi-omics integration can improve cancer subtyping, but modality informativeness and noise vary across cancer types and patients. Existing graph-based methods optimize modality weights jointly with the classification objective and therefore lack independent reliability estimates, so low-quality omics distort patient similarity graphs and amplify noise through message passing. Results: We propose CMGL, a two-stage framework that estimates per-sample modality reliability through evidential deep learning and uses the frozen confidence scores to guide cross-omics fusion and graph construction. On four MLOmics cancer-subtype tasks and the 32-class pan-cancer task, CMGL consistently improves over the strongest baseline, surpassing it by 4.03% in average accuracy on the four single-cancer tasks. Its representations recover the PAM50 intrinsic subtypes of breast invasive carcinoma (BRCA), and the BRCA-trained model transfers without fine-tuning to kidney renal clear cell carcinoma (KIRC), stratifying patients into prognostically distinct groups.

[917] IMPA-Net: Meteorology-Aware Multi-Scale Attention and Dynamic Loss for Extreme Convective Radar Nowcasting

Haofei Cui, Guangxin He, Juanzhen Sun, Jingjia Luo, Haonan Chen, Xiaoran Zhuang, Mingxuan Chen, Xian Xiao

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Short-range prediction of convective precipitation from weather radar observations is essential for severe weather warnings. However, deep learning models trained with pixel-wise error metrics tend to produce overly smooth forecasts that suppress intense echoes critical for hazard detection. This issue is exacerbated by insufficient multi-scale feature interaction and suboptimal fusion of heterogeneous geophysical inputs. We propose IMPA-Net (Integrated Multi-scale Predictive Attention Network), a deterministic 0-2 hour nowcasting framework that addresses these limitations through meteorologically-informed designs at the input, architecture, and loss function levels. A parameter-free Spatial Mixer reorganizes heterogeneous input channels at the mesoscale-$γ$ neighborhood (~2 km) via deterministic channel permutation, providing a structured cross-field prior. An integrated multi-scale predictive attention module serves as the spatiotemporal translator, capturing dynamics from mesoscale-$β$ to mesoscale-$γ$ scales. A Meteorologically-Aware Dynamic Loss employs three-level asymmetric weighting – adapting across training epochs, storm intensity, and forecast lead time – to counteract regression-to-the-mean. Evaluated against seven baselines on a multi-source radar dataset over eastern China, IMPA-Net raises the Heidke Skill Score at $\geq$45 dBZ from 0.049 (SimVP baseline) to 0.143 under matched settings. Relative to pySTEPS, it provides a better trade-off between severe-event detection and false-alarm control. Spectral analysis confirms preserved energy across mesoscale bands where competing methods show progressive smoothing. These improvements are shown within a single domain and convective regime; generalizability to other orographic and climatic regions remains to be tested.

[918] GeoEdit: Local Frames for Fast, Training-Free On-Manifold Editing in Diffusion Models

Yiming Zhang, Sitong Liu, Ke Li, Zhihong Wu, Alex Cloninger, Melvin Leok

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diffusion models are a leading paradigm for data generation, but training-free editing typically re-runs the full denoising trajectory for every edit strength, making iterative refinement expensive. To address this issue, we instead edit near the data manifold, where small local updates can replace repeated re-synthesis. To enable this, we estimate a local manifold tangent space directly from perturbed samples and prove that this sample-based estimator closely approximates the true tangent. Building on this guarantee, we devise a Jacobian-free algorithm that constructs a tangent frame via small perturbations to the initial noise and alternates small tangent moves with diffusion-based projections. Updates within this frame follow principled on-manifold directions while suppressing off-manifold drift, enabling fine-grained edits without full re-diffusion or additional training. Edit strength is controlled by the number of steps for rapid, continuous adjustments that preserve fidelity and plug into existing samplers. Empirically, the resulting tangent directions yield smooth, semantic unsupervised traversals and effective CLIP-guided optimization, demonstrating practical interactive continuous editing.

[919] BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment

Md. Ashiq Ul Islam Sajid, Mohammad Sakib Mahmood, Md. Tareq Hasan, Md Abdur Rahim, Rafat Ara, Md. Arafat Hossain

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The deployment of intelligent reinforcement learning (RL) agents on resource-constrained edge devices remains a fundamental challenge due to the substantial memory, computational, and energy requirements of modern deep learning systems. While large language models (LLMs) have emerged as powerful architectures for decision-making agents, their multi-billion parameter scale confines them to cloud-based deployment, raising concerns about latency, privacy, and connectivity dependence. We introduce BitRL, a framework for building RL agents using 1-bit quantized language models that enables practical on-device learning and inference under severe resource constraints. Leveraging the BitNet b1.58 architecture with ternary weights (-1, 0, +1) and an optimized inference stack, BitRL achieves 10-16x memory reduction and 3-5x energy efficiency improvements over full-precision baselines while maintaining 85-98 percent of task performance across benchmarks. We provide theoretical analysis of quantization as structured parameter perturbation, derive convergence bounds for quantized policy gradients under frozen-backbone architectures, and identify the exploration-stability trade-off in extreme quantization. Our framework systematically integrates 1-bit quantized language models with reinforcement learning for edge deployment and demonstrates effectiveness on commodity hardware.

[920] Model-Free Inference of Investor Preferences: A Relative Entropy IRL Approach

Chen Xu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present a framework using Relative Entropy Inverse Reinforcement Learning (RE-IRL) to recover investor reward functions from observed investment actions and market conditions. Unlike traditional IRL algorithms, RE-IRL is employed to account for environments where transition probabilities are unknown or inaccessible. To address the challenge of data sparsity, we utilize a $K$-nearest neighbor approach to estimate the observed behavior policy. Furthermore, we propose a statistical testing framework to evaluate the validity and robustness of the estimated results.

[921] Latent-Hysteresis Graph ODEs: Modeling Coupled Topology-Feature Evolution via Continuous Phase Transitions

Qinhan Hou, Jing Tang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Graph neural ordinary differential equations (Graph ODEs) extend graph learning from discrete message-passing layers to continuous-time representation flows. While it supports adaptive long-range propagation, we show that Graph ODEs with strictly positive irreducible mixing operators face an inherent \emph{monostability trap}: in the long-time regime, information leakage is unavoidable and the dynamics converge to a single global consensus attractor. We propose the \textbf{Hysteresis Graph ODE (HGODE)}, which couples feature evolution with a latent topological potential driven by a learned pairwise force. A double-well edge potential and bipolarized gate allow edge states to polarize into connected or insulated phases while preserving differentiability. We provide asymptotic analysis of the collapse mechanism and the proposed hysteretic topology dynamics, and validate HGODE on theory-driven synthetic diagnostics and real-world graph benchmarks.

[922] SolarTformer: A Transformer Based Deep Learning Approach for Short Term Solar Power Forecasting

Ankan Basu, Jyotiraditya Roy, Aditya Datta, Prayas Sanyal, Sumanta Banerjee

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate forecasting of solar power output is essential for efficient integration of renewable energy into the grid. In this study, an attention-based deep learning model, inspired by transformer architecture, is used for short-term solar power forecasting. Our proposed model, “SolarTformer”, is designed to predict solar power output from meteorological data. Unlike traditional models, SolarTformer leverages self-attention mechanisms to effectively capture temporal dependencies and spatial variability in solar irradiance. In addition, the proposed methodology includes feeding power station-specific metadata into the model, which helps to generalize between power stations located at different locations and with different panel configurations and in different seasons. Our experiments demonstrate that SolarTformer significantly outperforms previous models on the same data set. In particular, the model exhibits strong performance on both clear and cloudy days, indicating high robustness and generalizability. These findings highlight the potential of attention-based architectures in enhancing the accuracy of solar forecasting, contributing to a more reliable management of renewable energy.

[923] Self-Abstraction Learning for Effective and Stable Training of Deep Neural Networks

Wonyong Cho, Taemin Kim, Jungmin Kim, Jeong-Rae Kim, Sung Hoon Jung

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Training large-scale deep neural networks effectively and stably is essential for applying deep learning across various fields. However, conventional methods, which rely on training a single large network, often encounter challenges such as gradient vanishing, overfitting and unstable learning. To overcome these limitations, we introduce Self-Abstraction Learning (SAL), a hierarchical framework. In SAL, networks are arranged by structural complexity, where the simplest topmost network is trained first and its hidden and output layers serve as guidance for the successively more complex networks below. This top-down sequential guidance effectively mitigates optimization issues, enabling stable training of deep architectures. Various experiments across MLP, CNN, and RNN architectures demonstrate that SAL consistently outperforms conventional methods, ensuring robust generalization even in data-scarce and complex network regimes.

[924] Mitigating Error Amplification in Fast Adversarial Training

Mengnan Zhao, Lihe Zhang, Bo Wang, Tianhang Zheng, Hong Zhong, Geyong Min

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Fast Adversarial Training (FAT) has proven effective in enhancing model robustness by encouraging networks to learn perturbation-invariant representations. However, FAT often suffers from catastrophic overfitting (CO), where the model overfits to the training attack and fails to generalize to unseen ones. Moreover, robustness oriented optimization typically leads to notable performance degradation on clean inputs, and such degradation becomes increasingly severe as the perturbation budget grows. In this work, we conduct a comprehensive analysis of how guidance strength affects model performance by modulating perturbation and supervision levels across distinct confidence groups. The findings reveal that low confidence samples are the primary contributors to CO and the robustness accuracy trade off. Building on this insight, we propose a Distribution-aware Dynamic Guidance (DDG) strategy that dynamically adjusts both the perturbation budget and supervision signal. Specifically, DDG scales the perturbation magnitude according to the sample confidence at the ground truth class, thereby guiding samples toward consistent decision boundaries while mitigating the influence of learning spurious correlations. Simultaneously, it dynamically adjusts the supervision signal based on the prediction state of each sample, preventing overemphasis on incorrect signals. To alleviate potential gradient instability arising from dynamic guidance, we further design a weighted regularization constraint. Extensive experiments on standard benchmarks demonstrate that DDG effectively alleviates both CO and the robustness accuracy trade off.

[925] Perfecting Aircraft Maneuvers with Reinforcement Learning

Atahan Cilan, Mahir Demir, Özgün Can Yürütken, Seyyid Osman Sevgili, Ümit Can Bekar

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper evaluates an advanced jet trainer’s utilization of artificial intelligence (AI)-based aircraft aerobatic maneuvers with the intention of developing an AI-assisted pilot training module for specific aircraft maneuvers. A multitude of aircraft maneuvers have been simulated using reinforcement learning (RL) agents, which will serve as a training tool for future pilots.

[926] Unveiling the Backdoor Mechanism Hidden Behind Catastrophic Overfitting in Fast Adversarial Training

Mengnan Zhao, Lihe Zhang, Tianhang Zheng, Bo Wang, Baocai Yin

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Fast Adversarial Training (FAT) has attracted significant attention due to its efficiency in enhancing neural network robustness against adversarial attacks. However, FAT is prone to catastrophic overfitting (CO), wherein models overfit to the specific attack used during training and fail to generalize to others. While existing methods introduce diverse hypotheses and propose various strategies to mitigate CO, a systematic and intuitive explanation of CO remains absent. In this work, we innovatively interpret CO through the lens of backdoor. Through validations on pathway division, diverse feature predictions, and universal class distinguishable triggers in CO, we conceptualize CO as a weak trigger variant of unlearnable tasks, unifying CO, backdoor attacks, and unlearnable tasks under a common theoretical framework. Guided by this, we leverage several backdoor inspired strategies to mitigate CO: (i) Recalibrate CO affected model parameters using vanilla fine tuning, linear probing, or reinitialization-based techniques; (ii) Introduce a weight outlier suppression constraint to regulate abnormal deviations in model weights. Extensive experiments support our interpretation of CO and show the efficacy of the proposed mitigation strategies.

[927] Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion

Zhongjie Duan, Hong Zhang, Yingda Chen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Controllable diffusion methods have substantially expanded the practical utility of diffusion models, but they are typically developed as isolated, backbone-specific systems with incompatible training pipelines, parameter formats, and runtime hooks. This fragmentation makes it difficult to reuse infrastructure across tasks, transfer capabilities across backbones, or compose multiple controls within a single generation pipeline. We present Diffusion Templates, a unified and open plugin framework that decouples base-model inference from controllable capability injection. The framework is organized around three components: Template models that map arbitrary task-specific inputs to an intermediate capability representation, a Template cache that functions as a standardized interface for capability injection, and a Template pipeline that loads, merges, and injects one or more Template caches into the base diffusion runtime. Because the interface is defined at the systems level rather than tied to a specific control architecture, heterogeneous capability carriers such as KV-Cache and LoRA can be supported under the same abstraction. Based on this design, we build a diverse model zoo spanning structural control, brightness adjustment, color adjustment, image editing, super-resolution, sharpness enhancement, aesthetic alignment, content reference, local inpainting, and age control. These case studies show that Diffusion Templates can unify a broad range of controllable generation tasks while preserving modularity, composability, and practical extensibility across rapidly evolving diffusion backbones. All resources will be open sourced, including code, models, and datasets.

[928] An Aircraft Upset Recovery System with Reinforcement Learning

Mahir Demir, Atahan Cilan, Seyyid Osman Sevgili, Özgün Can Yürütken, Ümit Can Bekar

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This article explores the progress made in the creation of a pilot activated recovery system (PARS) for advanced jet trainers that utilizes artificial intelligence (AI) in an effort to enhance operational efficiency. The PARS model employs an advanced reinforcement learning (RL) architecture, incorporating a cutting-edge soft-actor critic (SAC) model and hyper-parameter optimization methods. Negative-g punishments and other handcrafted features remarked upon by control engineers and domain experts regarding PARS are also taken into account by the system. When evaluated by them, the AI model’s behavior is deemed more desirable than that of conventional control methods.

[929] DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models

Dake Bu, Wei Huang, Andi Han, Hau-San Wong, Qingfu Zhang, Taiji Suzuki, Atsushi Nitanda

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diffusion language models generate without a fixed left-to-right order, making token ordering a central algorithmic choice: which tokens should be revealed, retained, revised or verified at each step? Existing systems mainly use random masking or confidence-driven ordering. Random masking creates train–test mismatch, while confidence-only rules are efficient but can be myopic and suppress useful exploration. We introduce DPRM (Doob h-transform Process Reward Model), a plug-in token-ordering module for diffusion language models. DPRM keeps the host architecture, denoising objective and supervision unchanged, and changes only the ordering policy. It starts from confidence-driven progressive ordering and gradually shifts to Doob h transform Process Reward guided ordering through online estimates. We characterize the exact DPRM policy as a reward-tilted Gibbs reveal law, prove O(1/N) convergence of the stagewise Soft-BoN approximation, and show that the online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates. Under tractable optimization assumptions, DPRM also yields a sample-complexity advantage over random and confidence-only ordering. DPRM improves over confidence-based baselines in pretraining, post-training, test-time scaling, and single-cell masked diffusion, with particularly strong gains on harder reasoning subsets. In protein, molecular generation and DNA design, the effect is more multi-objective: ordering-aware variants significantly improve selected structural or fragment-constrained metrics while not uniformly dominating the host baseline on every quality metric. These results identify token ordering as a fundamental control axis in diffusion language models and establish DPRM as a general-purpose module for improving it. Code is available at https://github.com/DakeBU/DPRM-DLLM.

[930] SAGE: Sparse Adaptive Guidance for Dependency-Aware Tabular Data Generation

Shuo Yang, Zheyu Zhang, Bardh Prenkaj, Gjergji Kasneci

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Generating high-fidelity synthetic tabular data remains a critical challenge for enhancing data availability in privacy-sensitive and low-resource domains. Recent approaches leverage LLMs by representing table rows as sequences, yet suffer from two fundamental limitations: (1) they model feature dependencies densely, introducing spurious correlations; and (2) they assume static relationships between features, ignoring how these dependencies vary with feature values. To overcome these limitations, we introduce SAGE (Sparse Adaptive Guidance), a novel LLM-based generation framework that enforces sparse and dynamic dependency guidance. SAGE discretizes features into value-aware pseudo-features and constructs a mutual information-based sparse dependency graph. This graph adaptively guides generation through explicit context selection or implicit logit correction, enabling LLMs to focus on truly relevant information during synthesis. Our extensive experiments across six datasets and multiple tasks reveal that SAGE not only improves data fidelity and downstream utility, boosting F1 scores by 10% compared to previous LLM-based methods, but also reduces policy violations by one point. These results highlight the importance of adaptive structure in tabular data generation and provide new insights into context-sensitive control of LLMs.

[931] PathMoG: A Pathway-Centric Modular Graph Neural Network for Multi-Omics Survival Prediction

Di Wang, Chupei Tang, Junxiao Kong, Jixiu Zhai, Moyu Tang, Tianchi Lu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Cancer survival prediction from multi-omics data remains challenging because prognostic signals are high-dimensional, heterogeneous, and distributed across interacting genes and pathways. We propose PathMoG, a pathway-centric modular graph neural network for multi-omics survival prediction. PathMoG reorganizes genome-scale inputs into 354 KEGG-informed pathway modules, introduces a Hierarchical Omics Modulation module to condition gene-expression representations on mutation, copy number variation, pathway, and clinical context, and uses dual-level attention to capture both intra-pathway driver signals and inter-pathway clinical relevance. We evaluated PathMoG on 5,650 patients across 10 TCGA cancer types and observed consistent improvements over representative survival baselines. The framework further provides gene-level, pathway-level, and patient-level interpretability, supporting biologically grounded and clinically relevant risk stratification.

[932] Complexity of Linear Regions in Self-supervised Deep ReLU Networks

Mufhumudzi Muthivhi, Terence L. van Zyl

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: There has been growing interest in studying the complexity of Rectified Linear Unit (ReLU) based activation networks. Recent work investigates the evolution of the number of piecewise-linear partitions (linear regions) that are formed during training. However, current research is limited to examining the complexity of models trained in a supervised way. Self-Supervised Learning (SSL) differs in that it directly optimises the representation space using a loss function to enhance the model’s performance across multiple downstream tasks. This study investigates the local distribution of linear regions produced by SSL models. We demonstrate that the evolution of linear regions correlates with the representation quality by utilising SplineCam to extract two-dimensional polytopes near the data distribution. We track the number, area, eccentricity, and boundaries of regions throughout training. The study compares supervised, contrastive, and self-distillation methods over two standard benchmark datasets, MNIST and FashionMNIST. The analysis of the experimental results shows that self-supervised methods create substantially fewer regions to achieve comparable accuracy to supervised models. Contrastive methods rapidly expand regions over time, whereas self-distillation methods tend to consolidate by merging neighbouring regions. Lastly, we can detect representation collapse early within the geometric space of linear regions. Our analysis suggests that polytopal metrics can serve as reliable indicators of representation quality and model performance.

[933] An Automatic Ground Collision Avoidance System with Reinforcement Learning

Seyyid Osman Sevgili, Atahan Cilan, Mahir Demir, Özgün Can Yürütken, Ümit Can Bekar

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This article evaluates an artificial intelligence (AI)-based Automatic Ground Collision Avoidance System (AGCAS) designed for advanced jet trainers to enhance operational effectiveness. In the continuously evolving field of aerospace engineering, the integration of AI is crucial for advancing operations with improved timing constraints and efficiency. Our study explores the design process of an AI-driven AGCAS, specifically tailored for advanced jet trainers, focusing on addressing the AGCAS problem within a limited observation space. The system utilizes line-of-sight queries on a terrain server to ensure precise and efficient collision avoidance. This approach aims to significantly improve the safety and operational capabilities of advanced jet trainers.

[934] Advancing Ligand-based Virtual Screening and Molecular Generation with Pretrained Molecular Embedding Distance

Shiyun Wa, Yifei Wang, Simone Sciabola, Ye Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Molecular similarity plays a central role in ligand-based drug discovery, such as virtual screening, analog searching, and goal-directed molecular generation. However, traditional similarity measures, ranging from fingerprint-based Tanimoto coefficients to 3D shape overlays, are often computationally expensive at scale or rely on hand-crafted molecular descriptors. Meanwhile, many deep learning approaches to similarity-aware design still depend on similarity-specific supervision or costly data curation, limiting their generality across targets. In this work, we propose pretrained embedding distance (PED) as an effective alternative, computed directly from pretrained molecular models without task-specific training. Experimental results show that PED exhibits distinct correlations with traditional similarity metrics, and performs effectively in both ranking molecules for virtual screening and guiding molecular generation via reward design. These findings suggest that pretrained molecular embeddings capture rich structural information and can serve as a promising and scalable similarity measurement for modern AI-aided drug discovery.

[935] SceneSelect: Selective Learning for Trajectory Scene Classification and Expert Scheduling

Xinrun Wang, Deshun Xia, Ke Xu, Weijie Zhu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate trajectory prediction is fundamentally challenging due to high scene heterogeneity - the severe variance in motion velocity, spatial density, and interaction patterns across different real-world environments. However, most existing approaches typically train a single unified model, expecting a fixed-capacity architecture to generalize universally across all possible scenarios. This conventional model-centric paradigm is fundamentally flawed when confronting such extreme heterogeneity, inevitably leading to a severe generalization gap, degraded accuracy, and massive computational waste. To overcome this bottleneck, rather than refining restricted model-centric architectures, we propose selective learning, a novel scene-centric paradigm. It explicitly analyzes the characteristics of the underlying scene to dynamically route inputs to the most appropriate expert models. As a concrete implementation of this paradigm, we introduce SceneSelect. Specifically, SceneSelect utilizes unsupervised clustering on interpretable geometric and kinematic features to discover a latent scene taxonomy. A highly decoupled classification module is then trained to assign real-time inputs to these scene categories, and a highly extensible, plug-and-play scheduling policy automatically dispatches the trajectory sequence to the optimal expert predictor. Crucially, this decoupled design ensures excellent generalization capabilities, allowing seamless integration with different off-the-shelf models and robust adaptation across new datasets without requiring computationally expensive joint retraining. Extensive experiments on three public benchmarks (ETH-UCY, SDD, and NBA) demonstrate that our method consistently outperforms strong single-model and ensemble baselines, achieving an average improvement of 10.5%, showcasing the effectiveness of scene-aware selective learning.

[936] Prior-Agnostic Robust Forecast Aggregation

Zhi Chen, Cheng Peng, Wei Tang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Robust forecast aggregation combines the predictions of multiple information sources to perform well in the worst case across all possible information structures. Previous work largely focuses on settings with a known binary state space, where the state is either 0 or 1. We study prior-agnostic robust forecast aggregation in which the aggregator observes only experts’ reports, yet is ignorant of both the underlying joint information structure and the full prior, including the underlying state space. Unlike the standard model that fixes the binary state space {0, 1}, we allow the (binary) unknown state values to be arbitrary numbers in [0, 1], so the same reported probability may correspond to very different realized outcome frequencies across environments. Our main contribution is a simple, explicit, closed-form log-odds aggregator that linearly pools forecasts in logit space, together with (nearly-)tight minimax-regret guarantees across three knowledge regimes. We first show that under conditionally independent (CI) signals, robust aggregation with an unknown state space is strictly harder than in the known-state setting by establishing a larger lower bound, and our aggregation rule can achieve a worst-case regret of 0.0255. Along the way, we also characterize tight regret bounds for Blackwell-ordered structures and for general information structures. In the classical setting with known state space {0,1}, our aggregator achieves regret strictly below 0.0226 for CI structures. To the best of our knowledge, this is the first explicit closed-form aggregator that achieves a regret upper bound strictly less than 0.0226. Finally, we extend the model where the aggregator additionally knows each expert’s marginal forecast distribution; in this setting, with the CI structures, we show that a generalized log-odds rule achieves regret of 0.0228, complementing with a lower bound of 0.0225.

[937] A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning

Ying-Tu Chen, Wei Hung, Bing-Shu Wu, Zhang-Wei Hong, Ping-Chun Hsieh

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Many sequential decision-making tasks involve optimizing multiple conflicting objectives, requiring policies that adapt to different user preferences. In multi-objective reinforcement learning (MORL), one widely studied approach} addresses this by training a single policy network conditioned on preference-weighted rewards. In this paper, we explore a novel algorithmic perspective: leveraging reward-free reinforcement learning (RFRL) for MORL. While RFRL has historically been studied independently of MORL, it learns optimal policies for any possible reward function, making it a natural fit for MORL’s challenge of handling unknown user preferences. We propose using the RFRL’s training objective as an auxiliary task to enhance MORL, enabling more effective knowledge sharing beyond the multi-objective reward function given at training time. To this end, we adapt a state-of-the-art RFRL algorithm to the MORL setting and introduce a preference-guided exploration strategy that focuses learning on relevant parts of the environment. Through extensive experiments and ablation studies, we demonstrate that our approach significantly outperforms the state-of-the-art MORL methods across diverse MO-Gymnasium tasks, achieving superior performance and data efficiency. This work provides the first systematic adaptation of RFRL to MORL, demonstrating its potential as a scalable and empirically effective solution to multi-objective policy learning.

[938] Stochastic simultaneous optimistic optimization

Michal Valko, Alexandra Carpentier, Rémi Munos

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We study the problem of global maximization of a function f given a finite number of evaluations perturbed by noise. We consider a very weak assumption on the function, namely that it is locally smooth (in some precise sense) with respect to some semi-metric, around one of its global maxima. Compared to previous works on bandits in general spaces (Kleinberg et al., 2008; Bubeck et al., 2011a) our algorithm does not require the knowledge of this semi-metric. Our algorithm, StoSOO, follows an optimistic strategy to iteratively construct upper confidence bounds over the hierarchical partitions of the function domain to decide which point to sample next. A finite-time analysis of StoSOO shows that it performs almost as well as the best specifically-tuned algorithms even though the local smoothness of the function is not known.

[939] Dialysis Risk Prediction and Treatment Effect Estimation for AKI patients using Longitudinal Electronic Health Records

Kalyani P. Pande, Evan Yang, Bryan Zhu, Sandeep K. Mallipattu, Alisa Yurovsky, Tengfei Ma

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Progression to dialysis or end-stage renal disease is a rare but clinically important outcome. Clinicians need evidence on how medication exposures influence downstream risk. We constructed a fixed-window EHR cohort (90-day observation, 730-day prediction; N=81401; dialysis/ESRD prevalence: 1.1%) and modeled sequences of diagnoses, procedures, and medications with kidney laboratory trends (creatinine, BUN, eGFR). A transformer-based causal multi-head model was trained to estimate drug- and ingredient-level average treatment effects (ATEs) using counterfactual exposure removal and insertion under a full medication history setup. On test set, predictive performance reached an AUC of 0.694 and PR-AUC of 0.094. At the selected decision threshold (0.883), the model achieved an F1 score of 0.201 with a Brier score of 0.018. Post-hoc causal analyses of lab changes (eGFR, creatinine, BUN) using IPTW, AIPW, naive, and covariate-adjusted OLS methods assessed clinical directionality. Results showed partial protective-direction support for ACE/ARB exposures and worsening-direction signals for loop diuretics.

[940] GradMAP: Gradient-Based Multi-Agent Proximal Learning for Grid-Edge Flexibility

Yihong Zhou, Hongtai Zeng, Thomas Morstyn

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Coordinating large populations of grid-edge devices requires learning methods that remain fully decentralised in deployment while still respecting three-phase AC distribution-network physics. This paper proposes gradient-based multi-agent proximal learning (GradMAP) to address this challenge. GradMAP trains independent neural-network policies for each agent without any parameter sharing, and each agent uses only its own local observation for online decision-making without communication. During offline training, GradMAP embeds a differentiable three-phase AC power-flow model in a primal-dual learning loop and uses implicit differentiation to propagate exact network-constraint violations to update the policy parameters. To speed up training, GradMAP reuses expensive environment gradients through a proximal surrogate within a trust region defined in the more direct policy-output (action) space, instead of the probability distribution space used in other works, such as PPO. In case studies with 1,000 agents managing batteries, heat pumps, and controllable generators on the IEEE 123-bus feeder, GradMAP learns decentralised policies that minimise three-phase AC load-flow constraint violations within 15 minutes of training on a single workstation-class NVIDIA RTX PRO 5000 Blackwell 48GB GPU. This is a 3–5x training speed-up over gradient-based self-supervised learning benchmarks and substantially better training efficiency than multi-agent reinforcement-learning benchmarks. In out-of-sample tests, GradMAP also delivers among the lowest operating cost and constraint violations.

[941] Efficient learning by implicit exploration in bandit problems with side observations

Tomas Kocak, Gergely Neu, Michal Valko, Remi Munos

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We consider online learning problems under a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner’s action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem.

[942] Fraud Detection in Cryptocurrency Markets with Spatio-Temporal Graph Neural Networks

Lidia Losavio, Luca Persia, Madan Sathe, Dimosthenis Pasadakis

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Technological advancements in cryptocurrency markets have increased accessibility for investors, but concurrently exposed them to the risks of market manipulations. Existing fraud detection mechanisms typically rely on machine learning methods that treat each financial asset (i.e., token) and its related transactions independently. However, market manipulation strategies are rarely isolated events, but are rather characterized by coordination, repetition, and frequent transfers among related assets. This suggests that relational structure constitutes an integral component of the signal and can be effectively represented through graphical means. In this paper, we propose three graph construction methods that rely on aggregated hourly market data. The proposed graphs are processed by a unified spatio-temporal Graph Neural Network (GNN) architecture that combines attention-based spatial aggregation with temporal Transformer encoding. We evaluate our methodology on a real-world dataset comprised of pump-and-dump schemes in cryptocurrency markets, spanning a period of over three years. Our comparative results showcase that our graph-based models achieve significant improvements over standard machine learning baselines in detecting anomalous events. Our work highlights that learned market connectivity provides substantial gains for detecting coordinated market manipulation schemes.

Md All Shahria, Sanjeda Dewan Mithila, Touhid Alam, Mohammad Sakib Mahmood, Mahfuza Khatun

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The widespread adoption of social media has heightened interest in its psychological effects, particularly on mental health indicators such as anxiety, depression, loneliness, and sleep quality, as these platforms increasingly influence social interactions and well-being. Although previous research has examined correlations between social media use and mental health, few studies have utilized unsupervised machine learning to segment users based on behavioral and psychological patterns, leaving a gap in identifying distinct risk profiles across diverse groups. This study seeks to address this by segmenting individuals according to their social media usage and psychological well-being, employing clustering to reveal hidden patterns and evaluate their mental health implications. Data from 551 participants, collected via an online survey, were preprocessed using KNN imputation for missing values, one-hot encoding for categorical variables like Gender with 5 unique values, and outlier detection via IQR and Z-score methods. K-Means clustering, optimized at 6 clusters using the Elbow Method and a Silhouette Score of 0.32, was applied, with PCA reducing 22 dimensions for visualization and a correlation heatmap highlighting relationships, such as a 0.28 correlation between social media hours and anxiety.

[944] Cortex-Inspired Continual Learning: Unsupervised Instantiation and Recovery of Functional Task Networks

Kevin McKee, Thomas Hazy, Yicong Zheng, Zacharie Bugaud, Thomas Miconi

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Block-sequential continual learning demands that a single model both protect prior solutions from catastrophic forgetting and efficiently infer at inference time which prior solution matches the current input without task labels. We present Functional Task Networks (FTN), a parameter-isolation method inspired by structural and dynamical motifs found in the mammalian neocortex. Similar to mixture-of-experts, this method uses a high dimensional, self-organizing binary mask over a large population of small but deep networks, inspired by dendritic models of pyramidal neurons. The mask is produced by a three-stage procedure: (1) gradient descent on a continuous mask identifies task-relevant neurons, (2) a smoothing kernel biases the result toward spatial contiguity, (3) and k-winner-take-all binarizes the resulting group at a fixed capacity budget. Like mixture-of-experts, each neuron is an independent deep network, so disjoint masks give exactly disjoint gradient updates, providing structural guarantees against catastrophic forgetting. This three-stage procedure recovers the sub-network of a previously-trained task in a single gradient step, providing unsupervised task segmentation at inference time. We test it on three continual-learning benchmarks: (1) a synthetic multi-task classification/regression generator, (2) MNIST with shuffled class labels (pure concept shift), and (3) Permuted MNIST (domain shift). On all three, FTN with fine grained smoothing (FTN-Slow) results in nearly zero forgetting. FTN with a large kernel and only 2 iterations of smoothing (FTN-Fast) trades off some retention for increased speed. We show that the spatial organization mechanism reduces the effective mask search from the combinatorial top-k subset problem in O(C(H,K)) to the complexity of a near-linear scan in O(H) over compact cortical neighborhoods, which is parallelized by the gradient-based update.

[945] The Last Human-Written Paper: Agent-Native Research Artifacts

Jiachen Liu, Jiaxin Pei, Jintao Huang, Chenglei Si, Ao Qu, Xiangru Tang, Runyu Lu, Lichang Chen, Xiaoyan Bai, Haizhong Zheng, Carl Chen, Zhiyang Chen, Haojie Ye, Yujuan Fu, Zexue He, Zijian Jin, Zhenyu Zhang, Shangquan Sun, Maestro Harmon, John Dianzhuo Wang, Jianqiao Zeng, Jiachen Sun, Mingyuan Wu, Baoyu Zhou, Yuchen You, Shijian Lu, Yiming Qiu, Fan Lai, Yuan Yuan, Yao Li, Junyuan Hong, Ruihao Zhu, Beidi Chen, Alex Pentland, Ang Chen, Mosharaf Chowdhury, Zechen Zhang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation details unwritten. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work. We introduce the Agent-Native Research Artifact (Ara), a protocol that replaces the narrative paper with a machine-executable research package structured around four layers: scientific logic, executable code with full specifications, an exploration graph that preserves the failures compilation discards, and evidence grounding every claim in raw outputs. Three mechanisms support the ecosystem: a Live Research Manager that captures decisions and dead ends during ordinary development; an Ara Compiler that translates legacy PDFs and repos into Aras; and an Ara-native review system that automates objective checks so human reviewers can focus on significance, novelty, and taste. On PaperBench and RE-Bench, Ara raises question-answering accuracy from 72.4% to 93.7% and reproduction success from 57.4% to 64.4%. On RE-Bench’s five open-ended extension tasks, preserved failure traces in Ara accelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent’s capabilities.

[946] A Functorial Formulation of Neighborhood Aggregating Deep Learning

Sun Woo Park, Yun Young Choi, U Jin Choi, Youngho Woo

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We provide a mathematical interpretation of convolutional (or message passing) neural networks by using presheaves and copresheaves of the set of continuous functions over a topological space. Based on this interpretation, we formulate a theoretical heuristic which elaborates a number of empirical limitations of these neural networks by using obstructions on such sets of continuous functions over a topological space to be sheaves or copresheaves.

[947] Diffusion-Guided Feature Selection via Nishimori Temperature: Noise-Based Spectral Embedding

Vasiliy S. Usatyuk, Denis A. Sapozhnikov, Sergey I. Egorov

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We propose Noise-Based Spectral Embedding (NBSE), a physics-informed framework for selecting informative features from high-dimensional data without greedy search. NBSE constructs a sparse similarity graph on the samples and identifies the Nishimori temperature $β_N$ the critical inverse temperature at which the Bethe Hessian becomes singular. The corresponding smallest eigenvector captures the dominant mode of an intrinsically degree-corrected diffusion process, naturally reweighting nodes to prevent hub dominance. By transposing the data matrix and applying NBSE in feature space, we obtain a one-dimensional spectral embedding that reveals groups of redundant or semantically related dimensions; balanced binning then selects one representative per group. We prove that coloured Gaussian perturbations shift $β_N$ by at most $O(\barσ^2)$, guaranteeing robustness to measurement noise. Experiments on ImageNet embeddings from MobileNetV2 and EfficientNet-B4 show that NBSE preserves classification accuracy even under aggressive compression: on EfficientNet-B4 the accuracy drop is below $1%$ when retaining only $30%$ of features, outperforming ANOVA $F$-test and random selection by up to $6.8%$.

[948] Scalable Hyperparameter-Divergent Ensemble Training with Automatic Learning Rate Exploration for Large Models

Hailing Cheng, Tao Huang, Chen Zhu, Antonio Alonso

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Training large neural networks with data-parallel stochastic gradient descent allocates N GPU replicas to compute effectively identical updates – a practice that leaves the rich space of learning rate configurations entirely unexplored during training. We propose Hyperparameter-Divergent Ensemble Training (HDET), a method that repurposes these replicas for simultaneous learning rate exploration at negligible communication overhead. HDET operates in alternating phases: a fan-out stage in which replicas train independently under a structured, symmetric spread of learning rates, and a converge stage in which parameters are averaged across all replicas via AllReduce every T steps. Building on this ensemble substrate, we further propose an automatic learning rate (auto-LR) controller that treats the relative training loss across replicas as a performance signal, updating the shared base schedule toward higher-performing configurations via a momentum-based gradient-free meta-update. The combined method produces a self-adapting learning rate schedule that improves both optimization quality and generalization without additional hyperparameter sweeps or training budget. Crucially, the framework generalizes beyond learning rate: any scalar hyperparameter that does not alter model architecture – such as dropout rate, attention scale temperature, or weight-decay coefficient – can be explored across replicas using the same fan-out/converge protocol, with inter-replica loss differences serving as zero-order hypergradients that guide the search direction. HDET is implemented as a drop-in replacement for PyTorch’s OneCycleLR scheduler, requiring no changes to model architecture, optimizer, or data pipeline.

[949] SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

Zijian Guo, İlker Işık, H. M. Sabbir Ahmad, Wenchao Li

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Specification-guided reinforcement learning (RL) provides a principled framework for encoding complex, temporally extended tasks using formal specifications such as linear temporal logic (LTL). While recent methods have shown promising results, their ability to generalize across unseen specifications and diverse environments remains insufficiently understood. In this work, we introduce SpecRLBench, a benchmark designed to evaluate the generalization capabilities of LTL-based specification-guided RL methods. The benchmark spans multiple difficulty levels across navigation and manipulation domains, incorporating both static and dynamic environments, diverse robot dynamics, and varied observation modalities. Through extensive empirical evaluation, we characterize the strengths and limitations of existing approaches and reveal the challenges that emerge as specification and environment complexity increase. SpecRLBench provides a structured platform for systematic comparison and supports the development of more generalizable specification-guided RL methods. Code is available at https://github.com/BU-DEPEND-Lab/SpecRLBench.

[950] Learning to Think from Multiple Thinkers

Nirmit Joshi, Roey Magen, Nathan Srebro, Nikolaos Tsilivis, Gal Vardi

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem. We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings. On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy $\varepsilon$, a moderate number of thinkers that scales as $\log \frac{1}{\varepsilon}\log \log \frac{1}{\varepsilon}$, and sufficient passive end-result data that scales as $\frac{1}{\varepsilon}\cdot poly\log\frac{1}{\varepsilon}$.

[951] Conflict-Aware Harmonized Rotational Gradient for Multiscale Kinetic Regimes

Zhangyong Liang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In this paper, we propose a harmonized rotational gradient method, termed HRGrad, for simultaneously tackling multiscale time-dependent kinetic problems with varying small parameters. These parameters exhibit asymptotic transitions from microscopic to macroscopic physics, making it a challenging multi-task problem to solve over all ranges simultaneously. Solving tasks in different asymptotic regions often encounter gradient conflicts, which can lead to the failure of multi-task learning. To address this challenge, we explicitly encode a hidden representation of these parameters, ensuring that the corresponding solving tasks are serialized for simultaneous training. Furthermore, to mitigate gradient conflicts, we segment the prediction results to construct task losses and introduce a novel gradient alignment metric to ensure a positive dot product between the final update and each loss-specific gradient. This metric maintains consistent optimization rates for all task losses and dynamically adjusts gradient magnitudes based on conflict levels. Moreover, we provide a mathematical proof demonstrating the convergence of the HRGrad method, which is evaluated across a range of challenging asymptotic-preserving neural networks (APNNs) scenarios. We conduct an extensive set of experiments encompassing the Bhatnagar-Gross-Krook (BGK) equation and the linear transport equation in all ranges of Knudsen number. Our results indicate that HRGrad effectively overcomes the `failure modes’ of APNNs in these problems.

[952] The Optimal Sample Complexity of Multiclass and List Learning

Chirag Pabbaraju

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While the optimal sample complexity of binary classification in terms of the VC dimension is well-established, determining the optimal sample complexity of multiclass classification has remained open. The appropriate complexity parameter for multiclass classification is the DS dimension, and despite significant efforts, a gap of $\sqrt{\text{DS}}$ has persisted between the upper and lower bounds on sample complexity. Recent work by Hanneke et al. (2026) shows a novel algebraic characterization of multiclass hypothesis classes in terms of their DS dimension. Building up on this, we show that the maximum hypergraph density of any multiclass hypothesis class is upper-bounded by its DS dimension. This proves a longstanding conjecture of Daniely and Shalev-Shwartz (2014). As a consequence, we determine the optimal dependence of the sample complexity on the DS dimension for multiclass as well as list learning.

[953] Learning Gradient-based Mixup with Extrapolation toward Flatter Minima for Domain Generalization

Danni Peng, Sinno Jialin Pan

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2209.14742: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2209.14742&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[954] Consistency of Lloyd’s Algorithm Under Perturbations

Dhruv Patel, Hui Shen, Shankar Bhamidi, Yufeng Liu, Vladas Pipiras

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2309.00578: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2309.00578&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[955] Universal approximation property of Banach space-valued random feature models including random neural networks

Ariel Neufeld, Philipp Schmocker

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2312.08410: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2312.08410&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[956] Learning Under Moral Hazard with Instrumental Regression and Generalized Method of Moments

Shiliang Zuo

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2405.20642: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.20642&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[957] FlashNorm: Fast Normalization for Transformers

Nils Graef, Filip Makraduli, Andrew Wasielewski, Matthew Clapp

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2407.09577: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.09577&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[958] Universal Approximation of Operators with Transformers and Neural Integral Operators

Emanuele Zappala, Maryam Bagherian

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2409.00841: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.00841&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[959] Learning-Augmented Robust Algorithmic Recourse

Kshitij Kayastha, Vasilis Gkatzelis, Shahin Jabbari

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2410.01580: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.01580&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[960] On the Convergence Theory of Pipeline Gradient-based Analog In-memory Training

Zhaoxian Wu, Quan Xiao, Tayfun Gokmen, Hsinyu Tsai, Kaoutar El Maghraoui, Tianyi Chen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2410.15155: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.15155&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[961] Exploring the Impact of Dataset Statistical Effect Size on Model Performance and Data Sample Size Sufficiency

Arya Hatamian, Lionel Levine, Haniyeh Ehsani Oskouie, Majid Sarrafzadeh

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2501.02673: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.02673&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[962] Orthogonal Representation Learning for Estimating Causal Quantities

Valentyn Melnychuk, Dennis Frauen, Jonas Schweisthal, Stefan Feuerriegel

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2502.04274: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.04274&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[963] PoseX: AI Defeats Physics Approaches on Protein-Ligand Cross Docking

Yize Jiang, Xinze Li, Yuanyuan Zhang, Jin Han, Youjun Xu, Ayush Pandit, Zaixi Zhang, Mengdi Wang, Mengyang Wang, Minjie Shen, Guang Yang, Yejin Choi, Wu-Jun Li, Tianfan Fu, Fang Wu, Junhong Liu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.01700: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.01700&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[964] RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jing Liu, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Cheng Li, Yuqing Yang, Fan Yang, Mao Yang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.02922: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.02922&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[965] LAMP: Extracting Local Decision Surfaces From Large Language Models

Ryan Chen, Youngmin Ko, Zeyu Zhang, Catherine Cho, Sunny Chung, Mauro Giuffré, Dennis L. Shung, Bradly C. Stadie

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.11772: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11772&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[966] AlphaFold’s Bayesian Roots in Probability Kinematics

Thomas Hamelryck, Kanti V. Mardia

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.19763: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19763&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[967] MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation

Wei Shen, Zhang Yaxiang, Minhui Huang, Mengfan Xu, Jiawei Zhang, Cong Shen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.01897: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01897&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[968] Guided Speculative Inference for Efficient Test-Time Alignment of LLMs

Jonathan Geuter, Youssef Mroueh, David Alvarez-Melis

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.04118: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.04118&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[969] Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes

Daniel Jenson, Jhonathan Navott, Piotr Grynfelder, Mengyan Zhang, Makkunda Sharma, Elizaveta Semenova, Seth Flaxman

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.09163: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09163&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[970] Doloris: Dual Conditional Diffusion Implicit Bridges with Sparsity Masking Strategy for Unpaired Single-Cell Perturbation Estimation

Changxi Chi, Jun Xia, Yufei Huang, Zhuoli Ouyang, Cheng Tan, Yunfan Liu, Jingbo Zhou, Chang Yu, Liangyu Yuan, Siyuan Li, Zelin Zang, Stan Z. Li

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.21107: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21107&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[971] Learning Latent Graph Geometry via Fixed-Point Schrödinger-Type Activation: A Theoretical Study

Dmitry Pasechnyuk-Vilensky, Martin Takáč

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.20088: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.20088&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[972] RegMean++: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

The-Hai Nguyen, Dang Huu-Tien, Takeshi Suzuki, Le-Minh Nguyen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.03121: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.03121&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[973] Multimodal Remote Inference

Keyuan Zhang, Yin Sun, Bo Ji

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.07555: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07555&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[974] DeepCausalMMM: A Deep Learning Framework for Marketing Mix Modeling with Causal Structure Learning

Aditya Puttaparthi Tirumala

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.13087: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13087&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[975] Adversary-Free Counterfactual Prediction via Information-Regularized Representations

Shiqin Tang, Rong Feng, Shuxin Zhuang, Youzhi Zhang, Hongzong Li

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.15479: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15479&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[976] Beyond Binary Out-of-Distribution Detection: Characterizing Distributional Shifts with Multi-Statistic Diffusion Trajectories

Achref Jaziri, Martin Rogmann, Martin Mundt, Visvanathan Ramesh

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.17381: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17381&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[977] Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought

Jiachen Zhao, Yiyou Sun, Weiyan Shi, Dawn Song

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.24941: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24941&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[978] HardFlow: Hard-Constrained Sampling for Flow-Matching Models via Trajectory Optimization

Zeyang Li, Kaveh Alim, Navid Azizan

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.08425: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08425&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[979] LILogic Net: Compact Logic Gate Networks with Learnable Connectivity for Efficient Hardware Deployment

Katarzyna Fojcik, Renaldas Zioma, Jogundas Armaitis

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.12340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[980] TRINITY: An Evolved LLM Coordinator

Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.04695: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04695&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[981] Sparse Concept Anchoring for Interpretable and Controllable Neural Representations

Sandy Fraser, Patryk Wielopolski

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.12469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[982] Generalisation in Multitask Fitted Q-Iteration and Offline Q-learning

Kausthubh Manda, Raghuram Bharadwaj Diddigi

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.20220: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20220&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[983] VAMP-Net: An Interpretable Multi-Path Network of Genomic Permutation-Invariant Set Attention and Quality-Aware 1D-CNN for MTB Drug Resistance

Aicha Boutorh, Kamar Hibatallah Baghdadi, Anais Daoud

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.21786: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21786&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[984] Predicting Time Pressure of Powered Two-Wheeler Riders for Proactive Safety Interventions

Sumit S. Shevtekar, Chandresh K. Maurya, Gourab Sil

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.03173: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03173&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[985] Estimating Dense-Packed Zone Height in Liquid-Liquid Separation: A Physics-Informed Neural Network Approach

Mehmet Velioglu, Song Zhai, Alexander Mitsos, Adel Mhamdi, Andreas Jupke, Manuel Dahmen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.18399: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18399&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[986] Bellman Residual Minimization for Control: Geometry, Stationarity, and Convergence

Donghwan Lee, Hyukjun Yang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.18840: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18840&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[987] Test-Time Adaptation for Unsupervised Combinatorial Optimization

Yiqiao Liao, Farinaz Koushanfar, Parinaz Naghizadeh

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.21048: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21048&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[988] High-accuracy sampling for diffusion models and log-concave distributions

Fan Chen, Sinho Chewi, Constantinos Daskalakis, Alexander Rakhlin

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.01338: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01338&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[989] Live Knowledge Tracing: Real-Time Adaptation using Tabular Foundation Models

Mounir Lbath, Alexandre Parésy, Abdelkayoum Kaddouri, Abdelrahman Zighem, Jill-Jênn Vie

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.06542: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06542&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[990] Additive Control Variates Dominate Self-Normalisation in Off-Policy Evaluation

Olivier Jeunen, Shashank Gupta

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.14914: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14914&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[991] Symbolic recovery of PDEs from measurement data

Erion Morina, Philipp Scholl, Martin Holler

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.15603: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15603&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[992] Beyond Match Maximization and Fairness: Retention-Optimized Two-Sided Matching

Ren Kishimoto, Rikiya Takehi, Koichi Tanaka, Masahiro Nomura, Riku Togashi, Yoji Tomita, Yuta Saito

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.15752: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15752&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[993] Variational Grey-Box Dynamics Matching

Gurjeet Sangra Singh, Frantzeska Lavda, Giangiacomo Mercatali, Alexandros Kalousis

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.17477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[994] Radial Load–Reserve Certificates for Wasserstein Propagation in Isotropic Diffusion Samplers

Zicheng Lyu, Zengfeng Huang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.19670: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19670&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[995] Linear-Nonlinear Fusion Neural Operator for Partial Differential Equations

Heng Wu, Junjie Wang, Benzhuo Lu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.24143: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24143&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[996] SEVerA: Verified Synthesis of Self-Evolving Agents

Debangshu Banerjee, Changming Xu, Eugene Ie, Ming Zhang, Daiyi Peng, Chu-Cheng Lin, Gagandeep Singh

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.25111: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25111&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[997] Matrix Profile for Time-Series Anomaly Detection: A Reproducible Open-Source Benchmark on TSB-AD

Chin-Chia Michael Yeh

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.02445: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02445&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[998] End-to-end Automated Deep Neural Network Optimization for PPG-based Blood Pressure Estimation on Wearables

Francesco Carlucci, Giovanni Pollo, Xiaying Wang, Massimo Poncino, Enrico Macii, Luca Benini, Sara Vinco, Alessio Burrello, Daniele Jahier Pagliari

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.10117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[999] $π_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachlan Groom, Haroun Habeeb, Hunter Hancock, Karol Hausman, Gashon Hussein, Victor Hwang, Brian Ichter, Connor Jacobsen, Szymon Jakubczak, Rowan Jen, Tim Jones, Gregg Kammerer, Ben Katz, Liyiming Ke, Mairbek Khadikov, Chandra Kuchi, Marinda Lamb, Devin LeBlanc, Brendon LeCount, Sergey Levine, Xinyu Li, Adrian Li-Bell, Vladislav Lialin, Zhonglin Liang, Wallace Lim, Yao Lu, Enyu Luo, Vishnu Mano, Nandan Marwaha, Aikys Mongush, Liam Murphy, Suraj Nair, Tyler Patterson, Karl Pertsch, Allen Z. Ren, Gavin Schelske, Charvi Sharma, Baifeng Shi, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachowicz, Will Stoeckle, Jiaming Tang, Jimmy Tanner, Shalom Tekeste, Marcel Torne, Kyle Vedder, Quan Vuong, Anna Walling, Haohuan Wang, Jason Wang, XuDong Wang, Chris Whalen, Samuel Whitmore, Blake Williams, Charles Xu, Sukwon Yoo, Lili Yu, Wuming Zhang, Zhuoyang Zhang, Ury Zhilinsky

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.15483: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15483&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1000] Corner Reflector Array Jamming Discrimination Using Multi-Dimensional Micro-Motion Features with Frequency Agile Radar

Jie Yuan, Lei Wang, Yanhao Wang, Yimin Liu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16008: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16008&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1001] Back to Repair: A Minimal Denoising Network for Time Series Anomaly Detection

Kadir-Kaan Özer, René Ebeling, Markus Enzweiler

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1002] Correction and Corruption: A Two-Rate View of Error Flow in LLM Protocols

Fernando Reitich

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1003] NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization

Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18471: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18471&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1004] LLMs Know They’re Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

Manav Pandey

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.19117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.19117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1005] Physics-Guided Dimension Reduction for Simulation-Free Operator Learning of Stiff Differential-Algebraic Systems

Huy Hoang Le, Haoguang Wang, Christian Moya, Marcos Netto, Guang Lin

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.19930: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.19930&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1006] Fine-Tuning Regimes Define Distinct Continual Learning Problems

Paul-Tiberiu Iordache, Elena Burceanu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.21927: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21927&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1007] Improved Hardness Results for Learning Intersections of Halfspaces

Stefan Tiegel

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2402.15995: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.15995&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1008] Learning Operators by Regularized Stochastic Gradient Descent with Operator-valued Kernels

Jia-Qi Yang, Lei Shi

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2504.18184: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.18184&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1009] Solving Nonlinear PDEs with Sparse Radial Basis Function Networks

Zihan Shao, Konstantin Pieper, Xiaochuan Tian

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.07765: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.07765&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1010] High-Dimensional Private Linear Regression with Optimal Rates

Simone Bombari, Jialei Luo, Inbar Seroussi, Marco Mondelli

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.16329: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16329&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1011] Beyond ReLU: How Activations Affect Neural Kernels and Random Wide Networks

David Holzmüller, Max Schölpple

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.22429: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.22429&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1012] Modeling Parkinson’s Disease Progression Using Longitudinal Voice Biomarkers: A Comparative Study of Statistical and Neural Mixed-Effects Models

Ran Tong, Lanruo Wang, Tong Wang, Wei Yan

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.20058: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.20058&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1013] Self-Organising Memristive Networks as Physical Learning Systems

Francesco Caravelli, Gianluca Milano, Adam Z. Stieg, Carlo Ricciardi, Simon Anthony Brown, Zdenka Kuncic

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.00747: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.00747&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1014] Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs

Juyeon Yoon, Somin Kim, Robert Feldt, Shin Yoo

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.17314: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.17314&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1015] Geodesics in the Deep Linear Network

Alan Chen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.07324: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07324&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1016] Robust Least-Squares Optimization for Data-Driven Predictive Control: A Geometric Approach

Shreyas Bharadwaj, Bamdev Mishra, Cyrus Mostajeran, Alberto Padoan, Jeremy Coulson, Ravi N. Banavar

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.09242: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09242&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1017] Branching Flows: Discrete, Continuous, and Manifold Flow Matching with Splits and Deletions

Lukas Billera, Hedwig Nora Nordlinder, Jack Collier Ryder, Anton Oresten, Aron Stålmarck, Theodor Mosetti Björk, Ben Murrell

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.09465: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09465&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1018] SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, Danda Pani Paudel

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.17411: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17411&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1019] Lattice-to-Total Thermal Conductivity Ratio: A Phonon-Glass Electron-Crystal Descriptor for Data-Driven Thermoelectric Design

Yifan Sun, Zhi Li, Tetsuya Imamura, Yuji Ohishi, Chris Wolverton, Ken Kurosaki

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.21213: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21213&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1020] DNNs, Dataset Statistics, and Correlation Functions

Robert W. Batterman, James F. Woodward

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.21715: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21715&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1021] Flexible Deep Neural Networks for Partially Linear Survival Data: Estimation and Survival Inference

Asaf Ben Arie, Malka Gorfine

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.10570: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10570&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1022] Maritime object classification with SAR imagery using quantum kernel methods

John Tanner, Nicholas Davies, Pascal Jahan Elahi, Casey R. Myers, Du Huynh, Wei Liu, Mark Reynolds, Jingbo Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.11367: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11367&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1023] Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention

Yingzhen Yang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.20562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1024] A Mixture of Experts Vision Transformer for High-Fidelity Surface Code Decoding

Hoang Viet Nguyen, Manh Hung Nguyen, Hoang Ta, Van Khu Vu, Yeow Meng Chee

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.12483: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12483&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1025] Shuffle and Joint Differential Privacy for Generalized Linear Contextual Bandits

Sahasrajit Sarmasarkar

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.00417: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00417&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1026] On the Convergence of Jacobian-Free Backpropagation for Optimal Control Problems with Implicit Hamiltonians

Eric Gelphman, Deepanshu Verma, Nicole Tianjiao Yang, Stanley Osher, Samy Wu Fung

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.00921: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00921&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1027] Prenatal Stress Detection from Electrocardiography Using Self-Supervised Deep Learning: Development and External Validation

Martin G. Frasch, Marlene J.E. Mayer, Clara Becker, Peter Zimmermann, Camilla Zelgert, Marta C. Antonelli, Silvia M. Lobmaier

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.03886: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03886&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Brandon Yee, Lucas Wang, Kundana Kommini

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.23665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1029] KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

Zihao Zheng, Zhihao Mao, Maoliang Li, Jiayu Chen, Xinhao Sun, Zhaobo Zhang, Donggang Cao, Hong Mei, Xiang Chen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.01581: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01581&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1030] Bayesian Optimization with Gaussian Processes to Accelerate Stationary Point Searches

Rohit Goswami

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.10992: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10992&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1031] Optimal Experimental Design for Reliable Learning of History-Dependent Constitutive Laws

Kaushik Bhattacharya, Lianghao Cao, Andrew Stuart

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.12365: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12365&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1032] HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

Zihao Zheng, Zhihao Mao, Sicheng Tian, Maoliang Li, Jiayu Chen, Xinhao Sun, Zhaobo Zhang, Xuanzhe Liu, Donggang Cao, Hong Mei, Xiang Chen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.17573: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17573&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1033] On the Peril of (Even a Little) Nonstationarity in Satisficing Regret Minimization

Yixuan Zhang, Ruihao Zhu, Qiaomin Xie

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.18514: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18514&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1034] RoboECC: Multi-Factor-Aware Edge-Cloud Collaborative Deployment for VLA Models

Zihao Zheng, Hangyu Cao, Jiayu Chen, Sicheng Tian, Chenyue Li, Maoliang Li, Xinhao Sun, Guojie Luo, Xiang Chen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.20711: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20711&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1035] Think Anywhere in Code Generation

Xue Jiang, Tianyu Zhang, Ge Li, Mengyang Liu, Taozhi Chen, Zhenhua Xu, Binhua Li, Wenpin Jiao, Zhi Jin, Yongbin Li, Yihong Dong

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.29957: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29957&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1036] Pixel-Translation-Equivariant Quantum Convolutional Neural Networks via Fourier Multiplexers

Dmitry Chirkov, Igor Lobanov

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.06094: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06094&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1037] CASE: Cadence-Aware Set Encoding for Large-Scale Next Basket Repurchase Recommendation

Yanan Cao, Ashish Ranjan, Sinduja Subramaniam, Evren Korpeoglu, Kaushiki Nag, Kannan Achan

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.06718: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06718&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Yilong Dai, Yiming Sun, Yiheng Chen, Shengyu Chen, Xiaowei Jia, Runlong Yu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17149: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17149&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1039] Bilinear Input Modulation for Mamba: Koopman Bilinear Forms for Memory Retention and Multiplicative Computation

Hiroki Fujii, Masaki Yamakita

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17221: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17221&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1040] Predictive Modelling of Natural Medicinal Compounds for Alzheimer disease Using Machine Learning and Cheminformatics

Hafiza Syeda Yusra Tirmizi, Syed Ibad Hasnain, Muhammad Faris, Rabail Khowaja, Saad Abdullah

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18316: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18316&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[1041] Compliance Moral Hazard and the Backfiring Mandate

Jian Ni, Lecheng Zheng, John R Birge

[1054] In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

Xulin Fan, Vishal Sunder, Samuel Thomas, Mark Hasegawa-Johnson, Brian Kingsbury, George Saon

Main category: eess.AS

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advances in speech-aware language models have coupled strong acoustic encoders with large language models, enabling systems that move beyond transcription to produce richer outputs. Among these, word-level timestamp prediction is critical for applications such as captioning, media search, and multimodal synchronization, yet it is often handled by external alignment tools. In this work, we extend an existing speech-aware language model to predict timestamps directly alongside transcripts. We introduce a set of novel lightweight training strategies that improve alignment robustness while preserving recognition quality. Experiments across multiple datasets show that these strategies not only enhance timestamp accuracy, but also yield gains in overall ASR performance. Together, they demonstrate an efficient and unified approach to speech recognition with precise timestamp prediction.

[1055] Predictive Directional Selective Fixed-Filter Active Noise Control for Moving Sources via a Convolutional Recurrent Neural Network

Boxiang Wang, Zhengding Luo, Dongyuan Shi, Junwei Ji, Xiruo Su, Woon-Seng Gan

Main category: eess.AS

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Directional Selective Fixed-Filter Active Noise Control (D-SFANC) can effectively attenuate noise from different directions by selecting the suitable pre-trained control filter based on the Direction-of-Arrival (DoA) of the current noise. However, this method is weak at tracking the direction variations of non-stationary noise, such as that from a moving source. Therefore, this work proposes a Predictive Directional SFANC (PD-SFANC) method that uses a Convolutional Recurrent Neural Network (CRNN) to capture the hidden temporal dynamics of the moving noise and predict the control filter to cancel future noise. Accordingly, the proposed method can significantly improve its noise-tracking ability and dynamic noise-reduction performance. Furthermore, numerical simulations confirm the superiority of the proposed method for handling moving sources across various movement scenarios, compared to several representative ANC baselines.

[1056] Explainable AI in Speaker Recognition – Making Latent Representations Understandable

Yanze Xu, Wenwu Wang, Mark D. Plumbley

Main category: eess.AS

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Neural networks can be trained to learn task-relevant representations from data. Understanding how these networks make decisions falls within the Explainable AI (XAI) domain. This paper proposes to study an XAI topic: uncovering unknown organisational patterns in network representations, particularly those representations learned by the speaker recognition network that recognises the speaker identity of utterances. Past studies employed algorithms (e.g. t-distributed Stochastic Neighbour Embedding and K-means) to analyse and visualise how network representations form independent clusters, indicating the presence of flat clustering phenomena within the space defined by these representations. In contrast, this work applies two algorithms – Single-Linkage Clustering (SLINK) and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) – to analyse how representations form clusters with hierarchical relationships rather than being independent, thereby demonstrating the existence of hierarchical clustering phenomena within the network representation space. To semantically understand the above hierarchical clustering phenomena, a new algorithm, termed Hierarchical Cluster-Class Matching (HCCM), is designed to perform one-to-one matching between predefined semantic classes and hierarchical representation clusters (i.e. those produced by SLINK or HDBSCAN). Some hierarchical clusters are successfully matched to individual semantic classes (e.g. male, UK), while others to conjunctions of semantic classes (e.g. male and UK, female and Ireland). A new metric, Liebig’s score, is proposed to quantify the performance of each matching behaviour, allowing us to diagnose the factor that most strongly limits matching performance.

[1057] Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Chih-Kai Yang, Neo S. Ho, Hung-yi Lee

Main category: eess.AS

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs’ performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.

[1058] Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, Shinji Watanabe, Hung-yi Lee

Main category: eess.AS

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Full-duplex spoken dialogue systems promise to transform human-machine interaction from a rigid, turn-based protocol into a fluid, natural conversation. However, the central challenge to realizing this vision, managing overlapping speech, remains critically under-evaluated. We introduce Full-Duplex-Bench v1.5, the first fully automated benchmark designed to systematically probe how models behave during speech overlap. The benchmark simulates four representative overlap scenarios: user interruption, user backchannel, talking to others, and background speech. Our framework, compatible with open-source and commercial API-based models, provides a comprehensive suite of metrics analyzing categorical dialogue behaviors, stop and response latency, and prosodic adaptation. Benchmarking five state-of-the-art agents reveals two divergent strategies: a responsive approach prioritizing rapid response to user input, and a floor-holding approach that preserves conversational flow by filtering overlapping events. Our open-source framework enables practitioners to accelerate the development of robust full-duplex systems by providing the tools for reproducible evaluation.

[1059] Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, James Glass

Main category: eess.AS

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Conversational Spoken Language Models (SLMs) are emerging as a promising paradigm for real-time speech interaction. However, their capacity of temporal dynamics, including the ability to manage timing, tempo and simultaneous speaking, remains a critical and unevaluated challenge for conversational fluency. To address this gap, we introduce the Game-Time Benchmark, a framework to systematically assess these temporal capabilities. Inspired by how humans learn a language through language activities, Game-Time consists of basic instruction-following tasks and advanced tasks with temporal constraints, such as tempo adherence and synchronized responses. Our evaluation of diverse SLM architectures reveals a clear performance disparity: while state-of-the-art models handle basic tasks well, many contemporary systems still struggle with fundamental instruction-following. More critically, nearly all models degrade substantially under temporal constraints, exposing persistent weaknesses in time awareness and full-duplex interaction. The Game-Time Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI. Demos and datasets are available on our project website https://ga642381.github.io/Game-Time.

[1060] Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner

Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai-Wei Chang, Siddhant Arora, Shinji Watanabe, Hung-yi Lee

Main category: eess.AS

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While full-duplex speech agents enable natural, low-latency interaction by speaking and listening simultaneously, their consistency and task performance in multi-turn settings remain underexplored. We introduce Full-Duplex-Bench-v2 (FDB-v2), a streaming framework that integrates with an automated examiner that enforces staged goals under two pacing setups (Fast vs. Slow). FDB-v2 covers four task families: daily, correction, entity tracking, and safety. We report turn-taking fluency, multi-turn instruction following, and task-specific competence. The framework is extensible, supporting both commercial APIs and open source models. When we test full-duplex systems with FDB-v2, they often get confused when people talk at the same time, struggle to handle corrections smoothly, and sometimes lose track of who or what is being talked about. Through an open-sourced, standardized streaming protocol and a task set, FDB-v2 makes it easy to extend to new task families, allowing the community to tailor and accelerate evaluation of multi-turn full-duplex systems.

Rui Hu, Delai Qiu, Yining Wang, Shengping Liu, Jitao Sang

Main category: eess.AS

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Omni-modal large language models (OLLMs) offer a promising end-to-end solution for slide-enhanced speech recognition due to their inherent multimodal capabilities. However, we found a fundamental issue faced by OLLMs: \textit{Visual Interference}, where models show a bias towards visible text over auditory signals, causing them to hallucinate slide content that was never spoken. To address this, we propose Visually-Anchored Policy Optimization (VAPO), which aims to reshape models’ inference process to follow the human-like ``Look-then-Listen’’ inference chain. Specifically, we design a temporally decoupled policy: the model first extracts visual priors in a block to serve as semantic anchors, then generates the transcription in an block. The policy is optimized via multi-objective reinforcement learning. Furthermore, we introduce SlideASR-Bench, a comprehensive benchmark designed to address the scarcity of entity-rich data, comprising a large-scale synthetic corpus for training and a challenging real-world test set for evaluation. We conduct extensive evaluations demonstrating that VAPO effectively eliminates visual interference and achieves state-of-the-art performance on SlideASR-Bench and public datasets, significantly reducing entity recognition errors in specialized domains.

[1062] BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference

Sungjae Kim, Kihyun Na, Jinyoung Choi, Injung Kim

Main category: eess.AS

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Automatic Pitch Correction (APC) enhances vocal recordings by aligning pitch deviations with intended musical notes. However, existing APC systems either rely on reference pitches, which limits practical applicability, or employ simple pitch estimation algorithms that often fail to preserve expressiveness and naturalness. We propose BERT-APC, a reference-free APC framework that corrects pitch errors while maintaining the expressiveness and naturalness of vocal performances. In BERT-APC, a stationary pitch predictor first estimates the stationary pitch of each note from the detuned singing voice, where stationary pitch is the continuous pitch from the stable region of a note and approximates its perceived pitch. A context-aware note pitch predictor then infers the intended pitch sequence using a repurposed music language model that incorporates musical context. Finally, a note-level correction algorithm fixes pitch errors while preserving intentional deviations for emotional expression. We also introduce a learnable data augmentation strategy that improves robustness by simulating realistic detuning patterns. Compared to two recent singing voice transcription models, BERT-APC demonstrated superior target note pitch prediction, outperforming the second-best model, ROSVOT, by 10.49 percentage points on highly detuned samples in raw pitch accuracy. In the MOS test, BERT-APC achieved the highest quality rating of $4.32 \pm 0.15$, significantly higher than Auto-Tune ($3.22 \pm 0.18$) and Melodyne ($3.08 \pm 0.18$), while maintaining a comparable ability to preserve expressive nuances. To the best of our knowledge, this is the first APC model that leverages a music language model to achieve reference-free pitch correction with symbolic musical context. The corrected audio samples are available at https://joshua-1995.github.io/BERT-APC-Demo/.

[1063] Learning Filters in Feedback Delay Networks from Noisy Room Impulse Responses

Gloria Dal Santo, Karolina Prawda, Sebastian J. Schlecht, Vesa Välimäki

Md Assaduzzaman, Nushrat Jahan Oyshi, Eram Mahamud

Main category: eess.IV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The accurate classification of gastrointestinal diseases from endoscopic and histopathological imagery remains a significant challenge in medical diagnostics, mainly due to the vast data volume and subtle variation in inter-class visuals. This study presents a hybrid dual-stream deep learning framework built on teacher-student knowledge distillation, where a high-capacity teacher model integrates the global contextual reasoning of a Swin Transformer with the local fine-grained feature extraction of a Vision Transformer. The student network was implemented as a compact Tiny-ViT structure that inherits the teacher’s semantic and morphological knowledge via soft-label distillation, achieving a balance between efficiency and diagnostic accuracy. Two carefully curated Wireless Capsule Endoscopy datasets, encompassing major GI disease classes, were employed to ensure balanced representation and prevent inter-sample bias. The proposed framework achieved remarkable performance with accuracies of 0.9978 and 0.9928 on Dataset 1 and Dataset 2 respectively, and an average AUC of 1.0000, signifying near-perfect discriminative capability. Interpretability analyses using Grad-CAM, LIME, and Score-CAM confirmed that the model’s predictions were grounded in clinically significant tissue regions and pathologically relevant morphological cues, validating the framework’s transparency and reliability. The Tiny-ViT demonstrated diagnostic performance with reduced computational complexity comparable to its transformer-based teacher while delivering faster inference, making it suitable for resource-constrained clinical environments. Overall, the proposed framework provides a robust, interpretable, and scalable solution for AI-assisted GI disease diagnosis, paving the way toward future intelligent endoscopic screening that is compatible with clinical practicality.

Today’s Research Highlights

Table of Contents

cs.CL

[1] The Randomness Floor: Measuring Intrinsic Non-Randomness in Language Model Token Distributions

[2] TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

[3] AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

[4] Self Knowledge Re-expression: A Fully Local Method for Adapting LLMs to Tasks Using Intrinsic Knowledge

[5] Uncertainty Quantification for LLM Function-Calling

[6] Chinese-SkillSpan: A Span-Level Dataset for ESCO-Aligned Competency Extraction from Chinese Job Ads

[7] Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

[8] Evaluating Temporal Consistency in Multi-Turn Language Models

[9] DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining

[10] Implicit Framing in Obstetric Counseling Notes: A Grounded LLM Pipeline on a VBAC-Eligible Cohort

[11] ContextWeaver: Selective and Dependency-Structured Memory Construction for LLM Agents

[12] Mixture of Heterogeneous Grouped Experts for Language Modeling

[13] Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings

[14] DARC-CLIP: Dynamic Adaptive Refinement with Cross-Attention for Meme Understanding

[15] Measuring Temporal Linguistic Emergence in Diffusion Language Models

[16] Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt

[17] Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

[18] Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

[19] From Similarity to Structure: Training-free LLM Context Compression with Hybrid Graph Priors

[20] Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech

[21] EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce

[22] Au-M-ol: A Unified Model for Medical Audio and Language Understanding

[23] Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

[24] Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

[25] Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

[26] $\mathcal{S}^2$IT: Stepwise Syntax Integration Tuning for Large Language Models in Aspect Sentiment Quad Prediction

[27] Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

[28] Bridging Reasoning and Action: Hybrid LLM-RL Framework for Efficient Cross-Domain Task-Oriented Dialogue

[29] Evaluating Large Language Models on Computer Science University Exams in Data Structures

[30] When Chain-of-Thought Fails, the Solution Hides in the Hidden States

[31] VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs

[32] Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing

[33] Beyond Local vs. External: A Game-Theoretic Framework for Trustworthy Knowledge Acquisition

[34] Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

[35] Scheming Ability in LLM-to-LLM Strategic Interactions

[36] AI Safety Training Can be Clinically Harmful

[37] Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata

[38] A Benchmark Suite of Reddit-Derived Datasets for Mental Health Detection

[39] JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

[40] Your Students Don’t Use LLMs Like You Wish They Did

[41] K-SENSE: A Knowledge-Guided Self-Augmented Encoder for Neuro-Semantic Evaluation of Mental Health Conditions on Social Media

[42] MTRouter: Cost-Aware Multi-Turn LLM Routing with History-Model Joint Embeddings

[43] Pref-CTRL: Preference Driven LLM Alignment using Representation Editing

[44] RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization

[45] LLMs Reading the Rhythms of Daily Life: Aligned Understanding for Behavior Prediction and Generation

[46] ComplianceNLP: Knowledge-Graph-Augmented RAG for Multi-Framework Regulatory Gap Detection

[47] XITE: Cross-lingual Interpolation for Transfer using Embeddings

[48] Personality Shapes Gender Bias in Persona-Conditioned LLM Narratives Across English and Hindi: An Empirical Investigation

[49] Applications of the Transformer Architecture in AI-Assisted English Reading Comprehension

[50] GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs

[51] Neural Grammatical Error Correction for Romanian

[52] Benchmarking Testing in Automated Theorem Proving

[53] Agri-CPJ: A Training-Free Explainable Framework for Agricultural Pest Diagnosis Using Caption-Prompt-Judge and LLM-as-a-Judge

[54] AIPsy-Affect: A Keyword-Free Clinical Stimulus Battery for Mechanistic Interpretability of Emotion in Language Models

[55] Multimodal QUD: Inquisitive Questions from Scientific Figures

[56] Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale

[57] LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models

[58] DRACULA: Hunting for the Actions Users Want Deep Research Agents to Execute

[59] Resource-Lean Lexicon Induction for German Dialects

[60] One Size Fits None: Heuristic Collapse in LLM Investment Advice

[61] Reheat Nachos for Dinner? Evaluating AI Support for Cross-Cultural Communication of Neologisms

[62] Translate or Simplify First: An Analysis of Cross-lingual Text Simplification in English and French

[63] Learning Selective LLM Autonomy from Copilot Feedback in Enterprise Customer Support Workflows

[64] Knowledge Vector of Logical Reasoning in Large Language Models

[65] TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

[66] KOMBO: Korean Character Representations Based on the Combination Rules of Subcharacters

[67] Quantum Knowledge Graph: Modeling Context-Dependent Triplet Validity

[68] Propagation Structure-Semantic Transfer Learning for Robust Fake News Detection

[69] Stabilizing Efficient Reasoning with Step-Level Advantage Selection

[70] From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

[71] Improving Robustness of Tabular Retrieval via Representational Stability

[72] Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B

[73] PeeriScope: A Multi-Faceted Framework for Evaluating Peer Review Quality

[74] How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

[75] The Pragmatic Persona: Discovering LLM Persona through Bridging Inference

[76] BiMol-Diff: A Unified Diffusion Framework for Molecular Generation and Captioning

[77] Factual and Edit-Sensitive Graph-to-Sequence Generation via Graph-Aware Adaptive Noising