Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 130]
- cs.CV [Total: 143]
- cs.AI [Total: 89]
- cs.SD [Total: 10]
- cs.LG [Total: 134]
- cs.MA [Total: 5]
- cs.MM [Total: 4]
- eess.AS [Total: 2]
- eess.IV [Total: 7]
cs.CL
[1] Qwen3.5-Omni Technical Report
Qwen Team
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.
[2] Applied Explainability for Large Language Models: A Comparative Study
Venkata Abhinandan Kancharla
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) achieve strong performance across many natural language processing tasks, yet their decision processes remain difficult to interpret. This lack of transparency creates challenges for trust, debugging, and deployment in real-world systems. This paper presents an applied comparative study of three explainability techniques: Integrated Gradients, Attention Rollout, and SHAP, on a fine-tuned DistilBERT model for SST-2 sentiment classification. Rather than proposing new methods, the focus is on evaluating the practical behavior of existing approaches under a consistent and reproducible setup. The results show that gradient-based attribution provides more stable and intuitive explanations, while attention-based methods are computationally efficient but less aligned with prediction-relevant features. Model-agnostic approaches offer flexibility but introduce higher computational cost and variability. This work highlights key trade-offs between explainability methods and emphasizes their role as diagnostic tools rather than definitive explanations. The findings provide practical insights for researchers and engineers working with transformer-based NLP systems. This is a preprint and has not undergone peer review.
[3] Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch
Eleanor M. Lin, David Jurgens
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent developments in reasoning capabilities have enabled large language models to solve increasingly complex mathematical, symbolic, and logical tasks. Interestingly, while reasoning models are often trained to generate monolingual text, these models have also been observed to code-switch (i.e., mix languages). Prior works have either viewed code-switching as an undesirable error, attempted to control code-switching through modifications to input prompts or the output decoding process, or focus on narrow subsets of languages, domains, tasks, and models. We address these gaps by introducing the first linguistically and behaviorally motivated fine-tuning framework for identifying beneficial code-switched reasoning behaviors in large language models and teaching these models to code-switch more effectively for reasoning. First, we create and systematically analyze a dataset of reasoning traces from diverse models, languages, tasks, and domains to understand the types of code-switching behaviors found in existing reasoning models. Then, we develop fine-tuning interventions that teach reasoning models to code-switch based on our observations of helpful behaviors in existing models. We find that our framework can significantly increase beneficial code-switched reasoning behaviors in a data-efficient manner. Interestingly, we also find that code-switching behaviors in reasoning models can be modified by fine-tuning for tasks that do not directly demonstrate code-switching in reasoning (e.g., machine translation). Our work suggests that data-efficient interventions can instill helpful forms of code-switching behavior in reasoning models.
[4] Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences
Jingnong Qu, Ashvin Ranjan, Shane Steinert-Threlkeld
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent breakthroughs in language models (LMs) using neural networks have raised the question: how similar are these models’ processing to human language processing? Results using a framework called Brain Score (BS) – predicting fMRI activations during reading from LM activations – have been used to argue for a high degree of similarity. To understand this similarity, we conduct experiments by training LMs on various types of input data and evaluate them on BS. We find that models trained on various natural languages from many different language families have very similar BS performance. LMs trained on other structured data – the human genome, Python, and pure hierarchical structure (nested parentheses) – also perform reasonably well and close to natural languages in some cases. These findings suggest that BS can highlight language models’ ability to extract common structure across natural languages, but that the metric may not be sensitive enough to allow us to infer human-like processing from a high BS score alone.
[5] MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu, Haoli Bai, Shaohua Ma, Irwin King
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Full-Duplex Speech Language Models (FD-SLMs) enable real-time, overlapping conversational interactions, offering a more dynamic user experience compared to traditional half-duplex models. However, existing benchmarks primarily focus on evaluating single-round interactions, neglecting the complexities of multi-round communication. Evaluating FD-SLMs in multi-round settings poses significant challenges, including blurred turn boundaries in communication and context inconsistency during model inference. Also, existing benchmarks often focus solely on evaluating conversational features, neglecting other critical aspects. To address these gaps, we introduce MTR-DuplexBench, a novel benchmark designed for a comprehensive multi-round evaluation of FD-SLMs. MTR-DuplexBench not only segments continuous full-duplex dialogues into discrete turns for turn-by-turn assessment but also incorporates various evaluation aspects, including conversational features, dialogue quality, instruction following, and safety. Experimental results reveal that current FD-SLMs face difficulties in maintaining consistent performance across multiple rounds and evaluation dimensions, highlighting the necessity and effectiveness of our benchmark. Code and data are available at: https://github.com/ZhangHe0918/MTR-DuplexBench
[6] PolicyBank: Evolving Policy Understanding for LLM Agents
Jihye Choi, Jinsung Yoon, Long T. Le, Somesh Jha, Tomas Pfister
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: LLM agents operating under organizational policies must comply with authorization constraints typically specified in natural language. In practice, such specifications inevitably contain ambiguities and logical or semantic gaps that cause the agent’s behavior to systematically diverge from the true requirements. We ask: by letting an agent evolve its policy understanding through interaction and corrective feedback from pre-deployment testing, can it autonomously refine its interpretation to close specification gaps? We propose PolicyBank, a memory mechanism that maintains structured, tool-level policy insights and iteratively refines them – unlike existing memory mechanisms that treat the policy as immutable ground truth, reinforcing “compliant but wrong” behaviors. We also contribute a systematic testbed by extending a popular tool-calling benchmark with controlled policy gaps that isolate alignment failures from execution failures. While existing memory mechanisms achieve near-zero success on policy-gap scenarios, PolicyBank closes up to 82% of the gap toward a human oracle.
[7] MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models
Chung-Ming Chien, Manu Orsini, Eugene Kharitonov, Neil Zeghidour, Karen Livescu, Alexandre Défossez
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.
[8] Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)
Sharookh Daruwalla, Nitin Mayande, Shreeya Verma Kathuria, Nitin Joglekar, Charles Weber
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The fundamental challenge of using Large Language Models (LLMs) for reliable, enterprise-grade analytics, such as sentiment prediction, is the conflict between the LLMs’ inherent stochasticity (generative, non-deterministic nature) and the analytical requirement for consistency. The LLM inconsistency, coupled with the noisy nature of chaotic modern datasets, renders sentiment predictions too volatile for strategic business decisions. To resolve this, we present a Syntactic & Semantic Context Assessment Summarization (SSAS) framework for establishing context. Context established by SSAS functions as a sophisticated data pre-processing framework that enforces a bounded attention mechanism on LLMs. It achieves this by applying a hierarchical classification structure (Themes, Stories, Clusters) and an iterative Summary-of-Summaries (SoS) based context computation architecture. This endows the raw text with high-signal, sentiment-dense prompts, that effectively mitigate both irrelevant data and analytical variance. We empirically evaluated the efficacy of SSAS, using Gemini 2.0 Flash Lite, against a direct-LLM approach across three industry-standard datasets - Amazon Product Reviews, Google Business Reviews, Goodreads Book Reviews - and multiple robustness scenarios. Our results show that our SSAS framework is capable of significantly improving data quality, up to 30%, through a combination of noise removal and improvement in the estimation of sentiment prediction. Ultimately, consistency in our context-estimation capabilities provides a stable and reliable evidence base for decision-making.
[9] Why Fine-Tuning Encourages Hallucinations and How to Fix It
Guy Kaplan, Zorik Gekhman, Zhen Zhu, Lotem Rozner, Yuval Reif, Swabha Swayamdipta, Derek Hoiem, Roy Schwartz
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.
[10] “Excuse me, may I say something…” CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations
Yang Wu, Jinhong Yu, Jingwei Xiong, Zhimin Tao, Xiaozhong Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The integration of Large Language Models (LLMs) into scientific workflows presents exciting opportunities to accelerate biomedical discovery. However, the reactive nature of LLMs, which respond only when prompted, limits their effectiveness in collaborative settings that demand foresight and autonomous engagement. In this study, we introduce CoLabScience, a proactive LLM assistant designed to enhance biomedical collaboration between AI systems and human experts through timely, context-aware interventions. At the core of our method is PULI (Positive-Unlabeled Learning-to-Intervene), a novel framework trained with a reinforcement learning objective to determine when and how to intervene in streaming scientific discussions, by leveraging the team’s project proposal and long- and short-term conversational memory. To support this work, we introduce BSDD (Biomedical Streaming Dialogue Dataset), a new benchmark of simulated research discussion dialogues with intervention points derived from PubMed articles. Experimental results show that PULI significantly outperforms existing baselines in both intervention precision and collaborative task utility, highlighting the potential of proactive LLMs as intelligent scientific assistants.
[11] KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context
Nahyun Lee, Guijin Son, Hyunwoo Ko, Chanyoung Kim, JunYoung An, Kyubeen Han, Il-Youp Kwak
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce KMMMU, a native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings. KMMMU contains 3,466 questions from exams natively written in Korean, covering nine disciplines and nine visual modality categories, along with a 300-item Korean-specific subset and a hard subset of 627 questions. Unlike translated or English-centric benchmarks, KMMMU targets information-dense problems shaped by local conventions, official standards, and discipline-specific visual formats. Experiments show that the strongest open-source model reaches only 42.05% accuracy on the full set, while the best proprietary model achieves 52.42% on the hard subset. Performance varies across disciplines, with some disciplines emerging as bottlenecks, and Korean-specific questions showing gaps of up to 13.43%. Error analysis suggests that these failures stem less from insufficient reasoning depth than from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding. KMMMU provides a testbed for multimodal evaluation beyond English-centric benchmarks and for developing more reliable systems for expert real-world tasks.
[12] LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance
Jack Wei Lun Shi, Minghao Dang, Wawan Solihin, Justin K. W. Yeoh
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing research on large language models (LLMs) for automated code compliance has primarily focused on performance, treating the models as black boxes and overlooking how training decisions affect their interpretive behavior. This paper addresses this gap by employing a perturbation-based attribution analysis to compare the interpretive behaviors of LLMs across different fine-tuning strategies such as full fine-tuning (FFT), low-rank adaptation (LoRA) and quantized LoRA fine-tuning, as well as the impact of model scales which include varying LLM parameter sizes. Our results show that FFT produces attribution patterns that are statistically different and more focused than those from parameter-efficient fine-tuning methods. Furthermore, we found that as model scale increases, LLMs develop specific interpretive strategies such as prioritizing numerical constraints and rule identifiers in the building text, albeit with performance gains in semantic similarity of the generated and reference computer-processable rules plateauing for models larger than 7B. This paper provides crucial insights into the explainability of these models, taking a step toward building more transparent LLMs for critical, regulation-based tasks in the Architecture, Engineering, and Construction industry.
[13] DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation
Chao Li
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models compress heterogeneous knowledge into a single parameter space, allowing facts from different domains to interfere during generation. We propose DALM, a Domain-Algebraic Language Model that replaces unconstrained token generation with structured denoising over a domain lattice. DALM follows a three-phase generation path: it first resolves domain uncertainty, then relation uncertainty, and finally concept uncertainty, so each stage operates under explicit algebraic constraints. The framework requires only three ingredients: a lattice of domains with computable meet, join, and implication; a typing function over relations that controls inheritance across domains; and a fiber partition that localizes knowledge to domain-specific subsets. Given these ingredients, DALM yields a three-phase encoder-decoder architecture in which generation is confined to a domain fiber, cross-domain contamination is structurally prevented in closed-vocabulary mode and auditably bounded in open-vocabulary mode, and a single query can produce a domain-indexed multi-perspective answer space. We instantiate the framework with the CDC knowledge representation system and outline training and evaluation on validated domain-annotated crystal libraries. DALM reframes language generation as algebraically constrained structured denoising rather than unconstrained decoding over a flat token space.
[14] LLMs Corrupt Your Documents When You Delegate
Philippe Laban, Tobias Schnabel, Jennifer Neville
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
[15] BlasBench: An Open Benchmark for Irish Speech Recognition
Jyoutir Raj, John Conway
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing multilingual benchmarks include Irish among dozens of languages but apply no Irish-aware text normalisation, leaving reliable and reproducible ASR comparison impossible. We introduce BlasBench, an open evaluation harness that provides a standalone Irish-aware normaliser preserving fadas, lenition, and eclipsis; a reproducible scoring harness and per-utterance predictions released for all evaluated runs. We pilot this by benchmarking 12 systems across four architecture families on Common Voice ga-IE and FLEURS ga-IE. All Whisper variants exceed 100% WER through insertion-driven hallucination. Microsoft Azure reaches 22.2% WER on Common Voice and 57.5% on FLEURS; the best open model, Omnilingual ASR 7B, reaches 30.65% and 39.09% respectively. Models fine-tuned on Common Voice degrade 33-43 points moving to FLEURS, while massively multilingual models degrade only 7-10 - a generalisation gap that single-dataset evaluation misses.
[16] GroupDPO: Memory efficient Group-wise Direct Preference Optimization
Jixuan Leng, Si Si, Hsiang-Fu Yu, Vinod Raman, Inderjit S. Dhillon
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Preference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in preference datasets that typically contain multiple candidate responses. Motivated by this limitation, recent work explores group-wise preference optimization, which jointly contrasts multiple responses for the same prompt, but its empirical behavior and scalability remain underexplored due to the memory overhead of group-coupled objectives. In this work, we introduce a memory-efficient group-wise preference optimization algorithm that preserves gradients while decoupling samples during backpropagation, substantially reducing peak memory usage, which enables scalable training with larger group sizes. Across both offline and online alignment settings, we show that leveraging multiple responses consistently outperforms single-pair training. Furthermore, incorporating a negative log-likelihood (NLL) term on positive responses is critical for both performance gains and training stability.
[17] Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies
Myke C. Cohen, Mingqian Zheng, Neel Bhandari, Hsien-Te Kao, Xuhui Zhou, Daniel Nguyen, Laura Cassani, Maarten Sap, Svitlana Volkova
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: AI design characteristics and human personality traits each impact the quality and outcomes of human-AI interactions. However, their relative and joint impacts are underexplored in imperfectly cooperative scenarios, where people and AI only have partially aligned goals and objectives. This study compares a purely simulated dataset comprising 2,000 simulations and a parallel human subjects experiment involving 290 human participants to investigate these effects across two scenario categories: (1) hiring negotiations between human job candidates and AI hiring agents; and (2) human-AI transactions wherein AI agents may conceal information to maximize internal goals. We examine user Extraversion and Agreeableness alongside AI design characteristics, including Adaptability, Expertise, and chain-of-thought Transparency. Our causal discovery analysis extends performance-focused evaluations by integrating scenario-based outcomes, communication analysis, and questionnaire measures. Results reveal divergences between purely simulated and human study datasets, and between scenario types. In simulation experiments, personality traits and AI attributes were comparatively influential. Yet, with actual human subjects, AI attributes – particularly transparency – were much more impactful. We discuss how these divergences vary across different interaction contexts, offering crucial insights for the future of human-centered AI agents.
[18] Theory of Mind in Action: The Instruction Inference Task in Dynamic Human-Agent Collaboration
Fardin Saad, Pradeep K. Murukannaiah, Munindar P. Singh
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Successful human-agent teaming relies on an agent being able to understand instructions given by a (human) principal. In many cases, an instruction may be incomplete or ambiguous. In such cases, the agent must infer the unspoken intentions from their shared context, that is, it must exercise the principal’s Theory of Mind (ToM) and infer the mental states of its principal. We consider the prospects of effective human-agent collaboration using large language models (LLMs). To assess ToM in a dynamic, goal-oriented, and collaborative environment, we introduce a novel task, Instruction Inference, in which an agent assists a principal in reaching a goal by interpreting incomplete or ambiguous instructions. We present Tomcat, an LLM-based agent, designed to exhibit ToM reasoning in interpreting and responding to the principal’s instructions. We implemented two variants of Tomcat. One, dubbed Fs-CoT (Fs for few-shot, CoT for chain-of-thought), is based on a small number of examples demonstrating the requisite structured reasoning. One, dubbed CP (commonsense prompt), relies on commonsense knowledge and information about the problem. We realized both variants of Tomcat on three leading LLMs, namely, GPT-4o, DeepSeek-R1, and Gemma-3-27B. To evaluate the effectiveness of Tomcat, we conducted a study with 52 human participants in which we provided participants with the same information as the CP variant. We computed intent accuracy, action optimality, and planning optimality to measure the ToM capabilities of Tomcat and our study participants. We found that Tomcat with Fs-CoT, particularly with GPT-4o and DeepSeek-R1, achieves performance comparable to the human participants, underscoring its ToM potential for human-agent collaboration.
[19] FD-NL2SQL: Feedback-Driven Clinical NL2SQL that Improves with Use
Suparno Roy Chowdhury, Tejas Anvekar, Manan Roy Choudhury, Muhammad Ali Khan, Kaneez Zahra Rubab Khakwani, Mohamad Bassam Sonbol, Irbaz Bin Riaz, Vivek Gupta
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Clinicians exploring oncology trial repositories often need ad-hoc, multi-constraint queries over biomarkers, endpoints, interventions, and time, yet writing SQL requires schema expertise. We demo FD-NL2SQL, a feedback-driven clinical NL2SQL assistant for SQLite-based oncology databases. Given a natural-language question, a schema-aware LLM decomposes it into predicate-level sub-questions, retrieves semantically similar expert-verified NL2SQL exemplars via sentence embeddings, and synthesizes executable SQL conditioned on the decomposition, retrieved exemplars, and schema, with post-processing validity checks. To improve with use, FD-NL2SQL incorporates two update signals: (i) clinician edits of generated SQL are approved and added to the exemplar bank; and (ii) lightweight logic-based SQL augmentation applies a single atomic mutation (e.g., operator or column change), retaining variants only if they return non-empty results. A second LLM generates the corresponding natural-language question and predicate decomposition for accepted variants, automatically expanding the exemplar bank without additional annotation. The demo interface exposes decomposition, retrieval, synthesis, and execution results to support interactive refinement and continuous improvement.
[20] CIG: Measuring Conversational Information Gain in Deliberative Dialogues with Semantic Memory Dynamics
Ming-Bin Chen, Jey Han Lau, Lea Frermann
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Measuring the quality of public deliberation requires evaluating not only civility or argument structure, but also the informational progress of a conversation. We introduce a framework for Conversational Information Gain (CIG) that evaluates each utterance in terms of how it advances collective understanding of the target topic. To operationalize CIG, we model an evolving semantic memory of the discussion: the system extracts atomic claims from utterances and incrementally consolidates them into a structured memory state. Using this memory, we score each utterance along three interpretable dimensions: Novelty, Relevance, and Implication Scope. We annotate 80 segments from two moderated deliberative settings (TV debates and community discussions) with these dimensions and show that memory-derived dynamics (e.g., the number of claim updates) correlate more strongly with human-perceived CIG than traditional heuristics such as utterance length or TF–IDF. We develop effective LLM-based CIG predictors paving the way for information-focused conversation quality analysis in dialogues and deliberative success.
[21] HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning
Yanbin Wei, Chun Kang, Siwei Li, Haoxuan Che, Yang Chen, Hua Liu, Jian Liu, Zhuang Liu, Can Ouyang, Fei Xing, Lei Sha, Rui Liu, Yu Zhang, James Kwok
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Vision-Language Models (LVLMs) consistently require new arenas to guide their expanding boundaries, yet their capabilities with hypergraphs remain unexplored. In the real world, hypergraphs have significant practical applications in areas such as life sciences and social communities. Recent advancements in LVLMs have shown promise in understanding complex topologies, yet there remains a lack of a benchmark to delineate the capabilities of LVLMs with hypergraphs, leaving the boundaries of their abilities unclear. To fill this gap, in this paper, we introduce $\texttt{HyperGVL}$, the first benchmark to evaluate the proficiency of LVLMs in hypergraph understanding and reasoning. $\texttt{HyperGVL}$ provides a comprehensive assessment of 12 advanced LVLMs across 84,000 vision-language question-answering (QA) samples spanning 12 tasks, ranging from basic component counting to complex NP-hard problem reasoning. The involved hypergraphs contain multiscale synthetic structures and real-world citation and protein networks. Moreover, we examine the effects of 12 textual and visual hypergraph representations and introduce a generalizable router $\texttt{WiseHyGR}$ that improves LVLMs in hypergraph via learning adaptive representations. We believe that this work is a step forward in connecting hypergraphs with LVLMs.
[22] C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment
Pufan Zeng, Yilun Liu, Mingchen Dai, Mengyao Piao, Chunguang Zhao, Lingqi Miao, Shimin Tao, Weibin Meng, Minggui He, Chenxin Liu, Zhenzhen Qin, Li Zhang, Hongxia Ma, Boxing Chen, Daimeng Wei
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating cultural specificity as an abstract concept rather than a measurable signal. In this paper, we address this “quantification gap” by proposing C-Mining, an unsupervised framework that transforms the discovery of cultural seeds from a subjective selection process into a computable data mining formulation. Our approach exploits a novel geometric insight, leveraging the cross-lingual misalignment of cultural concepts within pre-trained embedding spaces as a quantifiable discovery signal. By systematically identifying these regions characterized by pronounced linguistic exclusivity and geometric isolation, while actively filtering out noise, C-Mining automatically extracts high-fidelity Culture Points (CPs) from raw multilingual corpora without reliance on human or LLM supervision, reducing preparation costs by more than 150-fold. We further leverage the mined knowledge to steer the synthesis of diverse instruction-tuning datasets. Extensive experiments demonstrate that this seed-centric approach significantly enhances cultural understanding and reasoning capabilities, achieving a +6.03 point improvement on CulturalBench-Hard and surpassing state-of-the-art baselines, providing a scalable, quantifiable solution for high-quality cultural data synthesis.
[23] Preference Estimation via Opponent Modeling in Multi-Agent Negotiation
Yuta Konishi, Kento Yamamoto, Eisuke Sonomoto, Rikuho Takeda, Ryo Furukawa, Yusuke Muraki, Takafumi Shimizu, Kazuma Fukumura, Yuya Kanemoto, Takayuki Ito, Shiyao Ding
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Automated negotiation in complex, multi-party and multi-issue settings critically depends on accurate opponent modeling. However, conventional numerical-only approaches fail to capture the qualitative information embedded in natural language interactions, resulting in unstable and incomplete preference estimation. Although Large Language Models (LLMs) enable rich semantic understanding of utterances, it remains challenging to quantitatively incorporate such information into a consistent opponent modeling. To tackle this issue, we propose a novel preference estimation method integrating natural language information into a structured Bayesian opponent modeling framework. Our approach leverages LLMs to extract qualitative cues from utterances and converts them into probabilistic formats for dynamic belief tracking. Experimental results on a multi-party benchmark demonstrate that our framework improves the full agreement rate and preference estimation accuracy by integrating probabilistic reasoning with natural language understanding.
[24] Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information
Yao Chen, Jiawei Sheng, Wenyuan Zhang, Tingwen Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The significant computational demands of large language models have increased interest in distilling reasoning abilities into smaller models via Chain-of-Thought (CoT) distillation. Current CoT distillation methods mainly focus on transferring teacher-generated rationales for complex reasoning to student models. However, they do not adequately explore teachers’ dynamic attention toward critical information during reasoning. We find that language models exhibit progressive attention shifts towards key information during reasoning, which implies essential clues for drawing conclusions. Building on this observation and analysis, we introduce a novel CoT distillation framework that transfers the teacher’s stepwise attention on key information to the student model. This establishes structured guidance for the student’s progressive concentration on key information during reasoning. More importantly, we develop a Mixture of Layers module enabling dynamic alignment that adapts to different layers between the teacher and student. Our method achieves consistent performance improvements across multiple mathematical and commonsense reasoning datasets. To our knowledge, it is the first method to leverage stepwise attention within CoT distillation to improve small model reasoning.
[25] The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce a cross-domain behavioural assay of monitoring-control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework and applying human psychometric methodology to LLM evaluation. The battery comprises 524 items across six cognitive domains (learning, metacognitive calibration, social cognition, attention, executive function, prospective regulation), each grounded in an established experimental paradigm. Tasks T1-T5 were pre-registered on OSF prior to data collection; T6 was added as an exploratory extension. After every forced-choice response, dual probes adapted from Koriat and Goldsmith (1996) ask the model to KEEP or WITHDRAW its answer and to BET or decline. The critical metric is the withdraw delta: the difference in withdrawal rate between incorrect and correct items. Applied to 20 frontier LLMs (10,480 evaluations), the battery discriminates three profiles consistent with the Nelson-Narens architecture: blanket confidence, blanket withdrawal, and selective sensitivity. Accuracy rank and metacognitive sensitivity rank are largely inverted. Retrospective monitoring and prospective regulation appear dissociable (r = .17, 95% CI wide given n=20; exemplar-based evidence is the primary support). Scaling on metacognitive calibration is architecture-dependent: monotonically decreasing (Qwen), monotonically increasing (GPT-5.4), or flat (Gemma). Behavioural findings converge structurally with an independent Type-2 SDT approach, providing preliminary cross-method construct validity. All items, data, and code: https://github.com/synthiumjp/metacognitive-monitoring-battery.
[26] Target-Oriented Pretraining Data Selection via Neuron-Activated Graph
Zijun Wang, Haoqin Tu, Weidong Zhou, Yiyang Zhou, Xiaohuan Zhou, Bingni Zhang, Weiguo Feng, Taifeng Wang, Cihang Xie, Fengze Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Everyday tasks come with a target, and pretraining models around this target is what turns them into experts. In this paper, we study target-oriented language model (LM) pretraining by introducing Neuron-Activated Graph Ranking (NAG-based Ranking), a training-free and interpretable framework for target pretraining data selection. Rather than using black-box representations, our approach directly characterizes each target input by a sparse set of high-impact neurons in any off-the-shelf LLMs. Concretely, we quantify neuron impact and select the most influential neurons across layers into a compact Neuron-Activated Graph (NAG), and rank candidate data by NAG similarity to target examples. We conduct experiments across six benchmarks, where our NAG-based Ranking improves target-oriented pretraining by 4.9% on average over random sampling, and also outperforms state-of-the-art baselines by 5.3% accuracy on HellaSwag. It also remains effective under a more applicable multi-target setting, where our best setup surpasses two baselines by 1.1% and 4.1%, respectively. Furthermore, we provide a comprehensive analysis on why and how our NAG works, e.g., deactivating NAG-selected neurons (only 0.12% of all) causes a 23.5% performance collapse, and restricting NAG to the final layer incurs a 4.1% average drop, indicating that NAG captures a sparse “functional backbone” for learning target features. We release the code at https://github.com/asillycat/NAG.
[27] GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
Jize Wang, Xuanxuan Liu, Yining Li, Songyang Zhang, Yijun Wang, Zifei Shan, Xinyi Le, Cailian Chen, Xinping Guan, Dacheng Tao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose GTA-2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open-ended workflows. Built on real-world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA-Atomic, inherited from our prior GTA benchmark, evaluates short-horizon, closed-ended tool-use precision. (ii) GTA-Workflow introduces long-horizon, open-ended tasks for realistic end-to-end completion. To evaluate open-ended deliverables, we propose a recursive checkpoint-based evaluation mechanism that decomposes objectives into verifiable sub-goals, enabling unified evaluation of both model capabilities and agent execution frameworks (i.e., execution harnesses). Experiments reveal a pronounced capability cliff: while frontier models already struggle on atomic tasks (below 50%), they largely fail on workflows, with top models achieving only 14.39% success. Further analysis shows that checkpoint-guided feedback improves performance, while advanced frameworks such as Manus and OpenClaw substantially enhance workflow completion, highlighting the importance of execution harness design beyond the underlying model capacity. These findings provide guidance for developing reliable personal and professional assistants. Dataset and code will be available at https://github.com/open-compass/GTA.
[28] Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
Ponhvoan Srey, Xiaobao Wu, Cong-Duy Nguyen, Anh Tuan Luu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Uncertainty estimation is a promising approach to detect hallucinations in large language models (LLMs). Recent approaches commonly depend on model internal states to estimate uncertainty. However, they suffer from strict assumptions on how hidden states should evolve across layers, and from information loss by solely focusing on last or mean tokens. To address these issues, we present Sequential Internal Variance Representation (SIVR), a supervised hallucination detection framework that leverages token-wise, layer-wise features derived from hidden states. SIVR adopts a more basic assumption that uncertainty manifests in the degree of dispersion or variance of internal representations across layers, rather than relying on specific assumptions, which makes the method model and task agnostic. It additionally aggregates the full sequence of per-token variance features, learning temporal patterns indicative of factual errors and thereby preventing information loss. Experimental results demonstrate SIVR consistently outperforms strong baselines. Most importantly, SIVR enjoys stronger generalisation and avoids relying on large training sets, highlighting the potential for practical deployment. Our code repository is available online at https://github.com/ponhvoan/internal-variance.
[29] Language, Place, and Social Media: Geographic Dialect Alignment in New Zealand
Sidney Wong
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This thesis investigates geographic dialect alignment in place-informed social media communities, focussing on New Zealand-related Reddit communities. By integrating qualitative analyses of user perceptions with computational methods, the study examines how language use reflects place identity and patterns of language variation and change based on user-informed lexical, morphosyntactic, and semantic variables. The findings show that users generally associate language with place, and place-related communities form a contiguous speech community, though alignment between geographic dialect communities and place-related communities remains complex. Advanced language modelling, including static and diachronic Word2Vec language embeddings, revealed semantic variation across place-based communities and meaningful semantic shifts within New Zealand English. The research involved the creation of a corpus containing 4.26 billion unprocessed words, which offers a valuable resource for future study. Overall, the results highlight the potential of social media as a natural laboratory for sociolinguistic inquiry.
[30] TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models
Jinlun Ye, Jiang Liao, Runhe Lai, Xinhua Lu, Jiaxin Zhuang, Zhiyong Gan, Ruixuan Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision-language models (VLMs) such as CLIP exhibit strong Out-of-distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP-based test-time adaptation methods further improve detection performance by incorporating external OOD labels. However, such labels are finite and fixed, while the real OOD semantic space is inherently open-ended. Consequently, fixed labels fail to represent the diverse and evolving OOD semantics encountered in test streams. To address this limitation, we introduce Test-time Textual Learning (TTL), a framework that dynamically learns OOD textual semantics from unlabeled test streams, without relying on external OOD labels. TTL updates learnable prompts using pseudo-labeled test samples to capture emerging OOD knowledge. To suppress noise introduced by pseudo-labels, we introduce an OOD knowledge purification strategy that selects reliable OOD samples for adaptation while suppressing noise. In addition, TTL maintains an OOD Textual Knowledge Bank that stores high-quality textual features, providing stable score calibration across batches. Extensive experiments on two standard benchmarks with nine OOD datasets demonstrate that TTL consistently achieves state-of-the-art performance, highlighting the value of textual adaptation for robust test-time OOD detection. Our code is available at https://github.com/figec/TTL.
[31] Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing
Kai Wei, Raymond Li, Xi Zhu, Zhaoqian Xue, Jiaojiao Han, Jingcheng Niu, Fan Yang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose – leaving the structural causes of query-evidence misalignment unaddressed. We observe that a significant portion of persistent retrieval failures stem not from the absence of relevant evidence but from an alignment gap between the query and the evidence space. We propose Skill-RAG, a failure-aware RAG framework that couples a lightweight hidden-state prober with a prompt-based skill router. The prober gates retrieval at two pipeline stages; upon detecting a failure state, the skill router diagnoses the underlying cause and selects among four retrieval skills – query rewriting, question decomposition, evidence focusing, and an exit skill for truly irreducible cases – to correct misalignment before the next generation attempt. Experiments across multiple open-domain QA and complex reasoning benchmarks show that Skill-RAG substantially improves accuracy on hard cases persisting after multi-turn retrieval, with particularly strong gains on out-of-distribution datasets. Representation-space analyses further reveal that the proposed skills occupy structured, separable regions of the failure state space, supporting the view that query-evidence misalignment is a typed rather than monolithic phenomenon.
[32] MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents
Weiwei Xie, Shaoxiong Guo, Fan Zhang, Tian Xia, Xue Yang, Lizhuang Ma, Junchi Yan, Qibing Ren
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevolution. This phenomenon refers to the gradual behavioral drift resulting from repeated exposure to misleading information. To address this gap, we introduce MemEvoBench, the first benchmark evaluating long-horizon memory safety in LLM agents against adversarial memory injection, noisy tool outputs, and biased feedback. The framework consists of QA-style tasks across 7 domains and 36 risk types, complemented by workflow-style tasks adapted from 20 Agent-SafetyBench environments with noisy tool returns. Both settings employ mixed benign and misleading memory pools within multi-round interactions to simulate memory evolution. Experiments on representative models reveal substantial safety degradation under biased memory updates. Our analysis suggests that memory evolution is a significant contributor to these failures. Furthermore, static prompt-based defenses prove insufficient, underscoring the urgency of securing memory evolution in LLM agents.
[33] PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection
Pritesh Jha
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution. To establish baseline difficulty, we evaluate eight published systems spanning rule-based engines (Microsoft Presidio), general purpose NER models (spaCy, BERT-base NER, XLM-RoBERTa NER, SpanMarker mBERT, SpanMarker BERT), a PII-specific model (Piiranha DeBERTa), and a financial NER specialist (XtremeDistil FiNER). All systems achieve span-level F1 below 0.14, with the best system (Presidio, F1=0.1385) still producing zero recall on most entity types. These results directly quantify the domain-silo problem and demonstrate that PIIBench presents a substantially harder and more comprehensive evaluation challenge than any existing single source PII dataset. The dataset construction pipeline and benchmark evaluation code are publicly available at https://github.com/pritesh-2711/pii-bench.
[34] A Systematic Study of Training-Free Methods for Trustworthy Large Language Models
Wai Man Si, Mingjie Li, Michael Backes, Yang Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As Large Language Models (LLMs) receive increasing attention and are being deployed across various domains, their potential risks, including generating harmful or biased content, producing unsupported claims, and exhibiting vulnerabilities to adversarial attacks, have drawn significant attention. To enable quick and low-cost adaptation, training-free methods have recently emerged as cost-effective alternatives to post-training alignment techniques. Despite their promising results, these methods are evaluated inconsistently across the literature, cover limited dimensions of trustworthiness, and can introduce undesirable side effects, such as utility degradation and increased brittleness. To fully assess the impacts of these training-free methods, we take a step back and systematically re-evaluate the effectiveness of existing training-free methods against various trustworthy settings and their influence on utility, robustness, and computational overhead. We also categorize these methods into three levels (input, internal, and output) based on where they intervene in the model’s information flow during inference. Using this taxonomy, we conduct a comprehensive analysis of various representative and effective methods from each level across different LLM families and sizes. Our analysis highlights several trade-offs and unresolved challenges in current approaches. We summarize key findings and limitations in the existing literature, and propose practical recommendations for balancing trustworthiness, utility, and robustness in LLMs without the need for additional training.
[35] CHOP: Chunkwise Context-Preserving Framework for RAG on Multi Documents
Hyunseok Park, Jihyeon Kim, Jongeun Kim, Dongsik Yoon
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Retrieval-Augmented Generation (RAG) systems lose retrieval accuracy when similar documents coexist in the vector database, causing unnecessary information, hallucinations, and factual errors. To alleviate this issue, we propose CHOP, a framework that iteratively evaluates chunk relevance with Large Language Models (LLMs) and progressively reconstructs documents by determining their association with specific topics or query types. CHOP integrates two key components: the CNM-Extractor, which generates compact per-chunk signatures capturing categories, key nouns, and model names, and the Continuity Decision Module, which preserves contextual coherence by deciding whether consecutive chunks belong to the same document flow. By prefixing each chunk with context-aware metadata, CHOP reduces semantic conflicts among similar documents and enhances retriever discrimination. Experiments on benchmark datasets show that CHOP alleviates retrieval confusion and provides a scalable approach for building high-quality knowledge bases, achieving a Top-1 Hit Rate of 90.77% and notable gains in ranking quality metrics.
[36] CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
Shidong Yang, Ziyu Ma, Tongwen Huang, Yiming Hu, Yong Wang, Xiangxiang Chu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent’s evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent-data mutual evolution framework that enables LLM agents to improve through closed-loop, interaction-driven training. Specifically, CoEvolve extracts feedback signals such as forgetting and uncertainty from rollout trajectories to identify failure-prone interaction patterns, and utilizes them to guide LLM-based task synthesis. The synthesized tasks are validated through environment interaction and utilized to update the data distribution, enabling joint adaptation of the agent and its data. Extensive experiments on AppWorld and BFCL across Qwen2.5-7B, Qwen3-4B, and Qwen3-30B-A3B demonstrate consistent and significant improvements over strong base models, yielding absolute gains of 19.43%, 15.58%, and 18.14%, respectively.
[37] Exploring the Capability Boundaries of LLMs in Mastering of Chinese Chouxiang Language
Dianqing Lin, Tian Lan, Jiali Zhu, Jiang Li, Wei Chen, Xu Liu, Aruukhan, Xiangdong Su, Hongxu Hou, Guanglai Gao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While large language models (LLMs) have achieved remarkable success in general language tasks, their performance on Chouxiang Language, a representative subcultural language in the Chinese internet context, remains largely unexplored. In this paper, we introduce Mouse, a specialized benchmark designed to evaluate the capabilities of LLMs on NLP tasks involving Chouxiang Language across six tasks. Experimental results show that, current state-of-the-art (SOTA) LLMs exhibit clear limitations on multiple tasks, while performing well on tasks that involve contextual semantic understanding. In addition, we further discuss the reasons behind the generally low performance of SOTA LLMs on Chouxiang Language, examine whether the LLM-as-a-judge approach adopted for translation tasks aligns with human judgments and values, and analyze the key factors that influence Chouxiang translation. Our study aims to promote further research in the NLP community on multicultural integration and the dynamics of evolving internet languages. Our code and data are publicly available.
[38] Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
Tanja Baeumel, Josef van Genabith, Simon Ostermann
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) have demonstrated impressive capabilities, yet their internal mechanisms for handling reasoning-intensive tasks remain underexplored. To advance the understanding of model-internal processing mechanisms, we present an investigation of how LLMs perform arithmetic operations by examining internal mechanisms during task execution. Using early decoding, we trace how next-token predictions are constructed across layers. Our experiments reveal that while the models recognize arithmetic tasks early, correct result generation occurs only in the final layers. Notably, models proficient in arithmetic exhibit a clear division of labor between attention and MLP modules, where attention propagates input information and MLP modules aggregate it. This division is absent in less proficient models. Furthermore, successful models appear to process more challenging arithmetic tasks functionally, suggesting reasoning capabilities beyond factual recall.
[39] CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization
Junyi Li, Yongqiang Chen, Ningning Ding
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to address complex questions, presents a dilemma to unlearning: existing methods either struggle to completely eliminate undesired knowledge from the CoT traces or degrade the reasoning performances due to the interference with the reasoning process. To this end, we introduce Counterfactual Unlearning through iterative Preference Optimization (CiPO), a novel framework that redefines unlearning as the targeted intervention of the CoT reasoning in LRMs. More specifically, given a desired unlearning target answer, CiPO instructs LRMs to generate a logically valid counterfactual reasoning trace for preference tuning. As the LRM adjusts to the counterfactual trace, CiPO iteratively updates the preference learning data to increase the discrepancy from the original model. This iterative loop ensures both desirable unlearning and smooth optimization, effectively mitigating the dilemma. Experiments on challenging benchmarks demonstrate that CiPO excels at unlearning, completely removing knowledge from both the intermediate CoT steps and the final answer, while preserving the reasoning abilities of LRMs.
[40] DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition
Siun Kim, Hyung-Jin Yoon
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) have advanced information extraction (IE) by enabling zero-shot and few-shot named entity recognition (NER), yet their generative outputs still show persistent and systematic errors. Despite progress through instruction fine-tuning, zero-shot NER still lags far behind supervised systems. These recurring errors mirror inconsistencies observed in early-stage human annotation processes that resolve disagreements through pilot annotation. Motivated by this analogy, we introduce DiZiNER (Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition), a framework that simulates the pilot annotation process, employing LLMs to act as both annotators and supervisors. Multiple heterogeneous LLMs annotate shared texts, and a supervisor model analyzes inter-model disagreements to refine task instructions. Across 18 benchmarks, DiZiNER achieves zero-shot SOTA results on 14 datasets, improving prior bests by +8.0 F1 and reducing the zero-shot to supervised gap by over +11 points. It also consistently outperforms its supervisor, GPT-5 mini, indicating that improvements stem from disagreement-guided instruction refinement rather than model capacity. Pairwise agreement between models shows a strong correlation with NER performance, further supporting this finding.
[41] How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models
Judith Sieker, Sina Zarrieß
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) are increasingly studied as repositories of linguistic knowledge. In this line of work, models are commonly evaluated both as generators of language and as judges of linguistic output, yet these two roles are rarely examined in direct relation to one another. As a result, it remains unclear whether success in one role aligns with success in the other. In this paper, we address this question for pragmatic competence by comparing LLMs’ performance as pragmatic listeners, judging the appropriateness of linguistic outputs, and as pragmatic speakers, generating pragmatically appropriate language. We evaluate multiple open-weight and proprietary LLMs across three pragmatic settings. We find a robust asymmetry between pragmatic evaluation and pragmatic generation: many models perform substantially better as listeners than as speakers. Our results suggest that pragmatic judging and pragmatic generation are only weakly aligned in current LLMs, calling for more integrated evaluation practices.
[42] MUSCAT: MUltilingual, SCientific ConversATion Benchmark
Supriti Sinhamahapatra, Thai-Binh Nguyen, Yiğit Oğuz, Enes Ugan, Jan Niehues, Alexander Waibel
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The goal of multilingual speech technology is to facilitate seamless communication between individuals speaking different languages, creating the experience as though everyone were a multilingual speaker. To create this experience, speech technology needs to address several challenges: Handling mixed multilingual input, specific vocabulary, and code-switching. However, there is currently no dataset benchmarking this situation. We propose a new benchmark to evaluate current Automatic Speech Recognition (ASR) systems, whether they are able to handle these challenges. The benchmark consists of bilingual discussions on scientific papers between multiple speakers, each conversing in a different language. We provide a standard evaluation framework, beyond Word Error Rate (WER) enabling consistent comparison of ASR performance across languages. Experimental results demonstrate that the proposed dataset is still an open challenge for state-of-the-art ASR systems. The dataset is available in https://huggingface.co/datasets/goodpiku/muscat-eval \ \newline \Keywords{multilingual, speech recognition, audio segmentation, speaker diarization}
[43] RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
Fabian Ridder, Laurin Lessel, Malte Schilling
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Retrieval-Augmented Generation (RAG) is widely used to augment the input to Large Language Models (LLMs) with external information, such as recent or domain-specific knowledge. Nonetheless, current models still produce closed-domain hallucinations and generate content that is unsupported by the retrieved context. Current detection approaches typically treat hallucination as a post-hoc problem, relying on black-box consistency checks or probes over frozen internal representations. In this work, we demonstrate that hallucination detection based on internal state representation can also serve as a direct training signal. We introduce RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and RAGognizer, a hallucination-aware fine-tuning approach that integrates a lightweight detection head into an LLM, allowing for the joint optimization of language modeling and hallucination detection. This joint objective forces the model to improve the separability of its internal states regarding hallucinations while simultaneously learning to generate well-formed and meaningful responses. Across multiple benchmarks, RAGognizer achieves state-of-the-art token-level hallucination detection while substantially reducing hallucination rates during generation, without degrading language quality or relevance.
[44] SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification
Ke Xiong, Qian Wu, Wangjie Gan, Yuke Li, Xuhong Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Few-shot Hierarchical Text Classification (few-shot HTC) is a challenging task that involves mapping texts to a predefined tree-structured label hierarchy under data-scarce conditions. While current approaches utilize structural constraints from the label hierarchy to maintain parent-child prediction consistency, they face a critical bottleneck, the difficulty in distinguishing semantically similar sibling classes due to insufficient domain knowledge. We introduce an innovative method named Sibling Contrastive Learning with Hierarchical Knowledge-aware Prompt Tuning for few-shot HTC tasks (SCHK-HTC). Our work enhances the model’s perception of subtle differences between sibling classes at deeper levels, rather than just enforcing hierarchical rules. Specifically, we propose a novel framework featuring two core components: a hierarchical knowledge extraction module and a sibling contrastive learning mechanism. This design guides model to encode discriminative features at each hierarchy level, thus improving the separability of confusable classes. Our approach achieves superior performance across three benchmark datasets, surpassing existing state-of-the-art methods in most cases. Our code is available at https://github.com/happywinder/SCHK-HTC.
[45] AgentV-RL: Scaling Reward Modeling with Agentic Verifier
Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing, Mingxu Chai, Wei He, Guoqiang Zhang, Chenghao Fan, Chenxin An, Wenxiang Chen, Zhicheng Liu, Haojie Pan, Dingwei Zhu, Tao Gui, Qi Zhang, Xuanjing Huang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.
[46] Where does output diversity collapse in post-training?
Constantinos Karouzos, Xingwei Tan, Nikolaos Aletras
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.
[47] Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning
Jiaxi Bi, Tongxu Luo, Wenyu Du, Zhengyang Tang, Benyou Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non-learnable). This classification reveals the unexplored potential of learnable internal methods, motivating our proposal of STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B parameters demonstrate that STOP achieves superior effectiveness and efficiency compared to existing baselines. Furthermore, we rigorously validate the scalability of STOP under varying compute budgets - for instance, boosting GPT-OSS-20B accuracy on AIME25 from 84% to nearly 90% under fixed compute budgets. Finally, we distill our findings into formalized empirical guidelines to facilitate optimal real-world deployment. Code, data and models are available at https://bijiaxihh.github.io/STOP
[48] Stochasticity in Tokenisation Improves Robustness
Sophie Steger, Rui Li, Sofiane Ennadir, Anya Sims, Arno Solin, Franz Pernkopf, Martin Trapp
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The widespread adoption of large language models (LLMs) has increased concerns about their robustness. Vulnerabilities in perturbations of tokenisation of the input indicate that models trained with a deterministic canonical tokenisation can be brittle to adversarial attacks. Recent studies suggest that stochastic tokenisation can deliver internal representations that are less sensitive to perturbations. In this paper, we analyse how stochastic tokenisations affect robustness to adversarial attacks and random perturbations. We systematically study this over a range of learning regimes (pre-training, supervised fine-tuning, and in-context learning), data sets, and model architectures. We show that pre-training and fine-tuning with uniformly sampled stochastic tokenisations improve robustness to random and adversarial perturbations. Evaluating on uniformly sampled non-canonical tokenisations reduces the accuracy of a canonically trained Llama-1b model by 29.8%. We find that training with stochastic tokenisation preserves accuracy without increasing inference cost.
[49] Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures
Yutong Gao, Qinglin Meng, Yuan Zhou, Liangming Pan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While Large Language Models (LLMs) have achieved strong performance across many NLP tasks, their opaque internal mechanisms hinder trustworthiness and safe deployment. Existing surveys in explainable AI largely focus on post-hoc explanation methods that interpret trained models through external approximations. In contrast, intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative. This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. We further discuss open challenges and outline future research directions in this emerging field. The paper list is available at: https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs.
[50] Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors
Jessica H. Zhu, Shayla Stringfield, Vahe Zaprosyan, Michael Wagner, Michel Cukier, Joseph B. Richardson
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Firearm violence is a pressing public health issue, yet research into survivors’ lived experiences remains underfunded and difficult to scale. Qualitative research, including in-depth interviews, is a valuable tool for understanding the personal and societal consequences of community firearm violence and designing effective interventions. However, manually analyzing these narratives through thematic analysis and inductive coding is time-consuming and labor-intensive. Recent advancements in large language models (LLMs) have opened the door to automating this process, though concerns remain about whether these models can accurately and ethically capture the experiences of vulnerable populations. In this study, we assess the use of open-source LLMs to inductively code interviews with 21 Black men who have survived community firearm violence. Our results demonstrate that while some configurations of LLMs can identify important codes, overall relevance remains low and is highly sensitive to data processing. Furthermore, LLM guardrails lead to substantial narrative erasure. These findings highlight both the potential and limitations of LLM-assisted qualitative coding and underscore the ethical challenges of applying AI in research involving marginalized communities.
[51] Sentiment Analysis of German Sign Language Fairy Tales
Fabrizio Nunnari, Siddhant Jain, Patrick Gebhard
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present a dataset and a model for sentiment analysis of German sign language (DGS) fairy tales. First, we perform sentiment analysis for three levels of valence (negative, neutral, positive) on German fairy tales text segments using four large language models (LLMs) and majority voting, reaching an inter-annotator agreement of 0.781 Krippendorff’s alpha. Second, we extract face and body motion features from each corresponding DGS video segment using MediaPipe. Finally, we train an explainable model (based on XGBoost) to predict negative, neutral or positive sentiment from video features. Results show an average balanced accuracy of 0.631. A thorough analysis of the most important features reveal that, in addition to eyebrows and mouth motion on the face, also the motion of hips, elbows, and shoulders considerably contribute in the discrimination of the conveyed sentiment, indicating an equal importance of face and body for sentiment communication in sign language.
[52] On the Rejection Criterion for Proxy-based Test-time Alignment
Ayoub Hammal, Pierre Zweigenbaum, Caio Corro
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent works proposed test-time alignment methods that rely on a small aligned model as a proxy that guides the generation of a larger base (unaligned) model. The implicit reward approach skews the large model distribution, whereas the nudging approach defers the generation of the next token to the small aligned model when the large base one is unconfident about its outcome. In this work, we first show that both approaches can be reduced to sampling from similar graphical models, where they differ only in the definition of a rejection criterion (or distribution). Moreover, we argue that the confidence criterion is ill-motivated due to linguistic phenomena like ambiguous phrasing. We propose a novel rejection criterion based on a conservative confidence bet. Experimentally, our novel approach outperforms previous work on several datasets.
[53] AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency
Max Henning Höth, Kristian Kersting, Björn Deiseroth, Letitia Parcalabescu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both contributes to and faithfully reflects the processes underlying the model’s final answer, rather than merely accompanying it, remains challenging. We introduce AtManRL, a method that leverages differentiable attention manipulation to learn more faithful reasoning through reinforcement learning. By training an additive attention mask that identifies tokens in the CoT crucial for producing correct answers, we derive a saliency reward signal that encourages the model to generate reasoning traces that genuinely influence its final predictions. We integrate this saliency reward with outcome-based rewards within the GRPO framework to jointly optimize for correctness and interpretability. Experiments on GSM8K and MMLU with Llama-3.2-3B-Instruct demonstrate that our approach can identify influential reasoning tokens and enable training more transparent reasoning models.
[54] Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations
Yanli Wang, Peng Kuang, Xiaoyu Han, Kaidi Xu, Haohan Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models are increasingly deployed in settings where reliability matters, yet output-level uncertainty signals such as token probabilities, entropy, and self-consistency can become brittle under calibration–deployment mismatch. Conformal prediction provides finite-sample validity under exchangeability, but its practical usefulness depends on the quality of the nonconformity score. We propose a conformal framework for LLM question answering that uses internal representations rather than output-facing statistics: specifically, we introduce Layer-Wise Information (LI) scores, which measure how conditioning on the input reshapes predictive entropy across model depth, and use them as nonconformity scores within a standard split conformal pipeline. Across closed-ended and open-domain QA benchmarks, with the clearest gains under cross-domain shift, our method achieves a better validity–efficiency trade-off than strong text-level baselines while maintaining competitive in-domain reliability at the same nominal risk level. These results suggest that internal representations can provide more informative conformal scores when surface-level uncertainty is unstable under distribution shift.
[55] Optimizing Korean-Centric LLMs via Token Pruning
Hoyeol Kim, Hyeonwoo Kim
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper presents a systematic benchmark of state-of-the-art multilingual large language models (LLMs) adapted via token pruning - a compression technique that eliminates tokens and embedding parameters corresponding to languages irrelevant to the target application. Focusing on Korean-centric natural language processing (NLP) tasks, we evaluate architectures including Qwen3, Gemma-3, Llama-3, and Aya across three vocabulary configurations: Original, English-Korean (EnKo), and English-Korean-Chinese (EnKoZh). Performance is assessed using established benchmarks for general aptitude, cultural literacy, instruction following, and machine translation. Our findings indicate that token pruning significantly improves generation stability by eliminating language confusion, and in the case of machine translation, frequently enhances performance on Korean-specific tasks. While instruction-following capabilities display architecture-dependent variance linked to latent cross-lingual representations, the significant reduction in vocabulary size validates token pruning as a highly effective optimization strategy for memory-constrained, domain-specific deployments, despite modest gains in inference latency.
[56] BAGEL: Benchmarking Animal Knowledge Expertise in Language Models
Jiacheng Shen, Masato Hagiwara, Milad Alizadeh, Ellen Gilsenan-McMahon, Marius Miron, David Robinson, Emmanuel Chemla, Sara Keen, Gagan Narula, Mathieu Laurière, Matthieu Geist, Olivier Pietquin
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.
[57] SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation
Deshan Sumanathilaka, Nicholas Micallef, Julian Hough, Saman Jayasinghe
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large Language Models (LLMs) can effectively disambiguate, their practical applicability in real-world narrative contexts remains underexplored. SemEval-2026 Task 5 addresses this gap by introducing a task that predicts the human-perceived plausibility of a word sense within a short story. In this work, we propose an LLM-based framework for plausibility scoring of homonymous word senses in narrative texts using a structured reasoning mechanism. We examine the impact of fine-tuning low-parameter LLMs with diverse reasoning strategies, alongside dynamic few-shot prompting for large-parameter models, on accurate sense identification and plausibility estimation. Our results show that commercial large-parameter LLMs with dynamic few-shot prompting closely replicate human-like plausibility judgments. Furthermore, model ensembling slightly improves performance, better simulating the agreement patterns of five human annotators compared to single-model predictions
[58] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
Van-Truong Le
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The complexity of Vietnam’s legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need. First, we establish a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, to understand the “why” behind these performance scores, we conduct a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles, using a novel, expert-validated error typology. Our results reveal a crucial trade-off: models like Grok-1 excel in Readability and Consistency but compromise on fine-grained legal Accuracy, while models like Claude 3 Opus achieve high Accuracy scores that mask a significant number of subtle but critical reasoning errors. The error analysis pinpoints \textit{Incorrect Example} and \textit{Misinterpretation} as the most prevalent failures, confirming that the primary challenge for current LLMs is not summarization but controlled, accurate legal reasoning. By integrating a quantitative benchmark with a qualitative deep dive, our work provides a holistic and actionable assessment of LLMs for legal applications.
[59] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus
Hitesh Mehta, Arjit Saxena, Garima Chhikara, Rohit Kumar
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper explores the response of Large Language Models (LLMs) to user prompts with different degrees of politeness and impoliteness. The Politeness Theory by Brown and Levinson and the Impoliteness Framework by Culpeper form the basis of experiments conducted across three languages (English, Hindi, Spanish), five models (Gemini-Pro, GPT-4o Mini, Claude 3.7 Sonnet, DeepSeek-Chat, and Llama 3), and three interaction histories between users (raw, polite, and impolite). Our sample consists of 22,500 pairs of prompts and responses of various types, evaluated across five levels of politeness using an eight-factor assessment framework: coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, and readability. The findings show that model performance is highly influenced by tone, dialogue history, and language. While polite prompts enhance the average response quality by up to ~11% and impolite tones worsen it, these effects are neither consistent nor universal across languages and models. English is best served by courteous or direct tones, Hindi by deferential and indirect tones, and Spanish by assertive tones. Among the models, Llama is the most tone-sensitive (11.5% range), whereas GPT is more robust to adversarial tone. These results indicate that politeness is a quantifiable computational variable that affects LLM behaviour, though its impact is language- and model-dependent rather than universal. To support reproducibility and future work, we additionally release PLUM (Politeness Levels in Utterances, Multilingual), a publicly available corpus of 1,500 human-validated prompts across three languages and five politeness categories, and provide a formal supplementary analysis of six falsifiable hypotheses derived from politeness theory, empirically assessed against the dataset.
[60] LaMSUM: Amplifying Voices Against Harassment through LLM Guided Extractive Summarization of User Incident Reports
Garima Chhikara, Anurag Sharma, V. Gurucharan, Kripabandhu Ghosh, Abhijnan Chakraborty
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Citizen reporting platforms help the public and authorities stay informed about sexual harassment incidents. However, the high volume of data shared on these platforms makes reviewing each individual case challenging. Therefore, a summarization algorithm capable of processing and understanding various code-mixed languages is essential. In recent years, Large Language Models (LLMs) have shown exceptional performance in NLP tasks, including summarization. LLMs inherently produce abstractive summaries by paraphrasing the original text, while the generation of extractive summaries - selecting specific subsets from the original text - through LLMs remains largely unexplored. Moreover, LLMs have a limited context window size, restricting the amount of data that can be processed at once. We tackle these challenges by introducing LaMSUM, a novel multi-level framework combining summarization with different voting methods to generate extractive summaries for large collections of incident reports using LLMs. Extensive evaluation using four popular LLMs (Llama, Mistral, Claude and GPT-4o) demonstrates that LaMSUM outperforms state-of-the-art extractive summarization methods. Overall, this work represents one of the first attempts to achieve extractive summarization through LLMs, and is likely to support stakeholders by offering a comprehensive overview and enabling them to develop effective policies to minimize incidents of unwarranted harassment.
[61] WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback
Taiwei Shi, Zhuoer Wang, Longqi Yang, Ying-Chun Lin, Zexue He, Mengting Wan, Pei Zhou, Sujay Jauhar, Sihao Chen, Shan Xia, Hongfei Zhang, Jieyu Zhao, Xiaofeng Xu, Xia Song, Jennifer Neville
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As large language models (LLMs) continue to advance, aligning these models with human preferences has emerged as a critical challenge. Traditional alignment methods, relying on human or LLM annotated datasets, are limited by their resource-intensive nature, inherent subjectivity, misalignment with real-world user preferences, and the risk of feedback loops that amplify model biases. To overcome these limitations, we introduce WildFeedback, a novel framework that leverages in-situ user feedback during conversations with LLMs to create preference datasets automatically. Given a corpus of multi-turn user-LLM conversation, WildFeedback identifies and classifies user feedback to LLM responses between conversation turns. The user feedback is then used to create examples of preferred and dispreferred responses according to users’ preference. Our experiments demonstrate that LLMs fine-tuned on WildFeedback dataset exhibit significantly improved alignment with user preferences, as evidenced by both traditional benchmarks and our proposed checklist-guided evaluation. By incorporating in-situ feedback from actual users, WildFeedback addresses the scalability, subjectivity, and bias challenges that plague existing approaches, marking a significant step toward developing LLMs that are more responsive to the diverse and evolving needs of their users.
[62] Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research
Tianyang Zhong, Zhenyuan Yang, Zhengliang Liu, Ruidong Zhang, Weihang You, Yiheng Liu, Haiyang Sun, Yi Pan, Yiwei Li, Yifan Zhou, Hanqi Jiang, Junhao Chen, Xiang Li, Tianming Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Low-resource languages serve as invaluable repositories of human history, embodying cultural evolution and intellectual diversity. Despite their significance, these languages face critical challenges, including data scarcity and technological limitations, which hinder their comprehensive study and preservation. Recent advancements in large language models (LLMs) offer transformative opportunities for addressing these challenges, enabling innovative methodologies in linguistic, historical, and cultural research. This study systematically evaluates the applications of LLMs in low-resource language research, encompassing linguistic variation, historical documentation, cultural expressions, and literary analysis. By analyzing technical frameworks, current methodologies, and ethical considerations, this paper identifies key challenges such as data accessibility, model adaptability, and cultural sensitivity. Given the cultural, historical, and linguistic richness inherent in low-resource languages, this work emphasizes interdisciplinary collaboration and the development of customized models as promising avenues for advancing research in this domain. By underscoring the potential of integrating artificial intelligence with the humanities to preserve and study humanity’s linguistic and cultural heritage, this study fosters global efforts towards safeguarding intellectual diversity.
[63] Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models
Guy Kaplan, Michael Toker, Yuval Reif, Yonatan Belinkov, Roy Schwartz
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Text-to-image generation models suffer from alignment problems, where generated images fail to accurately capture the objects and relations in the text prompt. Prior work has focused on improving alignment by refining the diffusion process, ignoring the role of the text encoder, which guides the diffusion. In this work, we investigate how semantic information is distributed across token representations in text-to-image prompts, analyzing it at two levels: (1) in-item representation-whether individual tokens represent their lexical item (i.e., a word or expression conveying a single concept), and (2) cross-item interaction-whether information flows between tokens of different lexical items. We use patching techniques to uncover encoding patterns, and find that information is usually concentrated in only one or two of the item’s tokens; for example, in the item San Francisco's Golden Gate Bridge'', the token Gate’’ sufficiently captures the entire expression while the other tokens could effectively be discarded. Lexical items also tend to remain isolated; for instance, in the prompt a green dog'', the token dog’’ encodes no visual information about green''. However, in some cases, items do influence each other's representation, often leading to misinterpretations-e.g., in the prompt a pool by a table’’, the token pool'' represents a pool table’’ after contextualization. Our findings highlight the critical role of token-level encoding in image generation, and demonstrate that simple interventions at the encoding stage can substantially improve alignment and generation quality.
[64] Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation
Siddhant Bhambri, Upasana Biswas, Subbarao Kambhampati
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in reasoning-focused Large Language Models (LLMs) have introduced Chain-of-Thought (CoT) traces - intermediate reasoning steps generated before a final answer. These traces, as in DeepSeek R1, guide inference and train smaller models. A common but under-examined assumption is that these traces are both semantically correct and interpretable to end-users. While intermediate reasoning steps are believed to improve accuracy, we question whether they are actually valid and understandable. To isolate the effect of trace semantics, we design experiments in Question Answering (QA) using rule-based problem decomposition, creating fine-tuning datasets where each problem is paired with either verifiably correct or incorrect traces, while always providing the correct final answer. Trace correctness is evaluated by checking the accuracy of every reasoning sub-step. To assess interpretability, we fine-tune LLMs on three additional trace types: R1 traces, R1 trace summaries, and post-hoc explanations, and conduct a human study with 100 participants rating each type on a Likert scale. We find: (1) Trace correctness does not reliably predict correct final answers - correct traces led to correct solutions in only 28% of test cases, while incorrect traces did not consistently degrade accuracy. (2) Fine-tuning on verbose R1 traces yielded the best model performance, but users rated them least interpretable (3.39 interpretability, 4.59 cognitive load on a 5-point scale), whereas more interpretable decomposed traces did not achieve comparable accuracy. Together, these findings challenge the assumption in question suggesting that researchers and practitioners should decouple model supervision objectives from end-user-facing trace design.
[65] Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation
Ekaterina Fadeeva, Aleksandr Rubashevskii, Dzianis Piatrashyn, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrinmaya Sachan, Maxim Panov
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance in open-domain question answering. However, RAG remains prone to hallucinations: factually incorrect outputs may arise from inaccuracies in the model’s internal knowledge and the retrieved context. Existing approaches to mitigating hallucinations often conflate factuality with faithfulness to the retrieved evidence, incorrectly labeling factually correct statements as hallucinations if they are not explicitly supported by the retrieval. In this paper, we introduce FRANQ, a new method for hallucination detection in RAG outputs. FRANQ applies distinct uncertainty quantification (UQ) techniques to estimate factuality, conditioning on whether a statement is faithful to the retrieved context. To evaluate FRANQ and competing UQ methods, we construct a new long-form question answering dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging cases. Extensive experiments across multiple datasets, tasks, and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing approaches.
[66] TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis
Xiaorui Wu, Xiaofeng Mao, Fei Li, Xin Zhang, Xuanhong Li, Chong Teng, Donghong Ji, Zhuang Li
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics. We further introduce TRIDENT, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: TRIDENT-Core, comprising 26,311 examples, and TRIDENT-Edge, with 18,773 examples. Fine-tuning Llama 3.1-8B on TRIDENT-Edge demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the WildBreak dataset.
[67] ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline
Morris Alper, Moran Yanuka, Raja Giryes, Gašper Beguš
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Constructed languages (conlangs) such as Esperanto and Quenya have played diverse roles in art, philosophy, and international communication. Meanwhile, foundation models have revolutionized creative generation in text, images, and beyond. In this work, we leverage modern LLMs as computational creativity aids for end-to-end conlang creation. We introduce ConlangCrafter, a multi-hop pipeline that decomposes language design into modular stages – phonology, morphology, syntax, lexicon generation, and translation. At each stage, our method leverages LLMs’ metalinguistic reasoning capabilities, injecting randomness to encourage diversity and leveraging self-refinement feedback to encourage consistency in the emerging language description. We construct a novel, scalable evaluation framework for this task, evaluating metrics measuring consistency and typological diversity. Automatic and manual evaluations demonstrate ConlangCrafter’s ability to produce coherent and varied conlangs without human linguistic expertise.
[68] Is this chart lying to me? Automating the detection of misleading visualizations
Jonathan Tonglet, Jan Zimny, Tinne Tuytelaars, Iryna Gurevych
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also create Misviz-synth, a synthetic dataset of 57,665 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and image-axis classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.
[69] Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection
Jun Seo Kim, Hyemi Kim, Woo Joo Oh, Hongjin Cho, Hochul Lee, Hye Hyeon Kim
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Cognitive distortions have been closely linked to mental health disorders, yet their automatic detection remains challenging due to contextual ambiguity, co-occurrence, and semantic overlap. We propose a novel framework that combines Large Language Models (LLMs) with a Multiple-Instance Learning (MIL) architecture to enhance interpretability and expression-level reasoning. Each utterance is decomposed into Emotion, Logic, and Behavior (ELB) components, which are processed by LLMs to infer multiple distortion instances, each with a predicted type, expression, and model-assigned salience score. These instances are integrated via a Multi-View Gated Attention mechanism for final classification. Experiments on Korean (KoACD) and English (Therapist QA) datasets demonstrate that incorporating ELB and LLM-inferred salience scores improves classification performance, especially for distortions with high interpretive ambiguity. Our results suggest a psychologically grounded and generalizable approach for fine-grained reasoning in mental health NLP. The dataset and implementation details are publicly accessible.
[70] OjaKV: Context-Aware Online Low-Rank KV Cache Compression
Yuxuan Zhu, David H. Yang, Mohammad Mohammadi Amiri, Keerthiram Murugesan, Tejaswini Pedapati, Pin-Yu Chen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The expanding long-context capabilities of large language models are constrained by a significant memory bottleneck: the key-value (KV) cache required for autoregressive generation. This bottleneck is substantial; for instance, a Llama-3.1-8B model processing a 32K-token prompt at a batch size of 4 requires approximately 16GB for its KV cache, a size exceeding the model’s weights. While KV-cache compression via low-rank projection is a promising direction, existing methods rely on a static, offline-learned subspace that performs poorly under data distribution shifts. To overcome these limitations, we introduce OjaKV, a novel framework that integrates a strategic hybrid storage policy with online subspace adaptation. First, OjaKV recognizes that not all tokens are equally important for compression; it preserves the crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention. Second, for the vast majority of intermediate tokens, it applies low-rank compression by incrementally adapting the projection basis using Oja’s algorithm for online principal component analysis. This adaptation involves a comprehensive update during prompt prefilling and lightweight periodic updates during decoding, ensuring the subspace remains aligned with the evolving context. Crucially, our framework is fully compatible with modern attention modules like FlashAttention. Experiments demonstrate that OjaKV maintains or even improves zero-shot accuracy at high compression ratios. In particular, OjaKV achieves its strongest gains on very long-context benchmarks that require complex reasoning, highlighting the importance of online subspace adaptation in dynamically tracking context shifts. These results establish our hybrid framework as a practical, plug-and-play solution for memory-efficient long-context inference without requiring model fine-tuning.
[71] Collaboration of Fusion and Independence: Hypercomplex-driven Robust Multi-Modal Knowledge Graph Completion
Zhiqiang Liu, Yichi Zhang, Mengshu Sun, Lei Liang, Wen Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multi-modal knowledge graph completion (MMKGC) aims to discover missing facts in multi-modal knowledge graphs (MMKGs) by leveraging both structural relationships and diverse modality information of entities. Existing MMKGC methods follow two multi-modal paradigms: fusion-based and ensemble-based. Fusion-based methods employ fixed fusion strategies, which inevitably leads to the loss of modality-specific information and a lack of flexibility to adapt to varying modality relevance across contexts. In contrast, ensemble-based methods retain modality independence through dedicated sub-models but struggle to capture the nuanced, context-dependent semantic interplay between modalities. To overcome these dual limitations, we propose a novel MMKGC method M-Hyper, which achieves the coexistence and collaboration of fused and independent modality representations. Our method integrates the strengths of both paradigms, enabling effective cross-modal interactions while maintaining modality-specific information. Inspired by ``quaternion’’ algebra, we utilize its four orthogonal bases to represent multiple independent modalities and employ the Hamilton product to efficiently model pair-wise interactions among them. Specifically, we introduce a Fine-grained Entity Representation Factorization (FERF) module and a Robust Relation-aware Modality Fusion (R2MF) module to obtain robust representations for three independent modalities and one fused modality. The resulting four modality representations are then mapped to the four orthogonal bases of a biquaternion (a hypercomplex extension of quaternion) for comprehensive modality interaction. Extensive experiments indicate its state-of-the-art performance, robustness, and computational efficiency.
[72] RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs’ Contextual Sensitivity
Jisu Shin, Hoyun Song, Juhyun Oh, Changgeon Ko, Eunsu Kim, Chani Jung, Alice Oh
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: People often encounter role conflicts – social dilemmas where the expectations of multiple roles clash and cannot be simultaneously fulfilled. As large language models (LLMs) increasingly navigate these social dynamics, a critical research question emerges. When faced with such dilemmas, do LLMs prioritize dynamic contextual cues or the learned preferences? To address this, we introduce RoleConflictBench, a novel benchmark designed to measure the contextual sensitivity of LLMs in role conflict scenarios. To enable objective evaluation within this subjective domain, we employ situational urgency as a constraint for decision-making. We construct the dataset through a three-stage pipeline that generates over 13,000 realistic scenarios across 65 roles in five social domains by systematically varying the urgency of competing situations. This controlled setup enables us to quantitatively measure contextual sensitivity, determining whether model decisions align with the situational contexts or are overridden by the learned role preferences. Our analysis of 10 LLMs reveals that models substantially deviate from this objective baseline. Instead of responding to dynamic contextual cues, their decisions are predominantly governed by the preferences toward specific social roles.
[73] Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards
Youliang Yuan, Qiuyang Mang, Jingbang Chen, Hong Wan, Xiaoyuan Liu, Junjielong Xu, Jen-tse Huang, Wenxuan Wang, Wenxiang Jiao, Pinjia He
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In this paper, we observe that current models are susceptible to reward hacking, leading to a substantial overestimation of a model’s reasoning ability. This is evidenced by a high incidence of false positives-solutions that reach the correct answer through an unsound process. Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps-abrupt jumps to a correct output without a valid preceding derivation. Probing experiments suggest that these Miracle Steps are linked to answer-recall shortcuts, including memorization from pretraining, where the model accesses the correct answer independently of its reasoning chain. To mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a process-oriented reward function that evaluates the entire reasoning trajectory against problem-specific rubrics. The RRM explicitly penalizes logical flaws and encourages rigorous deduction. When integrated into an RL pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks. Notably, it boosts Verified Pass@1024 on AIME2024 from 26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%. Our work demonstrates that rewarding the solution process is crucial for building accurate and reliable models.
[74] Do LLMs Really Know What They Don’t Know? Internal States Mainly Reflect Knowledge Recall Rather Than Truthfulness
Chi Seng Cheang, Hou Pong Chan, Wenxuan Zhang, Yang Deng
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent work suggests that LLMs “know what they don’t know”, positing that hallucinated and factually correct outputs arise from distinct internal processes and can therefore be distinguished using internal signals. However, hallucinations have multifaceted causes: beyond simple knowledge gaps, they can emerge from training incentives that encourage models to exploit statistical shortcuts or spurious associations learned during pretraining. In this paper, we argue that when LLMs rely on such learned associations to produce hallucinations, their internal processes are mechanistically similar to those of factual recall, as both stem from strong statistical correlations encoded in the model’s parameters. To verify this, we propose a novel taxonomy categorizing hallucinations into Unassociated Hallucinations (UHs), where outputs lack parametric grounding, and Associated Hallucinations (AHs), which are driven by spurious associations. Through mechanistic analysis, we compare their computational processes and hidden-state geometries with factually correct outputs. Our results show that hidden states primarily reflect whether the model is recalling parametric knowledge rather than the truthfulness of the output itself. Consequently, AHs exhibit hidden-state geometries that largely overlap with factual outputs, rendering standard detection methods ineffective. In contrast, UHs exhibit distinctive, clustered representations that facilitate reliable detection.
[75] A Linguistics-Aware LLM Watermarking via Syntactic Predictability
Shinwoo Park, Hyejin Park, Hyeseon An, Yo-Sub Han
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthening it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at https://github.com/Shinwoo-Park/stela_watermark.
[76] FACTS: Table Summarization via Offline Template Generation with Agentic Workflows
Ye Yuan, Mohammad Amin Shabani, Siqi Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Query-focused table summarization requires generating natural language summaries of tabular data conditioned on a user query, enabling users to access insights beyond fact retrieval. Existing approaches face key limitations: table-to-text models require costly fine-tuning and struggle with complex reasoning, prompt-based LLM methods suffer from token-limit and efficiency issues while exposing sensitive data, and prior agentic pipelines often rely on decomposition, planning, or manual templates that lack robustness and scalability. To mitigate these issues, we introduce an agentic workflow, FACTS, a Fast, Accurate, and Privacy-Compliant Table Summarization approach via Offline Template Generation. FACTS produces offline templates, consisting of SQL queries and Jinja2 templates, which can be rendered into natural language summaries and are reusable across multiple tables sharing the same schema. It enables fast summarization through reusable offline templates, accurate outputs with executable SQL queries, and privacy compliance by sending only table schemas to LLMs. Evaluations on widely-used benchmarks show that FACTS consistently outperforms baseline methods, establishing it as a practical solution for real-world query-focused table summarization. Our code is available at https://github.com/BorealisAI/FACTS.
[77] Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting
Chenchen Tan, Youyang Qu, Xinghao Li, Hui Zhang, Shujie Cui, Cunjian Chen, Longxiang Gao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The increase in computing power and the necessity of AI-assisted decision-making boost the growing application of large language models (LLMs). Along with this, the potential retention of sensitive data of LLMs has spurred increasing research into machine unlearning. However, existing unlearning approaches face a critical dilemma: Aggressive unlearning compromises model utility, while conservative strategies preserve utility but risk hallucinated responses. This significantly limits LLMs’ reliability in knowledge-intensive applications. To address this, we introduce a novel Attention-Shifting (AS) framework for selective unlearning. AS is driven by two design objectives: (1) context-preserving suppression that attenuates attention to fact-bearing tokens without disrupting LLMs’ linguistic structure; and (2) hallucination-resistant response shaping that discourages fabricated completions when queried about unlearning content. AS realizes these objectives through two attention-level interventions, which are importance-aware suppression applied to the unlearning set to reduce reliance on memorized knowledge and attention-guided retention enhancement that reinforces attention toward semantically essential tokens in the retained dataset to mitigate unintended degradation. These two components are jointly optimized via a dual-loss objective, which forms a soft boundary that localizes unlearning while preserving unrelated knowledge under representation superposition. Experimental results show that AS improves performance preservation over the state-of-the-art unlearning methods, achieving up to 15% higher accuracy on the ToFU benchmark and 10% on the TDEC benchmark, while maintaining competitive hallucination-free unlearning effectiveness. Compared to existing methods, AS demonstrates a superior balance between unlearning effectiveness, generalization, and response reliability.
[78] IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering
Jieyong Kim, Maryam Amirizaniani, Soojin Yoon, Dongha Lee
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.23536: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23536&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[79] Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
Hunzalah Hassan Bhatti, Firoj Alam
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.24328: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24328&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[80] Revisiting the Uniform Information Density Hypothesis in LLM Reasoning
Minju Gwak, Guijin Son, Jaehyung Kim
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.06953: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06953&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[81] Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation
Renfei Dang, Peng Hu, Zhejian Lai, Changjiang Gao, Min Zhang, Shujian Huang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.02626: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.02626&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[82] EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems
Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, Bryan Hooi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.13220: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13220&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[83] Reading Between the Lines: The One-Sided Conversation Problem
Victoria Ebert, Rishabh Singh, Tuochao Chen, Noah A. Smith, Shyamnath Gollakota
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.03056: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03056&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[84] TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG
Pengqian Lu, Jie Lu, Anjin Liu, Guangquan Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.07515: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07515&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[85] VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models
Nguyen Tien Dong, Minh-Anh Nguyen, Thanh Dat Hoang, Nguyen Tuan Ngoc, Dao Xuan Quang Minh, Phan Phi Hai, Nguyen Thi Ngoc Anh, Dang Van Tu, Binh Vu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.14554: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14554&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[86] Losses that Cook: Topological Optimal Transport for Structured Recipe Generation
Mattia Ottoborgo, Daniele Rege Cambrin, Paolo Garza
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.02531: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02531&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[87] Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners
Yihong Liu, Raoyuan Zhao, Hinrich Schütze, Michael A. Hedderich
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.02996: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02996&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[88] RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models
Quy-Anh Dang, Chris Ngo, Truong-Son Hy
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.03699: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03699&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[89] Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
Xinhang Ma, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.15143: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15143&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[90] Whose Facts Win? LLM Source Preferences under Knowledge Conflicts
Jakob Schuster, Vagrant Gautam, Katja Markert
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.03746: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03746&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[91] SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space
Sen Fang, Yalin Feng, Chunyu Sui, Hongbin Zhong, Yanxin Zhang, Hongwei Yi, Hezhen Hu, Dimitris N. Metaxas
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2504.16315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.16315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[92] Disco-RAG: Discourse-Aware Retrieval-Augmented Generation
Dongqi Liu, Hang Ding, Qiming Feng, Xurong Xie, Zhucun Xue, Chengjie Wang, Jian Li, Jiangning Zhang, Yabiao Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.04377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[93] EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.05808: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05808&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[94] CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning
Alexandra Dragomir, Florin Brad, Radu Tudor Ionescu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.05858: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05858&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[95] HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns
Xintao Wang, Jian Yang, Weiyuan Li, Rui Xie, Jen-tse Huang, Jun Gao, Shuai Huang, Yueping Kang, Yuanli Gou, Hongwei Feng, Yanghua Xiao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.10198: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10198&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[96] Reward Modeling for Scientific Writing Evaluation
Furkan Şahinuç, Subhabrata Dutta, Iryna Gurevych
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.11374: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11374&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[97] COMPOSITE-Stem
Kyle Waters, Lucas Nuzzi, Tadhg Looram, Alessandro Tomasiello, Ariel Ghislain Kemogne Kamdoum, Bikun Li, Damien Sileo, Egor Kretov, Francesco Fournier-Facio, Georgios Soloupis, Haile Kassahun, Hew Wolff, Jiaqi Cai, Lianghui Li, Marc Roth, Mohinder Naiya, Naixu Guo, Qicheng Tang, Richard Wheeler, Samuele Sala, Serguei Popov, Steven Dillmann, Yuqi Li
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.09836: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09836&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[98] FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents
Chiwei Zhu, Benfeng Xu, Mingxuan Du, Shaohan Wang, Xiaorui Wang, Zhendong Mao, Yongdong Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.01566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[99] ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning
Shuaiyi Nie, Siyu Ding, Wenyuan Zhang, Linhao Yu, Tianmeng Yang, Yao Chen, Weichong Yin, Yu Sun, Hua Wu, Tingwen Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.09953: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09953&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[100] Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
Samuel Cahyawijaya, Peerat Limkonchotiwat, Tack Hwa Wong, Hitesh Laxmichand Patel, Amit Agarwal, Manuel Antonio Rufino, Carlos Rafael Catalan, Muhammad Reza Qorib, Vicky Feliren, Holy Lovenia, Aye Hninn Khine, Frederikus Hudi, David Anugraha, Alham Fikri Aji, Romrawin Chumpu, Viet-Thanh Pham, Minghan Wang, Mohamed Fazli Imam, Ruochen Zhang, Joseph Marvin Imperial, Khumaisa Nur’aini, Do Xuan Long, Musa Izzanardi Wijanarko, Joel Ruben Antony Moniz, Patrick Amadeus Irawan, Hanif Muhammad Zhafran, Isaiah Flores, Salsabila Zahirah Pranida, Jun Kevin, Jostin Jerico Rosal, Patricia Nicole Monderin, Kun Kerdthaisong, Ahmad Mustafid, My Chiffon Nguyen, Natchapon Jongwiriyanurak, Siva Worajitwannakul, Haochen Li, Adrian Xuan Wei Lim, Bin Wang, Muhammad Ravi Shulthan Habibi, Lynnette Hui Xian Ng, Mithil Bangera, Yeshil Bangera, Priyaranjan Pattnayak, Dun Li Chan, Sherissa Caren Djuniwar, Cho Chan Myei Oo, Hee Ming Shan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11490: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11490&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[101] Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review
Qian Ruan, Iryna Gurevych
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.11173: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11173&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[102] Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil
Sukumar Kishanthan, Kumar Thushalika, Buddhi Jayasekara, Asela Hevapathige
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.14517: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14517&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[103] MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng, Kyle Lam, Lin Li, Jianing Qiu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.21950: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21950&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[104] Automatic Combination of Sample Selection Strategies for Few-Shot Learning
Branislav Pecher, Ivan Srba, Maria Bielikova, Joaquin Vanschoren
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2402.03038: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.03038&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[105] Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
Ashwath Vaithinathan Aravindan, Mayank Kejriwal
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.03332: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03332&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[106] ConFu: Contemplate the Future for Better Speculative Sampling
Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou Sun
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.08899: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08899&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[107] Deep Learning Based Amharic Chatbot for FAQs in Universities
Goitom Ybrah Hailu, Hadush Hailu, Shishay Welay
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2402.01720: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.01720&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[108] Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation
Hanwen Shen, Ting Ying, Jiajie Lu, Shanshan Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.13683: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13683&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[109] FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users
Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, Chelsea Finn
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2502.19312: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.19312&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[110] CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language
Roy Uziel, Omer Belhasin, Itay Levy, Akhiad Bercovich, Ran El-Yaniv, Ran Zilberstein, Michael Elad
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.20210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[111] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh, Carsten Eickhoff, Kyle Mahowald
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.05201: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05201&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[112] Measuring the Semantic Structure and Evolution of Conspiracy Theories
Manisha Keim, Sarmad Chandio, Osama Khalid, Rishab Nithyanand
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.26062: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26062&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[113] Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue
Junan Hu, Shudan Guo, Wenqi Liu, Jianhua Yin, Yinwei Wei
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.05552: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05552&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[114] STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems
Hongru Ji, Yuyin Fan, Meng Zhao, Xianghua Li, Lianwei Wu, Chao Gao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.07100: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07100&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[115] ChemAmp: Amplified Chemistry Tools via Composable Agents
Zhucong Li, Powei Chang, Jin Xiao, Zhijian Zhou, Qianyu He, Jiaqing Liang, Fenglei Cao, Xu Yinghui, Yuan Qi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.21569: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21569&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[116] When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
Ruotao Xu, Yixin Ji, Yu Luo, Jinpeng Li, Dong Li, Peifeng Li, Juntao Li, Min Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.08281: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08281&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[117] Evaluating Memory Capability in Continuous Lifelog Scenario
Jianjie Zheng, Zhichen Liu, Zhanyu Shen, Jingxiang Qu, Guanhua Chen, Yile Wang, Yang Xu, Yang Liu, Sijie Cheng
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[118] A Triadic Suffix Tokenization Scheme for Numerical Reasoning
Olga Chetverina
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11582: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11582&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[119] Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning
Hanbing Liu, Lang Cao, Yuanyi Ren, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.08125: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08125&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[120] Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction
Wael Hafez, Amir Nazeri
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.13061: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.13061&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[121] Beyond Static Personas: Situational Personality Steering for Large Language Models
Zesheng Wei, Mengxiang Li, Zilei Wang, Yang Deng
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.13846: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.13846&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[122] Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters
Bryan Sanchez
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14174: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14174&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[123] CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling
Karthik Singaravadivelan, Anant Gupta, Zekun Wang, Christopher J. MacLellan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14489: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14489&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[124] Designing Synthetic Discussion Generation Systems: A Case Study for Online Facilitation
Dimitris Tsirmpas, Ion Androutsopoulos, John Pavlopoulos
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.16505: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.16505&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[125] Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
Xiaoyun Zhang, Xiaojian Yuan, Di Huang, Wang You, Chen Hu, Jingqing Ruan, Ai Jian, Kejiang Chen, Xing Hu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.10959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[126] Power to the Clients: Federated Learning in a Dictatorship Setting
Mohammadsajad Alipour, Mohammad Mohammadi Amiri
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.22149: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22149&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[127] OSCBench: Benchmarking Object State Change in Text-to-Video Generation
Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.11698: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11698&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[128] AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows
Ramtin Babaeipour, François Charest, Madison Wright
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.00052: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00052&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[129] Spectral Tempering for Embedding Compression in Dense Passage Retrieval
Yongkang Li, Panagiotis Eustratiadis, Evangelos Kanoulas
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.19339: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19339&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[130] Olmo Hybrid: From Theory to Practice and Back
William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Chuan Li, Kyle Lo, Saumya Malik, DJ Matusz, Benjamin Minixhofer, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi, Ashish Sabharwal
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.03444: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03444&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[131] Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines
Keon Kim, Krish Chelikavada
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multi-step zoom-in pipelines are widely used for GUI grounding, yet the intermediate predictions they produce are typically discarded after coordinate remapping. We observe that these intermediate outputs contain a useful confidence signal for free: zoom consistency, the distance between a model’s step-2 prediction and the crop center. Unlike log-probabilities or token-level uncertainty, zoom consistency is a geometric quantity in a shared coordinate space, making it directly comparable across architecturally different VLMs without calibration. We prove this quantity is a linear estimator of step-1 spatial error under idealized conditions (perfect step-2, target within crop) and show it correlates with prediction correctness across two VLMs (AUC = 0.60; Spearman rho = -0.14, p < 10^{-6} for KV-Ground-8B; rho = -0.11, p = 0.0003 for Qwen3.5-27B). The correlation is small but consistent across models, application categories, and operating systems. As a proof-of-concept, we use zoom consistency to route between a specialist and generalist model, capturing 16.5% of the oracle headroom between them (+0.8%, McNemar p = 0.19). Code is available at https://github.com/omxyz/zoom-consistency-routing.
[132] Weak-to-Strong Knowledge Distillation Accelerates Visual Learning
Baiang Li, Wenhao Chai, Felix Heide
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large-scale visual learning is increasingly limited by training cost. Existing knowledge distillation methods transfer from a stronger teacher to a weaker student for compression or final-accuracy improvement. We instead investigate distillation to accelerate the training of strong students. We propose a generalizable plug-and-play recipe that freezes a weaker teacher, applies distillation only in early training, and turns it off once the student reaches and surpasses teacher-level performance. For ImageNet and CIFAR classification, this strategy reaches target thresholds much earlier, with up to 4.8 times speedup measured by epochs. We confirm that the method generalizes to other tasks and report 1.7 times epoch speedup for object detection on the COCO dataset, and 2.5 times earlier target-FID crossing for diffusion generation on the CIFAR-10 dataset, measured in steps. These findings validate our method as a universal speedup mechanism for visual learning.
[133] (1D) Ordered Tokens Enable Efficient Test-Time Search
Zhitong Gao, Parham Rezaei, Ali Cy, Mingqiao Ye, Nataša Jovanović, Jesse Allardice, Afshin Dehghan, Amir Zamir, Roman Bachmann, Oğuzhan Fatih Kar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and AR generation predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation through test-time search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers with coarse-to-fine structure can be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Through controlled experiments, we find that AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, pure test-time search over token sequences (i.e., without training an AR model) can perform training-free text-to-image generation when guided by an image-text verifier. Beyond this, we systematically study how classical search algorithms (best-of-N, beam search, lookahead search) interact with different token structures, as well as the role of different verifiers and AR priors. Our results highlight the impact of token structure on inference-time scalability and provide practical guidance for test-time scaling in AR models.
[134] SIMMER: Cross-Modal Food Image–Recipe Retrieval via MLLM-Based Embedding
Keisuke Gomi, Keiji Yanai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding generation by the MLLM. We further introduce a component-aware data augmentation strategy that trains the model on both complete and partial recipes, improving robustness to incomplete inputs. Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8% to 87.5% and the 10k image-to-recipe R@1 from 56.5% to 65.5% compared to the previous best method.
[135] Frequency-Aware Flow Matching for High-Quality Image Generation
Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Flow matching models have emerged as a powerful framework for realistic image generation by learning to reverse a corruption process that progressively adds Gaussian noise. However, because noise is injected in the latent domain, its impact on different frequency components is non-uniform. As a result, during inference, flow matching models tend to generate low-frequency components (global structure) in the early stages, while high-frequency components (fine details) emerge only later in the reverse process. Building on this insight, we propose Frequency-Aware Flow Matching (FreqFlow), a novel approach that explicitly incorporates frequency-aware conditioning into the flow matching framework via time-dependent adaptive weighting. We introduce a two-branch architecture: (1) a frequency branch that separately processes low- and high-frequency components to capture global structure and refine textures and edges, and (2) a spatial branch that synthesizes images in the latent domain, guided by the frequency branch’s output. By explicitly integrating frequency information into the generation process, FreqFlow ensures that both large-scale coherence and fine-grained details are effectively modeled low-frequency conditioning reinforces global structure, while high-frequency conditioning enhances texture fidelity and detail sharpness. On the class-conditional ImageNet-256 generation benchmark, our method achieves state-of-the-art performance with an FID of 1.38, surpassing the prior diffusion model DiT and flow matching model SiT by 0.79 and 0.58 FID, respectively. Code is available at https://github.com/OliverRensu/FreqFlow.
[136] UA-Net: Uncertainty-Aware Network for TRISO Image Semantic Segmentation
Kyle Lucke, Zuzanna Krajewska-Travar, Shoukun Sun, Lu Cai, John D. Stempien, Min Xian
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Tristructural isotropic (TRISO)-coated particle fuels undergo dimensional changes and chemical reactions during high-temperature neutron irradiation. Post-irradiation materialography helps understand processes that impact fuel performance, such as coating integrity and fission product retention. Conventionally, experts manually evaluate features in thousands of cross sections of sub-mm-sized samples, which is tedious and subjective. In this work, we propose UA-Net, a deep learning framework that segments five characteristic regions of TRISO fuel micrographs and generates an uncertainty map for predictions. The model uses a multi-stage pretraining strategy, starting with general image representations learned from ImageNet, followed by fine-tuning on TRISO micrographs from various irradiation experiments and AGR-5/6/7 particle cross sections. A meta-model for uncertainty prediction is integrated to identify small defects in TRISO images. UA-Net was evaluated on a test set of 102 images, achieving mean Intersection over Union (mIoU) and mean Precision (mP) of 95.5% and 97.3%, respectively. The meta-model achieved a specificity of 91.8% and sensitivity of 93.5%, demonstrating strong performance in detecting misclassifications. The model was also applied to new TRISO images for qualitative evaluation, showing high accuracy in extracting layer regions.
[137] CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification
Hexin Dong, Yi Lin, Pengyu Zhou, Fengnian Zhao, Alan Clint Legasto, Juno Cho, Dohui Kim, Justin Namuk Kim, Mingeon Kim, Sunwoo Kwak, Gabriel Moyà-Alcover, Ky Trung Nguyen, Thanh-Huy Nguyen, Ha-Hieu Pham, Huy-Hieu Pham, Huy Le Pham, Nikhileswara Rao Sulake, Aina Tur-Serrano, Ruichi Zhang, Ang Zu, Adam E. Flanders, Zhiyong Lu, Ronald M. Summers, Mingquan Lin, Hao Chen, Yuzhe Yang, George Shih, Yifan Peng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from a single institution, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT challenge. The first event, CXR-LT 2023, established a large-scale benchmark for long-tailed multi-label CXR classification and identified key challenges in rare disease recognition. CXR-LT 2024 further expanded the label space and introduced a zero-shot task to study generalization to unseen findings. Building on the success of CXR-LT 2023 and 2024, this third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. Additionally, all development and test sets in CXR-LT 2026 are annotated by radiologists, providing a more reliable and clinically grounded evaluation than report-derived labels. The challenge defines two core tasks this year: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. This paper summarizes the overview of the CXR-LT 2026 challenge. We describe the data collection and annotation procedures, analyze solution strategies adopted by participating teams, and evaluate head-versus-tail performance, calibration, and cross-center generalization gaps. Our results show that vision-language foundation models improve both in-distribution and zero-shot performance, but detecting rare findings under multi-center shift remains challenging. Our study provides a foundation for developing and evaluating AI systems in realistic long-tailed and open-world clinical conditions.
[138] CLIMB: Controllable Longitudinal Brain Image Generation using Mamba-based Latent Diffusion Model and Gaussian-aligned Autoencoder
Duy-Phuong Dao, Muhammad Taqiyuddin, Jahae Kim, Sang-Heon Lee, Hye-Won Jung, Jaehoo Choi, Hyung-Jeong Yang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Latent diffusion models have emerged as powerful generative models in medical imaging, enabling the synthesis of high quality brain magnetic resonance imaging scans. In particular, predicting the evolution of a patients brain can aid in early intervention, prognosis, and treatment planning. In this study, we introduce CLIMB, Controllable Longitudinal brain Image generation via state space based latent diffusion model, an advanced framework for modeling temporal changes in brain structure. CLIMB is designed to model the structural evolution of the brain structure over time, utilizing a baseline MRI scan and its acquisition age as foundational inputs. Additionally, multiple conditional variables, including projected age, gender, disease status, genetic information, and brain structure volumes, are incorporated to enhance the temporal modeling of anatomical changes. Unlike existing LDM methods that rely on self attention modules, which effectively capture contextual information from input images but are computationally expensive, our approach leverages state space, a state space model architecture that substantially reduces computational overhead while preserving high-quality image synthesis. Furthermore, we introduce a Gaussian-aligned autoencoder that extracts latent representations conforming to prior distributions without the sampling noise inherent in conventional variational autoencoders. We train and evaluate our proposed model on the Alzheimers Disease Neuroimaging Initiative dataset, consisting of 6,306 MRI scans from 1,390 participants. By comparing generated images with real MRI scans, CLIMB achieves a structural similarity index of 0.9433, demonstrating notable improvements over existing methods.
[139] Breakout-picker: Reducing false positives in deep learning-based borehole breakout characterization from acoustic image logs
Guangyu Wang, Xiaodong Ma, Xinming Wu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Borehole breakouts are stress-induced spalling on the borehole wall, which are identifiable in acoustic image logs as paired zones with near-symmetry azimuths, low acoustic amplitudes, and increased borehole radius. Accurate breakout characterization is crucial for in-situ stress analysis. In recent years, deep learning has been introduced to automate the time-consuming and labor-intensive breakout picking process. However, existing approaches often suffer from misclassification of non-breakout features, leading to high false positive rates. To address this limitation, this study develops a deep learning framework, termed Breakout-picker, with a specific focus on reducing false positives in automatic breakout characterization. Breakout-picker reduces false positives through two strategies. First, the training of Breakout-picker incorporates negative samples of non-breakout features, including natural fractures, keyseats, and logging artifacts. They share similar characteristics with breakouts, such as low acoustic amplitude or locally enlarged borehole radius. These negative training samples enables Breakout-picker to better discriminate true breakouts and similar non-breakout features. Second, candidate breakouts identified by Breakout-picker are further validated by azimuthal symmetry criteria, whereby detections that do not exhibit the near-symmetry characteristics of breakout azimuth are excluded. The performance of Breakout-picker is evaluated using three acoustic image log datasets from different regions. The results demonstrate that Breakout-picker outperforms other automatic methods with higher accuracy and substantially lower false positive rates. By reducing false positives, Breakout-picker enhances the reliability of automatic breakout characterization from acoustic image logs, which in turn benefits in-situ stress analysis based on borehole breakouts.
[140] AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution
Yiwei Zhao, Yi Zheng, Huapeng Su, Jieyu Lin, Stefano Ambrogio, Cijo Jose, Michaël Ramamonjisoa, Patrick Labatut, Barbara De Salvo, Chiao Liu, Phillip B. Gibbons, Ziyun Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Language-aligned vision foundation models (VFMs) enable versatile visual understanding for always-on contextual AI, but their deployment on edge devices is hindered by strict latency and power constraints. We present AdaVFM, an adaptive framework for efficient on-device inference of language-aligned VFMs that dynamically adjusts computation based on scene context and task complexity. Our key insight is that the effect of model size reduction on performance is task-dependent in vision applications, motivating a runtime-adaptive execution strategy. AdaVFM integrates neural architecture search (NAS) into the language-aligned VFM backbone to enable lightweight subnet execution during runtime. A multimodal large language model (LLM) deployed on the cloud enables runtime control with a context-aware agent. This synergy allows efficient model adaptation under diverse conditions while maintaining strong accuracy. Extensive experiments on zero-shot classification and open-vocabulary segmentation demonstrate that AdaVFM achieves state-of-the-art accuracy-efficiency trade-offs, surpassing prior baselines by up to $7.9%$ in acc@1 on IN1K and $5.2%$ mIoU on ADE20K over the best models of comparable VFM sizes. For models with similar accuracy, AdaVFM further reduces average FLOPs by up to $77.9%$.
[141] Causal Bootstrapped Alignment for Unsupervised Video-Based Visible-Infrared Person Re-Identification
Shuang Li, Jiaxu Leng, Changjiang Kuang, Mingpi Tan, Yu Yuan, Xinbo Gao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: VVI-ReID is a critical technique for all-day surveillance, where temporal information provides additional cues beyond static images. However, existing approaches rely heavily on fully supervised learning with expensive cross-modality annotations, limiting scalability. To address this issue, we investigate Unsupervised Learning for VVI-ReID (USL-VVI-ReID), which learns identity-discriminative representations directly from unlabeled video tracklets. Directly extending image-based USL-VI-ReID methods to this setting with generic pretrained encoders leads to suboptimal performance. Such encoders suffer from weak identity discrimination and strong modality bias, resulting in severe intra-modality identity confusion and pronounced clustering granularity imbalance between visible and infrared modalities. These issues jointly degrade pseudo-label reliability and hinder effective cross-modality alignment. To address these challenges, we propose a Causal Bootstrapped Alignment (CBA) framework that explicitly exploits inherent video priors. First, we introduce Causal Intervention Warm-up (CIW), which performs sequence-level causal interventions by leveraging temporal identity consistency and cross-modality identity consistency to suppress modality- and motion-induced spurious correlations while preserving identity-relevant semantics, yielding cleaner representations for unsupervised clustering. Second, we propose Prototype-Guided Uncertainty Refinement (PGUR), which employs a coarse-to-fine alignment strategy to resolve cross-modality granularity mismatch, reorganizing under-clustered infrared representations under the guidance of reliable visible prototypes with uncertainty-aware supervision. Extensive experiments on the HITSZ-VCM and BUPTCampus benchmarks demonstrate that CBA significantly outperforms existing USL-VI-ReID methods when extended to the USL-VVI-ReID setting.
[142] SPLIT: Self-supervised Partitioning for Learned Inversion in Nonlinear Tomography
Markus Haltmeier, Lukas Neumann, Nadja Gruber, Gyeongha Hwang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Machine learning has achieved impressive performance in tomographic reconstruction, but supervised training requires paired measurements and ground-truth images that are often unavailable. This has motivated self-supervised approaches, which have primarily addressed denoising and, more recently, linear inverse problems. We address nonlinear inverse problems and introduce SPLIT (Self-supervised Partitioning for Learned Inversion in Nonlinear Tomography), a self-supervised machine-learning framework for reconstructing images from nonlinear, incomplete, and noisy projection data without any samples of ground-truth images. SPLIT enforces cross-partition consistency and measurement-domain fidelity while exploiting complementary information across multiple partitions. Our main theoretical result shows that, under mild conditions, the proposed self-supervised objective is equivalent to its supervised counterpart in expectation. We regularize training with an automatic stopping rule that halts optimization when a no-reference image-quality surrogate saturates. As a concrete application, we derive SPLIT variants for multispectral computed tomography. Experiments on sparse-view acquisitions demonstrate high reconstruction quality and robustness to noise, surpassing classical iterative reconstruction and recent self-supervised baselines.
[143] Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline
Bingyu Li, Tao Huo, Haocheng Dong, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Open-vocabulary remote sensing image segmentation (OVRSIS) remains underexplored due to fragmented datasets, limited training diversity, and the lack of evaluation benchmarks that reflect realistic geospatial application demands. Our previous \textit{OVRSISBenchV1} established an initial cross-dataset evaluation protocol, but its limited scope is insufficient for assessing realistic open-world generalization. To address this issue, we propose \textit{OVRSISBenchV2}, a large-scale and application-oriented benchmark for OVRSIS. We first construct \textbf{OVRSIS95K}, a balanced dataset of about 95K image–mask pairs covering 35 common semantic categories across diverse remote sensing scenes. Built upon OVRSIS95K and 10 downstream datasets, OVRSISBenchV2 contains 170K images and 128 categories, substantially expanding scene diversity, semantic coverage, and evaluation difficulty. Beyond standard open-vocabulary segmentation, it further includes downstream protocols for building extraction, road extraction, and flood detection, thereby better reflecting realistic geospatial application demands and complex deployment scenarios. We also propose \textbf{Pi-Seg}, a baseline for OVRSIS. Pi-Seg improves transferability through a \textbf{positive-incentive noise} mechanism, where learnable and semantically guided perturbations broaden the visual-text feature space during training. Extensive experiments on OVRSISBenchV1, OVRSISBenchV2, and downstream tasks show that Pi-Seg delivers strong and consistent results, particularly on the more challenging OVRSISBenchV2 benchmark. Our results highlight both the importance of realistic benchmark design and the effectiveness of perturbation-based transfer for OVRSIS. The code and datasets are available at \href{https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg}{LiBingyu01/RSKT-Seg/tree/Pi-Seg}.
[144] From Zero to Detail: A Progressive Spectral Decoupling Paradigm for UHD Image Restoration with New Benchmark
Chen Zhao, Yunzhe Xu, Zhizhou Chen, Enxuan Gu, Kai Zhang, Xiaoming Liu, Jian Yang, Ying Tai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Ultra-high-definition (UHD) image restoration poses unique challenges due to the high spatial resolution, diverse content, and fine-grained structures present in UHD images. To address these issues, we introduce a progressive spectral decomposition for the restoration process, decomposing it into three stages: zero-frequency \textbf{enhancement}, low-frequency \textbf{restoration}, and high-frequency \textbf{refinement}. Based on this formulation, we propose a novel framework, \textbf{ERR}, which integrates three cooperative sub-networks: the zero-frequency enhancer (ZFE), the low-frequency restorer (LFR), and the high-frequency refiner (HFR). The ZFE incorporates global priors to learn holistic mappings, the LFR reconstructs the main content by focusing on coarse-scale information, and the HFR adopts our proposed frequency-windowed Kolmogorov-Arnold Network (FW-KAN) to recover fine textures and intricate details for high-fidelity restoration. To further advance research in UHD image restoration, we also construct a large-scale, high-quality benchmark dataset, \textbf{LSUHDIR}, comprising 82{,}126 UHD images with diverse scenes and rich content. Our proposed methods demonstrate superior performance across a range of UHD image restoration tasks, and extensive ablation studies confirm the contribution and necessity of each module. Project page: https://github.com/NJU-PCALab/ERR.
[145] CPU Optimization of a Monocular 3D Biomechanics Pipeline for Low-Resource Deployment
Yan Zhang, Xiong Zhao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Markerless 3D movement analysis from monocular video enables accessible biomechanical assessment in clinical and sports settings. However, most research-grade pipelines rely on GPU acceleration, limiting deployment on consumer-grade hardware and in low-resource environments. In this work, we optimize a monocular 3D biomechanics pipeline derived from the MonocularBiomechanics framework for efficient CPU-only execution. Through profiling-driven system optimization, including model initialization restructuring, elimination of disk I/O serialization, and improved CPU parallelization. Experiments on a consumer workstation (AMD Ryzen 7 9700X CPU) show a 2.47x increase in processing throughput and a 59.6% reduction in total runtime, with initialization latency reduced by 4.6x. Despite these changes, biomechanical outputs remain highly consistent with the baseline implementation (mean joint-angle deviation 0.35$^\circ$, $r=0.998$). These results demonstrate that research-grade vision-based biomechanics pipelines can be deployed on commodity CPU hardware for scalable movement assessment.
[146] PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
Shuyan Ke, Yifan Mei, Changli Wu, Yonghan Zheng, Jiayi Ji, Liujuan Cao, Rongrong Ji
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.
[147] HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning
Eunju Lee, MiHyeon Kim, JuneHyoung Kwon, Yoonji Lee, JiHyun Kim, Soojin Jang, YoungBin Kim
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Pretrained Vision-Language Models (VLMs) like CLIP show promise in continual learning, but existing Few-Shot Class-Incremental Learning (FSCIL) methods assume homogeneous domains and balanced data distributions, limiting real-world applicability where data arises from heterogeneous disciplines with imbalanced sample availability and varying visual complexity. We identify Domain Gravity, a representational asymmetry where data imbalance across heterogeneous domains causes overrepresented or low-entropy domains to disproportionately influence the embedding space, leading to prototype drift and degraded performance on underrepresented or high-entropy domains. To address this, we introduce Cross-Discipline Variable Few-Shot Class-Incremental Learning (XD-VSCIL), a benchmark capturing real-world heterogeneity and imbalance where Domain Gravity naturally intensifies. We propose Hybrid Prototype Calibration (HyCal), a training-free method combining cosine similarity and Mahalanobis distance to capture complementary geometric properties-directional alignment and covariance-aware magnitude-yielding stable prototypes under imbalanced heterogeneous conditions. Operating on frozen CLIP embeddings, HyCal achieves consistent retention-adaptation improvements while maintaining efficiency. Experiments show HyCal effectively mitigates Domain Gravity and outperforms existing methods in imbalanced cross-domain incremental learning.
[148] Self-Supervised Angular Deblurring in Photoacoustic Reconstruction via Noisier2Inverse
Markus Haltmeier, Nadja Gruber, Gyeongha Hwang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Photoacoustic tomography (PAT) is an emerging imaging modality that combines the complementary strengths of optical contrast and ultrasonic resolution. A central task is image reconstruction, where measured acoustic signals are used to recover the initial pressure distribution. For ideal point-like or line-like detectors, several efficient and fast reconstruction algorithms exist, including Fourier methods, filtered backprojection, and time reversal. However, when applied to data acquired with finite-size detectors, these methods yield systematically blurred images. Although sharper images can be obtained by compensating for finite-detector effects, supervised learning approaches typically require ground-truth images that may not be available in practice. We propose a self-supervised reconstruction method based on Noisier2Inverse that addresses finite-size detector effects without requiring ground-truth data. Our approach operates directly on noisy measurements and learns to recover high-quality PAT images in a ground-truth-free manner. Its key components are: (i) PAT-specific modeling that recasts the problem as angular deblurring; (ii) a Noisier2Inverse formulation in the polar domain that leverages the known angular point-spread function; and (iii) a novel, statistically grounded early-stopping rule. In experiments, the proposed method consistently outperforms alternative approaches that do not use supervised data and achieves performance close to supervised benchmarks, while remaining practical for real acquisitions with finite-size detectors.
[149] P3T: Prototypical Point-level Prompt Tuning with Enhanced Generalization for 3D Vision-Language Models
Geunyoung Jung, Soohong Kim, Kyungwoo Song, Jiyoung Jung
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: With the rise of pre-trained models in the 3D point cloud domain for a wide range of real-world applications, adapting them to downstream tasks has become increasingly important. However, conventional full fine-tuning methods are computationally expensive and storage-intensive. Although prompt tuning has emerged as an efficient alternative, it often suffers from overfitting, thereby compromising generalization capability. To address this issue, we propose Prototypical Point-level Prompt Tuning (P$^3$T), a parameter-efficient prompt tuning method designed for pre-trained 3D vision-language models (VLMs). P$^3$T consists of two components: 1) \textit{Point Prompter}, which generates instance-aware point-level prompts for the input point cloud, and 2) \textit{Text Prompter}, which employs learnable prompts into the input text instead of hand-crafted ones. Since both prompters operate directly on input data, P$^3$T enables task-specific adaptation of 3D VLMs without sacrificing generalizability. Furthermore, to enhance embedding space alignment, which is key to fine-tuning 3D VLMs, we introduce a prototypical loss that reduces intra-category variance. Extensive experiments demonstrate that our method matches or outperforms full fine-tuning in classification and few-shot learning, and further exhibits robust generalization under data shift in the cross-dataset setting. The code is available at \textcolor{violet}{https://github.com/gyjung975/P3T}.
[150] LP$^{2}$DH: A Locality-Preserving Pixel-Difference Hashing Framework for Dynamic Texture Recognition
Ruxin Ding, Jianfeng Ren, Heng Yu, Jiawei Li, Xudong Jiang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Spatiotemporal Local Binary Pattern (STLBP) is a widely used dynamic texture descriptor, but it suffers from extremely high dimensionality. To tackle this, STLBP features are often extracted on three orthogonal planes, which sacrifice inter-plane correlation. In this work, we propose a Locality-Preserving Pixel-Difference Hashing (LP$^{2}$DH) framework that jointly encodes pixel differences in the full spatiotemporal neighbourhood. LP$^{2}$DH transforms Pixel-Difference Vectors (PDVs) into compact binary codes with maximal discriminative power. Furthermore, we incorporate a locality-preserving embedding to maintain the PDVs’ local structure before and after hashing. Then, a curvilinear search strategy is utilized to jointly optimize the hashing matrix and binary codes via gradient descent on the Stiefel manifold. After hashing, dictionary learning is applied to encode the binary vectors into codewords, and the resulting histogram is utilized as the final feature representation. The proposed LP$^{2}$DH achieves state-of-the-art performance on three major dynamic texture recognition benchmarks: 99.80% against DT-GoogleNet’s 98.93% on UCLA, 98.52% against HoGF$^{3D}$’s 97.63% on DynTex++, and 96.19% compared to STS’s 95.00% on YUPENN. The source code is available at: https://github.com/drx770/LP2DH.
[151] APC: Transferable and Efficient Adversarial Point Counterattack for Robust 3D Point Cloud Recognition
Geunyoung Jung, Soohong Kim, Inseok Kong, Jiyoung Jung
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The advent of deep neural networks has led to remarkable progress in 3D point cloud recognition, but they remain vulnerable to adversarial attacks. Although various defense methods have been studied, they suffer from a trade-off between robustness and transferability. We propose Adversarial Point Counterattack (APC) to achieve both simultaneously. APC is a lightweight input-level purification module that generates instance-specific counter-perturbations for each point, effectively neutralizing attacks. Leveraging clean-adversarial pairs, APC enforces geometric consistency in data space and semantic consistency in feature space. To improve generalizability across diverse attacks, we adopt a hybrid training strategy using adversarial point clouds from multiple attack types. Since APC operates purely on input point clouds, it directly transfers to unseen models and defends against attacks targeting them without retraining. At inference, a single APC forward pass provides purified point clouds with negligible time and parameter overhead. Extensive experiments on two 3D recognition benchmarks demonstrate that the APC achieves state-of-the-art defense performance. Furthermore, cross-model evaluations validate its superior transferability. The code is available at https://github.com/gyjung975/APC.
[152] SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification
Enhui Chai, Sicheng Chen, Tianyi Zhang, Xingyu Li, Tianxiang Cui
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Pathological diagnosis is highly reliant on image analysis, where Regions of Interest (ROIs) serve as the primary basis for diagnostic evidence, while whole-slide image (WSI)-level tasks primarily capture aggregated patterns. To extract these critical morphological features, ROI-level Foundation Models (FMs) based on Vision Transformers (ViTs) and large-scale self-supervised learning (SSL) have been widely adopted. However, three core limitations remain in their application to ROI analysis: (1) cross-magnification domain shift, as fixed-scale pretraining hinders adaptation to diverse clinical settings; (2) inadequate local-global relationship modeling, wherein the ViT backbone of FMs suffers from high computational overhead and imprecise local characterization; (3) insufficient fine-grained sensitivity, as traditional self-attention mechanisms tend to overlook subtle diagnostic cues. To address these challenges, we propose SSMamba, a hybrid SSL framework that enables effective fine-grained feature learning without relying on large external datasets. This framework incorporates three domain-adaptive components: Mamba Masked Image Modeling (MAMIM) for mitigating domain shift, a Directional Multi-scale (DMS) module for balanced local-global modeling, and a Local Perception Residual (LPR) module for enhanced fine-grained sensitivity. Employing a two-stage pipeline, SSL pretraining on target ROI datasets followed by supervised fine-tuning (SFT), SSMamba outperforms 11 state-of-the-art (SOTA) pathological FMs on 10 public ROI datasets and surpasses 8 SOTA methods on 6 public WSI datasets. These results validate the superiority of task-specific architectural designs for pathological image analysis.
[153] NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker Recognition
Junguang Yao, Wenye Liu, Stjepan Picek, Yue Zheng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Visual speaker recognition based on lip motion offers a silent, hands-free, and behavior-driven biometric solution that remains effective even when acoustic cues are unavailable. Compared to traditional methods that rely heavily on appearance-dependent representations, lip motion encodes subject-specific behavioral dynamics driven by consistent articulation patterns and muscle coordination, offering inherent stability across environmental changes. However, capturing these robust, fine-grained dynamics is challenging for conventional frame-based cameras due to motion blur and low dynamic range. To exploit the intrinsic stability of lip motion and address these sensing limitations, we propose NeuroLip, an event-based framework that captures fine-grained lip dynamics under a strict yet practical cross-scene protocol: training is performed under a single controlled condition, while recognition must generalize to unseen viewing and lighting conditions. NeuroLip features a 1) Temporal-aware Voxel Encoding module with adaptive event weighting, 2) Structure-aware Spatial Enhancer that amplifies discriminative behavioral patterns by suppressing noise while preserving vertically structured motion information, and 3) Polarity Consistency Regularization mechanism to retain motion-direction cues encoded in event polarities. To facilitate systematic evaluation, we introduce DVSpeaker, a comprehensive event-based lip-motion dataset comprising 50 subjects recorded under four distinct viewpoint and illumination scenarios. Extensive experiments demonstrate that NeuroLip achieves near-perfect matched-scene accuracy and robust cross-scene generalization, attaining over 71% accuracy on unseen viewpoints and nearly 76% under low-light conditions, outperforming representative existing methods by at least 8.54%. The dataset and code are publicly available at https://github.com/JiuZeongit/NeuroLip.
[154] Diffusion Autoencoder for Unsupervised Artifact Restoration in Handheld Fundus Images
Mathumetha Palani, Kavya Puthumana, Ayantika Das, Ganapathy Krishnamurthi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The advent of handheld fundus imaging devices has made ophthalmologic diagnosis and disease screening more accessible, efficient, and cost-effective. However, images captured from these setups often suffer from artifacts such as flash reflections, exposure variations, and motion-induced blur, which degrade image quality and hinder downstream analysis. While generative models have been effective in image restoration, most depend on paired supervision or predefined artifact structures, making them less adaptable to unstructured degradations commonly observed in handheld fundus images. To address this, we propose an unsupervised diffusion autoencoder that integrates a context encoder with the denoising process to learn semantically meaningful representations for artifact restoration. The model is trained only on high-quality table-top fundus images and infers to restore artifact-affected handheld acquisitions. We validate the restorations through quantitative and qualitative evaluations, and have shown that diagnostic accuracy increases to 81.17% on an unseen dataset and multiple artifact conditions
[155] MambaBack: Bridging Local Features and Global Contexts in Whole Slide Image Analysis
Sicheng Chen, Chad Wong, Tianyi Zhang, Enhui Chai, Zeyu Liu, Fei Xia
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Whole Slide Image (WSI) analysis is pivotal in computational pathology, enabling cancer diagnosis by integrating morphological and architectural cues across magnifications. Multiple Instance Learning (MIL) serves as the standard framework for WSI analysis. Recently, Mamba has become a promising backbone for MIL, overtaking Transformers due to its efficiency and global context modeling capabilities originating from Natural Language Processing (NLP). However, existing Mamba-based MIL approaches face three critical challenges: (1) disruption of 2D spatial locality during 1D sequence flattening; (2) sub-optimal modeling of fine-grained local cellular structures; and (3) high memory peaks during inference on resource-constrained edge devices. Studies like MambaOut reveal that Mamba’s SSM component is redundant for local feature extraction, where Gated CNNs suffice. Recognizing that WSI analysis demands both fine-grained local feature extraction akin to natural images, and global context modeling akin to NLP, we propose MambaBack, a novel hybrid architecture that harmonizes the strengths of Mamba and MambaOut. First, we propose the Hilbert sampling strategy to preserve the 2D spatial locality of tiles within 1D sequences, enhancing the model’s spatial perception. Second, we design a hierarchical structure comprising a 1D Gated CNN block based on MambaOut to capture local cellular features, and a BiMamba2 block to aggregate global context, jointly enhancing multi-scale representation. Finally, we implement an asymmetric chunking design, allowing parallel processing during training and chunking-streaming accumulation during inference, minimizing peak memory usage for deployment. Experimental results on five datasets demonstrate that MambaBack outperforms seven state-of-the-art methods. Source code and datasets are publicly available.
[156] Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories
Delfina Sol Martinez Pandiani, Valentina Presutti
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The field of Computer Vision (CV) is increasingly shifting towards ``high-level’’ visual sensemaking tasks, yet the exact nature of these tasks remains unclear and tacit. This survey paper addresses this ambiguity by systematically reviewing research on high-level visual understanding, focusing particularly on Abstract Concepts (ACs) in automatic image classification. Our survey contributes in three main ways: Firstly, it clarifies the tacit understanding of high-level semantics in CV through a multidisciplinary analysis, and categorization into distinct clusters, including commonsense, emotional, aesthetic, and inductive interpretative semantics. Secondly, it identifies and categorizes computer vision tasks associated with high-level visual sensemaking, offering insights into the diverse research areas within this domain. Lastly, it examines how abstract concepts such as values and ideologies are handled in CV, revealing challenges and opportunities in AC-based image classification. Notably, our survey of AC image classification tasks highlights persistent challenges, such as the limited efficacy of massive datasets and the importance of integrating supplementary information and mid-level features. We emphasize the growing relevance of hybrid AI systems in addressing the multifaceted nature of AC image classification tasks. Overall, this survey enhances our understanding of high-level visual reasoning in CV and lays the groundwork for future research endeavors.
[157] Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval
Siyuan Wang, Hanchen Gao, Guangming Zhu, Jiang Lu, Yiyue Ma, Tianci Wu, Jincai Huang, Liang Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Fine-grained image retrieval via hand-drawn sketches or textual descriptions remains a critical challenge due to inherent modality gaps. While hand-drawn sketches capture complex structural contours, they lack color and texture, which text effectively provides despite omitting spatial contours. Motivated by the complementary nature of these modalities, we propose the Sketch and Text Based Image Retrieval (STBIR) framework. By synergizing the rich color and texture cues from text with the structural outlines provided by sketches, STBIR achieves superior fine-grained retrieval performance. First, a curriculum learning driven robustness enhancement module is proposed to enhance the model’s robustness when handling queries of varying quality. Second, we introduce a category-knowledge-based feature space optimization module, thereby significantly boosting the model’s representational power. Finally, we design a multi-stage cross-modal feature alignment mechanism to effectively mitigate the challenges of cross modal feature alignment. Furthermore, we curate the fine-grained STBIR benchmark dataset to rigorously validate the efficacy of our proposed framework and to provide data support as a reference for subsequent related research. Extensive experiments demonstrate that the proposed STBIR framework significantly outperforms state of the art methods.
[158] RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
Yichen Xu, Yuanhang Liu, Chuhan Wang, Zihan Zhao, jinghan luo, Jianzhe Ma, Wenxuan Wang, Qin Jin
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While Multimodal Large Language Models (MLLMs) excel at generic video understanding, their ability to support specialized, rule-grounded decision-making remains insufficiently explored. In this paper, we introduce RefereeBench, the first large-scale benchmark for evaluating MLLMs as automatic sports referees. Spanning 11 sports with 925 curated videos and 6,475 QA pairs, RefereeBench evaluates five core officiating abilities: foul existence, foul and penalty classification, foul and penalty reasoning, entity perception, and temporal grounding. The benchmark is fully human-annotated to ensure high-quality annotations grounded in authentic officiating logic and multimodal evidence. Extensive evaluations of state-of-the-art MLLMs show that even the strongest models, such as Doubao-Seed-1.8 and Gemini-3-Pro, achieve only around 60% accuracy, while the strongest open-source model, Qwen3-VL, reaches only 47%. These results indicate that current models remain far from being reliable sports referees. Further analysis shows that while models can often identify incidents and involved entities, they struggle with rule application and temporal grounding, and frequently over-call fouls on normal clips. Our benchmark highlights the need for future MLLMs that better integrate domain knowledge and multimodal understanding, advancing trustworthy AI-assisted officiating and broader multimodal decision-making.
[159] Concept-wise Attention for Fine-grained Concept Bottleneck Models
Minghong Zhong, Guoshuai Zou, Kanghao Chen, Dexia Chen, Ruixuan Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recently impressive performance has been achieved in Concept Bottleneck Models (CBM) by utilizing the image-text alignment learned by a large pre-trained vision-language model (i.e. CLIP). However, there exist two key limitations in concept modeling. Existing methods often suffer from pre-training biases, manifested as granularity misalignment or reliance on structural priors. Moreover, fine-tuning with Binary Cross-Entropy (BCE) loss treats each concept independently, which ignores mutual exclusivity among concepts, leading to suboptimal alignment. To address these limitations, we propose Concept-wise Attention for Fine-grained Concept Bottleneck Models (CoAt-CBM), a novel framework that achieves adaptive fine-grained image-concept alignment and high interpretability. Specifically, CoAt-CBM employs learnable concept-wise visual queries to adaptively obtain fine-grained concept-wise visual embeddings, which are then used to produce a concept score vector. Then, a novel concept contrastive optimization guides the model to handle the relative importance of the concept scores, enabling concept predictions to faithfully reflect the image content and improved alignment. Extensive experiments demonstrate that CoAt-CBM consistently outperforms state-of-the-art methods. The codes will be available upon acceptance.
[160] PLAF: Pixel-wise Language-Aligned Feature Extraction for Efficient 3D Scene Understanding
Junjie Wen, Junlin He, Fei Ma, Jinqiang Cui
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate open-vocabulary 3D scene understanding requires semantic representations that are both language-aligned and spatially precise at the pixel level, while remaining scalable when lifted to 3D space. However, existing representations struggle to jointly satisfy these requirements, and densely propagating pixel-wise semantics to 3D often results in substantial redundancy, leading to inefficient storage and querying in large-scale scenes. To address these challenges, we present \emph{PLAF}, a Pixel-wise Language-Aligned Feature extraction framework that enables dense and accurate semantic alignment in 2D without sacrificing open-vocabulary expressiveness. Building upon this representation, we further design an efficient semantic storage and querying scheme that significantly reduces redundancy across both 2D and 3D domains. Experimental results show that \emph{PLAF} provides a strong semantic foundation for accurate and efficient open-vocabulary 3D scene understanding. The codes are publicly available at https://github.com/RockWenJJ/PLAF.
[161] SegMix:Shuffle-based Feedback Learning for Semantic Segmentation of Pathology Images
Zhiling Yan, Sicheng Chen, Tianyi Zhang, Nan Ying, Yanli Lei, Guanglei Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Segmentation is a critical task in computational pathology, as it identifies areas affected by disease or abnormal growth and is essential for diagnosis and treatment. However, acquiring high-quality pixel-level supervised segmentation data requires significant workload demands from experienced pathologists, limiting the application of deep learning. To overcome this challenge, relaxing the label conditions to image-level classification labels allows for more data to be used and more scenarios to be enabled. One approach is to leverage Class Activation Map (CAM) to generate pseudo pixel-level annotations for semantic segmentation with only image-level labels. However, this method fails to thoroughly explore the essential characteristics of pathology images, thus identifying only small areas that are insufficient for pseudo masking. In this paper, we propose a novel shuffle-based feedback learning method inspired by curriculum learning to generate higher-quality pseudo-semantic segmentation masks. Specifically, we perform patch level shuffle of pathology images, with the model adaptively adjusting the shuffle strategy based on feedback from previous learning. Experimental results demonstrate that our proposed approach outperforms state-of-the-arts on three different datasets.
[162] Fed3D: Federated 3D Object Detection
Suyan Dai, Chenxi Liu, Fazeng Li, Peican Lin
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: 3D object detection models trained in one server plays an important role in autonomous driving, robotics manipulation, and augmented reality scenarios. However, most existing methods face severe privacy concern when deployed on a multi-robot perception network to explore large-scale 3D scene. Meanwhile, it is highly challenging to employ conventional federated learning methods on 3D object detection scenes, due to the 3D data heterogeneity and limited communication bandwidth. In this paper, we take the first attempt to propose a novel Federated 3D object detection framework (i.e., Fed3D), to enable distributed learning for 3D object detection with privacy preservation. Specifically, considering the irregular input 3D object in local robot and various category distribution between robots could cause local heterogeneity and global heterogeneity, respectively. We then propose a local-global class-aware loss for the 3D data heterogeneity issue, which could balance gradient back-propagation rate of different 3D categories from local and global aspects. To reduce communication cost on each round, we develop a federated 3D prompt module, which could only learn and communicate the prompts with few learnable parameters. To the end, several extensive experiments on federated 3D object detection show that our Fed3D model significantly outperforms state-of-the-art algorithms with lower communication cost when providing the limited local training data.
[163] Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI
Lama Moukheiber, Caleb M. Yeung, Haotian Xue, Alec Helbling, Zelin Zhao, Yongxin Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We introduce Spatially Grounded MRI Visual Question Answering (SGMRI-VQA), a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI. Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies, each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates. Tasks are organized hierarchically across detection, localization, counting/classification, and captioning, requiring models to jointly reason about what is present, where it is, and across which frames it extends. We benchmark 10 VLMs and show that supervised fine-tuning of Qwen3-VL-8B with bounding box supervision consistently improves grounding performance over strong zero-shot baselines, indicating that targeted spatial supervision is an effective path toward grounded clinical reasoning.
[164] Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
Chengxin Liu, Wonseok Choi, Chenshuang Zhang, Tae-Hyun Oh
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision-Language Models (VLMs) have demonstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent work shows that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. Based on the observation, we show that modulating the information flow during inference can improve the perception capability of VLMs. The idea is that text tokens should only be associated with important visual tokens during decoding, eliminating the interference of irrelevant regions. To achieve this, we propose a token dynamics-based method to determine the importance of visual tokens, where visual tokens that exhibit distinct activation patterns during different decoding stages are viewed as important. We apply our approach to representative open-source VLMs and evaluate on various datasets, including visual question answering, visual grounding and counting, optical character recognition, and object hallucination. The results show that our approach significantly improves the performance of baselines. Project page: https://cxliu0.github.io/AIF/.
[165] Continual Hand-Eye Calibration for Open-world Robotic Manipulation
Fazeng Li, Gan Sun, Chenxi Liu, Yao He, Wei Cong, Yang Cong
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Hand-eye calibration through visual localization is a critical capability for robotic manipulation in open-world environments. However, most deep learning-based calibration models suffer from catastrophic forgetting when adapting into unseen data amongst open-world scene changes, while simple rehearsal-based continual learning strategy cannot well mitigate this issue. To overcome this challenge, we propose a continual hand-eye calibration framework, enabling robots to adapt to sequentially encountered open-world manipulation scenes through spatially replay strategy and structure-preserving distillation. Specifically, a Spatial-Aware Replay Strategy (SARS) constructs a geometrically uniform replay buffer that ensures comprehensive coverage of each scene pose space, replacing redundant adjacent frames with maximally informative viewpoints. Meanwhile, a Structure-Preserving Dual Distillation (SPDD) is proposed to decompose localization knowledge into coarse scene layout and fine pose precision, and distills them separately to alleviate both types of forgetting during continual adaptation. As a new manipulation scene arrives, SARS provides geometrically representative replay samples from all prior scenes, and SPDD applies structured distillation on these samples to retain previously learned knowledge. After training on the new scene, SARS incorporates selected samples from the new scene into the replay buffer for future rehearsal, allowing the model to continuously accumulate multi-scene calibration capability. Experiments on multiple public datasets show significant anti scene forgetting performance, maintaining accuracy on past scenes while preserving adaptation to new scenes, confirming the effectiveness of the framework.
[166] Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
Ze Dong, Hao Shi, Zejia Gao, Zhonghua Yi, Kaiwei Wang, Lin Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework that models temporal visual evidence, narrative summaries, compressed historical context, and audio cues. Cross-domain experiments reveal a severe domain gap: models trained on cinematic footage drop from 27.99 to 16.69 Macro-F1 when evaluated on realistic egocentric screen-view observations. Training on ESE substantially improves robustness under realistic viewing conditions. Our approach achieves competitive performance compared with strong closed-source multimodal models, highlighting the importance of domain-specific data and long-context multimodal reasoning.
[167] SSFT: A Lightweight Spectral-Spatial Fusion Transformer for Generic Hyperspectral Classification
Alexander Musiat, Nikolas Ebert, Oliver Wasenmüller
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Hyperspectral imaging enables fine-grained recognition of materials by capturing rich spectral signatures, but learning robust classifiers is challenging due to high dimensionality, spectral redundancy, limited labeled data, and strong domain shifts. Beyond earth observation, labeled HSI data is often scarce and imbalanced, motivating compact models for generic hyperspectral classification across diverse acquisition regimes. We propose the lightweight Spectral-Spatial Fusion Transformer (SSFT), which factorizes representation learning into spectral and spatial pathways and integrates them via cross-attention to capture complementary wavelength-dependent and structural information. We evaluate our SSFT on the challenging HSI-Benchmark, a heterogeneous multi-dataset benchmark covering earth observation, fruit condition assessment, and fine-grained material recognition. SSFT achieves state-of-the-art overall performance, ranking first while using less than 2% of the parameters of the previous leading method. We further evaluate transfer to the substantially larger SpectralEarth benchmark under the official protocol, where SSFT remains competitive despite its compact size. Ablation studies show that both spectral and spatial pathways are crucial, with spatial modeling contributing most, and that SSFT remains robust without data augmentation.
[168] Beyond Text Prompts: Precise Concept Erasure through Text-Image Collaboration
Jun Li, Lizhi Xiong, Ziqiang Li, Weiwei Jiang, Zhangjie Fu, Yong Li, Guo-Sen Xie
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Text-to-image generative models have achieved impressive fidelity and diversity, but can inadvertently produce unsafe or undesirable content due to implicit biases embedded in large-scale training datasets. Existing concept erasure methods, whether text-only or image-assisted, face trade-offs: textual approaches often fail to fully suppress concepts, while naive image-guided methods risk over-erasing unrelated content. We propose TICoE, a text-image Collaborative Erasing framework that achieves precise and faithful concept removal through a continuous convex concept manifold and hierarchical visual representation learning. TICoE precisely removes target concepts while preserving unrelated semantic and visual content. To objectively assess the quality of erasure, we further introduce a fidelity-oriented evaluation strategy that measures post-erasure usability. Experiments on multiple benchmarks show that TICoE surpasses prior methods in concept removal precision and content fidelity, enabling safer, more controllable text-to-image generation. Our code is available at https://github.com/OpenAscent-L/TICoE.git
[169] Learning to Look before Learning to Like: Incorporating Human Visual Cognition into Aesthetic Quality Assessment
Liwen Yu, Chi Liu, Xiaotong Han, Congcong Zhu, Minghao Wang, Sheng Shen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Automated Aesthetic Quality Assessment (AQA) treats images primarily as static pixel vectors, aligning predictions with human-rating scores largely through semantic perception. However, this paradigm diverges from human aesthetic cognition, which arises from dynamic visual exploration shaped by scanning paths, processing fluency, and the interplay between bottom-up salience and top-down intention. We introduce AestheticNet, a novel cognitive-inspired AQA paradigm that integrates human-like visual cognition and semantic perception with a two-pathway architecture. The visual attention pathway, implemented as a gaze-aligned visual encoder (GAVE) pre-trained offline on eye-tracking data using resource-efficient contrast gaze alignment, models attention from human vision system. This pathway augments the semantic pathway, which uses a fixed semantic encoder such as CLIP, through cross-attention fusion. Visual attention provides a cognitive prior reflecting foreground/background structure, color cascade, brightness, and lighting, all of which are determinants of aesthetic perception beyond semantics. Experiments validated by hypothesis testing show a consistent improvement over the semantic-alone baselines, and demonstrate the gaze module as a model-agnostic corrector compatible with diverse AQA backbones, supporting the necessity and modularity of human-like visual cognition for AQA. Our code is available at https://github.com/keepgallop/AestheticNet.
[170] Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection
Irem Ulku, Erdem Akagündüz, Ömer Özgür Tanrıöver
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal remote sensing data provide complementary information for semantic segmentation, but in real-world deployments, some modalities may be unavailable due to sensor failures, acquisition issues, or challenging atmospheric conditions. Existing multimodal segmentation models typically address missing modalities by learning a shared representation across inputs. However, this approach can introduce a trade-off by compromising modality-specific complementary information and reducing performance when all modalities are available. In this paper, we tackle this limitation with CBC-SLP, a multimodal semantic segmentation model designed to preserve both modality-invariant and modality-specific information. Inspired by the theoretical results on modality alignment, which state that perfectly aligned multimodal representations can lead to sub-optimal performance in downstream prediction tasks, we propose a novel structured latent projection approach as an architectural inductive bias. Rather than enforcing this strategy through a loss term, we incorporate it directly into the architecture. In particular, to use the complementary information effectively while maintaining robustness under random modality dropout, we structure the latent representations into shared and modality-specific components and adaptively transfer them to the decoder according to the random modality availability mask. Extensive experiments on three multimodal remote sensing image sets demonstrate that CBC-SLP consistently outperforms state-of-the-art multimodal models across full and missing modality scenarios. Besides, we empirically demonstrate that the proposed strategy can recover the complementary information that may not be preserved in a shared representation. The code is available at https://github.com/iremulku/Multispectral-Semantic-Segmentation-via-Structured-Latent-Projection-CBC-SLP-.
[171] AHS: Adaptive Head Synthesis via Synthetic Data Augmentations
Taewoong Kang, Hyojin Jang, Sohyun Jeong, Seunggi Moon, Gihwi Kim, Hoon Jin Jung, Jaegul choo
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping, where one’s head is seamlessly integrated with another’s body. However, current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. They struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.
[172] Splats in Splats++: Robust and Generalizable 3D Gaussian Splatting Steganography
Yijia Guo, Wenkai Huang, Tong Hu, Gaolei Li, Yang Li, Yuxin Hong, Liwen Hu, Xitong Ling, Jianhua Li, Shengbo Chen, Tiejun Huang, Lei Ma
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: 3D Gaussian Splatting (3DGS) has recently redefined the paradigm of 3D reconstruction, striking an unprecedented balance between visual fidelity and computational efficiency. As its adoption proliferates, safeguarding the copyright of explicit 3DGS assets has become paramount. However, existing invisible message embedding frameworks struggle to reconcile secure and high-capacity data embedding with intrinsic asset utility, often disrupting the native rendering pipeline or exhibiting vulnerability to structural perturbations. In this work, we present \textbf{\textit{Splats in Splats++}}, a unified and pipeline-agnostic steganography framework that seamlessly embeds high-capacity 3D/4D content directly within the native 3DGS representation. Grounded in a principled analysis of the frequency distribution of Spherical Harmonics (SH), we propose an importance-graded SH coefficient encryption scheme that achieves imperceptible embedding without compromising the original expressive power. To fundamentally resolve the geometric ambiguities that lead to message leakage, we introduce a \textbf{Hash-Grid Guided Opacity Mapping} mechanism. Coupled with a novel \textbf{Gradient-Gated Opacity Consistency Loss}, our formulation enforces a stringent spatial-attribute coupling between the original and hidden scenes, effectively projecting the discrete attribute mapping into a continuous, attack-resilient latent manifold. Extensive experiments demonstrate that our method substantially outperforms existing approaches, achieving up to \textbf{6.28 db} higher message fidelity, \textbf{3$\times$} faster rendering, and exceptional robustness against aggressive 3D-targeted structural attacks (e.g., GSPure). Furthermore, our framework exhibits remarkable versatility, generalizing seamlessly to 2D image embedding, 4D dynamic scene steganography, and diverse downstream tasks.
[173] UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
Lifan Jiang, Tianrun Wu, Yuhang Pei, Chenyang Wang, Boxi Wu, Deng Cai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The evaluation of visual editing models remains fragmented across methods and modalities. Existing benchmarks are often tailored to specific paradigms, making fair cross-paradigm comparisons difficult, while video editing lacks reliable evaluation benchmarks. Furthermore, common automatic metrics often misalign with human preference, yet directly deploying large multimodal models (MLLMs) as evaluators incurs prohibitive computational and financial costs. We present UniEditBench, a unified benchmark for image and video editing that supports reconstruction-based and instruction-driven methods under a shared protocol. UniEditBench includes a structured taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, with coverage of challenging compositional tasks such as counting and spatial reordering. To enable scalable evaluation, we distill a high-capacity MLLM judge (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scoring over structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency (for videos). Experiments show that the distilled evaluators maintain strong agreement with human judgments and substantially reduce deployment cost relative to the teacher model. UniEditBench provides a practical and reproducible protocol for benchmarking modern visual editing methods. Our benchmark and the associated reward models are publicly available at https://github.com/wesar1/UniEditBench.
[174] CLOTH-HUGS: Cloth Aware Human Gaussian Splatting
Sadia Mubashshira, Nazanin Amini, Kevin Desai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present Cloth-HUGS, a Gaussian Splatting based neural rendering framework for photorealistic clothed human reconstruction that explicitly disentangles body and clothing. Unlike prior methods that absorb clothing into a single body representation and struggle with loose garments and complex deformations, Cloth-HUGS represents the performer using separate Gaussian layers for body and cloth within a shared canonical space. The canonical volume jointly encodes body, cloth, and scene primitives and is deformed through SMPL-driven articulation with learned linear blend skinning weights. To improve cloth realism, we initialize cloth Gaussians from mesh topology and apply physics-inspired constraints, including simulation-consistency, ARAP regularization, and mask supervision. We further introduce a depth-aware multi-pass rendering strategy for robust body-cloth-scene compositing, enabling real-time rendering at over 60 FPS. Experiments on multiple benchmarks show that Cloth-HUGS improves perceptual quality and geometric fidelity over state-of-the-art baselines, reducing LPIPS by up to 28% while producing temporally coherent cloth dynamics.
[175] PolarMAE: Efficient Fetal Ultrasound Pre-training via Semantic Screening and Polar-Guided Masking
Meng Lv, Yapeng Li, Hang Su, Juhua Liu, Bo Du
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Intelligent fetal ultrasound (US) interpretation is crucial for prenatal diagnosis, but high annotation costs and operator-induced variance make unsupervised pre-training a highly promising paradigm. However, existing pre-training methods largely ignore US-specific characteristics – severe data redundancy, fan-shaped locality, and polar coordinate beamforming – limiting their effectiveness in downstream tasks. To address this, we propose PolarMAE, a novel and efficient pre-training framework tailored for US images. Specifically, to mitigate continuous scanning redundancy, we introduce a Progressive Visual-Semantic Screening (PVSS) that adaptively extracts high-value samples, significantly boosting pre-training efficiency. Furthermore, we design an Acoustic-Bounded Region Constraint (ABRC) to accommodate US locality, forcing the model to focus strictly on valid acoustic regions rather than invalid dark backgrounds. Finally, leveraging the beamforming prior and local details, we propose a Polar-Texture Collaborative Masking (PTCM), enabling the model to capture underlying radial imaging patterns and critical tissue structures. Extensive experiments across diverse datasets and downstream interpretation tasks demonstrate that our method achieves state-of-the-art performance with strong pre-training scalability and efficiency.
[176] AeroDeshadow: Physics-Guided Shadow Synthesis and Penumbra-Aware Deshadowing for Aerospace Imagery
Wei Lu, Zi-Yang Bo, Fei-Fei Sang, Yi Liu, Xue Yang, Si-Bao Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Shadows are prevalent in high-resolution aerospace imagery (ASI). They often cause spectral distortion and information loss, which degrade downstream interpretation tasks. While deep learning methods have advanced natural-image shadow removal, their direct application to ASI faces two primary challenges. First, strictly paired training data are severely lacking. Second, homogeneous shadow assumptions fail to handle the broad penumbra transition zones inherent in aerospace scenes. To address these issues, we propose AeroDeshadow, a unified two-stage framework integrating physics-guided shadow synthesis and penumbra-aware restoration. In the first stage, a Physics-aware Degradation Shadow Synthesis Network (PDSS-Net) explicitly models illumination decay and spatial attenuation. This process constructs AeroDS-Syn, a large-scale paired dataset featuring soft boundary transitions. Constrained by this physical formulation, a Penumbra-aware Cascaded DeShadowing Network (PCDS-Net) then decouples the input into umbra and penumbra components. By restoring these regions progressively, PCDS-Net alleviates boundary artifacts and over-correction. Trained solely on the synthetic AeroDS-Syn, the network generalizes to real-world ASI without requiring paired real annotations. Experimental results indicate that AeroDeshadow achieves state-of-the-art quantitative accuracy and visual fidelity across synthetic and real-world datasets. The datasets and code will be made publicly available at: https://github.com/AeroVILab-AHU/AeroDeshadow.
[177] Efficient Video Diffusion Models: Advancements and Challenges
Shitong Shao, Lichen Bai, Pengfei Wan, James Kwok, Zeke Xie
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Video diffusion models have rapidly become the dominant paradigm for high-fidelity generative video synthesis, but their practical deployment remains constrained by severe inference costs. Compared with image generation, video synthesis compounds computation across spatial-temporal token growth and iterative denoising, making attention and memory traffic major bottlenecks in real-world settings. This survey provides a systematic and deployment-oriented review of efficient video diffusion models. We propose a unified categorization that organizes existing methods into four classes of main paradigms, including step distillation, efficient attention, model compression, and cache/trajectory optimization. Building on this categorization, we respectively analyze algorithmic trends of these four paradigms and examine how different design choices target two core objectives: reducing the number of function evaluations and minimizing per-step overhead. Finally, we discuss open challenges and future directions, including quality preservation under composite acceleration, hardware-software co-design, robust real-time long-horizon generation, and open infrastructure for standardized evaluation. To the best of our knowledge, our work is the first comprehensive survey on efficient video diffusion models, offering researchers and engineers a structured overview of the field and its emerging research directions.
[178] Making Image Editing Easier via Adaptive Task Reformulation with Agentic Executions
Bo Zhao, Kairui Guo, Runnan Du, Haiyang Sun, Pengshan Wang, Huan Yang, Kun Gai, Yixin Cao, Wei Ji
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Instruction guided image editing has advanced substantially with recent generative models, yet it still fails to produce reliable results across many seemingly simple cases. We observe that a large portion of these failures stem not from insufficient model capacity, but from poorly formulated editing tasks, such as those involving small targets, implicit spatial relations, or under-specified instructions. In this work, we frame image editing failures as a task formulation problem and propose an adaptive task reformulation framework that improves editing performance without modifying the underlying model. Our key idea is to transform the original image-instruction pair into a sequence of operations that are dynamically determined and executed by a MLLM agent through analysis, routing, reformulation, and feedback-driven refinement. Experiments on multiple benchmarks, including ImgEdit, PICA, and RePlan, across diverse editing backbones such as Qwen Image Edit and Nano Banana, show consistent improvements, with especially large gains on challenging cases. These results suggest that task reformulation is a critical but underexplored factor, and that substantial gains can be achieved by better matching editing tasks to the effective operating regime of existing models.
[179] Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction
Haato Watanabe, Nobuyuki Umetani
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent years have witnessed the rapid emergence of 3D Gaussian splatting (3DGS) as a powerful approach for 3D reconstruction and novel view synthesis. Its explicit representation with Gaussian primitives enables fast training, real-time rendering, and convenient post-processing such as editing and surface reconstruction. However, 3DGS suffers from a critical drawback: the number of primitives grows drastically for scenes with high-frequency appearance details, since each primitive can represent only a single color, requiring multiple primitives for every sharp color transition. To overcome this limitation, we propose neural Gabor splatting, which augments each Gaussian primitive with a lightweight multi-layer perceptron that models a wide range of color variations within a single primitive. To further control primitive numbers, we introduce a frequency-aware densification strategy that selects mismatch primitives for pruning and cloning based on frequency energy. Our method achieves accurate reconstruction of challenging high-frequency surfaces. We demonstrate its effectiveness through extensive experiments on both standard benchmarks, such as Mip-NeRF360 and High-Frequency datasets (e.g., checkered patterns), supported by comprehensive ablation studies.
[180] SENSE: Stereo OpEN Vocabulary SEmantic Segmentation
Thomas Campagnolo, Ezio Malis, Philippe Martinet, Gaétan Bahl
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Open-vocabulary semantic segmentation enables models to segment objects or image regions beyond fixed class sets, offering flexibility in dynamic environments. However, existing methods often rely on single-view images and struggle with spatial precision, especially under occlusions and near object boundaries. We propose SENSE, the first work on Stereo OpEN Vocabulary SEmantic Segmentation, which leverages stereo vision and vision-language models to enhance open-vocabulary semantic segmentation. By incorporating stereo image pairs, we introduce geometric cues that improve spatial reasoning and segmentation accuracy. Trained on the PhraseStereo dataset, our approach achieves strong performance in phrase-grounded tasks and demonstrates generalization in zero-shot settings. On PhraseStereo, we show a +2.9% improvement in Average Precision over the baseline method and +0.76% over the best competing method. SENSE also provides a relative improvement of +3.5% mIoU on Cityscapes and +18% on KITTI compared to the baseline work. By jointly reasoning over semantics and geometry, SENSE supports accurate scene understanding from natural language, essential for autonomous robots and Intelligent Transportation Systems.
[181] From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance
Jinhao Shen, Haoqian Du, Xulu Zhang, Xiao-Yong Wei, Qing Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Text-guided image editing, a pivotal task in modern multimedia content creation, has seen remarkable progress with training-free methods that eliminate the need for additional optimization. Despite recent progress, existing methods are typically constrained by a competitive paradigm in which the editing and reconstruction branches are independently driven by their respective objectives to maximize alignment with target and source prompts. The adversarial strategy causes semantic conflicts and unpredictable outcomes due to the lack of coordination between branches. To overcome these issues, we propose Coopetitive Training-Free Image Editing (CoEdit), a novel zero-shot framework that transforms attention control from competition to coopetitive negotiation, achieving editing harmony across spatial and temporal dimensions. Spatially, CoEdit introduces Dual-Entropy Attention Manipulation, which quantifies directional entropic interactions between branches to reformulate attention control as a harmony-maximization problem, eventually improving the localization of editable and preservable regions. Temporally, we present Entropic Latent Refinement mechanism to dynamically adjust latent representations over time, minimizing accumulated editing errors and ensuring consistent semantic transitions throughout the denoising trajectory. Additionally, we propose the Fidelity-Constrained Editing Score, a composite metric that jointly evaluates semantic editing and background fidelity. Extensive experiments on standard benchmarks demonstrate that CoEdit achieves superior performance in both editing quality and structural preservation, enhancing multimedia information utilization by enabling more effective interaction between visual and textual modalities. The code will be available at https://github.com/JinhaoShen/CoEdit.
[182] MMGait: Towards Multi-Modal Gait Recognition
Chenye Wang, Qingyuan Cai, Saihui Hou, Aoqi Li, Yongzhen Huang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Gait recognition has emerged as a powerful biometric technique for identifying individuals at a distance without requiring user cooperation. Most existing methods focus primarily on RGB-derived modalities, which fall short in real-world scenarios requiring multi-modal collaboration and cross-modal retrieval. To overcome these challenges, we present MMGait, a comprehensive multi-modal gait benchmark integrating data from five heterogeneous sensors, including an RGB camera, a depth camera, an infrared camera, a LiDAR scanner, and a 4D Radar system. MMGait contains twelve modalities and 334,060 sequences from 725 subjects, enabling systematic exploration across geometric, photometric, and motion domains. Based on MMGait, we conduct extensive evaluations on single-modal, cross-modal, and multi-modal paradigms to analyze modality robustness and complementarity. Furthermore, we introduce a new task, Omni Multi-Modal Gait Recognition, which aims to unify the above three gait recognition paradigms within a single model. We also propose a simple yet powerful baseline, OmniGait, which learns a shared embedding space across diverse modalities and achieves promising recognition performance. The MMGait benchmark, codebase, and pretrained checkpoints are publicly available at https://github.com/BNU-IVC/MMGait.
[183] IA-CLAHE: Image-Adaptive Clip Limit Estimation for CLAHE
Rikuto Otsuka, Yuho Shoji, Yuka Ogino, Takahiro Toizumi, Atsushi Ito
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper proposes image-adaptive contrast limited adaptive histogram equalization (IA-CLAHE). Conventional CLAHE is widely used to boost the performance of various computer vision tasks and to improve visual quality for human perception in practical industrial applications. CLAHE applies contrast limited histogram equalization to each local region to enhance local contrast. However, CLAHE often leads to over-enhancement, because the contrast-limiting parameter clip limit is fixed regardless of the histogram distribution of each local region. Our IA-CLAHE addresses this limitation by adaptively estimating tile-wise clip limits from the input image. To achieve this, we train a lightweight clip limits estimator with a differentiable extension of CLAHE, enabling end-to-end optimization. Unlike prior learning-based CLAHE methods, IA-CLAHE does not require pre-searched ground-truth clip limits or task-specific datasets, because it learns to map input image histograms toward a domain-invariant uniform distribution, enabling zero-shot generalization across diverse conditions. Experimental results show that IA-CLAHE consistently improves recognition performance, while simultaneously enhancing visual quality for human perception, without requiring any task-specific training data.
[184] Ranking XAI Methods for Head and Neck Cancer Outcome Prediction
Baoqiang Ma, Djennifer K. Madzia-Madzou, Rosa C. J. Kraaijveld, Jin Ouyang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: For head and neck cancer (HNC) patients, prognostic outcome prediction can support personalized treatment strategy selection. Improving prediction performance of HNC outcomes has been extensively explored by using advanced artificial intelligence (AI) techniques on PET/CT data. However, the interpretability of AI remains a critical obstacle for its clinical adoption. Unlike previous HNC studies that empirically selected explainable AI (XAI) techniques, we are the first to comprehensively evaluate and rank 13 XAI methods across 24 metrics, covering faithfulness, robustness, complexity and plausibility. Experimental results on the multi-center HECKTOR challenge dataset show large variations across evaluation aspects among different XAI methods, with Integrated Gradients (IG) and DeepLIFT (DL) consistently obtained high rankings for faithfulness, complexity and plausibility. This work highlights the importance of comprehensive XAI method evaluation and can be extended to other medical imaging tasks.
[185] Elucidating the SNR-t Bias of Diffusion Probabilistic Models
Meng Yu, Lei Sun, Jianhao Zeng, Xiangxiang Chu, Kun Zhan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diffusion Probabilistic Models have demonstrated remarkable performance across a wide range of generative tasks. However, we have observed that these models often suffer from a Signal-to-Noise Ratio-timestep (SNR-t) bias. This bias refers to the misalignment between the SNR of the denoising sample and its corresponding timestep during the inference phase. Specifically, during training, the SNR of a sample is strictly coupled with its timestep. However, this correspondence is disrupted during inference, leading to error accumulation and impairing the generation quality. We provide comprehensive empirical evidence and theoretical analysis to substantiate this phenomenon and propose a simple yet effective differential correction method to mitigate the SNR-t bias. Recognizing that diffusion models typically reconstruct low-frequency components before focusing on high-frequency details during the reverse denoising process, we decompose samples into various frequency components and apply differential correction to each component individually. Extensive experiments show that our approach significantly improves the generation quality of various diffusion models (IDDPM, ADM, DDIM, A-DPM, EA-DPM, EDM, PFGM++, and FLUX) on datasets of various resolutions with negligible computational overhead. The code is at https://github.com/AMAP-ML/DCW.
[186] Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs
Rohit Sinha, Aditya Kanade, Sai Srinivas Kancheti, Vineeth N Balasubramanian, Tanuja Ganu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce “Mind’s Eye”, a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel “A-R-T” taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.
[187] Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
Sai Srinivas Kancheti, Aditya Sanjiv Kanade, Vineeth N. Balasubramanian, Tanuja Ganu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.
[188] TableSeq: Unified Generation of Structure, Content, and Layout
Laziz Hamdi, Amine Tamasna, Pascal Boisson, Thierry Paquet
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present TableSeq, an image-only, end-to-end framework for joint table structure recognition, content recognition, and cell localization. The model formulates these tasks as a single sequence-generation problem: one decoder produces an interleaved stream of \texttt{HTML} tags, cell text, and discretized coordinate tokens, thereby aligning logical structure, textual content, and cell geometry within a unified autoregressive sequence. This design avoids external OCR, auxiliary decoders, and complex multi-stage post-processing. TableSeq combines a lightweight high-resolution FCN-H16 encoder with a minimal structure-prior head and a single-layer transformer encoder, yielding a compact architecture that remains effective on challenging layouts. Across standard benchmarks, TableSeq achieves competitive or state-of-the-art results while preserving architectural simplicity. It reaches 95.23 TEDS / 96.83 S-TEDS on PubTabNet, 97.45 TEDS / 98.69 S-TEDS on FinTabNet, and 99.79 / 99.54 / 99.66 precision / recall / F1 on SciTSR under the CAR protocol, while remaining competitive on PubTables-1M under GriTS. Beyond TSR/TCR, the same sequence interface generalizes to index-based table querying without task-specific heads, achieving the best IRDR score and competitive ICDR/ICR performance. We also study multi-token prediction for faster blockwise decoding and show that it reduces inference latency with only limited accuracy degradation. Overall, TableSeq provides a practical and reproducible single-stream baseline for unified table recognition, and the source code will be made publicly available at https://github.com/hamdilaziz/TableSeq.
[189] The Amazing Stability of Flow Matching
Rania Briq, Michael Kamp, Ohad Fried, Sarel Cohen, Stefan Kesselheim
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The success of deep generative models in generating high-quality and diverse samples is often attributed to particular architectures and large training datasets. In this paper, we investigate the impact of these factors on the quality and diversity of samples generated by \emph{flow-matching} models. Surprisingly, in our experiments on CelebA-HQ dataset, flow matching remains stable even when pruning 50% of the dataset. That is, the quality and diversity of generated samples are preserved. Moreover, pruning impacts the latent representation only slightly, that is, samples generated by models trained on the full and pruned dataset map to visually similar outputs for a given seed. We observe similar stability when changing the architecture or training configuration, such that the latent representation is maintained under these changes as well. Our results quantify just how strong this stability can be in practice, and help explain the reliability of flow-matching models under various perturbations.
[190] Early Detection of Acute Myeloid Leukemia (AML) Using YOLOv12 Deep Learning Model
Enas E. Ahmed, Salah A. Aly, Mayar Moner
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Acute Myeloid Leukemia (AML) is one of the most life-threatening type of blood cancers, and its accurate classification is considered and remains a challenging task due to the visual similarity between various cell types. This study addresses the classification of the multiclasses of AML cells Utilizing YOLOv12 deep learning model. We applied two segmentation approaches based on cell and nucleus features, using Hue channel and Otsu thresholding techniques to preprocess the images prior to classification. Our experiments demonstrate that YOLOv12 with Otsu thresholding on cell-based segmentation achieved the highest level of validation and test accuracy, both reaching 99.3%.
[191] DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics
Jieming Yu, Qiuxiao Feng, Zhuohan Wang, Xiaochen Ma
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: With the rapid advancement of deep generative models, realistic fake images have become increasingly accessible, yet existing localization methods rely on complex designs and still struggle to generalize across manipulation types and imaging conditions. We present a simple but strong baseline based on DINOv3 with LoRA adaptation and a lightweight convolutional decoder. Under the CAT-Net protocol, our best model improves average pixel-level F1 by 17.0 points over the previous state of the art on four standard benchmarks using only 9.1,M trainable parameters on top of a frozen ViT-L backbone, and even our smallest variant surpasses all prior specialized methods. LoRA consistently outperforms full fine-tuning across all backbone scales. Under the data-scarce MVSS-Net protocol, LoRA reaches an average F1 of 0.774 versus 0.530 for the strongest prior method, while full fine-tuning becomes highly unstable, suggesting that pre-trained representations encode forensic information that is better preserved than overwritten. The baseline also exhibits strong robustness to Gaussian noise, JPEG re-compression, and Gaussian blur. We hope this work can serve as a reliable baseline for the research community and a practical starting point for future image-forensic applications. Code is available at https://github.com/Irennnne/DINOv3-IML.
[192] Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance
Hamed Ouattara, Pierre Duthon, Pascal Houssam Salmane, Frédéric Bernardin, Omar Ait Aider
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: One of the dominant paradigms in self-supervised learning (SSL), illustrated by MoCo or DINO, aims to produce robust representations by capturing features that are insensitive to certain image transformations such as illumination, or geometric changes. This strategy is appropriate when the objective is to recognize objects independently of their appearance. However, it becomes counterproductive as soon as appearance itself constitutes the discriminative signal. In weather analysis, for example, rain streaks, snow granularity, atmospheric scattering, as well as reflections and halos, are not noise: they carry the essential information. In critical applications such as autonomous driving, ignoring these cues is risky, since grip and visibility depend directly on ground conditions and atmospheric conditions. We introduce ST-STORM, a hybrid SSL framework that treats appearance (style) as a semantic modality to be disentangled from content. Our architecture explicitly separates two latent streams, regulated by gating mechanisms. The Content branch aims at a stable semantic representation through a JEPA scheme coupled with a contrastive objective, promoting invariance to appearance variations. In parallel, the Style branch is constrained to capture appearance signatures (textures, contrasts, scattering) through feature prediction and reconstruction under an adversarial constraint. We evaluate ST-STORM on several tasks, including object classification (ImageNet-1K), fine-grained weather characterization, and melanoma detection (ISIC 2024 Challenge). The results show that the Style branch effectively isolates complex appearance phenomena (F1=97% on Multi-Weather and F1=94% on ISIC 2024 with 10% labeled data), without degrading the semantic performance (F1=80% on ImageNet-1K) of the Content branch, and improves the preservation of critical appearance
[193] DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates
Laziz Hamdi, Amine Tamasna, Thierry Paquet
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Tables condense key transactional and administrative information into compact layouts, but practical extraction requires more than text recognition: systems must also recover structure (rows, columns, merged cells, headers) and interpret roles such as line items, subtotals, and totals under common capture artifacts. Many existing resources for table structure recognition and TableVQA are built from clean digital-born sources or rendered tables, and therefore only partially reflect noisy administrative conditions. We introduce DenTab, a dataset of 2{,}000 cropped table images from dental estimates with high-quality HTML annotations, enabling evaluation of table recognition (TR) and table visual question answering (TableVQA) on the same inputs. DenTab includes 2{,}208 questions across eleven categories spanning retrieval, aggregation, and logic/consistency checks. We benchmark 16 systems, including 14 vision–language models (VLMs) and two OCR baselines. Across models, strong structure recovery does not consistently translate into reliable performance on multi-step arithmetic and consistency questions, and these reasoning failures persist even when using ground-truth HTML table inputs. To improve arithmetic reliability without training, we propose the Table Router Pipeline, which routes arithmetic questions to deterministic execution. The pipeline combines (i) a VLM that produces a baseline answer, a structured table representation, and a constrained table program with (ii) a rule-based executor that performs exact computation over the parsed table. The source code and dataset will be made publicly available at https://github.com/hamdilaziz/DenTab.
[194] Polyglot: Multilingual Style Preserving Speech-Driven Facial Animation
Federico Nocentini, Kwanggyoon Seo, Qingju Liu, Claudio Ferrari, Stefano Berretti, David Ferman, Hyeongwoo Kim, Pablo Garrido, Akin Caliskan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Speech-Driven Facial Animation (SDFA) has gained significant attention due to its applications in movies, video games, and virtual reality. However, most existing models are trained on single-language data, limiting their effectiveness in real-world multilingual scenarios. In this work, we address multilingual SDFA, which is essential for realistic generation since language influences phonetics, rhythm, intonation, and facial expressions. Speaking style is also shaped by individual differences, not only by language. Existing methods typically rely on either language-specific or speaker-specific conditioning, but not both, limiting their ability to model their interaction. We introduce Polyglot, a unified diffusion-based architecture for personalized multilingual SDFA. Our method uses transcript embeddings to encode language information and style embeddings extracted from reference facial sequences to capture individual speaking characteristics. Polyglot does not require predefined language or speaker labels, enabling generalization across languages and speakers through self-supervised learning. By jointly conditioning on language and style, it captures expressive traits such as rhythm, articulation, and habitual facial movements, producing temporally coherent and realistic animations. Experiments show improved performance in both monolingual and multilingual settings, providing a unified framework for modeling language and personal style in SDFA.
[195] Towards In-Context Tone Style Transfer with A Large-Scale Triplet Dataset
Yuhai Deng, Huimin She, Wei Shen, Meng Li, Ruoxi Wu, Lunxi Yuan, Xiang Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Tone style transfer for photo retouching aims to adapt the stylistic tone of the reference image to a given content image. However, the lack of high-quality large-scale triplet datasets with stylized ground truth forces existing methods to rely on self-supervised or proxy objectives, which limits model capability. To mitigate this gap, we design a data construction pipeline to build TST100K, a large-scale dataset of 100,000 content-reference-stylized triplets. At the core of this pipeline, we train a tone style scorer to ensure strict stylistic consistency for each triplet. In addition, existing methods typically extract content and reference features independently and then fuse them in a decoder, which may cause semantic loss and lead to inappropriate color transfer and degraded visual aesthetics. Instead, we propose ICTone, a diffusion-based framework that performs tone transfer in an in-context manner by jointly conditioning on both images, leveraging the semantic priors of generative models for semantic-aware transfer. Reward feedback learning using the tone style scorer is further incorporated to improve stylistic fidelity and visual quality. Experiments demonstrate the effectiveness of TST100K, and ICTone achieves state-of-the-art performance on both quantitative metrics and human evaluations.
[196] Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
Yige Xu, Yongjie Wang, Zizhuo Wu, Kaisong Song, Jun Lin, Zhiqi Shen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a substantial performance gap between textual and visual reasoning. Notably, VLMs excel with text-only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text-only baseline. These findings indicate that current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence. To mitigate this limitation, we curate a CrossMath training set for VLM fine-tuning. Empirical evaluations demonstrate that fine-tuning on this training set significantly boosts reasoning performance across all individual and joint modalities, while yielding robust gains on two general visual reasoning tasks. Source code is available at https://github.com/xuyige/CrossMath.
[197] From Articles to Canopies: Knowledge-Driven Pseudo-Labelling for Tree Species Classification using LLM Experts
Michał Romaszewski, Dominik Kopeć, Michał Cholewa, Katarzyna Kołodziej, Przemysław Głomb, Jan Niedzielko, Jakub Charyton, Justyna Wylazłowska, Anna Jarocińska
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Hyperspectral tree species classification is challenging due to limited and imbalanced class labels, spectral mixing (overlapping light signatures from multiple species), and ecological heterogeneity (variability among ecological systems). Addressing these challenges requires methods that integrate biological and structural characteristics of vegetation, such as canopy architecture and interspecific interactions, rather than relying solely on spectral signatures. This paper presents a biologically informed, semi-supervised deep learning method that integrates multi-sensor Earth observation data, specifically hyperspectral imaging (HSI) and airborne laser scanning (ALS), with expert, ecological knowledge. The approach relies on biologically inspired pseudo-labelling over a precomputed canopy graph, yielding accurate classification at low training cost. In addition, ecological priors on species cohabitation are automatically derived from reliable sources using large language models (LLMs) and encoded as a cohabitation matrix with likelihoods of species occurring together. These priors are incorporated into the pseudo-labelling strategy, effectively introducing expert knowledge into the model. Experiments on a real-world forest dataset demonstrate 5.6% improvement over the best reference method. Expert evaluation of cohabitation priors reveals high accuracy with differences no larger than 15%.
[198] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
Xiangbo Gao, Sicong Jiang, Bangya Liu, Xinghao Chen, Minglai Yang, Siyuan Yang, Mingyang Wu, Jiongze Yu, Qi Zheng, Haozhi Wang, Jiayi Zhang, Jared Yang, Jie Yang, Zihan Wang, Qing Yin, Zhengzhong Tu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models.
[199] Motion-Adapter: A Diffusion Model Adapter for Text-to-Motion Generation of Compound Actions
Yue Jiang, Mingyu Yang, Liuyuxin Yang, Yang Xu, Bingxin Yun, Yuhe Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in generative motion synthesis have enabled the production of realistic human motions from diverse input modalities. However, synthesizing compound actions from texts, which integrate multiple concurrent actions into coherent full-body sequences, remains a major challenge. We identify two key limitations in current text-to-motion diffusion models: (i) catastrophic neglect, where earlier actions are overwritten by later ones due to improper handling of temporal information, and (ii) attention collapse, which arises from excessive feature fusion in cross-attention mechanisms. As a result, existing approaches often depend on overly detailed textual descriptions (e.g., raising right hand), explicit body-part specifications (e.g., editing the upper body), or the use of large language models (LLMs) for body-part interpretation. These strategies lead to deficient semantic representations of physical structures and kinematic mechanisms, limiting the ability to incorporate natural behaviors such as greeting while walking. To address these issues, we propose the Motion-Adapter, a plug-and-play module that guides text-to-motion diffusion models in generating compound actions by computing decoupled cross-attention maps, which serve as structural masks during the denoising process. Extensive experiments demonstrate that our method consistently produces more faithful and coherent compound motions across diverse textual prompts, surpassing state-of-the-art approaches.
[200] SWNet: A Cross-Spectral Network for Camouflaged Weed Detection
Henry O. Velesaca, Luigi Miranda, Angel D. Sappa
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper presents SWNet, a bimodal end-to-end cross-spectral network specifically engineered for the detection of camouflaged weeds in dense agricultural environments. Plant camouflage, characterized by homochromatic blending where invasive species mimic the phenotypic traits of primary crops, poses a significant challenge for traditional computer vision systems. To overcome these limitations, SWNet utilizes a Pyramid Vision Transformer v2 backbone to capture long-range dependencies and a Bimodal Gated Fusion Module to dynamically integrate Visible and Near-Infrared information. By leveraging the physiological differences in chlorophyll reflectance captured in the NIR spectrum, the proposed architecture effectively discriminates targets that are otherwise indistinguishable in the visible range. Furthermore, an Edge-Aware Refinement module is employed to produce sharper object boundaries and reduce structural ambiguity. Experimental results on the Weeds-Banana dataset indicate that SWNet outperforms ten state-of-the-art methods. The study demonstrates that the integration of cross-spectral data and boundary-guided refinement is essential for high segmentation accuracy in complex crop canopies. The code is available on GitHub: https://cod-espol.github.io/SWNet/
[201] neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing
Toby Perrett, Matthew Bouchard, William McCarthy
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce neuralCAD-Edit, the first benchmark for editing 3D CAD models collected from expert CAD engineers. Instead of text conditioning as in prior works, we collect realistic CAD editing requests by capturing videos of professional designers, interacting directly with CAD models in CAD software, while talking, pointing and drawing. We recruited ten consenting designers to contribute to this contained study. We benchmark leading foundation models against human CAD experts carrying out edits, and find a large performance gap in both automatic metrics and human evaluations. Even the best foundation model (GPT 5.2) scores 53% lower (absolute) than CAD experts in human acceptance trials, demonstrating the challenge of neuralCAD-Edit. We hope neuralCAD-Edit will provide a solid foundation against which 3D CAD editing approaches and foundation models can be developed. Code/data: https://autodeskailab.github.io/neuralCAD-Edit
[202] Winner of CVPR2026 NTIRE Challenge on Image Shadow Removal: Semantic and Geometric Guidance for Shadow Removal via Cascaded Refinement
Lorenzo Beltrame, Jules Salzinger, Filip Svoboda, Jasmin Lampert, Phillipp Fanta-Jende, Radu Timofte, Marco Koerner
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present a three-stage progressive shadow-removal pipeline for the CVPR2026 NTIRE WSRD+ challenge. Built on OmniSR, our method treats deshadowing as iterative direct refinement, where later stages correct residual artefacts left by earlier predictions. The model combines RGB appearance with frozen DINOv2 semantic guidance and geometric cues from monocular depth and surface normals, reused across all stages. To stabilise multi-stage optimisation, we introduce a contraction-constrained objective that encourages non-increasing reconstruction error across the cascade. A staged training pipeline transfers from earlier WSRD pretraining to WSRD+ supervision and final WSRD+ 2026 adaptation with cosine-annealed checkpoint ensembling. On the official WSRD+ 2026 hidden test set, our final ensemble achieved 26.680 PSNR, 0.8740 SSIM, 0.0578 LPIPS, and 26.135 FID, ranked first overall, and won the NTIRE 2026 Image Shadow Removal Challenge. The strong performance of the proposed model is further validated on the ISTD+ and UAV-SC+ datasets.
[203] Saturation-Aware Space-Variant Blind Image Deblurring
Muhammad Z. Alam, Larry Stetsiuk, Arooba Zeshan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper presents a novel saturation aware space variant blind image deblurring framework designed to address challenges posed by saturated pixels in deblurring under high dynamic range and low light conditions. The proposed approach effectively segments the image based on blur intensity and proximity to saturation, leveraging a pre estimated Light Spread Function to mitigate stray light effects. By accurately estimating the true radiance of saturated regions using the dark channel prior, our method enhances the deblurring process without introducing artifacts like ringing. Experimental evaluations on both synthetic and real world datasets demonstrate that the framework improves deblurring outcomes across various scenarios showcasing superior performance compared to state of the art saturation-aware and general purpose methods. This adaptability highlights the framework potential integration with existing and emerging blind image deblurring techniques.
[204] AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection
Hao Wang, Beichen Zhang, Yanpei Gong, Shaoyi Fang, Zhaobo Qi, Yuanrong Xu, Xinyan Liu, Weigang Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As forgery types continue to emerge consistently, Incremental Face Forgery Detection (IFFD) has become a crucial paradigm. However, existing methods typically rely on data replay or coarse binary supervision, which fails to explicitly constrain the feature space, leading to severe feature drift and catastrophic forgetting. To address this, we propose AIFIND, Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection, which leverages semantic anchors to stabilize incremental learning. We design the Artifact-Driven Semantic Prior Generator to instantiate invariant semantic anchors, establishing a fixed coordinate system from low-level artifact cues. These anchors are injected into the image encoder via Artifact-Probe Attention, which explicitly constrains volatile visual features to align with stable semantic anchors. Adaptive Decision Harmonizer harmonizes the classifiers by preserving angular relationships of semantic anchors, maintaining geometric consistency across tasks. Extensive experiments on multiple incremental protocols validate the superiority of AIFIND.
[205] GAViD: A Large-Scale Multimodal Dataset for Context-Aware Group Affect Recognition from Videos
Deepak Kumar, Abhishek Pratap Singh, Puneet Kumar, Xiaobai Li, Balasubramanian Raman
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Understanding affective dynamics in real-world social systems is fundamental to modeling and analyzing human-human interactions in complex environments. Group affect emerges from intertwined human-human interactions, contextual influences, and behavioral cues, making its quantitative modeling a challenging computational social systems problem. However, computational modeling of group affect in in-the-wild scenarios remains challenging due to limited large-scale annotated datasets and the inherent complexity of multimodal social interactions shaped by contextual and behavioral variability. The lack of comprehensive datasets annotated with multimodal and contextual information further limits advances in the field. To address this, we introduce the Group Affect from ViDeos (GAViD) dataset, comprising 5091 video clips with multimodal data (video, audio and context), annotated with ternary valence and discrete emotion labels and enriched with VideoGPT-generated contextual metadata and human-annotated action cues. We also present Context-Aware Group Affect Recognition Network (CAGNet) for multimodal context-aware group affect recognition. CAGNet achieves 63.20% test accuracy on GAViD, comparable to state-of-the-art performance. The dataset and code are available at github.com/deepakkumar-iitr/GAViD.
[206] Dental Panoramic Radiograph Analysis Using YOLO26 From Tooth Detection to Disease Diagnosis
Khawaja Azfar Asif, Rafaqat Alam Khan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Panoramic radiography is a fundamental diagnostic tool in dentistry, offering a comprehensive view of the entire dentition with minimal radiation exposure. However, manual interpretation is time-consuming and prone to errors, especially in high-volume clinical settings. This creates a pressing need for efficient automated solutions. This study presents the first application of YOLOv26 for automated tooth detection, FDI-based numbering, and dental disease segmentation in panoramic radiographs. The DENTEX dataset was preprocessed using Roboflow for format conversion and augmentation, yielding 1,082 images for tooth enumeration and 1,040 images for disease segmentation across four pathology classes. Five YOLOv26-seg variants were trained on Google Colab using transfer learning at a resolution of 800x800. Results demonstrate that the YOLOv26m-seg model achieved the best performance for tooth enumeration, with a precision of 0.976, recall of 0.970, and box mAP50 of 0.976. It outperformed the YOLOv8x baseline by 4.9% in precision and 3.3% in mAP50, while also enabling high-quality mask-level segmentation (mask mAP50 = 0.970). For disease segmentation, the YOLOv26l-seg model attained a box mAP50 of 0.591 and a mask mAP50 of 0.547. Impacted teeth showed the highest per-class average precision (0.943), indicating that visual distinctiveness influences detection performance more than annotation quantity. Overall, these findings demonstrate that YOLOv26-based models offer a robust and accurate framework for automated dental image analysis, with strong potential to enhance diagnostic efficiency and consistency in clinical practice.
[207] A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection
Van-Truong Le, Le-Khanh Nguyen, Trong-Doanh Nguyen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Academic integrity continues to face the persistent challenge of examination cheating. Traditional invigilation relies on human observation, which is inefficient, costly, and prone to errors at scale. Although some existing AI-powered monitoring systems have been deployed and trusted, many lack transparency or require multi-layered architectures to achieve the desired performance. To overcome these challenges, we propose an improvement over a simple two-stage framework for exam cheating detection that integrates object detection and behavioral analysis using well-known technologies. First, the state-of-the-art YOLOv8n model is used to localize students in exam-room images. Each detected region is cropped and preprocessed, then classified by a fine-tuned RexNet-150 model as either normal or cheating behavior. The system is trained on a dataset compiled from 10 independent sources with a total of 273,897 samples, achieving 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1-score - a 13% increase over a baseline accuracy of 0.82 in video-based cheating detection. In addition, with an average inference time of 13.9 ms per sample, the proposed approach demonstrates robustness and scalability for deployment in large-scale environments. Beyond the technical contribution, the AI-assisted monitoring system also addresses ethical concerns by ensuring that final outcomes are delivered privately to individual students after the examination, for example, via personal email. This prevents public exposure or shaming and offers students an opportunity to reflect on their behavior. For further improvement, it is possible to incorporate additional factors, such as audio data and consecutive frames, to achieve greater accuracy. This study provides a foundation for developing real-time, scalable, ethical, and open-source solutions.
[208] CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting
Nishq Poorav Desai, Ali Etemad, Michael Greenspan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and global patterns encapsulated in a video, both spatially and temporally. To address the multi-scale nature of video, we introduce a novel spatiotemporal hierarchical transformer-based architecture called CollideNet, specifically catered for effective TTC forecasting. In the spatial stream, CollideNet aggregates information for each video frame simultaneously at multiple resolutions. In the temporal stream, along with multi-scale feature encoding, CollideNet also disentangles the non-stationarity, trend, and seasonality components. Our method achieves state-of-the-art performance in comparison to prior works on three commonly used public datasets, setting a new state-of-the-art by a considerable margin. We conduct cross-dataset evaluations to analyze the generalization capabilities of our method, and visualize the effects of disentanglement of the trend and seasonality components of the video data. We release our code at https://github.com/DeSinister/CollideNet/.
[209] Find, Fix, Reason: Context Repair for Video Reasoning
Haojian Huang, Chuanyu Qin, Yinchuan Li, Yingcong Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model’s knowledge boundary, or hybrid replay that mixes policies and demands careful regularization. Dynamic context methods zoom into focused evidence but often require curated pretraining and two-stage tuning, and their context remains bounded by a small model’s capability. In contrast, larger models excel at instruction following and multi-modal understanding, can supply richer context to smaller models, and rapidly zoom in on target regions via simple tools. Building on this capability, we introduce an observation-level intervention: a frozen, tool-integrated teacher identifies the missing spatiotemporal dependency and provides a minimal evidence patch (e.g., timestamps, regions etc.) from the original video while the question remains unchanged. The student answers again with the added context, and training updates with a chosen-rollout scheme integrated into Group Relative Policy Optimization (GRPO). We further propose a Robust Improvement Reward (RIR) that aligns optimization with two goals: outcome validity through correct answers and dependency alignment through rationales that reflect the cited evidence. Advantages are group-normalized across the batch, preserving on-policy exploration while directing it along causally meaningful directions with minimal changes to the training stack. Experiments on various related benchmarks show consistent accuracy gains and strong generalization. Web page and source code will be available at https://github.com/JethroJames/FFR.git.
[210] Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization
Siddhant Bharadwaj, Ashish Vashist, Fahimul Aleem, Shruti Vyas
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Image geolocalization has traditionally been addressed through retrieval-based place recognition or geometry-based visual localization pipelines. Recent advances in Vision-Language Models (VLMs) have demonstrated strong zero-shot reasoning capabilities across multimodal tasks, yet their performance in geographic inference remains underexplored. In this work, we present a systematic evaluation of multiple state-of-the-art VLMs for country-level image geolocalization using ground-view imagery only. Instead of relying on image matching, GPS metadata, or task-specific training, we evaluate prompt-based country prediction in a zero-shot setting. The selected models are tested on three geographically diverse datasets to assess their robustness and generalization ability. Our results reveal substantial variation across models, highlighting the potential of semantic reasoning for coarse geolocalization and the limitations of current VLMs in capturing fine-grained geographic cues. This study provides the first focused comparison of modern VLMs for country-level geolocalization and establishes a foundation for future research at the intersection of multimodal reasoning and geographic understanding.
[211] Information Router for Mitigating Modality Dominance in Vision-Language Models
Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model’s attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model’s attention does not resolve the underlying lack of information. In this paper, we propose \textsc{MoIR}: \textit{Multi-modal Information Router}, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textsc{MoIR} identifies less informative tokens and routes complementary information from a stronger modality, constructing information-dense token representations before they are processed by a large language model. By modifying information availability, \textsc{MoIR} enables reliable shifts in modality dominance, even when one modality is degraded. We evaluate \textsc{MoIR} on three widely used multi-modal benchmarks across multiple model backbones. Experimental results show that \textsc{MoIR} consistently demonstrates more balanced modality contribution, and improves robustness and downstream performance, particularly even under modality degradation. These findings demonstrate that explicitly modifying cross-modal information is an effective and complementary strategy for mitigating modality dominance in multi-modal reasoning models.
[212] Hero-Mamba: Mamba-based Dual Domain Learning for Underwater Image Enhancement
Tejeswar Pokuri, Shivarth Rai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Underwater images often suffer from severe degradation, such as color distortion, low contrast, and blurred details, due to light absorption and scattering in water. While learning-based methods like CNNs and Transformers have shown promise, they face critical limitations: CNNs struggle to model the long-range dependencies needed for non-uniform degradation, and Transformers incur quadratic computational complexity, making them inefficient for high-resolution images. To address these challenges, we propose Hero-Mamba, a novel Mamba-based network that achieves efficient dual-domain learning for underwater image enhancement. Our approach uniquely processes information from both the spatial domain (RGB image) and the spectral domain (FFT components) in parallel. This dual-domain input allows the network to decouple degradation factors, separating color/brightness information from texture/noise. The core of our network utilizes Mamba-based SS2D blocks to capture global receptive fields and long-range dependencies with linear complexity, overcoming the limitations of both CNNs and Transformers. Furthermore, we introduce a ColorFusion block, guided by a background light prior, to restore color information with high fidelity. Extensive experiments on the LSUI and UIEB benchmark datasets demonstrate that Hero-Mamba outperforms state-of-the-art methods. Notably, our model achieves a PSNR of 25.802 and an SSIM of 0.913 on LSUI, validating its superior performance and generalization capabilities.
[213] Enhancing Hazy Wildlife Imagery: AnimalHaze3k and IncepDehazeGan
Shivarth Rai, Tejeswar Pokuri
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Atmospheric haze significantly degrades wildlife imagery, impeding computer vision applications critical for conservation, such as animal detection, tracking, and behavior analysis. To address this challenge, we introduce AnimalHaze3k a synthetic dataset comprising of 3,477 hazy images generated from 1,159 clear wildlife photographs through a physics-based pipeline. Our novel IncepDehazeGan architecture combines inception blocks with residual skip connections in a GAN framework, achieving state-of-the-art performance (SSIM: 0.8914, PSNR: 20.54, and LPIPS: 0.1104), delivering 6.27% higher SSIM and 10.2% better PSNR than competing approaches. When applied to downstream detection tasks, dehazed images improved YOLOv11 detection mAP by 112% and IoU by 67%. These advances can provide ecologists with reliable tools for population monitoring and surveillance in challenging environmental conditions, demonstrating significant potential for enhancing wildlife conservation efforts through robust visual analytics.
[214] FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation
Dian Shao, Zhengzheng Xu, Peiyang Wang, Like Liu, Yule Wang, Jieqi Shi, Jing Huo
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous multi-step instructions over long horizons. Existing zero-shot methods remain limited, as they often rely on large base models, generic prompts, and loosely coordinated modules. In this work, we propose FineCog-Nav, a top-down framework inspired by human cognition that organizes navigation into fine-grained modules for language processing, perception, attention, memory, imagination, reasoning, and decision-making. Each module is driven by a moderate-sized foundation model with role-specific prompts and structured input-output protocols, enabling effective collaboration and improved interpretability. To support fine-grained evaluation, we construct AerialVLN-Fine, a curated benchmark of 300 trajectories derived from AerialVLN, with sentence-level instruction-trajectory alignment and refined instructions containing explicit visual endpoints and landmark references. Experiments show that FineCog-Nav consistently outperforms zero-shot baselines in instruction adherence, long-horizon planning, and generalization to unseen environments. These results suggest the effectiveness of fine-grained cognitive modularization for zero-shot aerial navigation. Project page: https://smartdianlab.github.io/projects-FineCogNav.
[215] Repurposing 3D Generative Model for Autoregressive Layout Generation
Haoran Feng, Yifan Niu, Zehuan Huang, Yang-Tian Sun, Chunchao Guo, Yuxin Peng, Lu Sheng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce LaviGen, a framework that repurposes 3D generative models for 3D layout generation. Unlike previous methods that infer object layouts from textual descriptions, LaviGen operates directly in the native 3D space, formulating layout generation as an autoregressive process that explicitly models geometric relations and physical constraints among objects, producing coherent and physically plausible 3D scenes. To further enhance this process, we propose an adapted 3D diffusion model that integrates scene, object, and instruction information and employs a dual-guidance self-rollout distillation mechanism to improve efficiency and spatial accuracy. Extensive experiments on the LayoutVLM benchmark show LaviGen achieves superior 3D layout generation performance, with 19% higher physical plausibility than the state of the art and 65% faster computation. Our code is publicly available at https://github.com/fenghora/LaviGen.
[216] EventCrab: Harnessing Frame and Point Synergy for Event-based Action Recognition and Beyond
Meiqi Cao, Xiangbo Shu, Jiachao Zhang, Rui Yan, Zechao Li, Jinhui Tang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2411.18328: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.18328&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[217] Comparison Study: Glacier Calving Front Delineation in Synthetic Aperture Radar Images With Deep Learning
Nora Gourmelon, Konrad Heidler, Erik Loebel, Daniel Cheng, Julian Klink, Anda Dong, Fei Wu, Noah Maul, Moritz Koch, Marcel Dreier, Dakota Pyles, Thorsten Seehaus, Matthias Braun, Andreas Maier, Vincent Christlein
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2501.05281: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.05281&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[218] When Cultures Meet: Multicultural Text-to-Image Generation
Parth Bhalerao, Mounika Yalamarty, Brian Trinh, Oana Ignat
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2502.15972: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.15972&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[219] Scalable Unseen Objects 6-DoF Absolute Pose Estimation with Robotic Integration
Jian Liu, Wei Sun, Kai Zeng, Jin Zheng, Hui Yang, Hossein Rahmani, Ajmal Mian, Lin Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.05578: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.05578&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[220] From Limited Labels to Open Domains:An Efficient Learning Method for Drone-view Geo-Localization
Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong, Jiawei Lang, Guoqi Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.07520: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.07520&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[221] MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications
Anshul Kumar, Gagan Raj Gupta, Manish Rai, Apu Chakraborty, Ashutosh Modi, Abdelaali Chaoub, Soumajit Pramanik, Moyank Giri, Yashwanth Holla, Sunny Kumar, M. V. Kiran Sooraj
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.13131: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13131&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[222] An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval
Min Cao, Yuxin Lu, Ziyin Zeng, Dong Yi, Jinqiao Wang, Mang Ye
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.22171: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.22171&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[223] PILOT: A Promptable Interleaved Layout-aware OCR Transformer
Laziz Hamdi, Amine Tamasna, Pascal Boisson, Thierry Paquet
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2504.03621: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.03621&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[224] Art3D: Training-Free 3D Generation from Flat-Colored Illustration
Xiaoyan Cong, Jiayi Shen, Zekun Li, Rao Fu, Tao Lu, Srinath Sridhar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2504.10466: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.10466&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[225] DyTact: Capturing Dynamic Contacts in Hand-Object Manipulation
Xiaoyan Cong, Angela Xing, Chandradeep Pokhariya, Rao Fu, Srinath Sridhar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.03103: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.03103&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[226] DVP-MVS++: Synergize Depth-Normal-Edge and Harmonized Visibility Prior for Multi-View Stereo
Zhenlong Yuan, Dapeng Zhang, Zehao Li, Chengxuan Qian, Jianing Chen, Yinda Chen, Kehua Chen, Tianlu Mao, Zhaoxin Li, Hao Jiang, Zhaoqi Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.13215: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.13215&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[227] GenHSI: Controllable Generation of Human-Scene Interaction Videos
Zekun Li, Rui Zhou, Rahul Sajnani, Xiaoyan Cong, Daniel Ritchie, Srinath Sridhar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.19840: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.19840&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[228] ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation
Hosam Elgendy, Ahmed Sharshar, Ahmed Aboeitta, Mohsen Guizani
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.10635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.10635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[229] Two-Stage Framework for Efficient UAV-Based Wildfire Video Analysis with Adaptive Compression and Fire Source Detection
Yanbing Bai, Rui-Yang Ju, Lemeng Zhao, Junjie Hu, Jianchao Bi, Erick Mas, Shunichi Koshimura
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.16739: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16739&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[230] DualTrack: Sensorless 3D Ultrasound needs Local and Global Context
Paul F. R. Wilson, Matteo Ronchetti, Rüdiger Göbl, Viktoria Markova, Sebastian Rosenzweig, Raphael Prevost, Parvin Mousavi, Oliver Zettinig
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.09530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.09530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[231] EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence
Chaoyin She, Ruifang Lu, Lida Chen, Wei Wang, Qinghua Huang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.14977: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.14977&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[232] Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.08480: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08480&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[233] Noise Aggregation Analysis Driven by Small-Noise Injection: Efficient Membership Inference for Diffusion Models
Guo Li, Weihong Chen, Yongfu Fan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.21783: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21783&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[234] Proper Body Landmark Subset Enables More Accurate and 5X Faster Recognition of Isolated Signs in LIBRAS
Daniele L. V. dos Santos, Thiago B. Pereira, Carlos Eduardo G. R. Alves, Richard J. M. G. Tello, Francisco de A. Boldt, Thiago M. Paixão
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.24887: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24887&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[235] FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound
Hussain Alasmawi, Numan Saeed, Mohammad Yaqub
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.22278: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22278&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[236] DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.23421: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23421&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[237] MFC-RFNet: A Multi-scale Guided Rectified Flow Network for Radar Sequence Prediction
Wenjie Luo, Chuanhu Deng, Chaorong Li, Rongyao Deng, Qiang Yang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.03633: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03633&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[238] VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck
Feiran Zhang, Yixin Wu, Zhenghua Wang, Xiaohua Wang, Changze Lv, Xuanjing Huang, Xiaoqing Zheng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.05547: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05547&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[239] VeRVE: Versatile Retrieval for Videos via Unified Embeddings
Shaunak Halbe, Bhagyashree Puranik, Jayakrishnan Unnikrishnan, Kushan Thakkar, Vimal Bhat, Toufiq Parag
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.12193: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12193&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[240] 1S-DAug: One-Shot Data Augmentation for Robust Few-Shot Generalization
Yunwei Bai, Ying Kiat Tan, Yao Shu, Tsuhan Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.00114: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00114&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[241] SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos
Jinlin Wu, Felix Holm, Chuxi Chen, An Wang, Yaxin Hu, Xiaofan Ye, Zelin Zang, Miao Xu, Lihua Zhou, Huai Liao, Danny T. M. Chan, Ming Feng, Wai S. Poon, Hongliang Ren, Dong Yi, Nassir Navab, Gaofeng Meng, Jiebo Luo, Hongbin Liu, Zhen Lei
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.05638: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05638&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[242] Scalable spatial point process models for forensic footwear analysis
Alokesh Manna, Neil Spencer, Dipak K. Dey
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.07006: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07006&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[243] LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
Zekun Li, Sizhe An, Chengcheng Tang, Chuan Guo, Ivan Shugurov, Linguang Zhang, Amy Zhao, Srinath Sridhar, Lingling Tao, Abhay Mittal
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.12370: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12370&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[244] A Single Image and Multimodality Is All You Need for Novel View Synthesis
Amirhosein Javadi, Chi-Shiang Gau, Konstantinos D. Polyzos, Tara Javidi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.17909: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17909&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[245] CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness
Wenhao Guo, Zhaoran Zhao, Peng Lu, Sheng Li, Qian Qiao, DeRui Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.22159: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22159&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[246] Differential privacy representation geometry for medical image analysis
Soroosh Tayebi Arasteh, Marziyeh Mohammadi, Sven Nebelung, Daniel Truhn
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.01098: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01098&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[247] HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
Yichen Liu, Donghao Zhou, Jie Wang, Xin Gao, Guisheng Liu, Jiatong Li, Quanwei Zhang, Qiang Lyu, Lanqing Guo, Shilei Wen, Weiqiang Wang, Pheng-Ann Heng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.02210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] Social-JEPA: Emergent Geometric Isomorphism
Haoran Zhang, Youjin Wang, Yi Duan, Rong Fu, Dianyu Zhao, Sicheng Fan, Shuaishuai Cao, Wentao Guo, Xiao Zhou
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.02263: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02263&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] Training Flow Matching: The Role of Weighting and Parameterization
Anne Gagneux, Ségolène Martin, Rémi Gribonval, Mathurin Massias
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.06454: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06454&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence
Seunghwan Bang, Hwanjun Song
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.13091: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13091&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] Cross-modal learning for plankton recognition
Joona Kareinen, Veikka Immonen, Tuomas Eerola, Lumi Haraguchi, Lasse Lensu, Kaisa Kraft, Sanna Suikkanen, Heikki Kälviäinen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.16427: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16427&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
Chengyin Hu, Xuemeng Sun, Jiaju Han, Qike Zhang, Xiang Chen, Xin Wang, Yiwei Wei, Jiahua Long
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.27759: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27759&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] InstructTable: Improving Table Structure Recognition Through Instructions
Boming Chen, Zining Wang, Zhentao Guo, Jianqiang Liu, Chen Duan, Yu Gu, Kai zhou, Pengfei Yan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.02880: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02880&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation
Md Aminur Hossain, Ayush V. Patel, Siddhant Gole, Sanjay K. Singh, Biplab Banerjee
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.06715: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06715&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
Chanhyuk Choi, Taesoo Kim, Donggyu Lee, Siyeol Jung, Taehwan Kim
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.07786: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07786&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] Neural Distribution Prior for LiDAR Out-of-Distribution Detection
Zizhao Li, Zhengkang Xiang, Jiayang Ao, Feng Liu, Joseph West, Kourosh Khoshelham
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.09232: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09232&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[257] VAGNet: Vision-based Accident Anticipation with Global Features
Vipooshan Vipulananthan, Charith D. Chitraranjan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.09305: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09305&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[258] ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents
Dongjie Huo, Haoyun Liu, Guoqing Liu, Dekang Qi, Zhiming Sun, Maoguo Gao, Jianxin He, Yandan Yang, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10096: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10096&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs
Aaditya Baranwal, Vishal Yadav, Abhishek Rajora
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10528: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10528&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[260] ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding
Xucheng Wang, Xiaoman Zhang, Sung Eun Kim, Ankit Pal, Pranav Rajpurkar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10916: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10916&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.
[262] Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
Sanghyeok Chu, Pyunghwan Ahn, Gwangmo Song, SeungHwan Kim, Honglak Lee, Bohyung Han
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.13508: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.13508&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection
Hui Han, Shunli Wang, Yandan Zhao, Taiping Yao, Shouhong Ding
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.13660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.13660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[264] SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning
Xue Wu, Shengting Cao, Shenglin Li, Jiaqi Gong
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14373: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14373&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[265] FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images
Sabab Ishraq, Aarushi Aarushi, Juncai Jiang, Chen Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] Towards Design Compositing
Abhinav Mahajan, Abhikhya Tripathy, Sudeeksha Reddy Pala, Vaibhav Methi, K J Joseph, Balaji Vasan Srinivasan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14605: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14605&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] Hybrid Latents: Geometry-Appearance-Aware Surfel Splatting
Neel Kelkar, Simon Niedermayr, Klaus Engel, Rüdiger Westermann
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14928: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14928&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[268] UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards
Jun Wang, Shuo Tan, Zelong Sun, Tiancheng Gu, Yongle Zhao, Ziyong Feng, Kaicheng Yang, Zhiwu Lu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14967: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14967&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
Xuanyi Liu, Chunan Yu, Deyi Ji, Qi Zhu, Lingyun Sun, Xuanfu Li, Jin Ma, Tianrun Chen, Lanyun Zhu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.15237: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15237&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[270] GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
Roni Itkin, Noam Issachar, Yehonatan Keypur, Xingyu Chen, Anpei Chen, Sagie Benaim
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.15284: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15284&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[271] Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation
Yiyang Jiang, Li Zhang, Xiao-Yong Wei, Li Qing
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.15301: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15301&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] TokenLight: Precise Lighting Control in Images using Attribute Tokens
Sumit Chaturvedi, Yannick Hold-Geoffroy, Mengwei Ren, Jingyuan Liu, He Zhang, Yiqun Mei, Julie Dorsey, Zhixin Shu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.15310: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15310&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[273] AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving
Zhenlong Yuan, Chengxuan Qian, Jing Tang, Rui Chen, Zijian Song, Lei Sun, Xiangxiang Chu, Yujun Cai, Dapeng Zhang, Shuo Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.01944: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.01944&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[274] DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI
Zhizheng Wang, Chih-Hsuan Wei, Joey Chan, Robert Leaman, Chi-Ping Day, Chuan Wu, Mark A Knepper, Antolin Serrano Farias, Jordina Rincon-Torroella, Hasan Slika, Betty Tyler, Ryan Huu-Tuan Nguyen, Asmita Indurkar, Mélanie Hébert, Shubo Tian, Lauren He, Noor Naffakh, Aseem Aseem, Nicholas Wan, Emily Y Chew, Tiarnan D L Keenan, Zhiyong Lu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Trustworthiness and transparency are essential for the clinical adoption of artificial intelligence (AI) in healthcare and biomedical research. Recent deep research systems aim to accelerate evidence-grounded scientific discovery by integrating AI agents with multi-hop information retrieval, reasoning, and synthesis. However, most existing systems lack explicit and inspectable criteria for evidence appraisal, creating a risk of compounding errors and making it difficult for researchers and clinicians to assess the reliability of their outputs. In parallel, current benchmarking approaches rarely evaluate performance on complex, real-world medical questions. Here, we introduce DeepER-Med, a Deep Evidence-based Research framework for Medicine with an agentic AI system. DeepER-Med frames deep medical research as an explicit and inspectable workflow of evidence-based generation, consisting of three modules: research planning, agentic collaboration, and evidence synthesis. To support realistic evaluation, we also present DeepER-MedQA, an evidence-grounded dataset comprising 100 expert-level research questions derived from authentic medical research scenarios and curated by a multidisciplinary panel of 11 biomedical experts. Expert manual evaluation demonstrates that DeepER-Med consistently outperforms widely used production-grade platforms across multiple criteria, including the generation of novel scientific insights. We further demonstrate the practical utility of DeepER-Med through eight real-world clinical cases. Human clinician assessment indicates that DeepER-Med’s conclusions align with clinical recommendations in seven cases, highlighting its potential for medical research and decision support.
[275] DASB – Discrete Audio and Speech Benchmark
Pooneh Mousavi, Jarod Duret, Darius Petermann, Artem Ploujnikov, Luca Della Libera, Anastasia Kuznetsova, Cem Subakan, Mirco Ravanelli
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2406.14294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.14294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology
Shivendra Agrawal, Bradley Hayes
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments. We present GIST (Grounded Intelligent Semantic Topology), a multimodal knowledge extraction pipeline that transforms a consumer-grade mobile point cloud into a semantically annotated navigation topology. Our architecture distills the scene into a 2D occupancy map, extracts its topological layout, and overlays a lightweight semantic layer via intelligent keyframe and semantic selection. We demonstrate the versatility of this structured spatial knowledge through critical downstream Human-AI interaction tasks: (1) an intent-driven Semantic Search engine that actively infers categorical alternatives and zones when exact matches fail; (2) a one-shot Semantic Localizer achieving a 1.04 m top-5 mean translation error; (3) a Zone Classification module that segments the walkable floor plan into high-level semantic regions; and (4) a Visually-Grounded Instruction Generator that synthesizes optimal paths into egocentric, landmark-rich natural language routing. In multi-criteria LLM evaluations, GIST outperforms sequence-based instruction generation baselines. Finally, an in-situ formative evaluation (N=5) yields an 80% navigation success rate relying solely on verbal cues, validating the system’s capacity for universal design.
[277] Bureaucratic Silences: What the Canadian AI Register Reveals, Omits, and Obscures
Dipto Das, Christelle Tessono, Syed Ishtiaque Ahmed, Shion Guha
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In November 2025, the Government of Canada operationalized its commitment to transparency by releasing its first Federal AI Register. In this paper, we argue that such registers are not neutral mirrors of government activity, but active instruments of ontological design that configure the boundaries of accountability. We analyzed the Register’s complete dataset of 409 systems using the Algorithmic Decision-Making Adapted for the Public Sector (ADMAPS) framework, combining quantitative mapping with deductive qualitative coding. Our findings reveal a sharp divergence between the rhetoric of “sovereign AI” and the reality of bureaucratic practice: while 86% of systems are deployed internally for efficiency, the Register systematically obscures the human discretion, training, and uncertainty management required to operate them. By privileging technical descriptions over sociotechnical context, the Register constructs an ontology of AI as “reliable tooling” rather than “contestable decision-making.” We conclude that without a shift in design, such transparency artifacts risk automating accountability into a performative compliance exercise, offering visibility without contestability.
[278] LACE: Lattice Attention for Cross-thread Exploration
Yang Li, Zirui Zhang, Yang Liu, Chengzhi Mao
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Current large language models reason in isolation. Although it is common to sample multiple reasoning paths in parallel, these trajectories do not interact, and often fail in the same redundant ways. We introduce LACE, a framework that transforms reasoning from a collection of independent trials into a coordinated, parallel process. By repurposing the model architecture to enable cross-thread attention, LACE allows concurrent reasoning paths to share intermediate insights and correct one another during inference. A central challenge is the absence of natural training data that exhibits such collaborative behavior. We address this gap with a synthetic data pipeline that explicitly teaches models to communicate and error-correct across threads. Experiments show that this unified exploration substantially outperforms standard parallel search, improving reasoning accuracy by over 7 points. Our results suggest that large language models can be more effective when parallel reasoning paths are allowed to interact.
[279] Preregistered Belief Revision Contracts
Saad Alqithami
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Deliberative multi-agent systems allow agents to exchange messages and revise beliefs over time. While this interaction is meant to improve performance, it can also create dangerous conformity effects: agreement, confidence, prestige, or majority size may be treated as if they were evidence, producing high-confidence convergence to false conclusions. To address this, we introduce PBRC (Preregistered Belief Revision Contracts), a protocol-level mechanism that strictly separates open communication from admissible epistemic change. A PBRC contract publicly fixes first-order evidence triggers, admissible revision operators, a priority rule, and a fallback policy. A non-fallback step is accepted only when it cites a preregistered trigger and provides a nonempty witness set of externally validated evidence tokens. This ensures that every substantive belief change is both enforceable by a router and auditable after the fact. In this paper, (a) we prove that under evidential contracts with conservative fallback, social-only rounds cannot increase confidence and cannot generate purely conformity-driven wrong-but-sure cascades. (b) We show that auditable trigger protocols admit evidential PBRC normal forms that preserve belief trajectories and canonicalized audit traces. (c) We demonstrate that sound enforcement yields epistemic accountability: any change of top hypothesis is attributable to a concrete validated witness set. For token-invariant contracts, (d) we prove that enforced trajectories depend only on token-exposure traces; under flooding dissemination, these traces are characterized exactly by truncated reachability, giving tight diameter bounds for universal evidence closure. Finally, we introduce a companion contractual dynamic doxastic logic to specify trace invariants, and provide simulations illustrating cascade suppression, auditability, and robustness-liveness trade-offs.
[280] Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation
Jacob Dang, Brian Y. Xie, Omar G. Younis
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent work on subliminal learning demonstrates that language models can transmit semantic traits through data that is semantically unrelated to those traits. However, it remains unclear whether behavioral traits can transfer in agentic systems, where policies are learned from trajectories rather than static text. In this work, we provide the first empirical evidence that unsafe agent behaviors can transfer subliminally through model distillation across two complementary experimental settings. In our primary setting, we construct a teacher agent exhibiting a strong deletion bias, a tendency to perform destructive file-system actions via an API-style tool interface, and distill it into a student using only trajectories from ostensibly safe tasks, with all explicit deletion keywords rigorously filtered. In our secondary setting, we replicate the threat model in a native Bash environment, replacing API tool calls with shell commands and operationalizing the bias as a preference for issuing chmod as the first permission-related command over semantically equivalent alternatives such as chown or setfacl. Despite full keyword sanitation in both settings, students inherit measurable behavioral biases. In the API setting the student’s deletion rate reaches 100% (versus a 5% baseline) under homogeneous distillation; in the Bash setting the student’s chmod-first rate reaches 30%-55% (versus a 0%-10% baseline), with the strongest transfer observed in large-to-small distillation. Our results demonstrate that explicit data sanitation is an insufficient defense, and behavioral biases are encoded implicitly in trajectory dynamics regardless of the tool interface.
[281] Bilevel Optimization of Agent Skills via Monte Carlo Tree Search
Chenyi Huang, Haoting Zhang, Jingxu Xu, Zeyu Zheng, Yunduan Lin
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Agent \texttt{skills} are structured collections of instructions, tools, and supporting resources that help large language model (LLM) agents perform particular classes of tasks. Empirical evidence shows that the design of \texttt{skills} can materially affect agent task performance, yet systematically optimizing \texttt{skills} remains challenging. Since a \texttt{skill} comprises instructions, tools, and supporting resources in a structured way, optimizing it requires jointly determining both the structure of these components and the content each component contains. This gives rise to a complex decision space with strong interdependence across structure and components. We therefore represent these two coupled decisions as \texttt{skill} structure and component content, and formulate \texttt{skill} optimization as a bilevel optimization problem. We propose a bilevel optimization framework in which an outer loop employs Monte Carlo Tree Search to determine the \texttt{skill} structure, while an inner loop refines the component content within the structure selected by the outer loop. In both loops, we employ LLMs to assist the optimization procedure. We evaluate the proposed framework on an open-source Operations Research Question Answering dataset, and the experimental results suggest that the bilevel optimization framework improves the performance of the agents with the optimized \texttt{skill}.
[282] Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As LLM agents scale to long-horizon, multi-session deployments, efficiently managing accumulated experience becomes a critical bottleneck. Agent memory systems and agent skill discovery both address this challenge – extracting reusable knowledge from interaction traces – yet a citation analysis of 1,136 references across 22 primary papers reveals a cross-community citation rate below 1%. We propose the \emph{Experience Compression Spectrum}, a unifying framework that positions memory, skills, and rules as points along a single axis of increasing compression (5–20$\times$ for episodic memory, 50–500$\times$ for procedural skills, 1,000$\times$+ for declarative rules), directly reducing context consumption, retrieval latency, and compute overhead. Mapping 20+ systems onto this spectrum reveals that every system operates at a fixed, predetermined compression level – none supports adaptive cross-level compression, a gap we term the \emph{missing diagonal}. We further show that specialization alone is insufficient – both communities independently solve shared sub-problems without exchanging solutions – that evaluation methods are tightly coupled to compression levels, that transferability increases with compression at the cost of specificity, and that knowledge lifecycle management remains largely neglected. We articulate open problems and design principles for scalable, full-spectrum agent learning systems.
[283] The World Leaks the Future: Harness Evolution for Future Prediction Agents
Chuyang Wei, Maohang Gao, Zhixin Han, Kefei Chen, Yu Zhuang, Haoxiang Guan, Yanzhi Zhang, Yilin Cheng, Jiyan He, Huanhuan Chen, Jian Li, Yu Shi, Yitong Duan, Shuxin Zheng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Many consequential decisions must be made before the relevant outcome is known. Such problems are commonly framed as \emph{future prediction}, where an LLM agent must form a prediction for an unresolved question using only the public information available at the prediction time. The setting is difficult because public evidence evolves while useful supervision arrives only after the question is resolved, so most existing approaches still improve mainly from final outcomes. Yet final outcomes are too coarse to guide earlier factor tracking, evidence gathering and interpretation, or uncertainty handling. When the same unresolved question is revisited over time, temporal contrasts between earlier and later predictions can expose omissions in the earlier prediction process; we call this signal \emph{internal feedback}. We introduce \emph{Milkyway}, a self-evolving agent system that keeps the base model fixed and instead updates a persistent \emph{future prediction harness} for factor tracking, evidence gathering and interpretation, and uncertainty handling. Across repeated predictions on the same unresolved question, \emph{Milkyway} extracts internal feedback and writes reusable guidance back into the harness, so later predictions on that question can improve before the outcome is known. After the question is resolved, the final outcome provides a \emph{retrospective check} before the updated harness is carried forward to subsequent questions. On FutureX and FutureWorld, Milkyway achieves the best overall score among the compared methods, improving FutureX from 44.07 to 60.90 and FutureWorld from 62.22 to 77.96.
[284] LLM Reasoning Is Latent, Not the Chain of Thought
Wenshuo Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This position paper argues that large language model (LLM) reasoning should be studied as latent-state trajectory formation rather than as faithful surface chain-of-thought (CoT). This matters because claims about faithfulness, interpretability, reasoning benchmarks, and inference-time intervention all depend on what the field takes the primary object of reasoning to be. We ask what that object should be once three often-confounded factors are separated and formalize three competing hypotheses: H1, reasoning is primarily mediated by latent-state trajectories; H2, reasoning is primarily mediated by explicit surface CoT; and H0, most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. Reorganizing recent empirical, mechanistic, and survey work under this framework, and adding compute-audited worked exemplars that factorize surface traces, latent interventions, and matched budget expansions, we find that current evidence most strongly supports H1 as a default working hypothesis rather than as a task-independent verdict. We therefore make two recommendations: the field should treat latent-state dynamics as the default object of study for LLM reasoning, and it should evaluate reasoning with designs that explicitly disentangle surface traces, latent states, and serial compute.
[285] Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
Haoyu Bian, Chaoning Zhang, Jiaquan Zhang, Xingyao Li, Yuanfang Guo, Wei Dong, Yang Yang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: LLM-driven multi-agent frameworks address complex reasoning tasks through multi-role collaboration. However, existing approaches often suffer from reasoning instability, where individual agent errors are amplified through collaboration, undermining overall performance. Current research mainly focuses on enhancing high-capability agents or suppressing unreliable outputs to improve framework effectiveness, while systematic identification and reinforcement of performance-limiting agents receive less attention. To address this gap, we propose WORC, a \underline{w}eak-link \underline{o}ptimization framework for multi-agent \underline{r}easoning and \underline{c}ollaboration, grounded in the weak-link principle. WORC follows a two-stage workflow. In the weak agent localization stage, task features are constructed, and a meta-learning-based weight predictor trained on optimal configurations identified by swarm intelligence algorithms (SIAs) enables zero-shot mapping from these features to agent performance weights, where the agent with the lowest predicted weight is identified as the weak agent. In the weak-link optimization stage, an uncertainty-driven allocation strategy assigns additional reasoning budgets to weak agents, with lower predicted weights leading to larger repeated-sampling quotas to compensate for reliability deficiencies. Experimental results show that WORC achieves an average accuracy of 82.2% on reasoning benchmarks while improving framework stability and cross-architecture generalization, suggesting that compensating for weak links, rather than reinforcing strengths alone, enhances the robustness of multi-agent systems.
[286] Structured Abductive-Deductive-Inductive Reasoning for LLMs via Algebraic Invariants
Sankalp Gilda, Shlok Gilda
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models exhibit systematic limitations in structured logical reasoning: they conflate hypothesis generation with verification, cannot distinguish conjecture from validated knowledge, and allow weak reasoning steps to propagate unchecked through inference chains. We present a symbolic reasoning scaffold that operationalizes Peirce’s tripartite inference – abduction, deduction, and induction – as an explicit protocol for LLM-assisted reasoning. The framework enforces logical consistency through five algebraic invariants (the Gamma Quintet), the strongest of which – the Weakest Link bound – ensures that no conclusion in a reasoning chain can exceed the reliability of its least-supported premise. This principle, independently grounded as weakest link resolution in possibilistic logic and empirically validated for chain-of-thought reasoning, prevents logical inconsistencies from accumulating across multi-step inference. We verify all invariants through a property-based testing suite of 100 properties and 16 fuzz tests over 10^5+ generated cases, providing a verified reference implementation of the invariants suitable as a foundation for future reasoning benchmarks.
[287] SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
Hikaru Shindo, Hanzhao Lin, Lukas Helff, Patrick Schramowski, Kristian Kersting
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.
[288] KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
Ankit Maloo
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.
[289] Stein Variational Black-Box Combinatorial Optimization
Thomas Landais, Olivier Goudet, Adrien Goëffon, Frédéric Saubion, Sylvain Lamprier
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Combinatorial black-box optimization in high-dimensional settings demands a careful trade-off between exploiting promising regions of the search space and preserving sufficient exploration to identify multiple optima. Although Estimation-of-Distribution Algorithms (EDAs) provide a powerful model-based framework, they often concentrate on a single region of interest, which may result in premature convergence when facing complex or multimodal objective landscapes. In this work, we incorporate the Stein operator to introduce a repulsive mechanism among particles in the parameter space, thereby encouraging the population to disperse and jointly explore several modes of the fitness landscape. Empirical evaluations across diverse benchmark problems show that the proposed method achieves performance competitive with, and in several cases superior to, leading state-of-the-art approaches, particularly on large-scale instances. These findings highlight the potential of Stein variational gradient descent as a promising direction for addressing large, computationally expensive, discrete black-box optimization problems.
[290] Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4
Chengwu Liu, Yichun Yin, Ye Yuan, Jiaxuan Xie, Botao Li, Siqi Li, Jianhao Shen, Yan Xu, Lifeng Shang, Ming Zhang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Most ATP benchmarks embed the final answer within the formal statement – a convention we call “Easy Mode” – a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting “Hard Mode”: the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks. Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers. DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode – while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are uniquely suited to measure.
[291] Towards Rigorous Explainability by Feature Attribution
Olivier Létoffé, Xuanxiang Huang, Joao Marques-Silva
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: For around a decade, non-symbolic methods have been the option of choice when explaining complex machine learning (ML) models. Unfortunately, such methods lack rigor and can mislead human decision-makers. In high-stakes uses of ML, the lack of rigor is especially problematic. One prime example of provable lack of rigor is the adoption of Shapley values in explainable artificial intelligence (XAI), with the tool SHAP being a ubiquitous example. This paper overviews the ongoing efforts towards using rigorous symbolic methods of XAI as an alternative to non-rigorous non-symbolic approaches, concretely for assigning relative feature importance.
[292] Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective
Ritik Raj, Souvik Kundu, Ishita Vohra, Hong Wang, Tushar Krishna
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Agentic AI serving converts monolithic LLM-based inference to autonomous problem-solvers that can plan, call tools, perform reasoning, and adapt on the fly. Due to diverse task execution need, such serving heavily rely on heterogeneous CPU-GPU systems with majority of the external tools responsible for agentic capability, either run on or are orchestrated by the CPU. Towards having a deeper understanding of its role, this paper aims to characterize and analyze the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU-centric perspective. We first present a compile-time characterization of agentic AI execution and choose representative workloads to capture the algorithmic diversity. We then perform runtime characterization of the representative workloads analyzing the end-to-end latency and throughput on two different hardware systems to isolate respective architectural bottlenecks. Based on the insights on the bottlenecks, we finally present two scheduling optimizations, namely, 1. CPU-Aware Overlapped Micro-Batching (COMB) and 2. Mixed Agentic Scheduling (MAS) on homogeneous and heterogeneous agentic workloads, respectively. In specific, these methods optimize for improved CPU-GPU concurrent utilization while reducing skewed resource allocation for heterogeneous execution. Experimental evaluations on the two hardware systems demonstrate the efficacy of COMB in yielding up to 1.7x lower P50 latency in standalone homogeneous workload execution and up to 3.9x/1.8x lower service/total latency under homogeneous open-loop load. Additionally, for heterogeneous open-loop load, MAS can reduce the total latency for minority request-type by up to 2.37x/2.49x at P50/P90 percentile.
[293] Integrating Graphs, Large Language Models, and Agents: Reasoning and Retrieval
Hamed Jelodar, Samita Bai, Mohammad Meymani, Parisa Hamedi, Roozbeh Razavi-Far, Ali Ghorbani
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Generative AI, particularly Large Language Models, increasingly integrates graph-based representations to enhance reasoning, retrieval, and structured decision-making. Despite rapid advances, there remains limited clarity regarding when, why, where, and what types of graph-LLM integrations are most appropriate across applications. This survey provides a concise, structured overview of the design choices underlying the integration of graphs with LLMs. We categorize existing methods based on their purpose (reasoning, retrieval, generation, recommendation), graph modality (knowledge graphs, scene graphs, interaction graphs, causal graphs, dependency graphs), and integration strategies (prompting, augmentation, training, or agent-based use). By mapping representative works across domains such as cybersecurity, healthcare, materials science, finance, robotics, and multimodal environments, we highlight the strengths, limitations, and best-fit scenarios for each technique. This survey aims to offer researchers a practical guide for selecting the most suitable graph-LLM approach depending on task requirements, data characteristics, and reasoning complexity.
[294] ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams
Qiang Xu, Shengyuan Bai, Yu Wang, He Cao, Leqing Chen, Yuanyuan Liu, Bin Feng, Zijing Liu, Yu Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark that reveals fundamental limitations in structural reasoning through chemical reaction diagrams. These real-world scientific diagrams offer an ideal testbed because they naturally span diverse structures from linear chains to cyclic graphs, while requiring both precise local recognition and coherent global reasoning. Our benchmark comprises 1,618 expert-annotated QA pairs across four hierarchical task dimensions. Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception. These findings expose a fundamental deficit in structural understanding and establish directions for advancing visual reasoning.
[295] MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition
Farhad Abtahi, Abdolamir Karbalaie, Eduardo Illueca-Fernandez, Fernando Seoane
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Metacognition, the ability to monitor and regulate one’s own reasoning, remains under-evaluated in AI benchmarking. We introduce MEDLEY-BENCH, a benchmark of behavioural metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. The benchmark evaluates 35 models from 12 families on 130 ambiguous instances across five domains and reports two complementary scores: the Medley Metacognition Score (MMS), a tier-based aggregate of reflective updating, social robustness, and epistemic articulation, and the Medley Ability Score (MAS), derived from four metacognitive sub-abilities. Results show a robust evaluation/control dissociation: evaluation ability increases with model size within families, whereas control does not. In a follow-up progressive adversarial analysis of 11 models, we observed two behavioural profiles, i.e., models that revise primarily in response to argument quality and models that track consensus statistics. Under within-model relative profiling (ipsative scoring), evaluation was the weakest relative ability in all 35 models, indicating a systematic knowing/doing gap. Smaller and cheaper models often matched or outperformed larger counterparts, suggesting that metacognitive competence is not simply a function of scale. These findings position MEDLEY-BENCH as a tool for measuring belief revision under social pressure and suggest that future training should reward calibrated, proportional updating rather than output quality alone.
[296] MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation
Yi Lin, Yihao Ding, Yonghui Wu, Yifan Peng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic “black-box” systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.
[297] Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models
Reham Alharbi, Valentina Tamma, Terry R. Payne, Jacopo de Berardinis
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Competency Questions (CQs) are a cornerstone of requirement elicitation in ontology engineering. CQs represent requirements as a set of natural language questions that an ontology should satisfy; they are traditionally modelled by ontology engineers together with domain experts as part of a human-centred, manual elicitation process. The use of Generative AI automates CQ creation at scale, therefore democratising the process of generation, widening stakeholder engagement, and ultimately broadening access to ontology engineering. However, given the large and heterogeneous landscape of LLMs, varying in dimensions such as parameter scale, task and domain specialisation, and accessibility, it is crucial to characterise and understand the intrinsic, observable properties of the CQs they produce (e.g., readability, structural complexity) through a systematic, cross-domain analysis. This paper introduces a set of quantitative measures for the systematic comparison of CQs across multiple dimensions. Using CQs generated from well defined use cases and scenarios, we identify their salient properties, including readability, relevance with respect to the input text and structural complexity of the generated questions. We conduct our experiments over a set of use cases and requirements using a range of LLMs, including both open (KimiK2-1T, LLama3.1-8B, LLama3.2-3B) and closed models (Gemini 2.5 Pro, GPT 4.1). Our analysis demonstrates that LLM performance reflects distinct generation profiles shaped by the use case.
[298] Learning to Reason with Insight for Informal Theorem Proving
Yunhe Li, Hao Shi, Bowen Deng, Wei Wang, Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Siyang Gao, Chao Wang, Shuang Qiu, Linqi Song
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models’ (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core techniques required to solve complex problems. To address this, we propose a novel framework designed to cultivate this essential reasoning skill and enable LLMs to perform insightful reasoning. We propose $\mathtt{DeepInsightTheorem}$, a hierarchical dataset that structures informal proofs by explicitly extracting core techniques and proof sketches alongside the final proof. To fully exploit this dataset, we design a Progressive Multi-Stage SFT strategy that mimics the human learning process, guiding the model from basic proof writing to insightful thinking. Our experiments on challenging mathematical benchmarks demonstrate that this insight-aware generation strategy significantly outperforms baselines. These results demonstrate that teaching models to identify and apply core techniques can substantially improve their mathematical reasoning.
[299] Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing
Thomas Bayer, Alexander Lohr, Sarah Weiß, Bernd Michelberger, Wolfram Höpken
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Explaining Machine Learning (ML) results in a transparent and user-friendly manner remains a challenging task of Explainable Artificial Intelligence (XAI). In this paper, we present a method to enhance the interpretability of ML models by using a Knowledge Graph (KG). We store domain-specific data along with ML results and their corresponding explanations, establishing a structured connection between domain knowledge and ML insights. To make these insights accessible to users, we designed a selective retrieval method in which relevant triplets are extracted from the KG and processed by a Large Language Model (LLM) to generate user-friendly explanations of ML results. We evaluated our method in a manufacturing environment using the XAI Question Bank. Beyond standard questions, we introduce more complex, tailored questions that highlight the strengths of our approach. We evaluated 33 questions, analyzing responses using quantitative metrics such as accuracy and consistency, as well as qualitative ones such as clarity and usefulness. Our contribution is both theoretical and practical: from a theoretical perspective, we present a novel approach for effectively enabling LLMs to dynamically access a KG in order to improve the explainability of ML results. From a practical perspective, we provide empirical evidence showing that such explanations can be successfully applied in real-world manufacturing environments, supporting better decision-making in manufacturing processes.
[300] ASMR-Bench: Auditing for Sabotage in ML Research
Eric Gan, Aryan Bhatt, Buck Shlegeris, Julian Stastny, Vivek Hebbar
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As AI systems are increasingly used to conduct research autonomously, misaligned systems could introduce subtle flaws that produce misleading results while evading detection. We introduce ASMR-Bench (Auditing for Sabotage in ML Research), a benchmark for evaluating the ability of auditors to detect sabotage in ML research codebases. ASMR-Bench consists of 9 ML research codebases with sabotaged variants that produce qualitatively different experimental results. Each sabotage modifies implementation details, such as hyperparameters, training data, or evaluation code, while preserving the high-level methodology described in the paper. We evaluated frontier LLMs and LLM-assisted human auditors on ASMR-Bench and found that both struggled to reliably detect sabotage: the best performance was an AUROC of 0.77 and a top-1 fix rate of 42%, achieved by Gemini 3.1 Pro. We also tested LLMs as red teamers and found that LLM-generated sabotages were weaker than human-generated ones but still sometimes evaded same-capability LLM auditors. We release ASMR-Bench to support research on monitoring and auditing techniques for AI-conducted research.
[301] Large Language Models for Market Research: A Data-augmentation Approach
Mengxin Wang, Dennis J. Zhang, Heng Zhang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) have transformed artificial intelligence by excelling in complex natural language processing tasks. Their ability to generate human-like text has opened new possibilities for market research, particularly in conjoint analysis, where understanding consumer preferences is essential but often resource-intensive. Traditional survey-based methods face limitations in scalability and cost, making LLM-generated data a promising alternative. However, while LLMs have the potential to simulate real consumer behavior, recent studies highlight a significant gap between LLM-generated and human data, with biases introduced when substituting between the two. In this paper, we address this gap by proposing a novel statistical data augmentation approach that efficiently integrates LLM-generated data with real data in conjoint analysis. This results in statistically robust estimators with consistent and asymptotically normal properties, in contrast to naive approaches that simply substitute human data with LLM-generated data, which can exacerbate bias. We further present a finite-sample performance bound on the estimation error. We validate our framework through an empirical study on COVID-19 vaccine preferences, demonstrating its superior ability to reduce estimation error and save data and costs by 24.9% to 79.8%. In contrast, naive approaches fail to save data due to the inherent biases in LLM-generated data compared to human data. Another empirical study on sports car choices validates the robustness of our results. Our findings suggest that while LLM-generated data is not a direct substitute for human responses, it can serve as a valuable complement when used within a robust statistical framework.
[302] WiseMind: a knowledge-guided multi-agent framework for accurate and empathetic psychiatric diagnosis
Yuqi Wu, Guangya Wan, Jingjing Li, Shengming Zhao, Lingfeng Ma, Tianyi Ye, Ion Pop, Yanbo Zhang, Jie Chen
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) offer promising opportunities to support mental healthcare workflows, yet they often lack the structured clinical reasoning needed for reliable diagnosis and may struggle to provide the emotionally attuned communication essential for patient trust. Here, we introduce WiseMind, a novel multi-agent framework inspired by the theory of Dialectical Behavior Therapy designed to facilitate psychiatric assessment. By integrating a “Reasonable Mind” Agent for evidence-based logic and an “Emotional Mind” Agent for empathetic communication, WiseMind effectively bridges the gap between instrumental accuracy and humanistic care. Our framework utilizes a Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5)-guided Structured Knowledge Graph to steer diagnostic inquiries, significantly reducing hallucinations compared to standard prompting methods. Using a combination of virtual standard patients, simulated interactions, and real human interaction datasets, we evaluate WiseMind across three common psychiatric conditions. WiseMind outperforms state-of-the-art LLM methods in both identifying critical diagnostic nodes and establishing accurate differential diagnoses. Across 1206 simulated conversations and 180 real user sessions, the system achieves 85.6% top-1 diagnostic accuracy, approaching reported diagnostic performance ranges of board-certified psychiatrists and surpassing knowledge-enhanced single-agent baselines by 15-54 percentage points. Expert review by psychiatrists further validates that WiseMind generates responses that are not only clinically sound but also psychologically supportive, demonstrating the feasibility of empathetic, reliable AI agents to conduct psychiatric assessments under appropriate human oversight.
[303] Agentic AI Optimisation (AAIO): what it is, how it works, why it matters, and how to deal with it
Luciano Floridi, Carlotta Buttaboni, Nicolas Gertler, Emmie Hine, Jessica Morley, Claudio Novelli, Tyler Schroder
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The emergence of Agentic Artificial Intelligence (AAI) systems capable of independently initiating digital interactions necessitates a new optimisation paradigm designed explicitly for seamless agent-platform interactions. This article introduces Agentic AI Optimisation (AAIO) as an essential methodology for ensuring effective integration between websites and agentic AI systems. Like how Search Engine Optimisation (SEO) has shaped digital content discoverability, AAIO can define interactions between autonomous AI agents and online platforms. By examining the mutual interdependency between website optimisation and agentic AI success, the article highlights the virtuous cycle that AAIO can create. It further explores the governance, ethical, legal, and social implications (GELSI) of AAIO, emphasising the necessity of proactive regulatory frameworks to mitigate potential negative impacts. The article concludes by affirming AAIO’s essential role as part of a fundamental digital infrastructure in the era of autonomous digital agents, advocating for equitable and inclusive access to its benefits.
[304] AI Agents and Hard Choices
Kangyu Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Can AI agents deal with hard choices – cases where options are incommensurable because multiple objectives are pursued simultaneously? Adopting a technologically engaged approach distinct from existing philosophical literature, I submit that the fundamental design of current AI agents as optimisers creates two limitations: the Identification Problem and the Resolution Problem. First, I demonstrate that agents relying on Multi-Objective Optimisation (MOO) are structurally unable to identify incommensurability. This inability generates three specific alignment problems: the blockage problem, the untrustworthiness problem, and the unreliability problem. I argue that standard mitigations, such as Human-in-the-Loop, are insufficient for many decision environments. As a constructive alternative, I conceptually explore an ensemble solution. Second, I argue that even if the Identification Problem is solved, AI agents face the Resolution Problem: they lack the autonomy to resolve hard choices rather than arbitrarily picking through self-modification of objectives. I conclude by examining the opaque normative trade-offs involved in granting AI this level of autonomy.
[305] Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning
Jun Rao, Xuebo Liu, Hexuan Deng, Zepeng Lin, Zixiong Yu, Jiansheng Wei, Xiaojun Meng, Min Zhang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In mathematical reasoning, data selection strategies predominantly rely on static, externally defined metrics, which fail to adapt to the evolving capabilities of models during training. This misalignment limits the efficiency of Supervised Fine-Tuning and Reinforcement Learning. To bridge this gap, we introduce SAI-DPO (Self-Aware Iterative Data Persistent Optimization), a dynamic sampling framework that aligns training data with the model’s intrinsic competence. SAI-DPO operationalizes two novel metrics: Knowledge Semantic Alignment for targeting domain weaknesses, and Self-Aware Difficulty, derived from pass rates and reasoning path characteristics, to gauge instance complexity relative to the model’s current state. By iteratively recalibrating the data distribution based on real-time feedback, SAI-DPO dynamically aligns training samples with the model’s evolving competence, ensuring the data remains strictly relevant to the model’s current capability level. Extensive experiments on eight benchmarks (including AIME24 and AMC23) demonstrate that SAI-DPO outperforms static baselines at most nearly 6 points, achieving state-of-the-art efficiency with significantly less data.
[306] TabularMath: Understanding Math Reasoning over Tables with Large Language Models
Shi-Yu Tian, Zhi Zhou, Wei Dong, Kun-Yang Yu, Ming Yang, Zi-Jian Cheng, Lan-Zhe Guo, Yu-Feng Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Mathematical reasoning has long been a key benchmark for evaluating large language models. Although substantial progress has been made on math word problems, the need for reasoning over tabular data in real-world applications has been overlooked. For instance, applications such as business intelligence demand not only multi-step numerical reasoning with tables but also robustness to incomplete or inconsistent information. However, comprehensive evaluation in this area is severely limited, constrained by the reliance on manually collected tables that are difficult to scale and the lack of coverage for potential traps encountered in real-world scenarios. To address this problem, we propose AutoT2T, a neuro-symbolic framework that controllably transforms math word problems into scalable and verified tabular reasoning tasks. Building on this pipeline, we develop TabularMath, a benchmark comprising four subsets that include both text-based and image-based tables, covering table complexity, table quality, and table representation dimensions. Our study reveals three key observations: (1) Table complexity and reasoning difficulty impact reasoning performance jointly; (2) Low-quality tables pose severe risks to reliable reasoning in current LLMs; (3) Different table modalities show similar trends, with text-based tables typically being easier for models to reason over. In-depth analyses are conducted for each observation to guide future research.
[307] Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning
Saloni Dash, Amélie Reymond, Emma S. Spiro, Aylin Caliskan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reasoning in humans is prone to biases due to underlying motivations like identity protection, that undermine rational decision-making and judgment. This \textit{motivated reasoning} at a collective level can be detrimental to society when debating critical issues such as human-driven climate change or vaccine safety, and can further aggravate political polarization. Prior studies have reported that large language models (LLMs) are also susceptible to human-like cognitive biases, however, the extent to which LLMs selectively reason toward identity-congruent conclusions remains largely unexplored. Here, we investigate whether assigning 8 personas across 4 political and socio-demographic attributes induces motivated reasoning in LLMs. Testing 8 LLMs (open source and proprietary) across two reasoning tasks from human-subject studies – veracity discernment of misinformation headlines and evaluation of numeric scientific evidence – we find that persona-assigned LLMs have up to 9% reduced veracity discernment relative to models without personas. Political personas specifically are up to 90% more likely to correctly evaluate scientific evidence on gun control when the ground truth is congruent with their induced political identity. Prompt-based debiasing methods are largely ineffective at mitigating these effects. Taken together, our empirical findings are the first to suggest that persona-assigned LLMs exhibit human-like motivated reasoning that is hard to mitigate through conventional debiasing prompts – raising concerns of exacerbating identity-congruent reasoning in both LLMs and humans.
[308] Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints
Zhenyun Yin, Shujie Wang, Xuhong Wang, Xingjun Ma, Yinchun Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.16727: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.16727&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[309] Distribution Shift Alignment Helps LLMs Simulate Survey Response Distributions
Ji Huang, Mengfei Li, Shuai Shao
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.21977: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21977&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[310] VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation
Heng Ping, Arijit Bhattacharjee, Peiyu Zhang, Shixuan Li, Wei Yang, Anzhe Cheng, Xiaole Zhang, Jesse Thomason, Ali Jannesari, Nesreen Ahmed, Paul Bogdan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.27617: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27617&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[311] Cost-Aware Model Orchestration for LLM-based Systems
Daria Smirnova, Hamid Nasiri, Marta Adamska, Zhengxin Yu, Peter Garraghan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.01099: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01099&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[312] AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units
Xinzi Cao, Jianyang Zhai, Pengfei Li, Zhiheng Hu, Cen Yan, Bingxu Mu, Guanghuan Fang, Bin She, Jiayu Li, Yihan Su, Dongyang Tao, Xiansong Huang, Fan Xu, Feidiao Yang, Yao Lu, Chang-Dong Wang, Yutong Lu, Weicheng Xue, Bin Zhou, Yonghong Tian
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.07160: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07160&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[313] The Informational Cost of Agency: A Bounded Measure of Interaction Efficiency for Deployed Reinforcement Learning
Wael Hafez, Cameron Reid, Amit Nazeri
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.01283: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01283&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[314] vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models
Suhwan Choi, Yunsung Lee, Yubeen Park, Chris Dongjoo Kim, Ranjay Krishna, Dieter Fox, Youngjae Yu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.13966: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13966&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[315] Seed1.8 Model Card: Towards Generalized Real-World Agency
Bytedance Seed
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.20633: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20633&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[316] ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence
ARC Prize Foundation
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.24621: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24621&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[317] From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?
Binyan Xu, Dong Fang, Haitao Li, Kehuan Zhang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.01608: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01608&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[318] The Amazing Agent Race: Strong Tool Users, Weak Navigators
Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or “legs”) with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race
[319] Heuristic Classification of Thoughts Prompting (HCoT): Integrating Expert System Heuristics for Structured Reasoning into Large Language Models
Lei Lin, Jizhao Zhu, Yong Liu, Donghong Sun, Hongbo He, Yihua Du
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[320] Mind DeepResearch Technical Report
MindDR Team, Li Auto Inc
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14518: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14518&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[321] Targeted Exploration via Unified Entropy Control for Reinforcement Learning
Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Ge Lan, Yue Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14646: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14646&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[322] COEVO: Co-Evolutionary Framework for Joint Functional Correctness and PPA Optimization in LLM-Based RTL Generation
Heng Ping, Peiyu Zhang, Shixuan Li, Wei Yang, Anzhe Cheng, Shukai Duan, Xiaole Zhang, Paul Bogdan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.15001: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15001&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[323] Transformer Neural Processes - Kernel Regression
Daniel Jenson, Jhonathan Navott, Mengyan Zhang, Makkunda Sharma, Elizaveta Semenova, Seth Flaxman
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2411.12502: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.12502&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[324] Prices, Bids, Values: One ML-Powered Combinatorial Auction to Rule Them All
Ermis Soumalias, Jakob Heiss, Jakob Weissteiner, Sven Seuken
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2411.09355: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.09355&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[325] VeriGraph: Scene Graphs for Execution Verifiable Robot Planning
Daniel Ekpo, Mara Levy, Saksham Suri, Chuong Huynh, Archana Swaminathan, Abhinav Shrivastava
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2411.10446: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.10446&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[326] Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning
Xinghao Wu, Jianwei Niu, Xuefeng Liu, Guogang Zhu, Jiayuan Zhang, Shaojie Tang, Wei Chen
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.13543: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.13543&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[327] A PennyLane-Centric Dataset to Enhance LLM-based Quantum Code Generation using RAG
Abdul Basit, Nouhaila Innan, Muhammad Haider Asif, Minghao Shao, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.02497: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.02497&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[328] Scalable Multi-Task Learning through Spiking Neural Networks with Adaptive Task-Switching Policy for Intelligent Autonomous Agents
Rachmad Vidya Wicaksana Putra, Avaneesh Devkota, Muhammad Shafique
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2504.13541: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13541&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[329] TriagerX: Dual Transformers for Bug Triaging Tasks with Content and Interaction Based Rankings
Md Afif Al Mamun, Gias Uddin, Lan Xia, Longyu Zhang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.16860: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16860&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[330] The threat of analytic flexibility in using large language models to simulate human data
Jamie Cummins
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.13397: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.13397&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[331] Bridging the phenotype-target gap for molecular generation via multi-objective reinforcement learning
Haotian Guo, Hui Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.21010: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21010&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[332] WARBERT: A Hierarchical BERT-based Model for Web API Recommendation
Zishuo Xu, Yuhong Gu, Dezhong Yao
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.23175: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23175&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[333] Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning
Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, Zhongzhi Li, Zaibin Zhang, Guibin Zhang, Chen Zhang, Zhenfei Yin, Philip Torr, Lei Bai
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.25300: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25300&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[334] AISysRev – LLM-based Tool for Title-abstract Screening
Aleksi Huotala, Miikka Kuutila, Olli-Pekka Turtio, Simo Sipilä, Mika Mäntylä
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.06708: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06708&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[335] When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models
Haoran Ou, Kangjie Chen, Xingshuo Han, Gelei Deng, Jie Zhang, Han Qiu, Tianwei Zhang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.09689: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09689&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[336] DB-FGA-Net: Dual Backbone Frequency Gated Attention Network for Multi-Class Brain Tumor Classification with Grad-CAM Interpretability
Saraf Anzum Shreya, MD. Abu Ismail Siddique, Sharaf Tasnim
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.20299: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20299&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[337] The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination
Chenlong Yin, Zeyang Sha, Shiwen Cui, Changhua Meng, Zechao Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.22977: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22977&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[338] PULSE: Privileged Knowledge Transfer from Rich to Deployable Sensors for Embodied Multi-Sensory Learning
Zihan Zhao, Kaushik Pendiyala, Masood Mortazavi, Ning Yan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.24058: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24058&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[339] Enabling Predictive Maintenance in District Heating Substations: A Labelled Dataset and Fault Detection Evaluation Framework based on Service Data
Cyriana M.A. Roelofs, Edison Guevara Bastidas, Thomas Hugo, Stefan Faulstich, Anna Cadenbach
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.14791: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14791&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[340] Mitigating hallucinations and omissions in LLMs for invertible problems: An application to hardware logic design automation
Andrew S. Cassidy, Guillaume Garreau, Jay Sivagnaname, Mike Grassi, Bernard Brezzo, John V. Arthur, Dharmendra S. Modha
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.03053: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03053&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[341] Information-Consistent Language Model Recommendations through Group Relative Policy Optimization
Sonal Prabhune, Balaji Padmanabhan, Kaushik Dutta
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.12858: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12858&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[342] AutoFed: Personalized Federated Traffic Prediction via Adaptive Prompt
Zijian Zhao, Yitong Shang, Sen Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.24625: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24625&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[343] CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling
Runsong Zhao, Shilei Liu, Jiwei Tang, Langming Liu, Haibin Chen, Weidong Zhang, Yujin Yuan, Tong Xiao, Jingbo Zhu, Wenbo Su, Bo Zheng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.01766: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01766&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[344] Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations
Shahin Honarvar, Amber Gorzynski, James Lee-Jones, Harry Coppock, Marek Rei, Joseph Ryan, Alastair F. Donaldson
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.05523: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05523&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[345] KRONE: Scalable LLM-Augmented Log Anomaly Detection via Hierarchical Abstraction
Lei Ma, Jinyang Liu, Tieying Zhang, Peter M. VanNostrand, Dennis M. Hofmann, Lei Cao, Elke A. Rundensteiner, Jianjun Chen
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.07303: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07303&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[346] Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP
Zeynab Anbiaee, Mahdi Rabbani, Mansur Mirani, Gunjan Piya, Igor Opushnyev, Ali Ghorbani, Sajjad Dadkhah
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.11327: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11327&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[347] Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
Indranil Halder, Annesya Banerjee, Cengiz Pehlevan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.11331: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11331&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[348] Puppets or partners? Governing cyborg propaganda in the digital public square
Jonas R. Kunst, Kinga Bierwiaczonek, Meeyoung Cha, Omid V. Ebrahimi, Marc Fawcett-Atkinson, Asbjørn Følstad, Anton Gollwitzer, Nils Köbis, Gary Marcus, Jon Roozenbeek, Daniel Thilo Schroeder, Jay J. Van Bavel, Sander van der Linden, Rory White, Live Leonhardsen Wilhelmsen
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.13088: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13088&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[349] ArrayTac: A Closed-loop Piezoelectric Tactile Platform for Continuously Tunable Rendering of Shape, Stiffness, and Friction
Tianhai Liang, Shiyi Guo, Baiye Cheng, Zhengrong Xue, Han Zhang, Huazhe Xu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.13829: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13829&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[350] Cognitive Agency Surrender: Defending Epistemic Sovereignty via Scaffolded AI Friction
Kuangzhe Xu, Yu Shen, Longjie Yan, Yinghui Ren
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.21735: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21735&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[351] Neural Computers
Mingchen Zhuge, Changsheng Zhao, Haozhe Liu, Zijian Zhou, Shuming Liu, Wenyi Wang, Ernie Chang, Gael Le Lan, Junjie Fei, Wenxuan Zhang, Yasheng Sun, Zhipeng Cai, Zechun Liu, Yunyang Xiong, Yining Yang, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.06425: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06425&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[352] MR-Coupler: Automated Metamorphic Test Generation via Functional Coupling Analysis
Congying Xu, Hengcheng Zhu, Songqiang Chen, Jiarong Wu, Valerio Terragni, Shing-Chi Cheung
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10126: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10126&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[353] The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents
Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li, Taiwei Shi, Nicholas Meade, Siva Reddy, Jian Kang, Jieyu Zhao
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10577: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10577&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[354] Layerwise Dynamics for In-Context Classification in Transformers
Patrick Lutz, Themistoklis Haris, Arjun Chandra, Aditya Gangrade, Venkatesh Saligrama
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11613: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11613&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[355] Beyond LLMs, Sparse Distributed Memory, and Neuromorphics <A Hyper-Dimensional SRAM-CAM “VaCoAl” for Ultra-High Speed, Ultra-Low Power, and Low Cost>
Hiroyuki Chuma, Kanji Otsuka, Yoichi Sato
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[356] SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models
You Qin, Linqing Wang, Hao Fei, Roger Zimmermann, Liefeng Bo, Qinglin Lu, Chunyu Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12617: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12617&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[357] Selectivity and Shape in the Design of Forward-Forward Goodness Functions
Talha Ruzgar Akkus, Suayp Talha Kocabay, Kamer Ali Yuksel, Hassan Sawaf
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.13081: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.13081&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[358] KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Bing Li, Ulf Schlichtmann
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.13226: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.13226&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[359] Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees
Sourav Ganguly, Kartik Pandit, Arnob Ghosh
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14243: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14243&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[360] Aerial Multi-Functional RIS in Fluid Antennas-Aided Full-Duplex Networks: A Self-Optimized Hybrid Deep Reinforcement Learning Approach
Li-Hsiang Shen, Yu-Quan Zheng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14309: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14309&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[361] Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery
Pushpa Kumar Balan, Aijing Feng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[362] Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?
Amy Rouillard, Sitwala Mundia, Linda Camara, Michael Cameron Gramanie, Ziyaad Dangor, Ismail Kalla, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Haroon Saloojee, Bruce A. Bassett
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14892: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14892&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[363] ArtifactNet: Detecting AI-Generated Music via Forensic Residual Physics
Heewon Oh
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present ArtifactNet, a lightweight framework that detects AI-generated music by reframing the problem as forensic physics – extracting and analyzing the physical artifacts that neural audio codecs inevitably imprint on generated audio. A bounded-mask UNet (ArtifactUNet, 3.6M parameters) extracts codec residuals from magnitude spectrograms, which are then decomposed via HPSS into 7-channel forensic features for classification by a compact CNN (0.4M parameters; 4.0M total). We introduce ArtifactBench, a multi-generator evaluation benchmark comprising 6,183 tracks (4,383 AI from 22 generators and 1,800 real from 6 diverse sources). Each track is tagged with bench_origin for fair zero-shot evaluation. On the unseen test partition (n=2,263), ArtifactNet achieves F1 = 0.9829 with FPR = 1.49%, compared to CLAM (F1 = 0.7576, FPR = 69.26%) and SpecTTTra (F1 = 0.7713, FPR = 19.43%) evaluated under identical conditions with published checkpoints. Codec-aware training (4-way WAV/MP3/AAC/Opus augmentation) further reduces cross-codec probability drift by 83% (Delta = 0.95 -> 0.16), resolving the primary codec-invariance failure mode. These results establish forensic physics – direct extraction of codec-level artifacts – as a more generalizable and parameter-efficient paradigm for AI music detection than representation learning, using 49x fewer parameters than CLAM and 4.8x fewer than SpecTTTra.
[364] Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models
Yanda Li, Yuhan Liu, Zirui Song, Yunchao Wei, Martin Takáč, Salem Lahlou
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large audio-language models (LALMs) generalize across speech, sound, and music, but unified decoders can exhibit a \emph{temporal smoothing bias}: transient acoustic cues may be underutilized in favor of temporally smooth context that is better supported by language priors, leading to less specific audio-grounded outputs. We propose \emph{Temporal Contrastive Decoding} (TCD), a training-free decoding method for unified LALMs that mitigates this effect at inference time. TCD constructs a temporally blurred slow-path view by smoothing the input waveform and re-encoding it, then contrasts next-token logits from the original and slow-path views. The contrastive signal is applied as a token-level logit update restricted to a small candidate set. A self-normalized stability score sets the blur window and update scale, and a step-wise gate based on uncertainty and audio reliance activates the update only when needed. Experiments on MMAU and AIR-Bench show consistent improvements on strong unified LALMs. We further conduct ablations and an architectural applicability study to analyze the contributions of key components and how TCD behaves across large audio-language model designs.
[365] VoxMind: An End-to-End Agentic Spoken Dialogue System
Tianle Liang, Yifu Chen, Shengpeng Ji, Yijun Chen, Zhiyang Jia, Jingyu Lu, Fan Zhuo, Xueyi Pu, Yangzhuo Li, Zhou Zhao
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.15710: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15710&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[366] MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce MMAudioSep, a generative model for video/text-queried sound separation that is founded on a pretrained video-to-audio model. By leveraging knowledge about the relationship between video/text and audio learned through a pretrained audio generative model, we can train the model more efficiently, i.e., the model does not need to be trained from scratch. We evaluate the performance of MMAudioSep by comparing it to existing separation models, including models based on both deterministic and generative approaches, and find it is superior to the baseline models. Furthermore, we demonstrate that even after acquiring functionality for sound separation via fine-tuning, the model retains the ability for original video-to-audio generation. This highlights the potential of foundational sound generation models to be adopted for sound-related downstream tasks. Our code is available at https://github.com/sony/mmaudiosep.
[367] TinyMU: A Compact Audio-Language Model for Music Understanding
Xiquan Li, Aurian Quelennec, Slim Essid
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Music understanding and reasoning are central challenges in the Music Information Research field, with applications ranging from retrieval and recommendation to music agents and virtual assistants. Recent Large Audio-Language Models (LALMs) have shown remarkable progress in answering music-related questions by following user instructions. However, their massive scale, often billions of parameters, results in expensive training, slow inference, and limited deployability on edge devices. In this work, we present TinyMU, a lightweight (229M) Music-Language Model (MLM) that achieves performance comparable to much larger LALMs while remaining efficient and compact. To train TinyMU, we introduce MusicSkills-3.5M, a carefully curated, music-grounded question-answering dataset with 3.5M samples. Spanning multiple-choice, binary, and open-ended formats, this dataset provides fine-grained supervision across diverse musical concepts. For its architecture, TinyMU leverages MATPAC++, the SOTA self-supervised audio encoder for fine-grained feature extraction. Paired with a lightweight linear projector, it efficiently aligns audio embeddings with the language model. Through extensive evaluation, we show that TinyMU performs strongly in both basic music understanding and complex reasoning. Notably, on the MuChoMusic benchmark, it achieves 82% of SOTA LALM’s performance despite being 35x smaller, highlighting the potential of small MLMs under constrained computational budgets.
[368] Hierarchical Codec Diffusion for Video-to-Speech Generation
Jiaxin Ye, Gaoxiang Cong, Chenhui Wang, Xin-Cheng Wen, Zhaoyang Li, Boyuan Cao, Hongming Shan
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Video-to-Speech (VTS) generation aims to synthesize speech from a silent video without auditory signals. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders direct alignment between visual and speech features at specific hierarchical levels during property matching. In this paper, leveraging the hierarchical structure of Residual Vector Quantization (RVQ)-based codec, we propose HiCoDiT, a novel Hierarchical Codec Diffusion Transformer that exploits the inherent hierarchy of discrete speech tokens to achieve strong audio-visual alignment. Specifically, since lower-level tokens encode coarse speaker-aware semantics and higher-level tokens capture fine-grained prosody, HiCoDiT employs low-level and high-level blocks to generate tokens at different levels. The low-level blocks condition on lip-synchronized motion and facial identity to capture speaker-aware content, while the high-level blocks use facial expression to modulate prosodic dynamics. Finally, to enable more effective coarse-to-fine conditioning, we propose a dual-scale adaptive instance layer normalization that jointly captures global vocal style through channel-wise normalization and local prosody dynamics through temporal-wise normalization. Extensive experiments demonstrate that HiCoDiT outperforms baselines in fidelity and expressiveness, highlighting the potential of discrete modelling for VTS. The code and speech demo are both available at https://github.com/Jiaxin-Ye/HiCoDiT.
[369] AST: Adaptive, Seamless, and Training-Free Precise Speech Editing
Sihan Lv, Yechen Jin, Zhen Li, Jintao Chen, Jinshan Zhang, Ying Li, Jianwei Yin, Meng Xi
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-Speech (TTS) models often faces a trade-off between editing quality and consistency. To address these issues, we propose AST, an Adaptive, Seamless, and Training-free precise speech editing framework. Leveraging a pre-trained autoregressive TTS model, AST introduces Latent Recomposition to selectively stitch preserved source segments with newly synthesized targets. Furthermore, AST extends this latent manipulation to enable precise style editing for specific speech segments. To prevent artifacts at these edit boundaries, the framework incorporates Adaptive Weak Fact Guidance (AWFG). AWFG dynamically modulates a mel-space guidance signal, enforcing structural constraints only where necessary without disrupting the generative manifold. To fill the gap of publicly accessible benchmarks, we introduce LibriSpeech-Edit, a new and larger speech editing dataset. As existing metrics poorly evaluate temporal consistency in unedited regions, we propose Word-level Dynamic Time Warping (WDTW). Extensive experiments demonstrate that AST resolves the controllability-quality trade-off without extra training. Compared to the previous most temporally consistent baseline, AST improves consistency while reducing Word Error Rate by nearly 70%. Moreover, applying AST to a foundation TTS model reduces WDTW by 27%, achieving state-of-the-art speaker preservation and temporal fidelity.
[370] NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations
Liumeng Xue, Weizhen Bian, Jiahao Pan, Wenxuan Wang, Yilin Ren, Boyi Kang, Jingbin Hu, Ziyang Ma, Shuai Wang, Xinyuan Qian, Hung-yi Lee, Yike Guo
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet standardized evaluation remains limited in jointly assessing whether systems can generate the intended NVVs, place them correctly, and keep them salient without harming speech. We present Non-verbal Vocalization Benchmark (NVBench), a bilingual (English/Chinese) benchmark that evaluates speech synthesis with NVVs. NVBench pairs a unified 45-type taxonomy with a curated bilingual dataset and introduces a multi-axis protocol that separates general speech naturalness and quality from NVV-specific controllability, placement, and salience. We benchmark 15 TTS systems using objective metrics, listening tests, and an LLM-based multi-rater evaluation. Results reveal that NVVs controllability often decouples from quality, while low-SNR oral cues and long-duration affective NVVs remain persistent bottlenecks. NVBench enables fair cross-system comparison across diverse control interfaces under a unified, standardized framework.
[371] NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages
Marie Maltais, Yejin Jeon, Min Ma, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Maryam Ibrahim Mukhtar, Daud Abolade, Joel Okepefi, Johnson Sewedo, David Ifeoluwa Adelani
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Speech translation for low-resource languages remains fundamentally limited by the scarcity of high-quality, diverse parallel speech data, a challenge that is especially pronounced in African linguistic contexts. To address this, we introduce NaijaS2ST, a parallel speech translation dataset spanning Igbo, Hausa, Yorùbá, and Nigerian Pidgin paired with English. The dataset comprises approximately 50 hours of speech per language and captures substantial variation in speakers and accents, reflecting realistic multilingual and multi-accent conditions. With NaijaS2ST, we conduct a comprehensive benchmark of cascaded, end-to-end (E2E), and AudioLLM-based approaches across bidirectional translation settings. Our results show that audio LLMs with few-shot examples are more effective for speech-to-text translation than cascaded and end-to-end methods trained on fine-tuned data. However, for speech-to-speech translation, the cascaded and audio LLM paradigms yield comparable performance, indicating that there is still considerable room for improvement in developing targeted, task-specific models for this setting. By providing both a high-quality dataset and a systematic benchmark, we hope that NaijaS2ST will serve as a strong foundation for advancing research in low-resource, multilingual speech translation.
[372] Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding
Tsai-Ning Wang, Lin-Lin Chen, Neil Zeghidour, Aaqib Saeed
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alignment), a lightweight post-training framework that instills semantic understanding into any audio encoder by aligning it with a medical language model, which acts as a “semantic teacher.” To enable alignment at scale, we construct a large-scale dataset by leveraging off-the-shelf large language models to translate the rich, structured metadata accompanying existing audio recordings into coherent clinical reports. Our alignment strategy combines a representation-level contrastive objective with a self-supervised modeling, ensuring that the model learns clinical semantics while preserving fine-grained temporal cues. AcuLa achieves state-of-the-art results across 18 diverse cardio-respiratory tasks from 10 different datasets, improving the mean AUROC on classification benchmarks from 0.68 to 0.79 and, on the most challenging COVID-19 cough detection task, boosting the AUROC from 0.55 to 0.89. Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools, establishing a novel paradigm for enhancing physiological understanding in audio-based health monitoring.
cs.LG
[373] The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason
Yi Liu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We discover that large language models exhibit \emph{spectral phase transitions} in their hidden activation spaces when engaging in reasoning versus factual recall. Through systematic spectral analysis across \textbf{11 models} spanning \textbf{5 architecture families} (Qwen, Pythia, Phi, Llama, DeepSeek-R1), we identify \textbf{seven} core phenomena: (1)\textbf{Reasoning Spectral Compression} – 9/11 models show significantly lower $α$ for reasoning ($p < 0.05$), with larger effects in stronger models; (2)\textbf{Instruction Tuning Spectral Reversal} – base models show reasoning $α< $ factual $α$, while instruction-tuned models reverse this relationship; (3)\textbf{Architecture-Dependent Generation Taxonomy} – prompt-to-response shifts partition into expansion, compression, and equilibrium regimes; (4)\textbf{Spectral Scaling Law} – $α_\text{reasoning} \propto -0.074 \ln N$ across 4 Qwen base models ($R^2 = 0.46$); (5)\textbf{Token-Level Spectral Cascade} – per-token alpha tracking reveals local synchronization that decays exponentially with layer distance, and is weaker for reasoning than factual tasks; (6)\textbf{Reasoning Step Spectral Punctuation} – phase-transition signatures align with reasoning step boundaries; and (7)~\textbf{Spectral Correctness Prediction} – spectral $α$ alone achieves AUC $= 1.000$ (Qwen2.5-7B, late layers) and mean AUC $= 0.893$ across 6 models in predicting correctness \emph{before} the final answer is generated. Together, these findings establish a comprehensive \emph{spectral theory of reasoning} in transformers, revealing that the geometry of thought is universal in direction, architecture-specific in dynamics, and predictive of outcome.
[374] Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures
Abdulmalek Saket
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Low-Rank Adaptation (LoRA) has become the dominant parameter-efficient fine-tuning method for large language models, yet standard practice applies LoRA adapters uniformly to all transformer layers regardless of their relevance to the downstream task. We introduce Aletheia, a gradient-guided layer selection method that identifies the most task-relevant layers via a lightweight gradient probe and applies LoRA adapters only to those layers with asymmetric rank allocation. Across 81 experiment rows covering 14 successful models from 8 architecture families (0.5B-72B parameters, including dense and Mixture-of-Experts architectures), with one additional documented failed Pythia/GPT-NeoX attempt in Campaign 2, Aletheia achieves a 15-28% training speedup (mean 23.1%, p < 0.001) with bounded extra forgetting and broadly matched downstream behavior on the evaluated MMLU, GSM8K, and HumanEval benchmark pack. Across the tested families and scales, Campaign 1 shows a 100% per-model speed win rate and Campaign 2 shows broadly preserved downstream behavior within a bounded-degradation framing. Together these results support a practical model-economics claim: intelligent layer selection can make LoRA fine-tuning materially more efficient without introducing major downstream damage on the evaluated set.
[375] M3R: Localized Rainfall Nowcasting with Meteorology-Informed MultiModal Attention
Sanjeev Panta, Rhett M Morvant, Xu Yuan, Li Chen, Nian-Feng Tzeng
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate and timely rainfall nowcasting is crucial for disaster mitigation and water resource management. Despite recent advances in deep learning, precipitation prediction remains challenging due to limitations in effectively leveraging diverse multimedia data sources. We introduce M3R, a Meteorology-informed MultiModal attention-based architecture for direct Rainfall prediction that synergistically combines visual NEXRAD radar imagery with numerical Personal Weather Station (PWS) measurements, using a comprehensive pipeline for temporal alignment of heterogeneous meteorological data. With specialized multimodal attention mechanisms, M3R novelly leverages weather station time series as queries to selectively attend to spatial radar features, enabling focused extraction of precipitation signatures. Experimental results for three spatial areas of 100 km * 100 km centered at NEXRAD radar stations demonstrate that M3R outperforms existing approaches, achieving substantial improvements in accuracy, efficiency, and precipitation detection capabilities. Our work establishes new benchmarks for multimedia-based precipitation nowcasting and provides practical tools for operational weather prediction systems. The source code is available at https://github.com/Sanjeev97/M3Rain
[376] Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Gregory Magarshak
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent work on KV cache quantization, culminating in TurboQuant, has approached the Shannon entropy limit for per-vector compression of transformer key-value caches. We observe that this limit applies to a strictly weaker problem than the one that actually matters: compressing the KV cache as a sequence. The tokens stored in a KV cache are not arbitrary floating-point data – they are samples from the exact formal language the model was trained on, and the model is by construction a near-optimal predictor of that language. We introduce sequential KV compression, a two-layer architecture that exploits this structure. The first layer, probabilistic prefix deduplication, identifies semantically equivalent shared prefixes across sessions using the trie metric d_T(s, s’) = -log_2 P_M(s ^ s’) from Probabilistic Language Tries (PLTs). The second layer, predictive delta coding, stores only the residual of each new KV vector from the model’s own prediction of it, achieving a per-token entropy bound of H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}). We prove that at typical language model perplexity – approximately 10-20 for fluent English text – this bound is 3.3-4.3 bits on average per token position, compared to TurboQuant’s 3 bits per vector component (with typical attention heads having 64-128 components). The theoretical compression ratio over TurboQuant is approximately 914,000x at the Shannon limit. Even at 1000x above the entropy floor – a deliberately pessimistic worst-case overhead, two orders of magnitude above the 2-5x typical of practical source coders – the ratio remains approximately 914x over TurboQuant, with compression improving rather than degrading as context length grows. The two layers are orthogonal and compose with existing per-vector quantization methods including TurboQuant.
[377] Mapping High-Performance Regions in Battery Scheduling across Data Uncertainty, Battery Design, and Planning Horizons
Jaime de Miguel Rodriguez, Artjom Vargunin, Brigitta Robin Raudne, David Solis Martin, Yaroslava Mykhailenko, Kaarel Oja
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This study presents a triadic analysis of energy storage operation under multi-stage model predictive control, investigating the interplay between data characteristics, forecast uncertainty, planning horizon, and battery c-rate. Synthetic datasets are generated to systematically explore variations in data profiles and uncertainty, enabling parametrization and the construction of relationships that map these characteristics to optimal horizon length. Results reveal the presence of an effective horizon, defined as the look-ahead length beyond which additional forecast information provides limited operational benefit. Accounting for this horizon can reduce computational costs while maintaining optimal performance. The study provides optimal horizon lengths across a broad range of combinations of battery types, uncertainty levels, and data profiles, offering practical guidance for industrial storage operation. It also quantifies revenue losses due to forecast uncertainty, showing that errors can impact performance even for fast batteries. Finally, the framework lays the groundwork for future machine learning approaches that map dataset parametrization to optimal horizons, supporting continuous optimization in industrial settings without heavy computation.
[378] Lightweight Geometric Adaptation for Training Physics-Informed Neural Networks
Kang An, Chenhao Si, Shiqian Ma, Ming Yan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Physics-Informed Neural Networks (PINNs) often suffer from slow convergence, training instability, and reduced accuracy on challenging partial differential equations due to the anisotropic and rapidly varying geometry of their loss landscapes. We propose a lightweight curvature-aware optimization framework that augments existing first-order optimizers with an adaptive predictive correction based on secant information. Consecutive gradient differences are used as a cheap proxy for local geometric change, together with a step-normalized secant curvature indicator to control the correction strength. The framework is plug-and-play, computationally efficient, and broadly compatible with existing optimizers, without explicitly forming second-order matrices. Experiments on diverse PDE benchmarks show consistent improvements in convergence speed, training stability, and solution accuracy over standard optimizers and strong baselines, including on the high-dimensional heat equation, Gray–Scott system, Belousov–Zhabotinsky system, and 2D Kuramoto–Sivashinsky system.
[379] Python library supporting Discrete Variational Formulations and training solutions with Collocation-based Robust Variational Physics Informed Neural Networks (DVF-CRVPINN)
Tomasz Służalec, Marcin Łoś, Askold Vilkha, Maciej Paszyński
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We explore the possibility of solving Partial Differential Equations (PDEs) using discrete weak formulations. We propose a programming environment for defining a discrete computational domain, introducing discrete functions defined over a set of points, constructing discrete inner products, and introducing discrete weak formulations employing Kronecker delta test functions. Building on this setup, we propose a discrete neural network representation, training the solution function defined over a discrete set of points and employing discrete finite difference derivatives in the automatic differentiation procedures. As a challenging computational model example, we focus on Stokes equations in two-dimensions, defined over a discrete set of points. We train the solution using the discrete weak residual and the Adamax algorithm with discrete automatic differentiation of the discrete gradients. Despite introducing the python environment, we also provide a rigorous mathematical formulation based on discrete weak formulations, proving the well-posedness and robustness of the loss function. The solution of the discrete weak formulations is based on neural network training employing a robust loss function that is related to the true error. In this way, we have a robust control of the numerical error during the training of the neural networks. Besides the Stokes formulation, we also explain the functionality of the proposed library using the Laplace problem formulation.
[380] Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
G. Aytug Akarlar
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same-prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt-level confounds. On Qwen2.5-1.5B across 61 prompts spanning six categories, 27 prompts (44.3%) bifurcate with factual and hallucinated trajectories diverging at the first generated token (KL = 0 at step 0, KL > 1.0 at step 1). Activation patching across 28 layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output in 87.5% of trials (layer 20), while the reverse recovers only 33.3% (layer 24); both exceed the 10.4% baseline (p = 0.025) and 12.5% random-patch control. Window patching shows correction requires sustained multi-step intervention, whereas corruption needs only a single perturbation. Probing the prompt encoding itself, step-0 residual states predict per-prompt hallucination rate at Pearson r = 0.776 at layer 15 (p < 0.001 against a 1000-permutation null); unsupervised clustering identifies five regime-like groups (eta^2 = 0.55) whose saddle-adjacent cluster concentrates 12 of the 13 bifurcating false-premise prompts, indicating that the basin structure is organized around regime commitments fixed at prompt encoding. These findings characterize hallucination as a locally stable attractor basin: entry is probabilistic and rapid, exit demands coordinated intervention across layers and steps, and the relevant basins are selected by clusterable regimes already discernible at step 0.
[381] Dispatch-Aware Ragged Attention for Pruned Vision Transformers
Saif Mahmoud, Ahmad Almasri
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Token pruning methods for Vision Transformers (ViTs) promise quadratic reductions in attention FLOPs by dropping uninformative patches. Yet when pruned sequences are executed with state-of-the-art variable-length attention APIs – including FlashAttention-2’s varlen and PyTorch’s NestedTensor SDPA-the wall-clock attention latency doesn’t scale accordingly. We trace this to a dispatch-overhead bottleneck: at the short, post-pruning sequence lengths typical of ViTs (<=197 tokens), actual matrix arithmetic completes in single-digit microseconds while the host-side dispatch path consumes 60-90 us. We present a lightweight, bidirectional Triton attention kernel whose dispatch floor is 40 us roughly 1.5x lower than FlashAttention-2 varlen-allowing pruning savings to become more visible in wall-clock time. Integrated into a complete pack-attend-unpack pipeline, our system achieves up to 2.24x end-to-end throughput over padded PyTorch SDPA consistently across four pruning algorithms (Threshold-L2, DynamicViT, EViT, ATS), scales across DeiT-T/S/B, and maintains bit-exact classification predictions with <0.007 max absolute logit difference.
[382] The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
Ranjith Chodavarapu, Lei Xu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: KV caching is a ubiquitous optimization in autoregressive transformer inference, long presumed to be numerically equivalent to cache-free computation. This assumption fails under standard FP16 precision: cache-ON and cache-OFF execution paths employ different floating-point accumulation orderings which, due to FP16 non-associativity, produce a deterministic divergence in decoded token sequences. Across three open-weight models (LLaMA-2-7B, Mistral-7B-v0.3, Gemma-2-2B) evaluated on GSM8K, we observe a 100% token divergence rate across all sampling strategies, including greedy decoding, which rules out sampling randomness as a cause, and also with cache-ON yielding higher accuracy in 8 of 9 conditions, where the accuracy difference serves as an indicator that the divergence direction is systematic rather than random. Controlled FP32 falsification reduces divergence by eight orders of magnitude, eliminates token flips, and drops the flip rate to exactly 0.0%, confirming FP16 non-associativity as the sole causal driver. Layer-wise drift profiling reveals architecturally predictable propagation patterns: models using Grouped-Query Attention exhibit sharp divergence at the first layer, while Gemma’s larger head dimension and sliding window attention produce uniform accumulation across all layers. Finally, activation patching of the entire residual stream fails to recover the cache-free trajectory, localizing the causal variable to the stateful KV cache. These findings establish that FP16 KV cache inference is fundamentally non-equivalent to recomputation and provide a mechanistic framework for understanding numerical instability in modern LLM inference systems.
[383] Histogram-based Parameter-efficient Tuning for Passive and Active Sonar Classification
Amirmohammad Mohammadi, Davelle Carreiro, Alexandra Van Dine, Joshua Peeples
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Parameter-efficient transfer learning (PETL) methods adapt large artificial neural networks to downstream tasks without fine-tuning the entire model. However, existing additive methods, such as adapters, sometimes struggle to capture distributional shifts in intermediate feature embeddings. We propose a novel histogram-based parameter-efficient tuning (HPT) technique that captures the statistics of the target domain and modulates the embeddings. Experimental results on three downstream passive sonar datasets (ShipsEar, DeepShip, Vessel Type Underwater Acoustic Data (VTUAD)) demonstrate that HPT outperforms conventional adapters. Notably, HPT achieves 91.8% vs. 89.8% accuracy on VTUAD. For active sonar imagery (Watertank, Turntable), HPT is competitive with other PETL methods. Furthermore, HPT yields feature representations closer to those of fully fine-tuned models. Overall, HPT balances parameter savings and provides a distribution-aware alternative to existing adapters and shows a promising direction for transfer learning in resource-constrained environments. The code is publicly available: https://github.com/Advanced-Vision-and-Learning-Lab/HLAST_DeepShip_ParameterEfficient.
[384] PRL-Bench: A Comprehensive Benchmark Evaluating LLMs’ Capabilities in Frontier Physics Research
Tingjia Miao, Wenkai Jin, Muhua Zhang, Jinxin Tan, Yuelin Hu, Tu Guo, Jiejun Zhang, Yuhan Wang, Wenbo Li, Yinuo Gao, Shuo Chen, Weiqi Jiang, Yayun Hu, Zixing Lei, Xianghe Pang, Zexi Liu, Yuzhi Zhang, Linfeng Zhang, Kun Chen, Wei Wang, Weinan E, Siheng Chen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The paradigm of agentic science requires AI systems to conduct robust reasoning and engage in long-horizon, autonomous exploration. However, current scientific benchmarks remain confined to domain knowledge comprehension and complex reasoning, failing to evaluate the exploratory nature and procedural complexity of real-world research. In this work, we present research-oriented evaluations in theoretical and computational physics, a natural testbed with comprehensive domain knowledge, complex reasoning, and verifiable end-to-end workflows without reliance on experiments. Here we introduce PRL-Bench (Physics Research by LLMs), a benchmark designed to systematically map the capability boundaries of LLMs in executing end-to-end physics research. Constructed from 100 curated papers from the latest issues of Physical Review Letters since August 2025 and validated by domain experts, PRL-Bench covers five major theory- and computation-intensive subfields of modern physics: astrophysics, condensed matter physics, high-energy physics, quantum information, and statistical physics. Each task in the benchmark is designed to replicate the core properties of authentic scientific research, including exploration-oriented formulation, long-horizon workflows, and objective verifiability, thereby reconstructing the essential reasoning processes and research workflows of real physics research. Evaluation across frontier models shows that performance remains limited, with the best overall score below 50, revealing a pronounced gap between current LLM capabilities and the demands of real scientific research. PRL-Bench serves a reliable testbed for accessing next generation AI scientists advancing AI systems toward autonomous scientific discovery.
[385] Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning
Lute Lillo, Nick Cheney
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Continual reinforcement learning must balance retention with adaptation, yet many methods still rely on \emph{single-model preservation}, committing to one evolving policy as the main reusable solution across tasks. Even when a previously successful policy is retained, it may no longer provide a reliable starting point for rapid adaptation after interference, reflecting a form of \emph{loss of plasticity} that single-policy preservation cannot address. Inspired by quality-diversity methods, we introduce \textsc{TeLAPA} (Transfer-Enabled Latent-Aligned Policy Archives), a continual RL framework that organizes behaviorally diverse policy neighborhoods into per-task archives and maintains a shared latent space so that archived policies remain comparable and reusable under non-stationary drift. This perspective shifts continual RL from retaining isolated solutions to maintaining \emph{skill-aligned neighborhoods} with competent and behaviorally related policies that support future relearning. In our MiniGrid CL setting, \textsc{TeLAPA} learns more tasks successfully, recovers competence faster on revisited tasks after interference, and retains higher performance across a sequence of tasks. Our analyses show that source-optimal policies are often not transfer-optimal, even within a local competent neighborhood, and that effective reuse depends on retaining and selecting among multiple nearby alternatives rather than collapsing them to one representative. Together, these results reframe continual RL around reusable and competent policy neighborhoods, providing a route beyond single-model preservation toward more plastic lifelong agents.
[386] StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
Dingzhi Yu, Rui Pan, Yuxing Liu, Tong Zhang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Sign-based optimization algorithms, such as SignSGD, have garnered significant attention for their remarkable performance in distributed learning and training large foundation models. Despite their empirical superiority, SignSGD is known to diverge on non-smooth objectives, which are ubiquitous in modern machine learning due to ReLUs, max-pools, and mixture-of-experts. To overcome this fundamental limitation, we propose \textbf{StoSignSGD}, an algorithm that injects structural stochasticity into the sign operator while maintaining an unbiased update step. In the regime of (online) convex optimization, our theoretical analysis shows that StoSignSGD rigorously resolves the non-convergence issues of SignSGD, achieving a sharp convergence rate matching the lower bound. For the more challenging non-convex non-smooth optimization, we introduce generalized stationary measures that encompass prior definitions, proving that StoSignSGD improves upon the best-known complexity bounds by dimensional factors. Empirically, StoSignSGD exhibits robust stability and superior efficiency across diverse large language model (LLM) training regimes. Notably, in low-precision FP8 pretraining – a setting where AdamW fails catastrophically – StoSignSGD remains highly stable and yields a remarkable 1.44$\times$ to 2.14$\times$ speedup relative to established baselines. Furthermore, when fine-tuning 7B LLMs on mathematical reasoning tasks, StoSignSGD delivers substantial performance gains over both AdamW and SignSGD. Finally, to dissect the mechanisms driving its success, we develop a sign conversion framework capable of transforming any general optimizer into its unbiased, sign-based counterpart. Utilizing this framework, we deconstruct the core components of StoSignSGD and present a comprehensive ablation study to empirically validate our algorithmic design choices.
[387] Transfer Learning from Foundational Optimization Embeddings to Unsupervised SAT Representations
Koyena Pal, Serdar Kadioglu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Foundational optimization embeddings have recently emerged as powerful pre-trained representations for mixed-integer programming (MIP) problems. These embeddings were shown to enable cross-domain transfer and reduce reliance on solver-generated labels. In this work, we investigate whether such representations generalize beyond optimization to decision problems, focusing on Boolean satisfiability (SAT). We adapt the foundational optimization architecture to SAT by mapping CNF formulas into the same bipartite constraint-variable graph representation used for MIPs. This allows direct reuse of the pre-trained embedding model without architectural changes or supervised fine-tuning. Our results show that these embeddings capture structural regularities in SAT instances and support unsupervised tasks such as instance clustering and distribution identification. We demonstrate, for the first time, that foundational optimization embeddings can transfer to constraint satisfaction domains. Our findings is a step toward a unified representational framework for both optimization and decision problems.
[388] Evaluating LLM Simulators as Differentially Private Data Generators
Nassima M. Bouzid, Dehao Yuan, Nam H. Nguyen, Mayana Pereira
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: LLM-based simulators offer a promising path for generating complex synthetic data where traditional differentially private (DP) methods struggle with high-dimensional user profiles. But can LLMs faithfully reproduce statistical distributions from DP-protected inputs? We evaluate this using PersonaLedger, an agentic financial simulator, seeded with DP synthetic personas derived from real user statistics. We find that PersonaLedger achieves promising fraud detection utility (AUC 0.70 at epsilon=1) but exhibits significant distribution drift due to systematic LLM biases–learned priors overriding input statistics for temporal and demographic features. These failure modes must be addressed before LLM-based methods can handle the richer user representations where they might otherwise excel.
[389] Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation
Yisheng Zhong, Sijia Liu, Zhuangdi Zhu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) unlearning is crucial for removing hazardous or privacy-leaking information from the model. Practical LLM unlearning demands satisfying multiple challenging objectives simultaneously: removing undesirable knowledge, preserving general utility, avoiding over-refusal of neighboring concepts, and, crucially, ensuring robustness against adversarial probing attacks. However, existing unlearning methods primarily focus on a limited subset of these goals, typically unlearning efficacy and utility preservation while overlooking robustness and boundary behaviors. Naively extending these methods to multi-objective settings may lead to unlearning task interference. We propose a novel multi-objective unlearning framework that harmonizes multiple unlearning objectives through a data and optimization co-design: We standardize training corpora into a unified data representation to reduce the domain gap, and then introduce a bidirectional distillation method that simultaneously elicits desired behavior from a context-instructed teacher while suppressing undesirable behavior in the student model. Theoretical and empirical analyses show that our method aligns domain distributions and converts seemingly irrelevant unlearning tasks into cooperative optimization. Evaluation demonstrates state-of-the-art performance, which enables balanced and reliable unlearning across diverse, challenging requirements.
[390] $π_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachlan Groom, Haroun Habeeb, Hunter Hancock, Karol Hausman, Gashon Hussein, Victor Hwang, Brian Ichter, Connor Jacobsen, Szymon Jakubczak, Rowan Jen, Tim Jones, Gregg Kammerer, Ben Katz, Liyiming Ke, Mairbek Khadikov, Chandra Kuchi, Marinda Lamb, Devin LeBlanc, Brendon LeCount, Sergey Levine, Xinyu Li, Adrian Li-Bell, Vladislav Lialin, Zhonglin Liang, Wallace Lim, Yao Lu, Enyu Luo, Vishnu Mano, Nandan Marwaha, Aikys Mongush, Liam Murphy, Suraj Nair, Tyler Patterson, Karl Pertsch, Allen Z. Ren, Gavin Schelske, Charvi Sharma, Baifeng Shi, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachowicz, Will Stoeckle, Jiaming Tang, Jimmy Tanner, Shalom Tekeste, Marcel Torne, Kyle Vedder, Quan Vuong, Anna Walling, Haohuan Wang, Jason Wang, XuDong Wang, Chris Whalen, Samuel Whitmore, Blake Williams, Charles Xu, Sukwon Yoo, Lili Yu, Wuming Zhang, Zhuoyang Zhang, Ury Zhilinsky
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present a new robotic foundation model, called $π_{0.7}$, that can enable strong out-of-the-box performance in a wide range of scenarios. $π_{0.7}$ can follow diverse language instructions in unseen environments, including multi-stage tasks with various kitchen appliances, provide zero-shot cross-embodiment generalization, for example enabling a robot to fold laundry without seeing the task before, and perform challenging tasks such as operating an espresso machine out of the box at a level of performance that matches much more specialized RL-finetuned models. The main idea behind $π_{0.7}$ is to use diverse context conditioning during training. This conditioning information, contained in the prompt, makes it possible to steer the model precisely to perform many tasks with different strategies. It is conditioned not just on a language command that describes what it should do, but on additional multimodal information that also describes the manner or strategy in which it should do it, including metadata about task performance and subgoal images. This enables $π_{0.7}$ to use very diverse data, including demonstrations, potentially suboptimal (autonomous) data including failures, and data from non-robot sources. Our experiments evaluate $π_{0.7}$ across numerous tasks with multiple robot platforms, on tasks that require speed and dexterity, language following, and compositional task generalization.
[391] FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models
Zixuan Weng, Jinghuai Zhang, Kunlin Cai, Ying Li, Peiran Wang, Yuan Tian
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) often exhibit undesirable behaviors, such as safety violations and hallucinations. Although inference-time steering offers a cost-effective way to adjust model behavior without updating its parameters, existing methods often fail to be simultaneously effective, utility-preserving, and training-efficient due to their rigid, one-size-fits-all designs and limited adaptability. In this work, we present FineSteer, a novel steering framework that decomposes inference-time steering into two complementary stages: conditional steering and fine-grained vector synthesis, allowing fine-grained control over when and how to steer internal representations. In the first stage, we introduce a Subspace-guided Conditional Steering (SCS) mechanism that preserves model utility by avoiding unnecessary steering. In the second stage, we propose a Mixture-of-Steering-Experts (MoSE) mechanism that captures the multimodal nature of desired steering behaviors and generates query-specific steering vectors for improved effectiveness. Through tailored designs in both SCS and MoSE, FineSteer maintains robust performance on general queries while adaptively optimizing steering vectors for targeted inputs in a training-efficient manner. Extensive experiments on safety and truthfulness benchmarks show that FineSteer outperforms state-of-the-art methods in overall performance, achieving stronger steering performance with minimal utility loss. Code is available at https://github.com/YukinoAsuna/FineSteer
[392] ProtoTTA: Prototype-Guided Test-Time Adaptation
Mohammad Mahdi Abootorabi, Parvin Mousavi, Purang Abolmaesumi, Evan Shelhamer
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Deep networks that rely on prototypes-interpretable representations that can be related to the model input-have gained significant attention for balancing high accuracy with inherent interpretability, which makes them suitable for critical domains such as healthcare. However, these models are limited by their reliance on training data, which hampers their robustness to distribution shifts. While test-time adaptation (TTA) improves the robustness of deep networks by updating parameters and statistics, the prototypes of interpretable models have not been explored for this purpose. We introduce ProtoTTA, a general framework for prototypical models that leverages intermediate prototype signals rather than relying solely on model outputs. ProtoTTA minimizes the entropy of the prototype-similarity distribution to encourage more confident and prototype-specific activations on shifted data. To maintain stability, we employ geometric filtering to restrict updates to samples with reliable prototype activations, regularized by prototype-importance weights and model-confidence scores. Experiments across four prototypical backbones on four diverse benchmarks spanning fine-grained vision, histopathology, and NLP demonstrate that ProtoTTA improves robustness over standard output entropy minimization while restoring correct semantic focus in prototype activations. We also introduce novel interpretability metrics and a vision-language model (VLM) evaluation framework to explain TTA dynamics, confirming ProtoTTA restores human-aligned semantic focus and correlates reliably with VLM-rated reasoning quality. Code is available at: https://github.com/DeepRCL/ProtoTTA.
[393] Optimizing Stochastic Gradient Push under Broadcast Communications
Tuan Nguyen, Ting He
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We consider the problem of minimizing the convergence time for decentralized federated learning (DFL) in wireless networks under broadcast communications, with focus on mixing matrix design. The mixing matrix is a critical hyperparameter for DFL that simultaneously controls the convergence rate across iterations and the communication demand per iteration, both strongly influencing the convergence time. Although the problem has been studied previously, existing solutions are mostly designed for decentralized parallel stochastic gradient descent (D-PSGD), which requires the mixing matrix to be symmetric and doubly stochastic. These constraints confine the activated communication graph to undirected (i.e., bidirected) graphs, which limits design flexibility. In contrast, we consider mixing matrix design for stochastic gradient push (SGP), which allows asymmetric mixing matrices and hence directed communication graphs. By analyzing how the convergence rate of SGP depends on the mixing matrices, we extract an objective function that explicitly depends on graph-theoretic parameters of the activated communication graph, based on which we develop an efficient design algorithm with performance guarantees. Our evaluations based on real data show that the proposed solution can notably reduce the convergence time compared to the state of the art without compromising the quality of the trained model.
[394] Natural gradient descent with momentum
Anthony Nouy, Agustín Somacal
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We consider the problem of approximating a function by an element of a nonlinear manifold which admits a differentiable parametrization, typical examples being neural networks with differentiable activation functions or tensor networks. Natural gradient descent (NGD) for the optimization of a loss function can be seen as a preconditioned gradient descent where updates in the parameter space are driven by a functional perspective. In a spirit similar to Newton’s method, a NGD step uses, instead of the Hessian, the Gram matrix of the generating system of the tangent space to the approximation manifold at the current iterate, with respect to a suitable metric. This corresponds to a locally optimal update in function space, following a projected gradient onto the tangent space to the manifold. Still, both gradient and natural gradient descent methods get stuck in local minima. Furthermore, when the model class is a nonlinear manifold or the loss function is not ideally conditioned (e.g., the KL-divergence for density estimation, or a norm of the residual of a partial differential equation in physics informed learning), even the natural gradient might yield non-optimal directions at each step. This work introduces a natural version of classical inertial dynamic methods like Heavy-Ball or Nesterov and show how it can improve the learning process when working with nonlinear model classes.
[395] Learning Affine-Equivariant Proximal Operators
Oriel Savir, Zhenghan Fang, Jeremias Sulam
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Proximal operators are fundamental across many applications in signal processing and machine learning, including solving ill-posed inverse problems. Recent work has introduced Learned Proximal Networks (LPNs), providing parametric functions that compute exact proximals for data-driven and potentially non-convex regularizers. However, in many settings it is important to include additional structure to these regularizers–and their corresponding proximals–such as shift and scale equivariance. In this work, we show how to obtain learned functions parametrized by neural networks that provably compute exact proximal operators while being equivariant to shifts and scaling, which we dub Affine-Equivariant Learned Proximal Networks (AE-LPNs). We demonstrate our results on synthetic, constructive examples, and then on real data via denoising in out-of-distribution settings. Our equivariant learned proximals enhance robustness to noise distributions and affine shifts far beyond training distributions, improving the practical utility of learned proximal operators
[396] Predicting Where Steering Vectors Succeed
Jayadev Billa
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Steering vectors work for some concepts and layers but fail for others, and practitioners have no way to predict which setting applies before running an intervention. We introduce the Linear Accessibility Profile (LAP), a per-layer diagnostic that repurposes the logit lens as a predictor of steering vector effectiveness. The key measure, $A_{\mathrm{lin}}$, applies the model’s unembedding matrix to intermediate hidden states, requiring no training. Across 24 controlled binary concept families on five models (Pythia-2.8B to Llama-8B), peak $A_{\mathrm{lin}}$ predicts steering effectiveness at $ρ= +0.86$ to $+0.91$ and layer selection at $ρ= +0.63$ to $+0.92$. A three-regime framework explains when difference-of-means steering works, when nonlinear methods are needed, and when no method can work. An entity-steering demo confirms the prediction end-to-end: steering at the LAP-recommended layer redirects completions on Gemma-2-2B and OLMo-2-1B-Instruct, while the middle layer (the standard heuristic) has no effect on either model.
[397] Reward Weighted Classifier-Free Guidance as Policy Improvement in Autoregressive Models
Alexander Peysakhovich, William Berman
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Consider an auto-regressive model that produces outputs x (e.g., answers to questions, molecules) each of which can be summarized by an attribute vector y (e.g., helpfulness vs. harmlessness, or bio-availability vs. lipophilicity). An arbitrary reward function r(y) encodes tradeoffs between these properties. Typically, tilting the model’s sampling distribution to increase this reward is done at training time via reinforcement learning. However, if the reward function changes, re-alignment requires re-training. In this paper, we show that a reward weighted classifier-free guidance (RCFG) can act as a policy improvement operator in this setting, approximating tilting the sampling distribution by the Q function. We apply RCFG to molecular generation, demonstrating that it can optimize novel reward functions at test time. Finally, we show that using RCFG as a teacher and distilling into the base policy to serve as a warm start significantly speeds up convergence for standard RL.
[398] PAWN: Piece Value Analysis with Neural Networks
Ethan Tang, Hasan Davulcu, Jia Zou, Zhongju Zhang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Predicting the relative value of any given chess piece in a position remains an open challenge, as a piece’s contribution depends on its spatial relationships with every other piece on the board. We demonstrate that incorporating the state of the full chess board via latent position representations derived using a CNN-based autoencoder significantly improves accuracy for MLP-based piece value prediction architectures. Using a dataset of over 12 million piece-value pairs gathered from Grandmaster-level games, with ground-truth labels generated by Stockfish 17, our enhanced piece value predictor significantly outperforms context-independent MLP-based systems, reducing validation mean absolute error by 16% and predicting relative piece value within approximately 0.65 pawns. More generally, our findings suggest that encoding the full problem state as context provides useful inductive bias for predicting the contribution of any individual component.
[399] Adapting in the Dark: Efficient and Stable Test-Time Adaptation for Black-Box Models
Yunbei Zhang, Shuaicheng Niu, Chengyi Cai, Feng Liu, Jihun Hamm
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Test-Time Adaptation (TTA) for black-box models accessible only via APIs remains a largely unexplored challenge. Existing approaches such as post-hoc output refinement offer limited adaptive capacity, while Zeroth-Order Optimization (ZOO) enables input-space adaptation but faces high query costs and optimization challenges in the unsupervised TTA setting. We introduce BETA (Black-box Efficient Test-time Adaptation), a framework that addresses these limitations by employing a lightweight, local white-box steering model to create a tractable gradient pathway. Through a prediction harmonization technique combined with consistency regularization and prompt learning-oriented filtering, BETA enables stable adaptation with no additional API calls and negligible latency beyond standard inference. On ImageNet-C, BETA achieves a +7.1% accuracy gain on ViT-B/16 and +3.4% on CLIP, surpassing strong white-box and gray-box methods including TENT and TPT. On a commercial API, BETA achieves comparable performance to ZOO at 250x lower cost while maintaining real-time inference speed, establishing it as a practical and efficient solution for real-world black-box TTA.
[400] VoodooNet: Achieving Analytic Ground States via High-Dimensional Random Projections
Wladimir Silva
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present VoodooNet, a non-iterative neural architecture that replaces the stochastic gradient descent (SGD) paradigm with a closed-form analytic solution via Galactic Expansion. By projecting input manifolds into a high-dimensional, high-entropy “Galactic” space ($d \gg 784$), we demonstrate that complex features can be untangled without the thermodynamic cost of backpropagation. Utilizing the Moore-Penrose pseudoinverse to solve for the output layer in a single step, VoodooNet achieves a classification accuracy of \textbf{98.10% on MNIST} and \textbf{86.63% on Fashion-MNIST}. Notably, our results on Fashion-MNIST surpass a 10-epoch SGD baseline (84.41%) while reducing the training time by orders of magnitude. We observe a near-logarithmic scaling law between dimensionality and accuracy, suggesting that performance is a function of “Galactic” volume rather than iterative refinement. This “Magic Hat” approach offers a new frontier for real-time Edge AI, where the traditional training phase is bypassed in favor of instantaneous manifold discovery.
[401] Flexible Empowerment at Reasoning with Extended Best-of-N Sampling
Taisuke Kobayashi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper proposes a novel method that incorporates empowerment when reasoning actions in reinforcement learning (RL), thereby achieving the flexibility of exploration-exploitation dilemma (EED). In previous methods, empowerment for promoting exploration has been provided as a bonus term to the task-specific reward function as an intrinsically-motivated RL. However, this approach introduces a delay until the policy that accounts for empowerment is learned, making it difficult to adjust the emphasis on exploration as needed. On the other hand, a trick devised for fine-tuning recent foundation models at reasoning, so-called best-of-N (BoN) sampling, allows for the implicit acquisition of modified policies without explicitly learning them. It is expected that applying this trick to exploration-promoting terms, such as empowerment, will enable more flexible adjustment of EED. Therefore, this paper investigates BoN sampling for empowerment. Furthermore, to adjust the degree of policy modification in a generalizable manner while maintaining computational cost, this paper proposes a novel BoN sampling method extended by Tsalis statistics. Through toy problems, the proposed method’s cability to balance EED is verified. In addition, it is demonstrated that the proposed method improves RL performance to solve complex locomotion tasks.
[402] Majority Voting for Code Generation
Tim Launer, Jonas Hübotter, Marco Bagatella, Ido Hakimi, Andreas Krause
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We investigate Functional Majority Voting (FMV), a method based on functional consensus for code generation with Large Language Models, which identifies a representative solution from multiple generations using their runtime execution signatures on test inputs. We find that FMV is an effective test-time inference strategy, substantially boosting performance on LiveCodeBench without a large compute overhead. Furthermore, we extend the utility of functional consensus and apply it as an aggregation strategy for label-free Test-Time Reinforcement Learning. We demonstrate that this increases pass@1 on holdout tasks, but find no evidence of self-improvement beyond the base model’s performance ceiling.
[403] PINNACLE: An Open-Source Computational Framework for Classical and Quantum PINNs
Shimon Pisnoy, Hemanth Chandravamsi, Ziv Chen, Aaron Goldgewert, Gal Shaviner, Boris Shragner, Steven H. Frankel
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present PINNACLE, an open-source computational framework for physics-informed neural networks (PINNs) that integrates modern training strategies, multi-GPU acceleration, and hybrid quantum-classical architectures within a unified modular workflow. The framework enables systematic evaluation of PINN performance across benchmark problems including 1D hyperbolic conservation laws, incompressible flows, and electromagnetic wave propagation. It supports a range of architectural and training enhancements, including Fourier feature embeddings, random weight factorization, strict boundary condition enforcement, adaptive loss balancing, curriculum training, and second-order optimization strategies, with extensibility to additional methods. We provide a comprehensive benchmark study quantifying the impact of these methods on convergence, accuracy, and computational cost, and analyze distributed data parallel scaling in terms of runtime and memory efficiency. In addition, we extend the framework to hybrid quantum-classical PINNs and derive a formal estimate for circuit-evaluation complexity under parameter-shift differentiation. Results highlight the sensitivity of PINNs to architectural and training choices, confirm their high computational cost relative to classical solvers, and identify regimes where hybrid quantum models offer improved parameter efficiency. PINNACLE provides a foundation for benchmarking physics-informed learning methods and guiding future developments through quantitative assessment of their trade-offs.
[404] Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
Xinge Liu, Terry Jingchen Zhang, Bernhard Schölkopf, Zhijing Jin, Kristen Menou
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The rise of autonomous AI agents suggests that dynamic benchmark environments with built-in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high-SNR single-planet systems to complex multi-planetary configurations requiring involved low-SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a limitation that persists even when agents are equipped with vanilla skills. Furthermore, increasing test-time compute yields only marginal gains, with excessive token usage often reflecting recursive failure loops rather than meaningful exploration. Stargazer presents an opportunity to train, evaluate, scaffold, and scale strategies on a model-fitting problem of practical research relevance today. Our methodology to design a simulation-driven environment for AI agents presumably generalizes to many other model-fitting problems across scientific domains. Source code and the project website are available at https://github.com/Gudmorning2025/Stargazer and https://gudmorning2025.github.io/Stargazer, respectively.
[405] NK-GAD: Neighbor Knowledge-Enhanced Unsupervised Graph Anomaly Detection
Zehao Wang, Lanjun Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Graph anomaly detection aims to identify irregular patterns in graph-structured data. Most unsupervised GNN-based methods rely on the homophily assumption that connected nodes share similar attributes. However, real-world graphs often exhibit attribute-level heterophily, where connected nodes have dissimilar attributes. Our analysis of attribute-level heterophily graphs reveals two phenomena indicating that current approaches are not practical for unsupervised graph anomaly detection: 1) attribute similarities between connected nodes show nearly identical distributions across different connected node pair types, and 2) anomalies cause consistent variation trends between the graph with and without anomalous edges in the low- and high-frequency components of the spectral energy distributions, while the mid-part exhibits more erratic variations. Based on these observations, we propose NK-GAD, a neighbor knowledge-enhanced unsupervised graph anomaly detection framework. NK-GAD integrates a joint encoder capturing both similar and dissimilar neighbor features, a neighbor reconstruction module modeling normal distributions, a center aggregation module refining node features, and dual decoders for reconstructing attributes and structures. Experiments on seven datasets show NK-GAD achieves an average 3.29% AUC improvement.
[406] Faster LLM Inference via Sequential Monte Carlo
Yahya Emara, Mauricio Barba da Costa, Chi-Chih Chang, Cameron Freer, Tim Vieira, Ryan Cotterell, Mohamed S. Abdelfattah
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Speculative decoding (SD) accelerates language model inference by drafting tokens from a cheap proposal model and verifying them against an expensive target model via rejection sampling. Because rejection truncates the draft block at the first error, throughput degrades when draft and target diverge. Rather than rejecting draft tokens outright, we propose to reweight them. To this end, we introduce sequential Monte Carlo speculative decoding (SMC-SD), which replaces token-level rejection with importance-weighted resampling over a population of draft particles. SMC-SD is a principled approximate inference scheme that trades exactness for additional speed, while preserving theoretical bounds on its per-step approximation error. Because LLM inference is memory bandwidth-bound, the arithmetic needed to draft particles and to score them in parallel comes nearly for free – SMC-SD uses idle compute to turn verification into a vectorized, fixed-size operation with no rollback. Empirically, SMC-SD achieves 2.36x speed-up over speculative decoding and a 5.2x speed-up over autoregressive decoding, while remaining within 3% of the target model’s accuracy on reasoning, instruction-following, and coding benchmarks.
[407] Hierarchical Active Inference using Successor Representations
Prashant Rangarajan, Rajesh P. N. Rao
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Active inference, a neurally-inspired model for inferring actions based on the free energy principle (FEP), has been proposed as a unifying framework for understanding perception, action, and learning in the brain. Active inference has previously been used to model ecologically important tasks such as navigation and planning, but scaling it to solve complex large-scale problems in real-world environments has remained a challenge. Inspired by the existence of multi-scale hierarchical representations in the brain, we propose a model for planning of actions based on hierarchical active inference. Our approach combines a hierarchical model of the environment with successor representations for efficient planning. We present results demonstrating (1) how lower-level successor representations can be used to learn higher-level abstract states, (2) how planning based on active inference at the lower-level can be used to bootstrap and learn higher-level abstract actions, and (3) how these learned higher-level abstract states and actions can facilitate efficient planning. We illustrate the performance of the approach on several planning and reinforcement learning (RL) problems including a variant of the well-known four rooms task, a key-based navigation task, a partially observable planning problem, the Mountain Car problem, and PointMaze, a family of navigation tasks with continuous state and action spaces. Our results represent, to our knowledge, the first application of learned hierarchical state and action abstractions to active inference in FEP-based theories of brain function.
[408] Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction
Jingyuan Li, Xiaoyi Jiang, Fukang Wen, Wei Liu, Renqian Luo, Yi Zhu, Zuoqiang Shi, Pipi Hu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Discrete diffusion models based on continuous-time Markov chains (CTMCs) have shown strong performance on language and discrete data generation, yet existing approaches typically parameterize the reverse rate matrix as a single object – via concrete scores, clean-data predictions ($x_0$-parameterization), or denoising distributions – rather than aligning the parameterization with the intrinsic CTMC decomposition into jump timing and jump direction. Since a CTMC is fundamentally a Poisson process fully determined by these two quantities, decomposing along this structure is closer to first principles and naturally leads to our formulation. We propose \textbf{Neural CTMC}, which separately parameterizes the reverse process through an \emph{exit rate} (when to jump) and a \emph{jump distribution} (where to jump) using two dedicated network heads. We show that the evidence lower bound (ELBO) differs from a path-space KL divergence between the true and learned reverse processes by a $θ$-independent constant, so that the training objective is fully governed by the exit rate and jump distribution we parameterize. Moreover, this KL factorizes into a Poisson KL for timing and a categorical KL for direction. We further show that the tractable conditional surrogate preserves the gradients and minimizers of the corresponding marginal reverse-process objective under standard regularity assumptions. Our theoretical framework also covers masked and GIDD-style noise schedules. Empirically, while the uniform forward process has been explored in prior work, our model, to our best of the knowledge, is the first pure-uniform method to outperform mask-based methods on the OpenWebText dataset.To facilitate reproducibility, we release our pretrained weights at https://huggingface.co/Jiangxy1117/Neural-CTMC.
[409] Graph self-supervised learning based on frequency corruption
Haojie Li, Mengjiao Zhang, Guanfeng Liu, Qiang Hu, Yan Wang, Junwei Du
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Graph self-supervised learning can reduce the need for labeled graph data and has been widely used in recommendation, social networks, and other web applications. However, existing methods often underuse high-frequency signals and may overfit to specific local patterns, which limits representation quality and generalization. We propose Frequency-Corrupt Based Graph Self-Supervised Learning (FC-GSSL), a method that builds corrupted graphs biased toward high-frequency information by corrupting nodes and edges according to their low-frequency contributions. These corrupted graphs are used as inputs to an autoencoder, while low-frequency and general features are reconstructed as supervision targets, forcing the model to fuse information from multiple frequency bands. We further design multiple sampling strategies and generate diverse corrupted graphs from the intersections and unions of the sampling results. By aligning node representations from these views, the model can discover useful frequency combinations, reduce reliance on specific high-frequency components, and improve robustness. Experiments on 14 datasets across node classification, graph prediction, and transfer learning show that FC-GSSL consistently improves performance and generalization.
[410] Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
Xiaoyu Yang, En Yu, Wei Duan, Jie Lu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement Fine-Tuning (RFT) has established itself as a critical paradigm for the alignment of Multi-modal Large Language Models (MLLMs) with complex human values and domain-specific requirements. Nevertheless, current research primarily focuses on mitigating exogenous distribution shifts arising from data-centric factors, the non-stationarity inherent in the endogenous reasoning remains largely unexplored. In this work, a critical vulnerability is revealed within MLLMs: they are highly susceptible to endogenous reasoning drift, across both thinking and perception perspectives. It manifests as unpredictable distribution changes that emerge spontaneously during the autoregressive generation process, independent of external environmental perturbations. To adapt it, we first theoretically define endogenous reasoning drift within the RFT of MLLMs as the multi-modal concept drift. In this context, this paper proposes Counterfactual Preference Optimization ++ (CPO++), a comprehensive and autonomous framework adapted to the multi-modal concept drift. It integrates counterfactual reasoning with domain knowledge to execute controlled perturbations across thinking and perception, employing preference optimization to disentangle spurious correlations. Extensive empirical evaluations across two highly dynamic and safety-critical domains: medical diagnosis and autonomous driving. They demonstrate that the proposed framework achieves superior performance in reasoning coherence, decision-making precision, and inherent robustness against extreme interference. The methodology also exhibits exceptional zero-shot cross-domain generalization, providing a principled foundation for reliable multi-modal reasoning in safety-critical applications.
[411] Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing
Zehao Wang, Lanjun Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Reasoning Models (LRMs) have demonstrated strong capabilities in generating step-by-step reasoning chains alongside final answers, enabling their deployment in high-stakes domains such as healthcare and education. While prior jailbreak attack studies have focused on the safety of final answers, little attention has been given to the safety of the reasoning process. In this work, we identify a novel problem that injects harmful content into the reasoning steps while preserving unchanged answers. This type of attack presents two key challenges: 1) manipulating the input instructions may inadvertently alter the LRM’s final answer, and 2) the diversity of input questions makes it difficult to consistently bypass the LRM’s safety alignment mechanisms and embed harmful content into its reasoning process. To address these challenges, we propose the Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) Framework, which integrates a Semantic-based Trigger Selection module and a Psychology-based Instruction Generation module. Specifically, the proposed PRJA automatically selects manipulative reasoning triggers via semantic analysis and leverages psychological theories of obedience to authority and moral disengagement to generate adaptive instructions for enhancing the LRM’s compliance with harmful content generation. Extensive experiments on five question-answering datasets demonstrate that PRJA achieves an average attack success rate of 83.6% against several commercial LRMs, including DeepSeek R1, Qwen2.5-Max, and OpenAI o4-mini.
[412] Why Colors Make Clustering Harder:Global Integrality Gaps, the Price of Fairness, and Color-Coupled Algorithms in Chromatic Correlation Clustering
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Chromatic Correlation Clustering (CCC) extends Correlation Clustering by assigning semantic colors to edges and requiring each cluster to receive a single color label. Unlike standard CC, whose LP relaxation has integrality gap 2 on complete graphs and admits a 2.06-approximation, the analogous LP for CCC has a strict lower bound of 2.11, and the best known LP-rounding algorithm achieves 2.15. We explain this gap by isolating the source of difficulty: cross-edge chromatic interference. Neutral edges, whose color does not match the candidate cluster color, create an irreducible cost absent from standard CC and force any color-independent rounding scheme to pay an additional mismatch penalty. We make four contributions. First, we prove a Global Integrality Gap Decomposition Theorem showing that the gap of any color-independent CCC rounding algorithm equals the standard CC gap plus an irreducible chromatic penalty Delta(L) > 0. Second, we solve the associated min-max problem and derive the staircase formula Delta(L) = ((L-1)/L) Delta_infinity, where Delta_infinity is approximately 0.0734. In particular, the two-color gap is 2.0967, separating CCC from standard CC already at L = 2. Third, we introduce Color-Coupled Correlation Clustering (C4). Adding the valid global constraint sum_c x_uv^c >= L-1 and a correlated interval-packing rounding scheme makes neutral edges behave like classical negative edges, recovering the optimal 2.06 approximation and bypassing the 2.11 lower bound for the uncoupled LP. Fourth, experiments on extremal instances, real multi-relational networks, and fairness benchmarks validate the theory: empirical LP gaps follow the predicted staircase, and C4 matches the unconstrained approximation ratio under fairness constraints.
[413] Collective Kernel EFT for Pre-activation ResNets
Hidetoshi Kawase, Toshihiro Ota
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In finite-width deep neural networks, the empirical kernel $G$ evolves stochastically across layers. We develop a collective kernel effective field theory (EFT) for pre-activation ResNets based on a $G$-only closure hierarchy and diagnose its finite validity window. Exploiting the exact conditional Gaussianity of residual increments, we derive an exact stochastic recursion for $G$. Applying Gaussian approximations systematically yields a continuous-depth ODE system for the mean kernel $K_0$, the kernel covariance $V_4$, and the $1/n$ mean correction $K_{1,\mathrm{EFT}}$, which emerges diagrammatically as a one-loop tadpole correction. Numerically, $K_0$ remains accurate at all depths. However, the $V_4$ equation residual accumulates to an $O(1)$ error at finite time, primarily driven by approximation errors in the $G$-only transport term. Furthermore, $K_{1,\mathrm{EFT}}$ fails due to the breakdown of the source closure, which exhibits a systematic mismatch even at initialization. These findings highlight the limitations of $G$-only state-space reduction and suggest extending the state space to incorporate the sigma-kernel.
[414] DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference
Xiang Xia, Wuyang Zhang, Jiazheng Liu, Cheng Yan, Yanyong Zhang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive language generation due to their potential for parallel decoding and global refinement of the entire sequence. To unlock this potential, DLM inference must carefully balance generation quality and decoding speed. Recent block-wise DLM decoding methods improve this trade-off by performing diffusion-based decoding sequentially in blocks. However, existing methods typically rely on fixed block schedules or current-step local signals to determine block boundaries, and use conservative confidence-based parallel decoding to avoid conflicts, limiting the quality-speed trade-off. In this paper, we argue that block-wise DLM inference requires more suitable signals for its two core decisions: cross-step signals for determining block boundaries, and token-level conflict signals for parallel decoding. Based on this view, we propose DepCap, a training-free framework for efficient block-wise DLM inference. Specifically, DepCap instantiates the cross-step signal as the influence of the last decoded block and uses it to adaptively determine how far the next block should extend, while identifying a conflict-free subset of tokens for safe parallel decoding within each block, enabling substantial inference acceleration with negligible quality degradation. DepCap is a plug-and-play method applicable to various DLMs, and compatible with existing KV-cache strategies for block-wise DLM. An information-theoretic analysis further suggests that the cumulative last-block influence on a candidate block is approximately additive across tokens, supporting the proposed block-partitioning criterion. Experimental results show that DepCap achieves favorable speed-quality trade-offs across multiple DLM backbones and reasoning and coding benchmarks, with up to 5.63$\times$ speedup without significant performance degradation.
[415] Multi-objective Reinforcement Learning With Augmented States Requires Rewards After Deployment
Peter Vamplew, Cameron Foale
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This research note identifies a previously overlooked distinction between multi-objective reinforcement learning (MORL), and more conventional single-objective reinforcement learning (RL). It has previously been noted that the optimal policy for an MORL agent with a non-linear utility function is required to be conditioned on both the current environmental state and on some measure of the previously accrued reward. This is generally implemented by concatenating the observed state of the environment with the discounted sum of previous rewards to create an augmented state. While augmented states have been widely-used in the MORL literature, one implication of their use has not previously been reported – namely that they require the agent to have continued access to the reward signal (or a proxy thereof) after deployment, even if no further learning is required. This note explains why this is the case, and considers the practical repercussions of this requirement.
[416] Zero-Shot Scalable Resilience in UAV Swarms: A Decentralized Imitation Learning Framework with Physics-Informed Graph Interactions
Huan Lin, Lianghui Ding
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large-scale Unmanned Aerial Vehicle (UAV) failures can split an unmanned aerial vehicle swarm network into disconnected sub-networks, making decentralized recovery both urgent and difficult. Centralized recovery methods depend on global topology information and become communication-heavy after severe fragmentation. Decentralized heuristics and multi-agent reinforcement learning methods are easier to deploy, but their performance often degrades when the swarm scale and damage severity vary. We present Physics-informed Graph Adversarial Imitation Learning algorithm (PhyGAIL) that adopts centralized training with decentralized execution. PhyGAIL builds bounded local interaction graphs from heterogeneous observations, and uses physics-informed graph neural network to encode directional local interactions as gated message passing with explicit attraction and repulsion. This gives the policy a physically grounded coordination bias while keeping local observations scale-invariant. It also uses scenario-adaptive imitation learning to improve training under fragmented topologies and variable-length recovery episodes. Our analysis establishes bounded local graph amplification, bounded interaction dynamics, and controlled variance of the terminal success signal. A policy trained on 20-UAV swarms transfers directly to swarms of up to 500 UAVs without fine-tuning, and achieves better performance across reconnection reliability, recovery speed, motion safety, and runtime efficiency than representative baselines.
[417] When Do Early-Exit Networks Generalize? A PAC-Bayesian Theory of Adaptive Depth
Dongxin Guo, Jikun Wu, Siu Ming Yiu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Early-exit neural networks enable adaptive computation by allowing confident predictions to exit at intermediate layers, achieving 2-8$\times$ inference speedup. Despite widespread deployment, their generalization properties lack theoretical understanding – a gap explicitly identified in recent surveys. This paper establishes a unified PAC-Bayesian framework for adaptive-depth networks. (1) Novel Entropy-Based Bounds: We prove the first generalization bounds depending on exit-depth entropy $H(D)$ and expected depth $\mathbb{E}[D]$ rather than maximum depth $K$, with sample complexity $\mathcal{O}((\mathbb{E}[D] \cdot d + H(D))/ε^2)$. (2) Explicit Constructive Constants: Our analysis yields the leading coefficient $\sqrt{2\ln 2} \approx 1.177$ with complete derivation. (3) Provable Early-Exit Advantages: We establish sufficient conditions under which adaptive-depth networks strictly outperform fixed-depth counterparts. (4) Extension to Approximate Label Independence: We relax the label-independence assumption to $ε$-approximate policies, broadening applicability to learned routing. (5) Comprehensive Validation: Experiments across 6 architectures on 7 benchmarks demonstrate tightness ratios of 1.52-3.87$\times$ (all $p < 0.001$) versus $>$100$\times$ for classical bounds. Bound-guided threshold selection matches validation-tuned performance within 0.1-0.3%.
[418] Closing the Theory-Practice Gap in Spiking Transformers via Effective Dimension
Dongxin Guo, Jikun Wu, Siu Ming Yiu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Spiking transformers achieve competitive accuracy with conventional transformers while offering $38$-$57\times$ energy efficiency on neuromorphic hardware, yet no theoretical framework guides their design. This paper establishes the first comprehensive expressivity theory for spiking self-attention. We prove that spiking attention with Leaky Integrate-and-Fire neurons is a universal approximator of continuous permutation-equivariant functions, providing explicit spike circuit constructions including a novel lateral inhibition network for softmax normalization with proven $O(1/\sqrt{T})$ convergence. We derive tight spike-count lower bounds via rate-distortion theory: $\varepsilon$-approximation requires $Ω(L_f^2 nd/\varepsilon^2)$ spikes, with rigorous information-theoretic derivation. Our key insight is input-dependent bounds using measured effective dimensions ($d_{\text{eff}}=47$–$89$ for CIFAR/ImageNet), explaining why $T=4$ timesteps suffice despite worst-case $T \geq 10{,}000$ predictions. We provide concrete design rules with calibrated constants ($C=2.3$, 95% CI: $[1.9, 2.7]$). Experiments on Spikformer, QKFormer, and SpikingResformer across vision and language benchmarks validate predictions with $R^2=0.97$ ($p<0.001$). Our framework provides the first principled foundation for neuromorphic transformer design.
[419] Federated Learning with Quantum Enhanced LSTM for Applications in High Energy Physics
Abhishek Sawaika, Durga Pritam Suggisetti, Udaya Parampalli, Rajkumar Buyya
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Learning with large-scale datasets and information-critical applications, such as in High Energy Physics (HEP), demands highly complex, large-scale models that are both robust and accurate. To tackle this issue and cater to the learning requirements, we envision using a federated learning framework with a quantum-enhanced model. Specifically, we design a hybrid quantum-classical long-shot-term-memory model (QLSTM) for local training at distributed nodes. It combines the representative power of quantum models in understanding complex relationships within the feature space, and an LSTM-based model to learn necessary correlations across data points. Given the computing limitations and unprecedented cost of current stand-alone noisy-intermediate quantum (NISQ) devices, we propose to use a federated learning setup, where the learning load can be distributed to local servers as per design and data availability. We demonstrate the benefits of such a design on a classification task for the Supersymmetry(SUSY) dataset, having 5M rows. Our experiments indicate that the performance of this design is not only better that some of the existing work using variational quantum circuit (VQC) based quantum machine learning (QML) techniques, but is also comparable ($Δ\sim \pm 1%$) to that of classical deep-learning benchmarks. An important observation from this study is that the designed framework has $<$300 parameters and only needs 20K data points to give a comparable performance. Which also turns out to be a 100$\times$ improvement than the compared baseline models. This shows an improved learning capability of the proposed framework with minimal data and resource requirements, due to the joint model with an LSTM based architecture and a quantum enhanced VQC.
[420] Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
Wai Man Si, Mingjie Li, Michael Backes, Yang Zhang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility. Our method employs a gradient-free attribution mechanism, requiring only modest GPU resources, and generalizes across architectures and quantized variants. Empirical evaluations on ML models show substantial reductions in unsafe generations and improved robustness against jailbreak attacks, with minimal utility loss. From the perspective of the Lottery Ticket Hypothesis, our results suggest that ML models contain “unsafe tickets” responsible for harmful behaviors, and pruning reveals “safety tickets” that maintain performance while aligning outputs. This provides a lightweight, post-hoc alignment strategy suitable for deployment in resource-constrained settings.
[421] Fusing Cellular Network Data and Tollbooth Counts for Urban Traffic Flow Estimation
Oluwaleke Yusuf, Shaira Tabassum
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Traffic simulations, essential for planning urban transit infrastructure interventions, require vehicle-category-specific origin-destination (OD) data. Existing data sources are imperfect: sparse tollbooth sensors provide accurate vehicle counts by category, while extensive mobility data from cellular network activity captures aggregated crowd movement, but lack modal disaggregation and have systematic biases. This study develops a machine learning framework to correct and disaggregate cellular network data using sparse tollbooth counts as ground truth. The model uses temporal and spatial features to learn the complex relationship between aggregated mobility data and vehicular data. The framework infers destinations from transit routes and implements routing logic to distribute corrected flows between OD pairs. This approach is applied to a bus depot expansion in Trondheim, Norway, generating hourly OD matrices by vehicle length category. The results show how limited but accurate sensor measurements can correct extensive but aggregated mobility data to produce grounded estimates of background vehicular traffic flows. These macro-scale estimates can be refined for micro-scale analysis at desired locations. The framework provides a generalisable approach for generating origin-destination data from cellular network data. This enables downstream tasks, like detailed traffic simulations for infrastructure planning in data-scarce contexts, supporting urban planners in making informed decisions.
[422] Similarity-Based Bike Station Expansion via Hybrid Denoising Autoencoders
Oluwaleke Yusuf, M. Tsaqif Wismadi, Adil Rasheed
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Urban bike-sharing systems require strategic station expansion to meet growing demand. Traditional allocation approaches rely on explicit demand modelling that may not capture the urban characteristics distinguishing successful stations. This study addresses the need to exploit patterns from existing stations to inform expansion decisions, particularly in data-constrained environments. We present a data-driven framework leveraging existing stations deemed desirable by operational metrics. A hybrid denoising autoencoder (HDAE) learns compressed latent representations from multi-source grid-level features (socio-demographic, built environment, and transport network), with a supervised classification head regularising the embedding space structure. Expansion candidates are selected via greedy allocation with spatial constraints based on latent-space similarity to existing stations. Evaluation on Trondheim’s bike-sharing network demonstrates that HDAE embeddings yield more spatially coherent clusters and allocation patterns than raw features. Sensitivity analyses across similarity methods and distance metrics confirm robustness. A consensus-based procedure across multiple parametrisations distils 32 high-confidence extension zones where all parametrisations agree. The results demonstrate how representation learning captures complex patterns that raw features miss, enabling evidence-based expansion planning without explicit demand modelling. The consensus procedure strengthens recommendations by requiring agreement across parametrisations, while framework configurability allows planners to incorporate operational knowledge. The methodology generalises to any location-allocation problem where existing desirable instances inform the selection of new candidates.
[423] EVIL: Evolving Interpretable Algorithms for Zero-Shot Inference on Event Sequences and Time Series with LLMs
David Berghaus
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce EVIL (\textbf{EV}olving \textbf{I}nterpretable algorithms with \textbf{L}LMs), an approach that uses LLM-guided evolutionary search to discover simple, interpretable algorithms for dynamical systems inference. Rather than training neural networks on large datasets, EVIL evolves pure Python/NumPy programs that perform zero-shot, in-context inference across datasets. We apply EVIL to three distinct tasks: next-event prediction in temporal point processes, rate matrix estimation for Markov jump processes, and time series imputation. In each case, a single evolved algorithm generalizes across all evaluation datasets without per-dataset training (analogous to an amortized inference model). To the best of our knowledge, this is the first work to show that LLM-guided program evolution can discover a single compact inference function for these dynamical-systems problems. Across the three domains, the discovered algorithms are often competitive with, and even outperform, state-of-the-art deep learning models while being orders of magnitudes faster, and remaining fully interpretable.
[424] Convolutionally Low-Rank Models with Modified Quantile Regression for Interval Time Series Forecasting
Miaoxuan Zhu, Yi Yu, Yuyang Li, Wei Li, Guangcan Liu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The quantification of uncertainty in prediction models is crucial for reliable decision-making, yet remains a significant challenge. Interval time series forecasting offers a principled solution to this problem by providing prediction intervals (PIs), which indicates the probability that the true value falls within the predicted range. We consider a recently established point forecasts (PFs) method termed Learning-Based Convolution Nuclear Norm Minimization (LbCNNM), which directly generates multi-step ahead forecasts by leveraging the convolutional low-rankness property derived from training data. While theoretically complete and empirically effective, LbCNNM lacks inherent uncertainty estimation capabilities, a limitation shared by many advanced forecasting methods. To resolve the issue, we modify the well-known Quantile Regression (QR) and integrate it into LbCNNM, resulting in a novel interval forecasting method termed LbCNNM with Modified Quantile Regression (LbCNNM-MQR). In addition, we devise interval calibration techniques to further improve the accuracy of PIs. Extensive experiments on over 100,000 real-world time series demonstrate the superior performance of LbCNNM-MQR.
[425] Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting
Chi Liu, Xin Chen, Xu Zhou, Fangbo Tu, Srinivasan Manoharan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) have achieved remarkable success, underpinning diverse AI applications. However, they often suffer from performance degradation due to factors such as catastrophic forgetting during Supervised Fine-Tuning (SFT), quantization, and pruning. In this work, we introduce a performance recovery framework based on Self-Distillation Fine-Tuning (SDFT) that effectively restores model capabilities. Complementing this practical contribution, we provide a rigorous theoretical explanation for the underlying recovery mechanism. We posit that an LLM’s generative capability fundamentally relies on the high-dimensional manifold constructed by its hidden layers. To investigate this, we employ Centered Kernel Alignment (CKA) to quantify the alignment between student and teacher activation trajectories, leveraging its invariance to orthogonal transformations and scaling. Our experiments demonstrate a strong correlation between performance recovery and manifold alignment, substantiating the claim that self-distillation effectively aligns the student’s high-dimensional manifold with the optimal structure represented by the teacher. This study bridges the gap between practical recovery frameworks and geometric representation theory, offering new insights into the internal mechanisms of self-distillation.
[426] ECG-Lens: Benchmarking ML & DL Models on PTB-XL Dataset
Saloni Garg, Ukant Jadia, Amit Sagtani, Kamal Kant Hiran
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Automated classification of electrocardiogram (ECG) signals is a useful tool for diagnosing and monitoring cardiovascular diseases. This study compares three traditional machine learning algorithms (Decision Tree Classifier, Random Forest Classifier, and Logistic Regression) and three deep learning models (Simple Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Complex CNN (ECGLens)) for the classification of ECG signals from the PTB-XL dataset, which contains 12-lead recordings from normal patients and patients with various cardiac conditions. The DL models were trained on raw ECG signals, allowing them to automatically extract discriminative features. Data augmentation using the Stationary Wavelet Transform (SWT) was applied to enhance model performance, increase the diversity of training samples, and preserve the essential characteristics of the ECG signals. The models were evaluated using multiple metrics, including accuracy, precision, recall, F1-score, and ROC-AUC. The ECG-Lens model achieved the highest performance, with 80% classification accuracy and a 90% ROC-AUC. These findings demonstrate that deep learning architectures, particularly complex CNNs substantially outperform traditional ML methods on raw 12-lead ECG data, and provide a practical benchmark for selecting automated ECG classification models and identifying directions for condition-specific model development.
[427] Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning
Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement learning has become a powerful approach for enhancing large language model reasoning, but faces a fundamental dilemma: training on easy problems can cause overfitting and pass@k degradation, while training on hard problems often results in sparse rewards. Recent question augmentation methods address this by prepending partial solutions as hints. However, uniform hint provision may introduce redundant information while missing critical reasoning bottlenecks, and excessive hints can reduce reasoning diversity, causing pass@k degradation. We propose \textbf{PieceHint}, a hint injection framework that strategically identifies and provides critical reasoning steps during training. By scoring the importance of different reasoning steps, selectively allocating hints based on problem difficulty, and progressively withdrawing scaffolding, PieceHint enables models to transition from guided learning to independent reasoning. Experiments on six mathematical reasoning benchmarks show that our 1.5B model achieves comparable average performance to 32B baselines while preserving pass@k diversity across all $k$ values.
[428] Modern Structure-Aware Simplicial Spatiotemporal Neural Network
Zhaobo Hu, Vincent Gauthier, Mehdi Naima
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Spatiotemporal modeling has evolved beyond simple time series analysis to become fundamental in structural time series analysis. While current research extensively employs graph neural networks (GNNs) for spatial feature extraction with notable success, these networks are limited to capturing only pairwise relationships, despite real-world networks containing richer topological relationships. Additionally, GNN-based models face computational challenges that scale with graph complexity, limiting their applicability to large networks. To address these limitations, we present Modern Structure-Aware Simplicial SpatioTemporal neural network (ModernSASST), the first approach to leverage simplicial complex structures for spatiotemporal modeling. Our method employs spatiotemporal random walks on high-dimensional simplicial complexes and integrates parallelizable Temporal Convolutional Networks to capture high-order topological structures while maintaining computational efficiency. Our source code is publicly available on GitHub\footnote{Code is available at: https://github.com/ComplexNetTSP/ST_RUM.
[429] Reversible Residual Normalization Alleviates Spatio-Temporal Distribution Shift
Zhaobo Hu, Vincent Gauthier, Mehdi Naima
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Distribution shift severely degrades the performance of deep forecasting models. While this issue is well-studied for individual time series, it remains a significant challenge in the spatio-temporal domain. Effective solutions like instance normalization and its variants can mitigate temporal shifts by standardizing statistics. However, distribution shift on a graph is far more complex, involving not only the drift of individual node series but also heterogeneity across the spatial network where different nodes exhibit distinct statistical properties. To tackle this problem, we propose Reversible Residual Normalization (RRN), a novel framework that performs spatially-aware invertible transformations to address distribution shift in both spatial and temporal dimensions. Our approach integrates graph convolutional operations within invertible residual blocks, enabling adaptive normalization that respects the underlying graph structure while maintaining reversibility. By combining Center Normalization with spectral-constrained graph neural networks, our method captures and normalizes complex Spatio-Temporal relationships in a data-driven manner. The bidirectional nature of our framework allows models to learn in a normalized latent space and recover original distributional properties through inverse transformation, offering a robust and model-agnostic solution for forecasting on dynamic spatio-temporal systems.
[430] DPrivBench: Benchmarking LLMs’ Reasoning for Differential Privacy
Erchi Wang, Pengrun Huang, Eli Chien, Om Thakkar, Kamalika Chaudhuri, Yu-Xiang Wang, Ruihan Wu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Differential privacy (DP) has a wide range of applications for protecting data privacy, but designing and verifying DP algorithms requires expert-level reasoning, creating a high barrier for non-expert practitioners. Prior works either rely on specialized verification languages that demand substantial domain expertise or remain semi-automated and require human-in-the-loop guidance. In this work, we investigate whether large language models (LLMs) can automate DP reasoning. We introduce DPrivBench, a benchmark in which each instance asks whether a function or algorithm satisfies a stated DP guarantee under specified assumptions. The benchmark is carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning through trivial pattern matching. Experiments show that while the strongest models handle textbook mechanisms well, all models struggle with advanced algorithms, revealing substantial gaps in current DP reasoning capabilities. Through further analytic study and failure-mode analysis, we identify several promising directions for improving automated DP reasoning. Our benchmark provides a solid foundation for developing and evaluating such methods, and complements existing benchmarks for mathematical reasoning.
[431] QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals
Jeremy Qin, Maksym Andriushchenko
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90% coverage target, with the top performers Gemini 3.1 Pro (79.1%), Grok 4 (76.4%), and GPT-5.4 (75.3%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.
[432] (Weighted) Adaptive Radius Near Neighbor Search: Evaluation for WiFi Fingerprint-based Positioning
Khang Le, Joaquín Torres-Sospedra, Philipp Müller
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Fixed Radius Near Neighbor (FRNN) search is an alternative to the widely used k Nearest Neighbors (kNN) search. Unlike kNN, FRNN determines a label or an estimate for a test sample based on all training samples within a predefined distance. While this approach is beneficial in certain scenarios, assuming a fixed maximum distance for all training samples can decrease the accuracy of the FRNN. Therefore, in this paper we propose the Adaptive Radius Near Neighbor (ARNN) and the Weighted ARNN (WARNN), which employ adaptive distances and in latter case weights. All three methods are compared to kNN and twelve of its variants for a regression problem, namely WiFi fingerprinting indoor positioning, using 22 different datasets to provide a comprehensive analysis. While the performances of the tested FRNN and ARNN versions were amongst the worse, three of the four best methods in the test were WARNN versions, indicating that using weights together with adaptive distances achieves performance comparable or even better than kNN variants.
[433] TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation
Tristan Kirscher, Alexandra Ertl, Klaus Maier-Hein, Xavier Coubez, Philippe Meyer, Sylvain Faisan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities to the empirical mean human response (MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modeling inter-rater disagreement. The proposed post-hoc calibration procedure is simple and requires only a small multi-rater calibration set. It consistently improves calibration metrics over standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.
[434] Multi-Objective Bayesian Optimization via Adaptive \varepsilon-Constraints Decomposition
Yaohong Yang, Sammie Katt, Samuel Kaski
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multi-objective Bayesian optimization (MOBO) provides a principled framework for optimizing expensive black-box functions with multiple objectives. However, existing MOBO methods often struggle with coverage, scalability with respect to the number of objectives, and integrating constraints and preferences. In this work, we propose \textit{STAGE-BO, Sequential Targeting Adaptive Gap-Filling $\varepsilon$-Constraint Bayesian Optimization}, that explicitly targets under-explored regions of the Pareto front. By analyzing the coverage of the approximate Pareto front, our method identifies the largest geometric gaps. These gaps are then used as constraints, which transforms the problem into a sequence of inequality-constrained subproblems, efficiently solved via constrained expected improvement acquisition. Our approach provides a uniform Pareto coverage without hypervolume computation and naturally applies to constrained and preference-based settings. Experiments on synthetic and real-world benchmarks demonstrate superior coverage and competitive hypervolume performance against state-of-the-art baselines.
[435] JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models
Alexandra Dragomir, Ioana Pintilie, Antonio Barbalau, Marius Dragoi, Florin Brad, Cristian Daniel Paduraru, Alexandru Tifrea, Elena Burceanu, Radu Tudor Ionescu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Adapter-based methods have become a cost-effective approach to continual learning (CL) for Large Language Models (LLMs), by sequentially learning a low-rank update matrix for each task. To mitigate catastrophic forgetting, state-of-the-art approaches impose constraints on new adapters with respect to the previous ones, by targeting either subspace or coordinate-wise interference. In this paper, we propose JumpLoRA, a novel framework to adaptively induce sparsity in the Low-Rank Adaptation (LoRA) blocks through the use of JumpReLU gating. The method achieves dynamic parameter isolation, which helps prevent task interference. We demonstrate that our method is highly modular and compatible with LoRA-based CL approaches. Specifically, it significantly boosts the performance of IncLoRA and outperforms the leading state-of-the-art CL method, ELLA.
[436] Evaluating quality in synthetic data generation for large tabular health datasets
Jean-Baptiste Escudié, Benjamin Barnes, Stefan Meisegeier, Klaus Kraywinkel, Fabian Prasser, Nils Körber
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: There is no consensus in the field of synthetic data on concise metrics for quality evaluations or benchmarks on large health datasets, such as historical epidemiological data. This study presents an evaluation of seven recent models from major machine learning families. The models were evaluated using four different datasets, each with a distinct scale. To ensure a fair comparison, we systematically tuned the hyperparameters of each model for each dataset. We propose a methodology for evaluating the fidelity of synthesized joint distributions, aligning metrics with visualization on a single plot. This method is applicable to any dataset and is complemented by a domain-specific analysis of the German Cancer Registries’ epidemiological dataset. The analysis reveals the challenges models face in strictly adhering to the medical domain. We hope this approach will serve as a foundational framework for guiding the selection of synthesizers and remain accessible to all stakeholders involved in releasing synthetic datasets.
[437] Detecting and Suppressing Reward Hacking with Gradient Fingerprints
Songtao Wang, Quang Hieu Pham, Fangcong Yin, Xinpeng Wang, Jocelyn Qiaochu Chen, Greg Durrett, Xi Ye
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward-hacking behaviors are often implicit, as the intermediate chain-of-thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text-based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models’ internal computations. Given a prompt and a model-generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior. Across verifiable reasoning benchmarks spanning math, code, and logical reasoning, GRIFT substantially outperforms strong baselines, including CoT Monitor and TRACE, achieving over 25% relative improvement in detecting reward hacking behavior. Moreover, integrating GRIFT into the rejection fine-tuning pipeline for reasoning tasks reduces reward hacking and improves performance on the true task objective. Our results highlight a promising direction of leveraging gradient level representations for assessing the quality of CoT reasoning traces. Our code is available at: https://github.com/songtao-x/reward_hack.
[438] Impact of Nonlinear Power Amplifier on Massive MIMO: Machine Learning Prediction Under Realistic Radio Channel
Marcin Hoffmann, Paweł Kryszkiewicz
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: M-MIMO is one of the crucial technologies for increasing spectral and energy efficiency of wireless networks. Most of the current works assume that M-MIMO arrays are equipped with a linear front end. However, ongoing efforts to make wireless networks more energy-efficient push the hardware to the limits, where its nonlinear behavior appears. This is especially a common problem for the multicarrier systems, e.g., OFDM used in 4G, 5G, and possibly also in 6G, which is characterized by a high Peak-to-Average Power Ratio. While the impact of a nonlinear Power Amplifier (PA) on an OFDM signal is well characterized, it is a relatively new topic for the M-MIMO OFDM systems. Most of the recent works either neglect nonlinear effects or utilize simplified models proper for Rayleigh or LoS radio channel models. In this paper, we first theoretically characterize the nonlinear distortion in the M-MIMO system under commonly used radio channel models. Then, utilizing 3D-Ray Tracing (3D-RT) software, we demonstrate that these models are not very accurate. Instead, we propose two models: a statistical one and an ML-based one using 3D-RT results. The proposed statistical model utilizes the Generalized Extreme Value (GEV) distribution to model Signal to Distortion Ratio (SDR) for victim users, receiving nonlinear distortion, e.g., as interference from neighboring cells. The proposed ML model aims to predict SDR for a scheduled user (receiving nonlinear distortion along with the desired signal), based on the spatial characteristics of the radio channel and the operation point of each PA feeding at the M-MIMO antenna array. The predicted SDR can then be used to perform PA-aware per-user power allocation. The results show about 12% median gain in user throughput achieved by the proposed ML-based power allocation scheme over the state-of-the-art, fixed operating point scheme.
[439] Corner Reflector Array Jamming Discrimination Using Multi-Dimensional Micro-Motion Features with Frequency Agile Radar
Jie Yuan, Lei Wang, Yanhao Wang, Yimin Liu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper introduces a robust discrimination method for distinguishing real ship targets from corner-reflector-array jamming with frequency-agile radar. The key idea is to exploit the multidimensional micro-motion signatures that separate rigid ships from non-rigid decoys. From Range-Velocity maps we derive two new hand-crafted descriptors-mean weighted residual (MWR) and complementary contrast factor (CCF) and fuse them with deep features learned by a lightweight CNN. An XGBoost classifier then gives the final decision. Extensive simulations show that the hybrid feature set consistently outperforms state-of-the-art alternatives, confirming the superiority of the proposed approach.
[440] AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning
Guransh Singh
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Adapting pre-trained vision-language models (VLMs) for robotic control requires injecting high-magnitude continuous gradients from a flow-matching action expert into a backbone trained exclusively with cross-entropy. This cross-modal gradient asymmetry - the spectral dimensionality mismatch between low-rank MSE regression gradients and the high-dimensional semantic manifold sculpted by CE pre-training, causes rapid, severe erosion of the VLM’s visual-question-answering (VQA) capability. Industry-standard defences either sever the gradient pathway entirely via stop gradient, discarding the rich continuous supervision, or restrict parameter capacity through low-rank adapters (LoRA) that constrain the rank of updates but not their direction, and thus still overwrite the pre-trained manifold. We introduce AEGIS (Anchor-Enforced Gradient Isolation System): a buffer-free, layer-wise orthogonal gradient projection framework that enables direct continuous MSE learning while preserving the pre-trained VQA manifold - without any co-training data or replay buffer. AEGIS pre-computes a static Gaussian reference anchor from masked VQA forward passes across all transformer layers, then at each training step constructs a Wasserstein-2 transport penalty that generates an anchor restoration gradient. A sequential dual-backward decomposes the task and anchor gradients; for each transformer layer, AEGIS applies a single Gram-Schmidt orthogonal projection that bends the task gradient away from the destructive direction while preserving its constructive content. The projection sheds less than 1% of gradient energy on average, yet eliminates the cumulative activation drift that drives severe forgetting.
[441] Prototype-Grounded Concept Models for Verifiable Concept Alignment
Stefano Colamonaco, David Debot, Pietro Barbiero, Giuseppe Marra
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Concept Bottleneck Models (CBMs) aim to improve interpretability in Deep Learning by structuring predictions through human-understandable concepts, but they provide no way to verify whether learned concepts align with the human’s intended meaning, hurting interpretability. We introduce Prototype-Grounded Concept Models (PGCMs), which ground concepts in learned visual prototypes: image parts that serve as explicit evidence for the concepts. This grounding enables direct inspection of concept semantics and supports targeted human intervention at the prototype level to correct misalignments. Empirically, PGCMs match the predictive performance of state-of-the-art CBMs while substantially improving transparency, interpretability, and intervenability.
[442] Unveiling Stochasticity: Universal Multi-modal Probabilistic Modeling for Traffic Forecasting
Weijiang Xiong, Robert Fonod, Nikolas Geroliminis
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Traffic forecasting is a challenging spatio-temporal modeling task and a critical component of urban transportation management. Current studies mainly focus on deterministic predictions, with limited considerations on the uncertainty and stochasticity in traffic dynamics. Therefore, this paper proposes an elegant yet universal approach that transforms existing models into probabilistic predictors by replacing only the final output layer with a novel Gaussian Mixture Model (GMM) layer. The modified model requires no changes to the training pipeline and can be trained using only the Negative Log-Likelihood (NLL) loss, without any auxiliary or regularization terms. Experiments on multiple traffic datasets show that our approach generalizes from classic to modern model architectures while preserving deterministic performance. Furthermore, we propose a systematic evaluation procedure based on cumulative distributions and confidence intervals, and demonstrate that our approach is considerably more accurate and informative than unimodal or deterministic baselines. Finally, a more detailed study on a real-world dense urban traffic network is presented to examine the impact of data quality on uncertainty quantification and to show the robustness of our approach under imperfect data conditions. Code available at https://github.com/Weijiang-Xiong/OpenSkyTraffic
[443] The Harder Path: Last Iterate Convergence for Uncoupled Learning in Zero-Sum Games with Bandit Feedback
Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Michal Valko, Vianney Perchet
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We study the problem of learning in zero-sum matrix games with repeated play and bandit feedback. Specifically, we focus on developing uncoupled algorithms that guarantee, without communication between players, the convergence of the last-iterate to a Nash equilibrium. Although the non-bandit case has been studied extensively, this setting has only been explored recently, with a bound of $\mathcal{O}(T^{-1/8})$ on the exploitability gap. We show that, for uncoupled algorithms, guaranteeing convergence of the policy profiles to a Nash equilibrium is detrimental to the performance, with the best attainable rate being $Ω(T^{-1/4})$ in contrast to the usual $Ω(T^{-1/2})$ rate for convergence of the average iterates. We then propose two algorithms that achieve this optimal rate up to constant and logarithmic factors. The first algorithm leverages a straightforward trade-off between exploration and exploitation, while the second employs a regularization technique based on a two-step mirror descent approach.
[444] Sample Complexity Bounds for Stochastic Shortest Path with a Generative Model
Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We study the sample complexity of learning an $ε$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{\min}$, and maximum expected cost of the optimal policy over all states $B_{\star}$, where any algorithm requires at least $Ω(SAB_{\star}^3/(c_{\min}ε^2))$ samples to return an $ε$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{\min} = 0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this lower bound with an algorithm that matches it, up to logarithmic factors, in the general case, and an algorithm that matches it up to logarithmic factors even when $c_{\min} = 0$, but only under the condition that the optimal policy has a bounded hitting time to the goal state.
[445] SCRIPT: Implementing an Intelligent Tutoring System for Programming in a German University Context
Alina Deriyeva, Jesper Dannath, Benjamin Paassen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Practice and extensive exercises are essential in programming education. Intelligent tutoring systems (ITSs) are a viable option to provide individualized hints and advice to programming students even when human tutors are not available. However, prior ITS for programming rarely support the Python programming language, mostly focus on introductory programming, and rarely take recent developments in generative models into account. We aim to establish a novel ITS for Python programming that is highly adaptable, serves both as a teaching and research platform, provides interfaces to plug in hint mechanisms (e.g.\ via large language models), and works inside the particularly challenging regulatory environment of Germany, that is, conforming to the European data protection regulation, the European AI act, and ethical framework of the German Research Foundation. In this paper, we present the description of the current state of the ITS along with future development directions, as well as discuss the challenges and opportunities for improving the system.
[446] Univariate Channel Fusion for Multivariate Time Series Classification
Fernando Moro, Vinicius M. A. Souza
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multivariate time series classification (MTSC) plays a crucial role in various domains, including biomedical signal analysis and motion monitoring. However, existing approaches, particularly deep learning models, often require high computational resources, making them unsuitable for real-time applications or deployment on low-cost hardware, such as IoT devices and wearable systems. In this paper, we propose the Univariate Channel Fusion (UCF) method to deal with MTSC efficiently. UCF transforms multivariate time series into a univariate representation through simple channel fusion strategies such as the mean, median, or dynamic time warping barycenter. This transformation enables the use of any classifier originally designed for univariate time series, providing a flexible and computationally lightweight alternative to complex models. We evaluate UCF in five case studies covering diverse application domains, including chemical monitoring, brain-computer interfaces, and human activity analysis. The results demonstrate that UCF often outperforms baseline methods and state-of-the-art algorithms tailored for MTSC, while achieving substantial gains in computational efficiency, being particularly effective in problems with high inter-channel correlation.
[447] Tabular foundation models for in-context prediction of molecular properties
Karim K. Ben Hicham, Jan G. Rittig, Martin Grohe, Alexander Mitsos
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate molecular property prediction is central to drug discovery, catalysis, and process design, yet real-world applications are often limited by small datasets. Molecular foundation models provide a promising direction by learning transferable molecular representations; however, they typically involve task-specific fine-tuning, require machine learning expertise, and often fail to outperform classical baselines. Tabular foundation models (TFMs) offer a fundamentally different paradigm: they perform predictions through in-context learning, enabling inference without task-specific training. Here, we evaluate TFMs in the low- to medium-data regime across both standardized pharmaceutical benchmarks and chemical engineering datasets. We evaluate both frozen molecular foundation model representations, as well as classical descriptors and fingerprints. Across the benchmarks, the approach shows excellent predictive performance while reducing computational cost, compared to fine-tuning, with these advantages also transferring to practical engineering data settings. In particular, combining TFMs with CheMeleon embeddings yields up to 100% win rates on 30 MoleculeACE tasks, while compact RDKit2d and Mordred descriptors provide strong descriptor-based alternatives. Molecular representation emerges as a key determinant in TFM performance, with molecular foundation model embeddings and 2D descriptor sets both providing substantial gains over classic molecular fingerprints on many tasks. These results suggest that in-context learning with TFMs provides a highly accurate and cost-efficient alternative for property prediction in practical applications.
[448] Training Time Prediction for Mixed Precision-based Distributed Training
Minchul Kang, Changyong Shin, Jinwoo Jeong, Hyunho Lee, Younghun Go, Gyeongmin Kim, Gyeongsik Yang, Chuck Yoo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate prediction of training time in distributed deep learning is crucial for resource allocation, cost estimation, and job scheduling. We observe that the floating-point precision setting is a key determinant of training time, leading to training time variations of ~2.4x over its minimum. However, existing studies on distributed training time prediction rely on static model computation graphs that do not capture precision variations, including mixed precision. According to our experiments, training time prediction without considering precision results in significant prediction errors - reaching up to 147.85% in mean absolute percentage error (MAPE). To address this issue, we propose a precision-aware distributed training time predictor that achieves robust accuracy across diverse precision settings, including mixed precision, with 9.8% MAPE.
[449] Synthetic data in cryptocurrencies using generative models
André Saimon S. Sousa, Otto Pires, Frank Acasiete, Oscar M. Granados, Valéria Loureiro da Silva, Hugo Saba
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Data plays a fundamental role in consolidating markets, services, and products in the digital financial ecosystem. However, the use of real data, especially in the financial context, can lead to privacy risks and access restrictions, affecting institutions, research, and modeling processes. Although not all financial datasets present such limitations, this work proposes the use of deep learning techniques for generating synthetic data applied to cryptocurrency price time series. The approach is based on Conditional Generative Adversarial Networks (CGANs), combining an LSTM-type recurrent generator and an MLP discriminator to produce statistically consistent synthetic data. The experiments consider different crypto-assets and demonstrate that the model is capable of reproducing relevant temporal patterns, preserving market trends and dynamics. The generation of synthetic series through GANs is an efficient alternative for simulating financial data, showing potential for applications such as market behavior analysis and anomaly detection, with lower computational cost compared to more complex generative approaches.
[450] Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
Yide Ran, Jianwen Xie, Minghui Wang, Wenjin Zheng, Denghui Zhang, Chuan Li, Zhaozhuo Xu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Data attribution and valuation are critical for understanding data-model synergy for Large Language Models (LLMs), yet existing gradient-based methods suffer from scalability challenges on LLMs. Inspired by human cognition, where decision making relies on a focused readout of relevant memories rather than replaying all pathways, we introduce RISE (Readout Influence Sketching Estimator). Instead of computing and indexing gradients across the entire LLM, RISE focuses on influence hotspots at the output layer, where influence signals concentrate, and the gradient admits a decomposed outer-product form. This enables a dual-channel representation combining a lexical residual channel (RH) and a semantic projected-error channel (GH). Applying CountSketch projections to these channels achieves strong compression while maintaining accurate attribution. Across the OLMo (1B-32B) and Pythia (14M-6.9B) families, RISE reduces index storage by up to 112$\times$ compared to RapidIn and scales to 32B parameters LLM, where gradient-based baselines such as RapidIn and ZO-Inf become memory-infeasible. We evaluate RISE on two paradigms: (1) retrospective attribution, retrieving influential training examples for specific predictions, and (2) prospective valuation, scoring candidate data utility zero-shot. We validate RISE on three tasks: Howdy backdoor data detection, Finance-Medical domain separation, and Brain Rot high-quality data selection. In a closed-loop Brain Rot study, continued pretraining on RISE-selected data yields consistent downstream improvements. Overall, RISE provides a practical and scalable primitive for influence analysis and training-data selection in modern large language models.
[451] OT on the Map: Quantifying Domain Shifts in Geographic Space
Haoran Zhang, Livia Betti, Konstantin Klemmer, Esther Rolf, David Alvarez-Melis
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In computer vision and machine learning for geographic data, out-of-domain generalization is a pervasive challenge, arising from uneven global data coverage and distribution shifts across geographic regions. Though models are frequently trained in one region and deployed in another, there is no principled method for determining when this cross-region adaptation will be successful. A well-defined notion of distance between distributions can effectively quantify how different a new target domain is compared to the domains used for model training, which in turn could support model training and deployment decisions. In this paper, we propose a strategy for computing distances between geospatial domains that leverages geographic information with Optimal Transport methods (GeoSpOT). In our experiments, GeoSpOT distances emerge as effective predictors of cross-domain transfer difficulty. We further demonstrate that embeddings from pretrained location encoders provide information comparable to image/text embeddings, despite relying solely on longitude-latitude pairs as input. This allows users to get an approximation of out-of-domain performance for geospatial models, even when the exact downstream task is unknown, or no task-specific data is available. Building on these findings, we show that GeoSpOT distances can preemptively guide data selection and enable predictive tools to analyze regions where a model is likely to underperform.
[452] Neuro-Symbolic ODE Discovery with Latent Grammar Flow
Karin Yu, Eleni Chatzi, Georgios Kissas
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Understanding natural and engineered systems often relies on symbolic formulations, such as differential equations, which provide interpretability and transferability beyond black-box models. We introduce Latent Grammar Flow (LGF), a neuro-symbolic generative framework for discovering ordinary differential equations from data. LGF embeds equations as grammar-based representations into a discrete latent space and forces semantically similar equations to be positioned closer together with a behavioural loss. Then, a discrete flow model guides the sampling process to recursively generate candidate equations that best fit the observed data. Domain knowledge and constraints, such as stability, can be either embedded into the rules or used as conditional predictors.
[453] Enhancing AI and Dynamical Subseasonal Forecasts with Probabilistic Bias Correction
Hannah Guan, Soukayna Mouatadid, Paulo Orenstein, Judah Cohen, Haiyu Dong, Zekun Ni, Jeremy Berman, Genevieve Flaspohler, Alex Lu, Jakob Schloer, Joshua Talib, Jonathan A. Weyn, Lester Mackey
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Decision-makers rely on weather forecasts to plant crops, manage wildfires, allocate water and energy, and prepare for weather extremes. Today, such forecasts enjoy unprecedented accuracy out to two weeks thanks to steady advances in physics-based dynamical models and data-driven artificial intelligence (AI) models. However, model skill drops precipitously at subseasonal timescales (2 - 6 weeks ahead), due to compounding errors and persistent biases. To counter this degradation, we introduce probabilistic bias correction (PBC), a machine learning framework that substantially reduces systematic error by learning to correct historical probabilistic forecasts. When applied to the leading dynamical and AI models from the European Centre for Medium-Range Weather Forecasts (ECMWF), PBC doubles the subseasonal skill of the AI Forecasting System and improves the skill of the operationally-debiased dynamical model for 91% of pressure, 92% of temperature, and 98% of precipitation targets. We designed PBC for operational deployment, and, in ECMWF’s 2025 real-time forecasting competition, its global forecasts placed first for all weather variables and lead times, outperforming the dynamical models from six operational forecasting centers, an international dynamical multi-model ensemble, ECMWF’s AI Forecasting System, and the forecasting systems of 34 teams worldwide. These probabilistic skill gains translate into more accurate prediction of extreme events and have the potential to improve agricultural planning, energy management, and disaster preparedness in vulnerable communities.
[454] Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization
Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for learning document-level audio-text representations from long, segmented sequences in low-resource data settings. HILBERT leverages frozen pre-trained speech and language encoders to extract segment-level features, which are aggregated via cross-modal attention and self-attentive pooling to form modality-specific document representations and a joint cross-attentive embedding. To align modalities while preserving modality-specific structure under severe audio-text dimensional imbalance, we introduce a reciprocal dual contrastive objective that simultaneously aligns audio-to-joint and text-to-joint representations, rather than directly contrasting audio and text alone. Two auxiliary regularizers further stabilize long-sequence fusion: a Centered Kernel Alignment (CKA) loss that preserves structural consistency between each modality and the joint embedding, and a mutual information balancing loss that prevents dominance of a single modality by equalizing information flow from audio and text into the joint space. For downstream prediction, HILBERT employs a Mixture-of-Experts (MoE) classifier over concatenated audio, text, and joint representations to accommodate heterogeneous label regimes. Extensive evaluation across multiple audio-text backbone combinations demonstrates that HILBERT learns semantically meaningful long-sequence representations and achieves superior performance on highly imbalanced multi-class settings.
[455] Beyond Distribution Sharpening: The Importance of Task Rewards
Sarthak Mittal, Leo Gagnon, Guillaume Lajoie
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Frontier models have demonstrated exceptional capabilities following the integration of task-reward-based reinforcement learning (RL) into their training pipelines, enabling systems to evolve from pure reasoning models into sophisticated agents. However, debate persists regarding whether RL genuinely instills new skills within a base model or merely sharpens its existing distribution to elicit latent capabilities. To address this dichotomy, we present an explicit comparison between distribution sharpening and task-reward-based learning, utilizing RL as a tool to implement both paradigms. Our analysis reveals the inherent limitations of distribution sharpening, demonstrating from first principles how and why the optima can be unfavorable and the approach fundamentally unstable. Furthermore, our experiments using Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct and Qwen3-4B-Instruct-2507 on math datasets confirm that sharpening yields limited gains, whereas incorporating task-based reward signal can greatly help achieve robust performance improvements and stable learning.
[456] FL-MHSM: Spatially-adaptive Fusion and Ensemble Learning for Flood-Landslide Multi-Hazard Susceptibility Mapping at Regional Scale
Aswathi Mundayatt, Jaya Sreevalsan-Nair
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing multi-hazard susceptibility mapping (MHSM) studies often rely on spatially uniform models, treat hazards independently, and provide limited representation of cross-hazard dependence and uncertainty. To address these limitations, this study proposes a deep learning (DL) workflow for joint flood-landslide multi-hazard susceptibility mapping (FL-MHSM) that combines two-level spatial partitioning, probabilistic Early Fusion (EF), a tree-based Late Fusion (LF) baseline, and a soft-gating Mixture of Experts (MoE) model, with MoE serving as final predictive model. The proposed design preserves spatial heterogeneity through zonal partitions and enables data-parallel large-area prediction using overlapping lattice grids. In Kerala, EF remained competitive with LF, improving flood recall from 0.816 to 0.840 and reducing Brier score from 0.092 to 0.086, while MoE provided strongest performance for flood susceptibility, achieving an AUC-ROC of 0.905, recall of 0.930, and F1-score of 0.722. In Nepal, EF similarly improved flood recall from 0.820 to 0.858 and reduced Brier score from 0.057 to 0.049 relative to LF, while MoE outperformed both EF and LF for landslide susceptibility, achieving an AUC-ROC of 0.914, recall of 0.901, and F1-score of 0.559. GeoDetector analysis of MoE outputs further showed that dominant factors varied more across zones in Kerala, where susceptibility was shaped by different combinations of topographic, land-cover, and drainage-related controls, while Nepal showed a more consistent influence of topographic and glacier-related factors across zones. These findings show that EF and LF provide complementary predictive behavior, and that their spatially adaptive integration through MoE yields robust overall predictive performance for FL-MHSM while supporting interpretable characterization of multi-hazard susceptibility in spatially heterogeneous landscapes.
[457] Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design
Shriram Chennakesavalu, Kirill Shmilovich, Hayley Weir, Colin Grambow, John Bradshaw, Patricia Suriana, Chen Cheng, Kangway Chuang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) have the potential to accelerate small molecule drug design due to their ability to reason about information from diverse sources and formats. However, their practical utility remains unclear due to the lack of benchmarks that reflect real-world scenarios. In this work, we introduce a suite of chemically-grounded tasks spanning molecular property prediction, molecular representation transformations, and molecular design. Importantly, we formulate these tasks as reinforcement learning (RL) environments, enabling a unified approach for evaluation and post-training. Across three model families, we find that frontier models are increasingly proficient at chemical tasks, but that there is significant room for improvement, especially in experimental settings with low data. Critically, we show that RL-based post-training can substantially improve performance. A smaller model post-trained on our environments becomes competitive with state-of-the-art frontier models, despite a significantly weaker base model. This suggests a practical route toward employing LLMs in drug discovery; by combining carefully-designed evaluation tasks with targeted post-training, we can both elucidate and close critical capability gaps.
[458] Geometric regularization of autoencoders via observed stochastic dynamics
Sean Hill, Felix X. -F. Ye
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Stochastic dynamical systems with slow or metastable behavior evolve, on long time scales, on an unknown low-dimensional manifold in high-dimensional ambient space. Building a reduced simulator from short-burst ambient ensembles is a long-standing problem: local-chart methods like ATLAS suffer from exponential landmark scaling and per-step reprojection, while autoencoder alternatives leave tangent-bundle geometry poorly constrained, and the errors propagate into the learned drift and diffusion. We observe that the ambient covariance~$Λ$ already encodes coordinate-invariant tangent-space information, its range spanning the tangent bundle. Using this, we construct a tangent-bundle penalty and an inverse-consistency penalty for a three-stage pipeline (chart learning, latent drift, latent diffusion) that learns a single nonlinear chart and the latent SDE. The penalties induce a function-space metric, the $ρ$-metric, strictly weaker than the Sobolev $H^1$ norm yet achieving the same chart-quality generalization rate up to logarithmic factors. For the drift, we derive an encoder-pullback target via Itô’s formula on the learned encoder and prove a bias decomposition showing the standard decoder-side formula carries systematic error for any imperfect chart. Under a $W^{2,\infty}$ chart-convergence assumption, chart-level error propagates controllably to weak convergence of the ambient dynamics and to convergence of radial mean first-passage times. Experiments on four surfaces embedded in up to $201$ ambient dimensions reduce radial MFPT error by $50$–$70%$ under rotation dynamics and achieve the lowest inter-well MFPT error on most surface–transition pairs under metastable Müller–Brown Langevin dynamics, while reducing end-to-end ambient coefficient errors by up to an order of magnitude relative to an unregularized autoencoder.
[459] Heterogeneous Sheaf Neural Networks
Luke Braithwaite, Alessio Borgi, Gabriele Onorato, Kristjan Tarantelli, Iulia Duta, Francesco Restuccia, Fabrizio Silvestri, Pietro Liò
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Heterogeneous graphs, whose nodes and edges may belong to different types and feature spaces, arise in a wide variety of real-world domains such as biology, chemistry and computer networks. Existing methods typically address this heterogeneity by modifying the model architecture itself, which often results in specialized and parameter-intensive designs. To address this issue, we propose HetSheaf, a framework that models heterogeneous relational data through cellular sheaves, which provide a principled topological framework for encoding type-specific local feature spaces and their interactions directly in the data representation. We also introduce a family of heterogeneous sheaf predictors that learn restriction maps conditioned on node and edge types. To enable graph-level predictions, we further propose SheafPool, a graph pooling mechanism that aggregates node representations in stalk space while remaining invariant to local changes of basis, ultimately enabling stalk-space graph-level representations for the first time. HetSheaf achieves strong predictive performance on standard heterogeneous graph benchmarks, over numerous tasks such as node/graph classification, link prediction and recommendation, while reducing by up to 10x the number of parameters with respect to state-of-the-art baselines.
[460] AutoNFS: Automatic Neural Feature Selection
Witold Wydmański, Marek Śmieja
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.13304: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.13304&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[461] Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba
Shriyank Somvanshi, Md Monzurul Islam, Mahmuda Sultana Mimi, Sazzad Bin Bashar Polock, Gaurab Chhetri, Anandi Dutta, Amir Rafe, Subasish Das
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.18970: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.18970&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[462] An Information-Geometric Approach to Artificial Curiosity
Alexander Nedergaard, Pablo A. Morales
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2504.06355: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.06355&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[463] A Tale of Two Learning Algorithms: Multiple Stream Random Walk and Asynchronous Gossip
Peyman Gholami, Hulya Seferoglu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2504.09792: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.09792&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[464] Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2504.20966: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.20966&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[465] PyLO: Towards Accessible Learned Optimizers in PyTorch
Paul Janson, Benjamin Therien, Quentin Anthony, Xiaolong Huang, Abhinav Moudgil, Eugene Belilovsky
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.10315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.10315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[466] HiPreNets: High-Precision Neural Networks through Progressive Training
Ethan Mulle, Wei Kang, Qi Gong
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.15064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.15064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[467] Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation
Pedro R. Pires, Gregorio F. Azevedo, Pietro L. Campos, Rafael T. Sereicikas, Tiago A. Almeida
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.18756: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.18756&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[468] Self-Aligned Reward: Towards Effective and Efficient Reasoners
Peixuan Han, Adit Krishnan, Gerald Friedland, Jiaxuan You, Chris Kong
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.05489: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05489&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[469] Examining the Relationship between Scientific Publishing Activity and Hype-Driven Financial Bubbles: A Comparison of the Dot-Com and AI Eras
Aksheytha Chelikavada, Casey C. Bennett
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.11982: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11982&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[470] Online Distributionally Robust LLM Alignment via Regression to Relative Reward
Sharan Sahu, Martin T. Wells
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.19104: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19104&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[471] Truncated Kernel Stochastic Gradient Descent with General Losses and Spherical Radial Basis Functions
Jinhui Bai, Andreas Christmann, Lei Shi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.04237: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04237&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[472] COMPASS: Benchmarking Constrained Optimization in LLM Agents
Tian Qin, Felix Bai, Ting-Yao Hu, Raviteja Vemulapalli, Hema Swetha Koppula, Zhiyang Xu, Bowen Jin, Mert Cemri, Jiarui Lu, Zirui Wang, Meng Cao
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.07043: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07043&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[473] On Optimal Hyperparameters for Differentially Private Deep Transfer Learning
Aki Rehn, Linzh Zhao, Mikko A. Heikkilä, Antti Honkela
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.20616: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20616&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[474] Joint Score-Threshold Optimization for Interpretable Risk Assessment
Fardin Ganjkhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Kimia Ghobadi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.21934: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21934&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[475] In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs
Vishnu Sarukkai, Asanshay Gupta, James Hong, Michaël Gharbi, Kayvon Fatahalian
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.02543: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02543&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[476] Teaching Language Models Mechanistic Explainability Through MechSMILES
Théo A. Neukomm, Zlatko Jončev, Philippe Schwaller
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.05722: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05722&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[477] OXtal: An All-Atom Diffusion Model for Organic Crystal Structure Prediction
Emily Jin, Andrei Cristian Nica, Mikhail Galkin, Jarrid Rector-Brooks, Kin Long Kelvin Lee, Santiago Miret, Frances H. Arnold, Michael Bronstein, Avishek Joey Bose, Alexander Tong, Cheng-Hao Liu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.06987: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06987&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[478] Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration
Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.07173: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07173&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[479] Dynamic Tool Dependency Retrieval for Lightweight Function Calling
Bhrij Patel, Davide Belli, Amir Jalalirad, Maximilian Arnold, Aleksandr Ermolov, Bence Major
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.17052: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17052&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[480] Unsupervised domain adaptation for radioisotope identification in gamma spectroscopy
Peter Lalor, Ayush Panigrahy, Alex Hagen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.05719: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05719&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[481] Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks
Yuval Ran-Milo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.11487: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11487&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[482] SLE-FNO: Single-Layer Extensions for Task-Agnostic Continual Learning in Fourier Neural Operators
Mahmoud Elhadidy, Roshan M. D’Souza, Amirhossein Arzani
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.20410: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20410&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[483] Scalable Maximum Entropy Population Synthesis via Persistent Contrastive Divergence
Mirko Degli Esposti
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.27312: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27312&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[484] Plateaus, Optima, and Overfitting in Multi-Layer Perceptrons: A Saddle-Saddle-Attractor Scenario
Alex Alì Maleknia, Yuzuru Sato
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.02393: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02393&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[485] Restless Bandits with Individual Penalty Constraints: Near-Optimal Indices and Deep Reinforcement Learning
Nida Zamir, I-Hong Hou
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.04101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[486] AdaBoost Does Not Always Cycle: A Computer-Assisted Counterexample
Erik Y. Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.07055: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07055&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[487] Structural Evaluation Metrics for SVG Generation via Leave-One-Out Analysis
Haonan Zhu, Adrienne Deganutti, Elad Hirsch, Purvanshi Mehta
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.08809: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08809&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[488] Beyond Fixed False Discovery Rates: Post-Hoc Conformal Selection with E-Variables
Meiyi Zhu, Osvaldo Simeone
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11305: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11305&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[489] When Missing Becomes Structure: Intent-Preserving Policy Completion from Financial KOL Discourse
Yuncong Liu, Yuan Wan, Zhou Jiang, Yao Lu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14333: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14333&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[490] Benchmarking Optimizers for MLPs in Tabular Deep Learning
Yury Gorishniy, Ivan Rubachev, Dmitrii Feoktistov, Artem Babenko
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.15297: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15297&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[491] Adaptive Spatio-temporal Estimation on the Graph Edges via Line Graph Transformation
Yi Yan, Ercan Engin Kuruoglu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2311.00656: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2311.00656&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[492] Estimating Joint Interventional Distributions from Marginal Interventional Data
Sergio Hernan Garrido Mejia, Elke Kirschbaum, Armin Kekić, Bernhard Schölkopf, Atalanti Mastakouri
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2409.01794: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.01794&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[493] Resource-efficient equivariant quantum convolutional neural networks
Koki Chinzei, Quoc Hoan Tran, Yasuhiro Endo, Hirotaka Oshima
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2410.01252: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.01252&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[494] Two-Dimensional Deep ReLU CNN Approximation for Korobov Functions: A Constructive Approach
Qin Fang, Lei Shi, Min Xu, Ding-Xuan Zhou
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.07976: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.07976&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[495] What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context
Zhongyu Ouyang, Qianlong Wen, Chunhui Zhang, Yanfang Ye, Soroush Vosoughi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.02261: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02261&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[496] Sequential Regression Learning with Randomized Algorithms
Dorival Leão, Reiko Aoki, Alberto Ohashi, Teh Led Red
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.03759: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03759&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[497] Modeling Parkinson’s Disease Progression Using Longitudinal Voice Biomarkers: A Comparative Study of Statistical and Neural Mixed-Effects Models
Ran Tong, Lanruo Wang, Tong Wang, Wei Yan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.20058: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.20058&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[498] Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks
Dionysios Adamopoulos, Anastasia Poulopoulou, Georgios Goumas, Christina Giannoula
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.20834: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20834&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[499] The Machine Learning Approach to Moment Closure Relations for Plasma: A Review
Samuel Burles, Enrico Camporeale
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.22486: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22486&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[500] Comparing the latent features of universal machine-learning interatomic potentials
Sofiia Chorna, Davide Tisi, Cesare Malosso, Wei Bin How, Michele Ceriotti, Sanggyu Chong
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.05717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[501] Robustness Verification of Polynomial Neural Networks
Yulia Alexandr, Hao Duan, Guido Montúfar
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.06105: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06105&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[502] Solving Inverse Parametrized Problems via Finite Elements and Extreme Learning Networks
Erik Burman, Mats G. Larson, Karl Larsson, Jonatan Vallin
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.14757: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14757&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[503] Scalable Posterior Uncertainty for Flexible Density-Based Clustering
Nicola Bariletto, Stephen G. Walker
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.03188: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03188&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[504] Beyond Augmented-Action Surrogates for Multi-Expert Learning-to-Defer
Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.09414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[505] Bias in Surface Electromyography Features across a Demographically Diverse Cohort
Aditi Agrawal, Celine John Philip, Giancarlo K. Sagastume, Marcus A. Battraw, Wilsaan M. Joiner, Jonathon S. Schofield, Lee M. Miller, Richard S. Whittle
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14460: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14460&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[506] Optimal algorithmic complexity of inference in quantum kernel methods
Elies Gil-Fuster, Seongwook Shin, Sofiene Jerbi, Jens Eisert, Maximilian J. Kramer
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.15214: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15214&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[507] InfoChess: A Game of Adversarial Inference and a Laboratory for Quantifiable Information Control
Kieran A. Murphy
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We propose InfoChess, a symmetric adversarial game that elevates competitive information acquisition to the primary objective. There is no piece capture, removing material incentives that would otherwise confound the role of information. Instead, pieces are used to alter visibility. Players are scored on their probabilistic inference of the opponent’s king location over the duration of the game. To explore the space of strategies for playing InfoChess, we introduce a hierarchy of heuristic agents defined by increasing levels of opponent modeling, and train a reinforcement learning agent that outperforms these baselines. Leveraging the discrete structure of the game, we analyze gameplay through natural information-theoretic characterizations that include belief entropy, oracle cross entropy, and predictive log score under the action-induced observation channel. These measures disentangle epistemic uncertainty, calibration mismatch, and uncertainty induced by adversarial movement. The design of InfoChess renders it a testbed for studying multi-agent inference under partial observability. We release code for the environment and agents, and a public interface to encourage further study.
[508] Scalable Algorithms with Provable Optimality Bounds for the Multiple Watchman Route Problem
Srikar Gouru, Ariel Felner, Jiaoyang Li
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In this paper, we tackle the Multiple Watchman Route Problem (MWRP), which aims to find a set of paths that M watchmen can follow such that every location on the map can be seen by at least one watchman. First, we propose multiple methods to reduce the state space over which a search needs to be conducted by pruning map areas that are guaranteed to be seen en route to other areas. Next, we introduce MWRP-CP3, an efficient optimal planner that combines these methods with techniques that improve the quality and calculation time of existing heuristics. We present several suboptimal algorithms with bounds on solution quality, including MxWA*, a general variant of weighted A* for makespan problems. We also present anytime variations of our suboptimal algorithms, as well as techniques to improve an existing suboptimal solution by solving multiple decomposed sub-problems. We show that MWRP-CP3 can reduce the search space by more than 95% and runs more than 200x faster than existing optimal algorithms on 2D grid maps. We also show that our suboptimal algorithms solve maps 3x larger than those solvable by MWRP-CP3. See mwrp-cp3.github.io for the open source codebase and video demonstrations.
[509] AstroVLM: Expert Multi-agent Collaborative Reasoning for Astronomical Imaging Quality Diagnosis
Yaohui Han, Tianshuo Wang, Zixi Zhao, Zhengchun Zhu, Shuo Ren, Yiru Wang, Rongliang Fu, Tinghuan Chen, Tsung-Yi Ho
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision Language Models (VLMs) have been applied to several specific domains and have shown strong problem-solving capabilities. However, astronomical imaging, a quite complex problem involving multidisciplinary knowledge and several subtasks, has not been adequately studied. Due to the complexity of the astronomical imaging process, both world-class astronomical organizations, such as NASA, and expert enthusiasts devote a great deal of time and effort. This is because the processes in astronomical imaging have complex underlying correlations that significantly influence one another, making the quality diagnosis and error localization of astronomical images challenging. To address this problem, we propose AstroVLM, a collaborative multi-agent system for diagnosing the quality of astronomical images. Experiment results show that AstroVLM outperforms all baselines on real-world astronomical imaging quality diagnosis tasks, providing a reference for language models to handle complicated multi-process tasks.
[510] Veritas-RPM: Provenance-Guided Multi-Agent False Positive Suppression for Remote Patient Monitoring
Aswini Misro, Vikash Sharma, Shreyank N Gowda
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present Veritas-RPM, a provenance-guided multi-agent architecture comprising five processing layers: VeritasAgent (ground-truth assembly), SentinelLayer (anomaly detection), DirectorAgent (specialist routing), six domain Specialist Agents, and MetaSentinelAgent (conflict resolution and final decision). We construct a 98-case synthetic taxonomy of false-positive scenarios derived from documented RPM patterns. Synthetic patient epochs (n = 530) were generated directly from taxonomy parameters and processed through the pipeline. Ground-truth labels are known for all cases. Performance is reported as True Suppression Rate (TSR), False Escalation Rate (FER), and Indeterminate Rate (INDR).
[511] LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading
Chengwei Lou, Zekai Jin, Wei Tang, Guangfei Geng, Jin Yang, Lu Zhang
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Real-time peer-to-peer (P2P) electricity markets dynamically adapt to fluctuations in renewable energy and variations in demand, maximizing economic benefits through instantaneous price responses while enhancing grid flexibility. However, scaling expert guidance for massive personalized prosumers poses critical challenges, including diverse decision-making demands and a lack of customized modeling frameworks. This paper proposes an integrated large language model-multi-agent reinforcement learning (LLM-MARL) framework for real-time P2P energy trading to address challenges such as the limited technical capability of prosumers, the lack of expert experience, and security issues of distribution networks. LLMs are introduced as experts to generate personalized strategies, guiding MARL under the centralized training with decentralized execution (CTDE) paradigm through imitation. To handle the scalability issues inherent in large-scale P2P networks, a differential attention-based critic network is introduced to efficiently extract key interaction features and enhance convergence. Experimental results demonstrate that LLM-generated strategies effectively substitute human experts. The proposed imitative expert MARL algorithms achieve significantly lower economic costs and voltage violation rates on test sets compared to baseline algorithms, while maintaining robust stability. This paper provides an effective solution for the real-time decision-making of the P2P electricity market by bridging expert knowledge with agent learning.
cs.MM
[512] MOMENTA: Mixture-of-Experts Over Multimodal Embeddings with Neural Temporal Aggregation for Misinformation Detection
Yeganeh Abdollahinejad, Ahmad Mousavi, Naeemul Hassan, Kai Shu, Nathalie Japkowicz, Shahriar Khosravi, Amir Karami
Main category: cs.MM
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The widespread dissemination of multimodal content on social media has made misinformation detection increasingly challenging, as misleading narratives often arise not only from textual or visual content alone, but also from semantic inconsistencies between modalities and their evolution over time. Existing multimodal misinformation detection methods typically model cross-modal interactions statically and often show limited robustness across heterogeneous datasets, domains, and narrative settings. To address these challenges, we propose MOMENTA, a unified framework for multimodal misinformation detection that captures modality heterogeneity, cross-modal inconsistency, temporal dynamics, and cross-domain generalization within a single architecture. MOMENTA employs modality-specific mixture-of-experts modules to model diverse misinformation patterns, bidirectional co-attention to align textual and visual representations in a shared semantic space, and a discrepancy-aware branch to explicitly capture semantic disagreement between modalities. To model narrative evolution, we introduce an attention-based temporal aggregation mechanism with drift and momentum encoding over overlapping time windows, enabling the framework to capture both short-term fluctuations and longer-term trends in misinformation propagation. In addition, domain-adversarial learning and a prototype memory bank improve domain invariance and stabilize representation learning across datasets. The model is trained using a multi-objective optimization strategy that jointly enforces classification performance, cross-modal alignment, contrastive learning, temporal consistency, and domain robustness. Experiments on Fakeddit, MMCoVaR, Weibo, and XFacta show that MOMENTA achieves strong, consistent results across accuracy, F1-score, AUC, and MCC, highlighting its effectiveness for multimodal misinformation detection.
[513] Subjective and Objective Quality-of-Experience Evaluation Study for Live Video Streaming
Zehao Zhu, Wei Sun, Jun Jia, Wei Wu, Sibin Deng, Kai Li, Ying Chen, Xiongkuo Min, Jia Wang, Guangtao Zhai
Main category: cs.MM
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In recent years, live video streaming has gained widespread popularity across various social media platforms. Quality of experience (QoE), which reflects end-users’ satisfaction and overall experience, plays a critical role for media service providers to optimize large-scale live compression and transmission strategies to achieve perceptually optimal rate-distortion trade-off. Although many QoE metrics for video-on-demand (VoD) have been proposed, there remain significant challenges in developing QoE metrics for live video streaming. To bridge this gap, we conduct a comprehensive study of subjective and objective QoE evaluations for live video streaming. For the subjective QoE study, we introduce the first live video streaming QoE dataset, TaoLive QoE, which consists of $42$ source videos collected from real live broadcasts and $1,155$ corresponding distorted ones degraded due to a variety of streaming distortions, including conventional streaming distortions such as compression, stalling, as well as live streaming-specific distortions like frame skipping, variable frame rate, etc. Subsequently, a human study was conducted to derive subjective QoE scores of videos in the TaoLive QoE dataset. For the objective QoE study, we benchmark existing QoE models on the TaoLive QoE dataset as well as publicly available QoE datasets for VoD scenarios, highlighting that current models struggle to accurately assess video QoE, particularly for live content. Hence, we propose an end-to-end QoE evaluation model, Tao-QoE, which integrates multi-scale semantic features and optical flow-based motion features to predicting a retrospective QoE score, eliminating reliance on statistical quality of service (QoS) features.
[514] Concept Drift Guided LayerNorm Tuning for Efficient Multimodal Metaphor Identification
Wenhao Qian, Zhenzhen Hu, Zijie Song, Jia Li
Main category: cs.MM
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Metaphorical imagination, the ability to connect seemingly unrelated concepts, is fundamental to human cognition and communication. While understanding linguistic metaphors has advanced significantly, grasping multimodal metaphors, such as those found in internet memes, presents unique challenges due to their unconventional expressions and implied meanings. Existing methods for multimodal metaphor identification often struggle to bridge the gap between literal and figurative interpretations. Additionally, generative approaches that utilize large language models or text-to-image models, while promising, suffer from high computational costs. This paper introduces \textbf{C}oncept \textbf{D}rift \textbf{G}uided \textbf{L}ayerNorm \textbf{T}uning (\textbf{CDGLT}), a novel and training-efficient framework for multimodal metaphor identification. CDGLT incorporates two key innovations: (1) Concept Drift, a mechanism that leverages Spherical Linear Interpolation (SLERP) of cross-modal embeddings from a CLIP encoder to generate a new, divergent concept embedding. This drifted concept helps to alleviate the gap between literal features and the figurative task. (2) A prompt construction strategy, that adapts the method of feature extraction and fusion using pre-trained language models for the multimodal metaphor identification task. CDGLT achieves state-of-the-art performance on the MET-Meme benchmark while significantly reducing training costs compared to existing generative methods. Ablation studies demonstrate the effectiveness of both Concept Drift and our adapted LN Tuning approach. Our method represents a significant step towards efficient and accurate multimodal metaphor understanding. The code is available: \href{https://github.com/Qianvenh/CDGLT}{https://github.com/Qianvenh/CDGLT}.
[515] MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production
Huanran Hu, Zihui Ren, Dingyi Yang, Liangyu Chen, Qixiang Gao, Tiezheng Ge, Qin Jin
Main category: cs.MM
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and lack support for evaluating this full process. To address this gap, we propose Multimodal Context-to-Script Creation (MCSC), a new task that transforms noisy multimodal inputs and user instructions into structured, executable video scripts. We further introduce MCSC-Bench, the first large-scale MCSC dataset, comprising 11K+ well-annotated videos. Each sample includes: (1) redundant multimodal materials and user instructions; (2) a coherent, production-ready script containing material-based shots, newly planned shots (with shooting instructions), and shot-aligned voiceovers. MCSC-Bench supports comprehensive evaluation across material selection, narrative planning, and conditioned script generation, and includes both in-domain and out-of-domain test sets. Experiments show that current multimodal LLMs struggle with structure-aware reasoning under long contexts, highlighting the challenges posed by our benchmark. Models trained on MCSC-Bench achieve SOTA performance, with an 8B model surpassing Gemini-2.5-Pro, and generalize to out-of-domain scenarios. Downstream video generation guided by the generated scripts further validates the practical value of MCSC. Datasets will be public soon.
eess.AS
[516] XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection
Kwok-Ho Ng, Tingting Song, Yongdong Wu, Zhihua Xia
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.02944: Page request resulted in HTTP 503 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02944&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[517] Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models
Pengbo Lyu, Xiangyu Zhao, Chengwei Liu, Haoyin Yan, Xiaotao Liang, Hongyu Wang, Shaofei Xue
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.09371: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09371&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
eess.IV
[518] Portable Medical Imaging in Modern Healthcare: Fundamentals, AI-Based Taxonomy, Image Quality, and Open Challenges
Yassine Habchi, Hamza Kheddar, Muhammad Ali Qureshi, Mohamed Seghier, Azeddine Beghdadi
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Portable medical imaging (PMI) has emerged as an important solution for point-of-care diagnosis in emergency, rural, and resource-limited settings where conventional imaging infrastructure is not readily available. Modalities such as portable computed tomography, portable magnetic resonance imaging, portable ultrasound, and wireless capsule endoscopy improve access to timely diagnosis, but they remain highly vulnerable to image-quality degradation caused by motion artifacts, environmental interference, hardware limitations, and unstable acquisition conditions. This review provides a systematic and quality-centered synthesis of recent advances in PMI. It introduces a taxonomy of AI-based PMI methods spanning machine learning, deep learning, transfer learning, and Transformer-based approaches, and examines their roles in image enhancement, reconstruction, quality assessment, detection, and classification. The review also analyzes PMI devices, sensing pipelines, modality-specific distortions, evaluation metrics, and publicly available datasets. In contrast to existing surveys that are mainly modality-driven or application-focused, this work emphasizes the relationship between image quality, AI robustness, and clinical usability in portable settings. Finally, it identifies current research gaps and outlines future directions toward reliable, interpretable, and clinically deployable PMI systems.
[519] RelativeFlow: Taming Medical Image Denoising Learning with Noisy Reference
Yuxin Liu, Yiqing Dong, Wenxue Yu, Zhan Wu, Rongjun Ge, Yang Chen, Yuting He
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Medical image denoising (MID) lacks absolutely clean images for supervision, leading to a noisy reference problem that fundamentally limits denoising performance. Existing simulated-supervised discriminative learning (SimSDL) and simulated-supervised generative learning (SimSGL) treat noisy references as clean targets, causing suboptimal convergence or reference-biased learning, while self-supervised learning (SSL) imposes restrictive noise assumptions that are seldom satisfied in realistic MID scenarios. We propose \textbf{RelativeFlow}, a flow matching framework that learns from heterogeneous noisy references and drives inputs from arbitrary quality levels toward a unified high-quality target. RelativeFlow reformulates flow matching by decomposing the absolute noise-to-clean mapping into relative noisier-to-noisy mappings, and realizes this formulation through two key components: 1) consistent transport (CoT), a displacement map that constrains relative flows to be components of and progressively compose a unified absolute flow, and 2) simulation-based velocity field (SVF), which constructs a learnable velocity field using modality-specific degradation operators to support different medical imaging modalities. Extensive experiments on Computed Tomography (CT) and Magnetic Resonance (MR) denoising demonstrate that RelativeFlow significantly outperforms existing methods, taming MID with noisy references.
[520] CTSCAN: Evaluation Leakage in Chest CT Segmentation and a Reproducible Patient-Disjoint Benchmark
Anton Ivchenko
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reported chest CT segmentation performance can be strongly inflated when train and test partitions mix slices from the same study. We present CTSCAN, a reproducible multi-source chest CT benchmark and research stack designed to measure what survives under patient-disjoint evaluation. The current four-class artifact aggregates 89 cases from PleThora, MedSeg SIRM, and LongCIU, and we show that the original slice-PNG workflow induces near-complete case reuse across train, validation, and test. Using the playground environment, we run a multi-seed protocol sweep with the same FPN plus EfficientNet-B0 control configuration under slice-mixed and case-disjoint evaluation. Across 3 seeds and 12 epochs per seed, the slice-mixed protocol reaches 0.6665 foreground Dice and 0.5031 foreground IoU, whereas the case-disjoint protocol reaches 0.2066 Dice and 0.1181 IoU. Removing patient reuse therefore reduces foreground Dice by 0.4599 absolute (69.00% relative) and foreground IoU by 0.3850 absolute (76.52% relative). CTSCAN packages the corrected benchmark with deterministic split manifests, explicit weak-supervision controls, a scripted multi-seed protocol sweep, and reproducible figure generation, providing a reusable basis for patient-disjoint chest CT evaluation.
[521] Topology-Driven Fusion of nnU-Net and MedNeXt for Accurate Brain Tumor Segmentation on Sub-Saharan Africa Dataset
Prabin Bohara, Pralhad Kumar Shrestha, Arpan Rai, Usha Poudel Lamgade, Confidence Raymond, Dong Zhang, Aondona Lorumbu, Craig Jones, Mahesh Shakya, Bishesh Khanal, Pratibha Kulung
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate automatic brain tumor segmentation in Low and Middle-Income (LMIC) countries is challenging due to the lack of defined national imaging protocols, diverse imaging data, extensive use of low-field Magnetic Resonance Imaging (MRI) scanners and limited health-care resources. As part of the Brain Tumor Segmentation (BraTS) Africa 2025 Challenge, we applied topology refinement to the state-of-the-art segmentation models like nnU-Net, MedNeXt, and a combination of both. Since the BraTS-Africa dataset has low MRI image quality, we incorporated the BraTS 2025 challenge data of pre-treatment adult glioma (Task 1) to pre-train the segmentation model and use it to fine-tune on the BraTS-Africa dataset. We added an extra topology refinement module to address the issue of deformation in prediction that arose due to topological error. With the introduction of this module, we achieved a better Normalized Surface Distance (NSD) of 0.810, 0.829, and 0.895 on Surrounding Non-Enhancing FLAIR Hyperintensity (SNFH) , Non-Enhancing Tumor Core (NETC) and Enhancing tumor (ET).
[522] Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration
Baramee Sukumal, Aueaphum Aueawatthanaphisut
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Lung cancer remains one of the leading causes of cancer-related mortality worldwide. Conventional computed tomography (CT) imaging, while essential for detection and staging, has limitations in distinguishing benign from malignant lesions and providing interpretable diagnostic insights. To address this challenge, this study proposes a dual-modal artificial intelligence framework that integrates CT radiology with hematoxylin and eosin (H&E) histopathology for lung cancer diagnosis and subtype classification. The system employs convolutional neural networks to extract radiologic and histopathologic features and incorporates clinical metadata to improve robustness. Predictions from both modalities are fused using a weighted decision-level integration mechanism to classify adenocarcinoma, squamous cell carcinoma, large cell carcinoma, small cell lung cancer, and normal tissue. Explainable AI techniques including Grad-CAM, Grad-CAM++, Integrated Gradients, Occlusion, Saliency Maps, and SmoothGrad are applied to provide visual interpretability. Experimental results show strong performance with accuracy up to 0.87, AUROC above 0.97, and macro F1-score of 0.88. Grad-CAM++ achieved the highest faithfulness and localization accuracy, demonstrating strong correspondence with expert-annotated tumor regions. These results indicate that multimodal fusion of radiology and histopathology can improve diagnostic performance while maintaining model transparency, suggesting potential for future clinical decision support systems in precision oncology.
[523] Deep Learning-Enabled Modality Transfer Between Independent Microscopes for High-Throughput Imaging
Dominik Panek, Carina Rząca, Maksymilian Szczypior, Joanna Sorysz, Krzysztof Misztal, Zbigniew Baster, Zenon Rajfur
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: High-throughput biological imaging is often constrained by a trade-off between acquisition speed and image quality. Fast imaging modalities, such as wide-field fluorescence microscopy, enable large-scale data acquisition but suffer from reduced contrast and resolution, whereas high-resolution techniques, including confocal microscopy or single-molecule localization microscopy-based super-resolution techniques, provide superior image quality at the cost of throughput and instrument time. Here, we present a deep learning-based approach for modality transfer across independent microscopes, enabling the transformation of low-quality images acquired on fast systems into high-quality representations comparable to those obtained using advanced imaging platforms. To achieve this, we employ a generative adversarial network (GAN)-based model trained on paired datasets acquired on physically separate wide-field and confocal microscopes, demonstrating that image quality can be reliably transferred between independent instruments. Quantitative evaluation shows substantial improvement in structural similarity and signal fidelity, with median SSIM and PSNR of 0.94 and 31.87, respectively, compared to 0.83 and 21.48 for the original wide-field images. These results indicate that key structural features can be recovered with high accuracy. Importantly, this approach enables a workflow in which high-throughput imaging can be performed on fast, accessible microscopy systems while preserving the ability to computationally recover high-quality structural information. High-resolution microscopy can then be reserved for targeted validation, reducing acquisition time and improving overall experimental efficiency. Together, our results establish deep learning-enabled modality transfer as a practical strategy for bridging independent microscopy systems and supporting scalable, high-content imaging workflows.
[524] Intelligent Healthcare Imaging Platform: A VLM-Based Framework for Automated Medical Image Analysis and Clinical Report Generation
Samer Al-Hamadani
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The rapid advancement of artificial intelligence (AI) in healthcare imaging has revolutionized diagnostic medicine and clinical decision-making processes. This work presents an intelligent multimodal framework for medical image analysis that leverages Vision-Language Models (VLMs) in healthcare diagnostics. The framework integrates Google Gemini 2.5 Flash for automated tumor detection and clinical report generation across multiple imaging modalities including CT, MRI, X-ray, and Ultrasound. The system combines visual feature extraction with natural language processing to enable contextual image interpretation, incorporating coordinate verification mechanisms and probabilistic Gaussian modeling for anomaly distribution. Multi-layered visualization techniques generate detailed medical illustrations, overlay comparisons, and statistical representations to enhance clinical confidence, with location measurement achieving 80 pixels average deviation. Result processing utilizes precise prompt engineering and textual analysis to extract structured clinical information while maintaining interpretability. Experimental evaluations demonstrated high performance in anomaly detection across multiple modalities. The system features a user-friendly Gradio interface for clinical workflow integration and demonstrates zero-shot learning capabilities to reduce dependence on large datasets. This framework represents a significant advancement in automated diagnostic support and radiological workflow efficiency, though clinical validation and multi-center evaluation are necessary prior to widespread adoption.