Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 93]
- cs.CV [Total: 98]
- cs.AI [Total: 64]
- cs.SD [Total: 5]
- cs.LG [Total: 102]
- cs.MA [Total: 3]
- cs.MM [Total: 1]
- eess.AS [Total: 9]
- eess.IV [Total: 7]
cs.CL
[1] When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation
Anamta Khan, Ratna Kandala, Deepti, Sheza Munir, Joyojeet Pal
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Social media platforms have become primary channels for health information in the Global South. Using gomutra (cow urine) discourse on YouTube in India as a case study, we present a post-facto Large Language Model (LLM)-assisted discourse analysis of 30 multilingual transcripts showing that promotional content blends sacred traditional language with pseudo-scientific claims in ways that sophisticated debunking content itself mirrors, creating a rhetorical register that LLMs, trained predominantly on Western corpora, are systematically ill-equipped to analyse. Varying prompt tone across three LLMs (GPT-4o, Gemini 2.5 Pro, DeepSeek-V3.1), we find that culturally embedded health misinformation does not look like ordinary misinformation, and this cultural obfuscation extends to gendered rhetoric and prompt design, compounding analytical unreliability. Our findings argue that cultural competency in LLM-assisted discourse analysis cannot be retrofitted through prompt engineering alone.
[2] Shared Lexical Task Representations Explain Behavioral Variability In LLMs
Zhuonan Yang, Jacob Xiaochen Li, Francisco Piedrahita Velez, Eric Todd, David Bau, Michael L. Littman, Stephen H. Bach, Ellie Pavlick
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: One of the most common complaints about large language models (LLMs) is their prompt sensitivity – that is, the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: instruction-based prompts, which describe the task in natural language, and example-based prompts, which provide in-context few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, the model engages some common underlying mechanisms across different prompts of a task. Specifically, we identify task-specific attention heads whose outputs literally describe the task – which we dub lexical task heads – and show that these heads are shared across prompting styles and trigger subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs’ internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.
[3] Source-Modality Monitoring in Vision-Language Models
Etha Tianze Hua, Tian Yun, Ellie Pavlick
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We define and investigate source-modality monitoring – the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of the more general binding problem, and evaluate the extent to which models exploit syntactic vs. semantic signals in order to bind words like image in a user-provided prompt to specific components of their input and context (i.e., actual images). Across experiments spanning 11 vision-language models (VLMs) performing target-modality information retrieval tasks, we find that both syntactic and semantic signals play an important role, but that the latter tend to outweigh the former in cases when modalities are highly distinct distributionally. We discuss the implications of these findings for model robustness, and in the context of increasingly multimodal agentic systems.
[4] Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching
Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.
[5] Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning
Karthic Palaniappan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don’t care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louise Banks who, by learning to think in an alien language (Heptapod) formed of non-sequential sentences, gains the ability to transcend time and look into the future. In this work, I aim to explore the representation and reasoning of vision-language concepts in a neuro-symbolic language, and study improvement in analytical reasoning abilities and efficiency of “thinking systems”. With Qwen3-VL-2B-Instruct as base model and 4 $\times$ Nvidia H200 GPU nodes, I achieve an accuracy improvement of 3.33% on a vision-language evaluation dataset consisting of math, science, and general knowledge questions, while reducing the reasoning tokens by 75% over SymPy. I’ve documented the compute challenges faced, scaling possibilities, and the future work to improve thinking in a neuro-symbolic language in vision-language models. The training and inference setup can be found here: https://github.com/i-like-bfs-and-dfs/wolfram-reasoning.
[6] Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake
Guan Gui, Peter Zandi, Jacob Taylor, Ananya Joshi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Psychiatric intake is a sequential, high-stakes information-gathering process in which clinicians must decide what to ask, in what order, and how to interpret incomplete or ambiguous responses under limited time. Despite growing interest in conversational AI for healthcare, there is still limited infrastructure for conversational AI in this application. Accordingly, we formulate this task as a question-selection problem with clinically grounded questions, known target information, and controllable patient difficulty. We also introduce a task-specific question-selection benchmark based on a bank of 655 clinician-authored intake questions and corresponding synthetic patient vignettes with 5 different behavioral conditions. In our evaluation, we compare random questioning, a clinical psychiatric intake form baseline, and an LLM-guided adaptive policy across 300 interview sessions spanning four patients and five behavioral conditions. Across the benchmark, the clinically ordered fixed form substantially outperforms random questioning, and the LLM-guided policy achieves the strongest overall recovery. The advantage of adaptation grows sharply under patient behavior that is less amenable to field recovery, especially under guarded-concise conditions. These findings suggest that performance in conversational clinical systems depends not only on language understanding after information is disclosed, but also on whether the system reaches the right topics within a limited interaction budget. More broadly, the benchmark provides a controlled framework for studying how clinical structure and adaptive follow-up contribute to information recovery in interactive clinical machine learning.
[7] Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning
Qinan Yu, Alexa Tartaglini, Peter Hase, Carlos Guestrin, Christopher Potts
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.
[8] An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation
Mykola Trokhymovych, Yana Oliinyk, Nazarii Nyzhnyk
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper presents a highly efficient Retrieval-Augmented Generation (RAG) system built specifically for Ukrainian document question answering, which achieved 2nd place in the UNLP 2026 Shared Task. Our solution features a custom two-stage search pipeline that retrieves relevant document pages, paired with a specialized Ukrainian language model fine-tuned on synthetic data to generate accurate, grounded answers. Finally, we compress the model for lightweight deployment. Evaluated under strict computational limits, our architecture demonstrates that high-quality, verifiable AI question answering can be achieved locally on resource-constrained hardware without sacrificing accuracy.
[9] TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
Xi Wang, Jie Wang, Xingchen Song, Baijun Song, Jingran Xie, Jiahe Shao, Zijian Lin, Di Wu, Meng Meng, Jian Luan, Zhiyong Wu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at https://github.com/xiaomi-research/tts-prism.
[10] Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation
Weisi Liu, Guangzeng Han, Xiaolei Huang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Time introduces fundamental challenges in model development and deployment: models are usually trained on historical data while deployed on future data where semantic distributions and domain knowledge may evolve. Unfortunately, existing studies either overlook temporal shifts or hardly capture rich shifting patterns of both semantic and knowledge. We develop Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation (KARITA) to capture diverse temporal shifts (e.g., uncertainty and feature shift), construct and integrate rich knowledge sources (e.g., medical ontology like MeSH), and leverage shifting insights for selecting-retrieval augmented learning. We evaluate KARITA on classification tasks across multiple domains, clinical, legal, and scientific corpora, demonstrating consistent improvements across multiple domains with temporal adaptation. Our results show that knowledge integration can be more critical and effective in temporal augmentation and learning.
[11] Where Should LoRA Go? Component-Type Placement in Hybrid Language Models
Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Hybrid language models that interleave attention with recurrent components are increasingly competitive with pure Transformers, yet standard LoRA practice applies adapters uniformly without considering the distinct functional roles of each component type. We systematically study component-type LoRA placement across two hybrid architectures – Qwen3.5-0.8B (sequential, GatedDeltaNet + softmax attention) and Falcon-H1-0.5B (parallel, Mamba-2 SSM + attention) – fine-tuned on three domains and evaluated on five benchmarks. We find that the attention pathway – despite being the minority component – consistently outperforms full-model adaptation with 5-10x fewer trainable parameters. Crucially, adapting the recurrent backbone is destructive in sequential hybrids (-14.8 pp on GSM8K) but constructive in parallel ones (+8.6 pp). We further document a transfer asymmetry: parallel hybrids exhibit positive cross-task transfer while sequential hybrids suffer catastrophic forgetting. These results establish that hybrid topology fundamentally determines adaptation response, and that component-aware LoRA placement is a necessary design dimension for hybrid architectures.
[12] Dissociating Decodability and Causal Use in Bracket-Sequence Transformers
Aryan Sharma, Cutter Dawes, Shivam Raval
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.
[13] SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs
Sihang, Zhao, Kangrui Yu, Youliang Yuan, Pinjia He, Hongyi Wen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at https://github.com/MAPS-research/SHaPE
[14] Voice Under Revision: Large Language Models and the Normalization of Personal Narrative
Tom van Nuenen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This study examines how large language model rewriting alters the style and narrative texture of personal narratives. It analyzes 300 personal narratives rewritten by three frontier LLMs under three prompt conditions: generic improvement, rewrite-only, and voice-preserving revision. Change is measured across 13 linguistic markers drawn from computational stylistics, including function words, vocabulary diversity, word length, punctuation, contractions, first-person pronouns, and emotion words. Across models and prompt conditions, LLM rewriting produces a consistent pattern of stylistic normalization. Function words, contractions, and first-person pronouns decrease, while vocabulary diversity, word length, and punctuation elaboration increase. These shifts occur whether the prompt asks the model to “improve” the text or simply to “rewrite” it. Voice-preserving prompts reduce the magnitude of the changes but do not eliminate their direction. Stylometric analysis shows that rewritten texts converge in feature space and become harder to match back to their source texts. Additional narrative markers indicate a shift from embedded to distanced narration, and from explicit causal reasoning to compressed abstraction. The findings suggest that contemporary LLMs exert a directional pull toward a more polished, less situated register. This has consequences for digital humanities and computational text analysis, where features such as function words, pronouns, contractions, and punctuation often serve as evidence for style, voice, authorship, and corpus integrity. LLM revision should therefore be understood not merely as surface-level editing, but as a consequential form of textual mediation.
[15] When AI Speaks, Whose Values Does It Express? A Cross-Cultural Audit of Individualism-Collectivism Bias in Large Language Models
Pruthvinath Jeripity Venkata
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: When you ask an AI assistant for advice about your career, your marriage, or a conflict with your family, does it give you the same answer regardless of where you are from? We tested this systematically by presenting three leading AI systems (Claude Sonnet 4.5, GPT-5.4, and Gemini 2.5 Flash) with ten real-life personal dilemmas, framed for users from 10 countries across 5 continents in 7 languages (n=840 scored responses). We compared AI advice against World Values Survey Wave 7 data measuring what people in each country actually believe. All three AI systems consistently gave Western-style, individualist advice even to users from societies that prioritize family, community, and authority, significantly more so than local values would predict (mean gap +0.76 on a 1-5 scale; t=15.65, p<0.001). The gap is largest for Nigeria (+1.85) and India (+0.82). Japan is the sole exception: AI systems treated Japanese users as more group-oriented than surveys show, revealing that AI encodes outdated stereotypes. Claude and GPT-5.4 show nearly identical bias magnitude, while Gemini is lower but still significant. The models diverge in mechanism: Claude shifts further collectivist in the user’s native language; Gemini shifts more individualist; GPT-5.4 responds only to stated country identity. These findings point to a systemic homogenization of values across frontier AI. Data, code, and scoring pipeline are openly released.
[16] Fine-Grained Analysis of Shared Syntactic Mechanisms in Language Models
Ryoma Kumon, Hitomi Yanaka
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While language models demonstrate sophisticated syntactic capabilities, the extent to which their internal mechanisms align with cross-constructional principles studied in linguistics remains poorly understood. This study investigates whether models employ shared neural mechanisms across different syntactic constructions by applying causal interpretability methods at a granular level. Focusing on filler-gap dependencies and negative polarity item (NPI) licensing, we utilize activation patching to identify the functional roles of specific attention heads and MLP blocks. Our results reveal a highly localized and shared mechanism for filler-gap dependencies located in the early to middle layers, whereas NPI processing exhibits no such unified mechanism. Furthermore, we find that these mechanisms identified by activation patching generalize to out-of-distribution, while distributed alignment search, a supervised interpretability method, is susceptible to overfitting on narrow linguistic distributions. Finally, we validate our findings by demonstrating that the manipulation of the identified components improves model performance on acceptability judgment benchmarks.
[17] How Large Language Models Balance Internal Knowledge with User and Document Assertions
Shuowei Li, Haoxin Li, Wenda Chu, Yi Fang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) often need to balance their internal parametric knowledge with external information, such as user beliefs and content from retrieved documents, in real-world scenarios like RAG or chat-based systems. A model’s ability to reliably process these sources is key to system safety. Previous studies on knowledge conflict and sycophancy are limited to a binary conflict paradigm, primarily exploring conflicts between parametric knowledge and either a document or a user, but ignoring the interactive environment where all three sources exist simultaneously. To fill this gap, we propose a three-source interaction framework and systematically evaluate 27 LLMs from 3 families on 2 datasets. Our findings reveal general patterns: most models rely more on document assertions than user assertions, and this preference is reinforced by post-training. Furthermore, our behavioral analysis shows that most models are impressionable, unable to effectively discriminate between helpful and harmful external information. To address this, we demonstrate that fine-tuning on diverse source interaction data can significantly increase a model’s discrimination abilities. In short, our work paves the way for developing trustworthy LLMs that can effectively and reliably integrate multiple sources of information. Code is available at https://github.com/shuowl/llm-source-balancing.
[18] Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal validity criteria for item-level Type-2 discrimination under minimal numeric elicitation with greedy decoding. In a pre-registered study (OSF: osf.io/azbvx), 524 TriviaQA items were administered under numeric (0-100) and categorical (10-class) elicitation to eight models at Q5_K_M quantisation on consumer hardware, yielding 8,384 deterministic trials. A psychometric validity screen was applied to each model-format cell. All seven instruct models were classified Invalid on numeric confidence (H2 confirmed, 7/7 vs. predicted >=4/7), with a mean ceiling rate of 91.7% (H1 confirmed). Categorical elicitation did not rescue validity. Instead, it disrupted task performance in six of seven models, producing accuracy below 5% (H4 not confirmed). Token-level logprobability did not usefully predict verbalised confidence under the observed variance regime (H5 confirmed, mean cross-validated R^2 < 0.01). Within the reasoning-distilled model, reasoning-trace length showed a strong negative partial correlation with confidence (rho = -0.36, p < .001), consistent with the Reasoning Contamination Effect. These results do not imply that internal uncertainty representations are absent. They show that minimal verbal elicitation fails to preserve internal signals at the output interface in this model-size regime. Psychometric screening should precede any downstream use of such signals.
[19] Tell Me Why: Designing an Explainable LLM-based Dialogue System for Student Problem Behavior Diagnosis
Zhilin Fan, Deliang Wang, Penghe Chen, Yu Lu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diagnosing student problem behaviors requires teachers to synthesize multifaceted information, identify behavioral categories, and plan intervention strategies. Although fine-tuned large language models (LLMs) can support this process through multi-turn dialogue, they rarely explain why a strategy is recommended, limiting transparency and teachers’ trust. To address this issue, we present an explainable dialogue system built on a fine-tuned LLM. The system uses a hierarchical attribution method based on explainable AI (xAI) to identify dialogue evidence for each recommendation and generate a natural-language explanation based on that evidence. In technical evaluation, the method outperformed baseline approaches in identifying supporting evidence. In a preliminary user study with 22 pre-service teachers, participants who received explanations reported higher trust in the system. These findings suggest a promising direction for improving LLM explainability in educational dialogue systems.
[20] Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA
Zhanli Li, Yixuan Cao, Lvzhou Luo, Ping Luo
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper introduces the task of analytical question answering over large, semi-structured document collections. We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi-document QA benchmarks that typically require information from only a few documents with limited cross-document reasoning, MuDABench demands extensive inter-document analysis and aggregation. Constructed via distant supervision by leveraging document-level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate-fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi-agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single-document information extraction accuracy and insufficient domain-specific knowledge in current systems. MuDABench is available at https://github.com/Zhanli-Li/MuDABench.
[21] Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion
Fahmida Alam, Mihai Surdeanu, Ellen Riloff
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) struggle with relation completion (RC), both with and without retrieval-augmented generation (RAG), particularly when the required information is rare or sparsely represented. To address this, we propose a novel multi-stage paraphrase-guided relation-completion framework, RC-RAG, that systematically incorporates relation paraphrases across multiple stages. In particular, RC-RAG: (a) integrates paraphrases into retrieval to expand lexical coverage of the relation, (b) uses paraphrases to generate relation-aware summaries, and (c) leverages paraphrases during generation to guide reasoning for relation completion. Importantly, our method does not require any model fine-tuning. Experiments with five LLMs on two benchmark datasets show that RC-RAG consistently outperforms several RAG baselines. In long-tail settings, the best-performing LLM augmented with RC-RAG improves by 40.6 Exact Match (EM) points over its standalone performance and surpasses two strong RAG baselines by 16.0 and 13.8 EM points, respectively, while maintaining low computational overhead.
[22] Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions
Minda Zhao, Yilun Du, Mengyu Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines and systems approaching general intelligence, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces $N{=}1000$ samples within one response, and Independent Requests, comprising $N{=}1000$ stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 7% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the sampling horizon $N$ increases. Finally, we demonstrate how the propagation of these failures into downstream real-world application tasks introduces systematic biases: models fail to enforce uniform answer-position constraints in Multiple Choice Question generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating external tools for applications requiring statistical guarantees.
[23] Large Language Models Decide Early and Explain Later
Ayan Datta, Zhixue Zhao, Bhuvanesh Verma, Radhika Mamidi, Mounika Marreddy, Alexander Mehler
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model’s final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model’s intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.
[24] STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation
Peng Yu, En Xu, Bin Chen, Haibiao Chen, Yinfei Xu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Knowledge Graph-based Question Answering (KGQA) plays a pivotal role in complex reasoning tasks but remains constrained by two persistent challenges: the structural heterogeneity of Knowledge Graphs(KGs) often leads to semantic mismatch during retrieval, while existing reasoning path retrieval methods lack a global structural perspective. To address these issues, we propose Structure-Tracing Evidence Mining (STEM), a novel framework that reframes multi-hop reasoning as a schema-guided graph search task. First, we design a Semantic-to-Structural Projection pipeline that leverages KG structural priors to decompose queries into atomic relational assertions and construct an adaptive query schema graph. Subsequently, we execute globally-aware node anchoring and subgraph retrieval to obtain the final evidence reasoning graph from KG. To more effectively integrate global structural information during the graph construction process, we design a Triple-Dependent GNN (Triple-GNN) to generate a Global Guidance Subgraph (Guidance Graph) that guides the construction. STEM significantly improves both the accuracy and evidence completeness of multi-hop reasoning graph retrieval, and achieves State-of-the-Art performance on multiple multi-hop benchmarks.
[25] ReLeVAnT: Relevance Lexical Vectors for Accurate Legal Text Classification
Ishaan Gakhar, Harsh Nandwani
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The classification of legal documents from an unstructured data corpus has several crucial applications in downstream tasks. Documents relevant to court filings are key in use cases such as drafting motions, memos, and outlines, as well as in tasks like docket summarisation, retrieval systems, and training data curation. Current methods classify based on provided metadata, LLM-extracted metadata, or multimodal methods. These methods depend on structured data, metadata, and extensive computational power. This task is approached from a perspective of leveraging discriminative features in the documents between classes. The authors propose ReLeVAnT, a framework for legal document binary classification. ReLeVAnT utilises n-gram processing, contrastive score matching, and a shallow neural network as the primary drivers for discriminative classification. It leverages one-time keyword extraction per corpus, followed by a shallow classifier to swiftly and reliably classify documents with 99.3% accuracy and 98.7% F1 score on the LexGLUE dataset.
[26] Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets
Harshit Joshi, Priyank Shethia, Jadelynn Dao, Monica S. Lam
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.
[27] CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems
Tabinda Sarwar, Farhad Moghimifar, Cong Duy Vu Hoang, Xiaoxiao Ma, Shawn Chang Xu, Fahimeh Saleh, Poorya Zaremoodi, Avirup Sil, Katrin Kirchhoff
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: NL2SQL systems deployed in industry settings often encounter ambiguous or unanswerable queries, particularly in interactive scenarios with incomplete user clarification. Existing benchmarks typically assume a single source of ambiguity and rely on user interaction for resolution, overlooking realistic failure modes. We introduce Clarity, a framework for automatically generating an NL2SQL benchmark with multi-faceted ambiguities and diverse user behaviors across both single- and multi-turn settings. Using a constraint-driven pipeline, Clarity transforms executable SQL into ambiguous queries, augmented with grounded conversational continuations and schema-level metadata. Empirical evaluation on Spider and BIRD shows that leading NL2SQL systems, including those based on strong LLMs, suffer significant performance degradation under multi-faceted ambiguity. While these systems often detect ambiguity, they struggle to accurately localize and resolve the underlying schema-level sources. Our results highlight the need for more robust ambiguity detection and resolution in industry-grade NL2SQL systems.
[28] Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks
Fahmida Alam, Ellen Riloff
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing Natural Language Processing (NLP) resources often lack the task-specific information required for real-world problems and provide limited coverage of lesser-known or newly introduced entities. For example, business organizations and health care providers may need to be classified into a variety of different taxonomic schemes for specific application tasks. Our goal is to enable domain experts to easily create a task-specific classifier for entities by providing only entity names and gold labels as training data. Our framework then dynamically acquires descriptive text about each entity, which is subsequently used as the basis for producing a text-based classifier. We propose a novel text acquisition method that leverages both web and large language models (LLMs). We evaluate our proposed framework on two classification problems in distinct domains: (i) classifying organizations into Standard Industrial Classification (SIC) Codes, which categorize organizations based on their business activities; and (ii) classifying healthcare providers into healthcare provider taxonomy codes, which represent a provider’s medical specialty and area of practice. Our best-performing model achieved macro-averaged F1-scores of 82.3% and 72.9% on the SIC code and healthcare taxonomy code classification tasks, respectively.
[29] Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding
Weixu Zhang, Fanghua Ye, Qiang Gao, Jian Li, Haolun Wu, Yuxing Tian, Sijing Duan, Nan Du, Xiaolong Li, Xue Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) often produce content that contradicts or overlooks information provided in the input context, a phenomenon known as faithfulness hallucination. In this paper, we propose Context-Fidelity Boosting (CFB), a lightweight and general decoding-time framework that reduces such hallucinations by increasing the generation probability of source-supported tokens. Motivated by logit-shaping principles from watermarking techniques, CFB applies additive token-level logit adjustments based on a token’s degree of support from the input context. Specifically, we develop three boosting strategies: static boosting, which applies a fixed bias to source-supported tokens; context-aware boosting, which scales this bias using the divergence between next-token distributions with and without context; and token-aware boosting, which further redistributes the adaptive bias according to local relevance estimated from source-position attention and source-scoped semantic similarity. CFB requires no retraining or architectural changes, making it compatible with a wide range of LLMs. Experiments on summarization and question answering tasks across multiple open-source LLMs show that CFB consistently improves faithfulness metrics with minimal generation overhead. Our implementation is fully open-sourced.
[30] Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization
Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures. Our implementation is publicly available.
[31] CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language
Rui Zhao, Xuewen Zhong, Xiaoyun Zheng, Jinsong Su, Yidong Chen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese em{National Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textit{National Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.
[32] Selective Contrastive Learning For Gloss Free Sign Language Translation
Changhao Lai, Rui Zhao, Xuewen Zhong, Jinsong Su, Yidong Chen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Sign language translation (SLT) converts continuous sign videos into spoken-language text, yet it remains challenging due to the intrinsic modality mismatch between visual signs and written text, particularly in gloss-free settings. Recent SLT systems increasingly adopt CLIP-like Vision-Language pretraining (VLP) for cross-modal alignment, but the random in-batch contrast provides few, batch-dependent negatives and may mislabel semantically similar (or even identical) pairs as negatives, introducing noisy and potentially inconsistent alignment supervision. In this work, we first conduct a preliminary trajectory-based analysis that tracks negative video-text similarity over training. The results show that only a small subset of negatives exhibits the desired behavior of being consistently pushed away, while the remaining negatives display heterogeneous and often non-decreasing similarity dynamics, suggesting that random in-batch negatives are frequently uninformative for effective alignment. Inspired by this, we propose Selective Contrastive Learning for SLT (SCL-SLT) with a Pair Selection (PS) strategy. PS scores candidate negatives using similarity dynamics from reference checkpoints and constructs mini-batches via a curriculum that progressively emphasizes more challenging negatives, thereby strengthening contrastive supervision while reducing the influence of noisy or semantically invalid negatives.
[33] Measuring and Mitigating Persona Distortions from AI Writing Assistance
Paul Röttger, Kobi Hackenburg, Hannah Rose Kirk, Christopher Summerfield
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Hundreds of millions of people use artificial intelligence (AI) for writing assistance. Here, we evaluated how AI writing assistance distorts writer personas - their perceived beliefs, personality, and identity. In three large-scale experiments, writers (N=2,939) wrote political opinion paragraphs with and without AI assistance. Separate groups of readers (N=11,091) blindly evaluated these paragraphs across 29 socially salient dimensions of reader perception, spanning political opinion, writing quality, writer personality, emotions, and demographics. AI writing assistance produced persona distortions across all dimensions: with AI, writers seemed more opinionated, competent, and positive, and their perceived demographic profile shifted towards more privileged groups. Writers objected to many of the observed distortions, yet continued to prefer AI-assisted text even when made aware of them. We successfully mitigated objectionable persona distortions at the model level by training reward models on our experimental data (10,008 paragraphs, 2,903,596 ratings) to steer AI outputs towards faithful representation of writer stance. However, this came at a cost to user acceptance, suggesting an entanglement between desirable and undesirable properties of AI writing assistance that may be difficult to resolve. Together, our findings demonstrate that persona distortions from AI writing assistance are pervasive and persistent even under realistic conditions of human oversight, which carries implications for public discourse, trust, and democratic deliberation that scale with AI adoption.
[34] Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement
Wataru Hirota, Tomoki Taniguchi, Tomoko Ohkuma, Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Takuto Asakura, Chung-Chi Chen, Tatsuya Ishigaki
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator’s scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.
[35] RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment
Yingfeng Luo, Hongyu Liu, Dingyang Lin, Kaiyan Chang, Chenglong Wang, Bei Li, Quan Du, Tong Xiao, Jingbo Zhu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) have achieved remarkable performance in Machine Translation (MT), but deploying them at scale remains prohibitively expensive. A widely adopted remedy is the hybrid system paradigm, which balances cost and quality by serving most requests with a small model and selectively routing a fraction to a large model. However, existing routing strategies often rely on heuristics, external predictors, or absolute quality estimation, which fail to capture whether the large model actually provides a worthwhile improvement over the small one. In this paper, we formulate routing as a budget allocation problem and identify marginal gain, i.e., the large model’s improvement over the small model, as the optimal signal for budgeted decisions. Building on this, we propose \textbf{RouteLMT} (routing for LLM-based MT), an efficient in-model router that predicts this expected gain by probing the small translators prompt-token representation, without requiring external models or hypothesis decoding. Extensive experiments demonstrate that our RouteLMT outperforms heuristics, quality/difficulty estimation baselines, achieving a superior quality-budget Pareto frontier. Furthermore, we analyze regression risks and show that a simple guarded variant can mitigate severe quality losses.
[36] Controllable Spoken Dialogue Generation: An LLM-Driven Grading System for K-12 Non-Native English Learners
Haidong Yuan, Haokun Zhao, Wanshi Xu, Songjun Cao, Qingyu Zhou, Long Ma, Hongjie Fan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) often fail to meet the pedagogical needs of K-12 English learners in non-native contexts due to a proficiency mismatch. To address this widespread challenge, we introduce a proficiency-aligned framework that adapts LLM outputs to learner abilities, using China’s national curriculum (CSE) as a representative case. Our framework enables precise control over lexical complexity through a four-tier grading system, supported by a comprehensive suite of new resources: graded vocabulary lists and a multi-turn dialogue corpus. Our core technical contribution is the \textbf{DDPO} algorithm,Diversity Driven Policy Optimization, a multi-turn GRPO-based approach designed to preserve dialogue diversity while holistically optimizing dialogue quality. This method significantly outperforms conventional approaches, achieving low out-of-vocabulary rates and high diversity while enhancing conversational naturalness and pedagogical value. While grounded in the CSE, our framework is designed for flexibility and can be readily adapted to other educational standards. Our models, data, and code will all be open-sourced, providing a scalable platform for personalized English speaking practice that effectively addresses the unique challenges faced by K-12 learners in non-immersive environments.
[37] Using Embedding Models to Improve Probabilistic Race Prediction
Noan Dasanaike, Kosuke Imai
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Estimating racial disparity requires individual-level race data, which are often unavailable due to the sensitivity of collecting such information. To address this problem, many researchers utilize Bayesian Improved Surname Geocoding (BISG), which have critically relied on Census surname data. Unfortunately, these data capture race-surname relationships only for common surnames, omitting approximately 10% of the US population. We show that predictive performance degrades substantially for individuals with such omitted, uncommon surnames because standard BISG implementation relies on a uninformative generic prior in these cases. To address this limitation, we propose embedding-powered BISG (eBISG), which uses pre-trained text embeddings to represent names as dense vectors and trains neural networks on 2020 Census surname and first-name data to estimate race probabilities for names not covered in the Census. We compare five approaches: standard BISG using only surnames, BIFSG incorporating first name probabilities, surname embedding for unlisted names, surname and first name embedding combining both, and a full-name embedding trained on voter file data from Southern states that captures interactions between name components. We show that each successive eBISG approach improves race prediction, with the full-name embedding yielding the largest gains, particularly for Hispanic and Asian voters whose surnames are absent from the Census list.
[38] Learning Evidence Highlighting for Frozen LLMs
Shaoang Li, Yanhang Shi, Yufei Li, Mingfu Liang, Xiaohan Wei, Yunchen Pu, Fei Tian, Chonglin Sun, Frank Shyu, Luke Simon, Sandeep Pandey, Xi Liu, Jian Li
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver’s task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.
[39] Dharma, Data and Deception: An LLM-Powered Rhetorical Analysis of Cow-Urine Health Claims on YouTube
Sheza Munir, Ratna Kandala, Anamta Khan, Deepti, Joyojeet Pal
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Health misinformation remains one of the most pressing challenges on social media, particularly when cultural traditions intersect with scientific-sounding claims. These dynamics are not only global but also deeply local, manifesting in culturally specific controversies that require careful analysis. Motivated by this, we examine 100 YouTube transcripts that promote or debunk cow urine (gomutra) as a health remedy, focusing on rhetorical strategies such as appeals to authority, efficacy appeals, and conspiracy framing. We employ large language models (LLMs) including GPT-4, GPT-4o, GPT-4.1, GPT-5, Gemini 2.5 Pro, and Mistral Medium 3 to annotate transcripts using a 14-category taxonomy of persuasive tactics. Our analysis reveals that promoters predominantly rely on efficacy appeals and social proof, while debunkers emphasize authority and rebuttal. Human evaluation of a subset of annotations yielded 90.1% inter-annotator agreement, confirming the reliability of our taxonomy and validation process. This work advances computational methods for misinformation analysis and demonstrates how LLMs can support large-scale studies of cultural discourse online.
[40] From graphemic dependence to lexical structure: a Markovian perspective on Dante’s Commedia
Angelo Maria Sabatini
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This study investigates the structural organisation of Dante’s Divina Commedia through a symbolic representation based on vowel-consonant (V/C) encoding. Modelling the resulting sequence as a four-state Markov chain yields a parsimonious index of graphemic memory, capturing the balance between persistence and alternation patterns. Across the poem, this index exhibits a slight but consistent increase from the Inferno to the Paradiso, indicating a directional shift in local dependency structure. Trigram-level analysis shows that this trend is driven by a restricted set of recurrent configurations, interpreted as graphemic probes linking the Markov representation to identifiable lexical environments in the text. These probes display distinct behaviours: configurations involving two transitions more frequently emerge across word boundaries, reflecting interactions between adjacent tokens, whereas configurations with fewer transitions are largely confined to intra-lexical structures. Part of the signal is further shaped by orthographic phenomena, particularly apostrophised forms, highlighting the role of writing conventions alongside phonological and lexical organisation. A complementary classification analysis identifies cantica-specific terms, providing lexical anchors through which graphemic probes can be related to the structure of the poem. This organisation is reflected not only in the separation of the three cantiche, but also in a continuous trajectory across the text. Overall, the results show that simple probabilistic models applied to symbolic text representations can uncover structured interactions between local dependencies, lexical distribution, orthographic encoding, and large-scale organisation, providing an interpretable framework for linking local symbolic dynamics to higher-level textual organisation.
[41] Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models
Felix Herron, Solange Rossato, Alexandre Allauzen, François Portet
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Modern automatic speech recognition (ASR) systems have been observed to function better for certain speaker groups (SGs) than others, despite recent gains in overall performance. One potential impediment to progress towards fairer ASR is a more nuanced understanding of the types of modeling errors that speech encoder models make, and in particular the difference between the structure of embeddings for high-performance and low-performance SGs. This paper proposes a framework typifying two types of error that can occur in modeling phonemes in ASR systems: random error/high variance in phoneme embedding, vs systematic error/embedding bias. We find that training phoneme classification probes only on a single, typically disadvantaged SG, sometimes improves performance for that SG, which is evidence for the existence of SG-level bias in phoneme embeddings. On the other hand, we find that speakers and SGs with higher levels of phoneme variance are the same as those with worse phoneme prediction accuracy. We conclude that both types of error are present in phoneme embeddings and both are candidate causes for SG-level unfairness in ASR, though random error is likely a greater hindrance to fairness than systematic error. Furthermore, we find that finetuning encoder models using a fairness-enhancing algorithm (domain enhancing and adversarial training) changes neither the benefits of in-domain phoneme classification probe training, nor measured levels of random embedding error.
[42] BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering
Jinghong Chen, Jingbiao Mei, Guangyu Yang, Bill Byrne
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: A common approach to question answering with retrieval-augmented generation (RAG) is to concatenate documents into a single context and pass it to a language model to generate an answer. While simple, this strategy can obscure the contribution of individual documents, making attribution difficult and contributing to the lost-in-the-middle'' effect, where relevant information in long contexts is overlooked. Concatenation also scales poorly: computational cost grows quadratically with context length, a problem that becomes especially severe when the context includes visual data, as in visual question answering. Attempts to mitigate these issues by limiting context length can further restrict performance by preventing models from benefiting from the improved recall offered by deeper retrieval. We propose Bayesian Ensemble Retrieval-Augmented Generation (BERAG), along with Bayesian Ensemble Fine-Tuning (BEFT), as a RAG framework in which language models are conditioned on individual retrieved documents rather than a single combined context. BERAG treats document posterior probabilities as ensemble weights and updates them token by token using Bayes' rule during generation. This approach enables probabilistic re-ranking, parallel memory usage, and clear attribution of document contribution, making it well-suited for large document collections. We evaluate BERAG and BEFT primarily on knowledge-based visual question answering tasks, where models must reason over long, imperfect retrieval lists. The results show substantial improvements over standard RAG, including strong gains on Document Visual Question Answering and multimodal needle-in-a-haystack benchmarks. We also demonstrate that BERAG mitigates the lost-in-the-middle’’ effect. The document posterior can be used to detect insufficient grounding and trigger deflection, while document pruning enables faster decoding than standard RAG.
[43] CRAFT: Clustered Regression for Adaptive Filtering of Training data
Parthasarathi Panda, Asheswari Swain, Subhrakanta Panda
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT’s 75.6 seconds, a 2.8 time speedup.
[44] Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose $\textbf{Abstract Chain-of-Thought}$, a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen ‘‘abstract’’ tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to $11.6\times$ fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across language model families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.
[45] Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
Ilana Nguyen, Harini Suresh, Thema Monroe-White, Evan Shieh
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) are increasingly used for text generation tasks from everyday use to high-stakes enterprise and government applications, including simulated interviews with asylum seekers. While many works highlight the new potential applications of LLMs, there are risks of LLMs encoding and perpetuating harmful biases about non-dominant communities across the globe. To better evaluate and mitigate such harms, more research examining how LLMs portray diverse individuals is needed. In this work, we study how national origin identities are portrayed by widely-adopted LLMs in response to open-ended narrative generation prompts. Our findings demonstrate the presence of persistent representational harms by national origin, including harmful stereotypes, erasure, and one-dimensional portrayals of Global Majority identities. Minoritized national identities are simultaneously underrepresented in power-neutral stories and overrepresented in subordinated character portrayals, which are over fifty times more likely to appear than dominant portrayals. The degree of harm is amplified when US nationality cues (e.g., ``American’’) are present in input prompts. Notably, we find that the harms we identify cannot be explained away via sycophancy, as US-centric biases persist even when replacing US nationality cues with non-US national identities in the prompts. Based on our findings, we call for further exploration of cultural harms in LLMs through methodologies that center Global Majority perspectives and challenge the uncritical adoption of US-based LLMs for the classification, surveillance, and misrepresentation of the majority of our planet.
[46] How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models’ ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.
[47] PL-MTEB: Polish Massive Text Embedding Benchmark
Rafał Poświata, Sławomir Dadas, Michał Perełkiewicz
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in the Polish language. PL-MTEB comprises 30 diverse NLP tasks across five categories: classification, clustering, pair classification, information retrieval, and semantic text similarity. Within the scope of this work, we added 12 new Polish-language tasks to MTEB based on existing datasets and prepared two new datasets used to create four clustering tasks. We evaluated 30 publicly available text embedding models, including Polish and multilingual models. We analyzed the results in detail for specific task types and model sizes. We made the prepared datasets, the source code for evaluation, and the obtained results available to the public at https://github.com/rafalposwiata/pl-mteb.
[48] MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression
Noel Elias, Homa Esfahanizadeh, Kaan Kale, Sriram Vishwanath, Muriel Medard
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models have drastically changed the prospects of AI by introducing technologies for more complex natural language processing. However, current methodologies to train such LLMs require extensive resources including but not limited to large amounts of data, expensive machinery, and lengthy training. To solve this problem, this paper proposes a new tokenization method inspired by universal Lempel-Ziv-Welch data compression that compresses repetitive phrases into multi-word tokens. With MultiTok as a new tokenizing tool, we show that language models are able to be trained notably more efficiently while offering a similar accuracy on more succinct and compressed training data. In fact, our results demonstrate that MultiTok achieves a comparable performance to the BERT and GPT standards as both a stand-alone tokenizer and an add-on to existing tokenizers while also providing close to 2.5x faster training with more than 30% less training data.
[49] Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search
Wentao Shi, Zichun Yu, Fuli Feng, Xiangnan He, Chenyan Xiong
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Monte Carlo Tree Search (MCTS) based methods provide promising approaches for generating synthetic data to enhance the self-training of Large Language Model (LLM) based multi-agent systems (MAS). These methods leverage Q-values to estimate individual agent contributions. However, relying solely on Q-values to identify informative data may misalign with the data synthesis objective, as the focus should be on selecting data that best enhances model training. To address this discrepancy, we propose Data Influence-oriented Tree Search (DITS), a novel framework that incorporates influence scores to guide both tree search and data selection. By leveraging influence scores, we effectively identify the most impactful data for system improvement, thereby enhancing model performance. Furthermore, we derive influence score estimation methods tailored for non-differentiable metrics, significantly reducing computational overhead by utilizing inference computations. Extensive experiments on eight multi-agent datasets demonstrate the robustness and effectiveness of the proposed methods. Notably, our findings reveal that allocating more inference resources to estimate influence scores, rather than Q-values, during data synthesis can more effectively and efficiently enhance model training.
[50] Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression
Jingyu Peng, Maolin Wang, Nan Wang, Jiatong Li, Yuchen Li, Yuyang Ye, Wanyu Wang, Pengyue Jia, Kai Zhang, Xiangyu Zhao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Despite substantial advancements in aligning large language models (LLMs) with human values, current safety mechanisms remain susceptible to jailbreak attacks. We hypothesize that this vulnerability stems from distributional discrepancies between alignment-oriented prompts and malicious prompts. To investigate this, we introduce LogiBreak, a novel and universal black-box jailbreak method that leverages logical expression translation to circumvent LLM safety systems. By converting harmful natural language prompts into formal logical expressions, LogiBreak exploits the distributional gap between alignment data and logic-based inputs, preserving the underlying semantic intent and readability while evading safety constraints. We evaluate LogiBreak on a multilingual jailbreak dataset spanning three languages, demonstrating its effectiveness across various evaluation settings and linguistic contexts.
[51] Language Specific Knowledge: Do Models Know Better in X than in English?
Ishika Agarwal, Nimet Beyza Bozdag, Nisval Patel, Dilek Hakkani-Tür
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Often, multilingual language models are trained with the objective to map semantically similar content (in different languages) in the same latent space. In this paper, we show a nuance in this training objective, and find that by changing the language of the input query, we can improve the question answering ability of language models. We make two main contributions. First, we introduce the term Language Specific Knowledge (LSK) to denote queries that are best answered in an ``expert language’’ for a given LLM, thereby enhancing its question-answering ability. We introduce the problem of language selection – for some queries, language models can perform better when queried in languages other than English, sometimes even better in low-resource languages – and the goal is to select the optimal language for the query. Second, we introduce a variety of simple to strong baselines to empirically motivate the language selection problem (including one of our own methods called LSKExtractor). During our evaluation, we employ three datasets that contain knowledge about both cultural and social behavioral norms. Overall, the results show that principled language selection can improve the performance of a language model, and that the expected question-to-language map is not always intuitive: Gemma models know most about China and Middle East in Spanish; Qwen models know most about authority and responsibility in Arabic and Chinese. Broadly, our research contributes to the open-source development of language models that are inclusive and more aligned with the cultural and linguistic contexts in which they are deployed.
[52] Toward Automated Robustness Evaluation of Mathematical Reasoning
Yutao Hou, Zeguan Xiao, Fei Yu, Yihan Jiang, Ma Shuguang, Zhaoqian Dai, Hailiang Huang, Yun Chen, Guanhua Chen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning-intensive tasks. However, these models exhibit unexpected brittleness, often failing on simple variations of the same underlying task. Existing robustness evaluations predominantly rely on hand-crafted templates or a limited set of perturbation rules. Consequently, such approaches lack the adaptability to probe latent vulnerabilities unique to specific models and remain susceptible to data contamination. To address this, we propose the Math Stress Tester (MaSTer), an automated framework inspired by software stress testing. MaSTer generates adversarial variants via a multi-round rewrite-verify loop, ensuring semantic consistency while successfully inducing model failure. Our framework generates benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the effectiveness of MaSTer on mathematical tasks. Additionally, we validate the framework’s extensibility to non-mathematical tasks, highlighting its broad applicability. Furthermore, we demonstrate that the synthesized variants generated by MaSTer can be utilized as a fine-tuning dataset to significantly enhance the model’s robustness.
[53] UR$^2$: Unify RAG and Reasoning through Reinforcement Learning
Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) have shown strong capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) for knowledge grounding and Reinforcement Learning from Verifiable Rewards (RLVR) for complex reasoning. However, existing attempts to unify these paradigms remain narrow in scope, typically limited to open-domain QA with fixed retrieval settings, which constrains generalization to broader domains. To address this limitation, we propose UR$^2$ (Unified RAG and Reasoning)), a general reinforcement learning framework that dynamically coordinates retrieval and reasoning. UR$^2$ introduces two key designs: a difficulty-aware curriculum that selectively invokes retrieval only for challenging instances, and a hybrid knowledge access strategy that combines domain-specific offline corpora with on-the-fly LLM-generated summaries. Together, these components mitigate the imbalance between retrieval and reasoning and improve robustness to noisy information. Experiments on open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks show that UR$^2$, built on Qwen-2.5-3/7B and LLaMA-3.1-8B, consistently outperforms existing RAG and RL baselines, and achieves performance comparable to GPT-4o-mini and GPT-4.1-mini on several benchmarks. Our code is available at https://github.com/Tsinghua-dhy/UR2.
[54] Learning from Natural Language Feedback for Personalized Question Answering
Alireza Salemi, Hamed Zamani
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.
[55] StateX: Enhancing RNN Recall via Post-training State Expansion
Xingyu Shen, Yingfa Chen, Zhen Leng Thai, Xu Han, Zhiyuan Liu, Maosong Sun
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recurrent neural networks (RNNs), such as linear attention and state-space models, have gained popularity due to their constant per-token complexity when processing long contexts. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a fixed-size recurrent state. Previous studies have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with large recurrent states results in high training costs. In this paper, we introduce StateX, a post-training framework that efficiently expands the states of pre-trained RNNs. For two popular classes of RNNs, linear attention and state-space models, we design post-training architectural modifications in StateX, to scale up the state size with no or negligible increase in model parameters. Experiments on models with up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning performance of RNNs without incurring high post-training costs or compromising other capabilities.
[56] Survey Response Generation: Generating Closed-Ended Survey Responses In-Silico with Large Language Models
Georg Ahnert, Anna-Carolina Haensch, Barbara Plank, Markus Strohmaier
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Many in-silico simulations of human survey responses with large language models (LLMs) focus on generating closed-ended survey responses, whereas LLMs are typically trained to generate open-ended text instead. Previous research has used a diverse range of methods for generating closed-ended survey responses with LLMs, and a standard practice remains to be identified. In this paper, we systematically investigate the impact that various Survey Response Generation Methods have on predicted survey responses. We present the results of 32 mio. simulated survey responses across 8 Survey Response Generation Methods, 4 political attitude surveys, and 10 open-weight language models. We find significant differences between the Survey Response Generation Methods in both individual-level and subpopulation-level alignment. Our results show that Restricted Generation Methods perform best overall, and that reasoning output does not consistently improve alignment. Our work underlines the significant impact that Survey Response Generation Methods have on simulated survey responses, and we develop practical recommendations on the application of Survey Response Generation Methods.
[57] NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium
Dinghong Song, Jierui Xu, Weichu Yang, Pengfei Su, Dong Li
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Emerging AI accelerators have started to gain attention and offer new opportunities for efficient inference of large language models (LLMs). Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we propose NeuronMLP, an efficient LLM inference method based on Singular Value Decomposition (SVD) compression and tiling on AWS Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. The proposed method is specifically optimized for multi-layer perceptron (MLP) layers in LLMs, which serve as a critical computational kernel for inference on Trainium. Evaluating on nine datasets and six recent LLMs, we show that NeuronMLP significantly outperforms the state-of-the-art Neuron Kernel Interface (NKI)-based matrix multiplication (matmul) kernel implemented by AWS on Trainium: at the kernel level, it achieves an average 1.35x speedup, which translates to an average 1.21x speedup for end-to-end LLM inference, under a compression ratio of 0.05.
[58] Identifying the Periodicity of Information in Natural Language
Yulin Ou, Yu Wang, Yang Xu, Hendrik Buschmeier
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.
[59] NiuTrans.LMT: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs
Yingfeng Luo, Ziqiang Xu, Yuxuan Ouyang, Murun Yang, Dingyang Lin, Kaiyan Chang, Tong Zheng, Bei Li, Peinan Feng, Quan Du, Tong Xiao, Jingbo Zhu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models have significantly advanced Multilingual Machine Translation (MMT), yet scaling to many languages while keeping quality robust across directions remains challenging. In this paper, we identify a failure mode of multilingual supervised fine-tuning (SFT) on multi-way parallel data: when such data are reused symmetrically around a pivot language (e.g., English), performance on reverse directions (X $\to$ pivot) can drop substantially. We term this phenomenon Directional Degeneration and attribute it to excessive many-to-one mappings, which encourage shortcut learning. We propose Strategic Downsampling (SD), a simple yet effective method to mitigate this degeneration. In addition, we introduce Parallel Multilingual Prompting (PMP), which augments translation instructions with an auxiliary parallel sentence to promote cross-lingual transfer during training and enables optional test-time enhancement when auxiliary translations are available. We further develop \textbf{NiuTrans.LMT} (\textbf{L}arge-scale \textbf{M}ultilingual \textbf{T}ranslation, abbreviated as \textbf{LMT}), a Chinese-English-centric suite of multilingual translation models spanning four sizes (0.6B/1.7B/4B/8B) and covering 60 languages and 234 directions. Comprehensive evaluations show that LMT is competitive among open-source MMT systems, and that our 4B LMT model performs on par with or better than substantially larger baselines. We release our models and project resources to support inclusive and scalable MMT.
[60] Selective Rotary Position Embedding
Sajad Movahedi, Timur Carstensen, Arshia Afzal, Frank Hutter, Antonio Orvieto, Volkan Cevher
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Position information is essential for language modeling. In softmax transformers, Rotary Position Embeddings (\textit{RoPE}) encode positions through \textit{fixed-angle} rotations, while in linear transformers, order is handled via input-dependent (selective) gating that decays past key-value associations. Selectivity has generally been shown to improve language-related tasks. Inspired by this, we introduce \textit{Selective RoPE}, an \textit{input-dependent} rotary embedding mechanism, that generalizes \textit{RoPE}, and enables rotation in \textit{arbitrary angles} for both linear and softmax transformers. We show that softmax attention already performs a hidden form of these rotations on query-key pairs, uncovering an implicit positional structure. We further show that in state-space models and gated linear transformers, the real part manages forgetting while the imaginary part encodes positions through rotations. We validate our method by equipping gated transformers with \textit{Selective RoPE}, demonstrating that its input-dependent rotations improve performance in language modeling and on difficult sequence tasks like copying, state tracking, and retrieval.
[61] Tracing the complexity profiles of different linguistic phenomena through the intrinsic dimension of LLM representations
Marco Baroni, Emily Cheng, Iria de-Dios-Flores, Francesca Franzon
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We explore intrinsic dimension (ID) of LLM representations as a marker of linguistic complexity. Specifically, we test whether ID differences across model layers reflect well-known complexity contrasts established in (psycho)linguistics: coordination vs. subordination, right-branching vs. center-embedding, and unambiguous vs. ambiguous attachment. Our results on six different LLMs show that these contrasts are consistently reflected in ID differences, with more complex phenomena eliciting higher ID profiles. Notably, ID differences emerge at different points across layers for different contrasts, also reaching their peaks at different stages. Further experiments using representational similarity and layer pruning confirm the trends. We conclude that ID is a useful marker of linguistic complexity in LLMs, that it points to similar linguistic processing steps across disparate LLMs, and that it has the potential to differentiate between different types of complexity.
[62] From Interpretability to Performance: Optimizing Retrieval Heads for Long-Context Language Models
Youmi Ma, Naoaki Okazaki
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Advances in mechanistic interpretability have identified special attention heads, known as retrieval heads, that are responsible for retrieving information from the context. However, the role of these retrieval heads in improving model performance remains unexplored. This work investigates whether retrieval heads can be leveraged to enhance the long-context capabilities of LLMs. Specifically, we propose RetMask, a method that generates training signals by contrasting normal model outputs with those from an ablated variant in which the retrieval heads are masked. This mechanism-based approach achieves substantial improvements: +2.28 points on HELMET at 128K for Llama-3.1, with +70% gains on generation with citation and +32% on passage re-ranking, while preserving performance on general tasks. Experiments across four models in three families demonstrate that RetMask consistently improves long-context performance, where gains correlate with the sparsity of the retrieval score distribution: models with sparser distributions, where retrieval capabilities are concentrated in a small set of heads, respond more strongly, while those with less sparse distributions show more modest gains. These results validate the functional role of retrieval heads and show that mechanistic insights can be transformed into performance enhancements.
[63] System-Mediated Attention Imbalances Make Vision-Language Models Say Yes
Tsan Tsai Chan, Varsha Suresh, Anisha Saha, Michael Hahn, Vera Demberg
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision-language model (VLM) hallucination is commonly linked to imbalanced allocation of attention across input modalities: system, image and text. However, existing mitigation strategies tend towards an image-centric interpretation of these imbalances, often prioritising increased image attention while giving less consideration to the roles of the other modalities. In this study, we evaluate a more holistic, system-mediated account, which attributes these imbalances to functionally redundant system weights that reduce attention to image and textual inputs. We show that this framework offers a useful empirical perspective on the yes-bias, a common form of hallucination in which VLMs indiscriminately respond `yes’. Causally redistributing attention from the system modality to image and textual inputs substantially suppresses this bias, often outperforming existing approaches. We further present evidence suggesting that system-mediated attention imbalances contribute to the yes-bias by encouraging a default reliance on coarse input representations, which are effective for some tasks but ill-suited to others. Taken together, these findings firmly establish system attention as a key factor in VLM hallucination and highlight its potential as a lever for mitigation.
[64] The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check
Qingyu Lu, Liang Ding, Kanjian Zhang, Jinxia Zhang, Dacheng Tao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.12979: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12979&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[65] One Persona, Many Cues, Different Results: How Sociodemographic Cues Impact LLM Personalization
Franziska Weeber, Vera Neplenbroek, Jan Batzner, Sebastian Padó
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.18572: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18572&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[66] DimABSA: Building Multilingual and Multidomain Datasets for Dimensional Aspect-Based Sentiment Analysis
Lung-Hao Lee, Liang-Chih Yu, Natalia Loukashevich, Ilseyar Alimova, Alexander Panchenko, Tzu-Mi Lin, Zhe-Yu Xu, Jian-Yu Zhou, Guangmin Zheng, Jin Wang, Sharanya Awasthi, Jonas Becker, Jan Philip Wahle, Terry Ruas, Shamsuddeen Hassan Muhammad, Saif M. Mohammad
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.23022: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.23022&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[67] Multi-Token Prediction via Self-Distillation
John Kirchenbauer, Abhimanyu Hans, Brian Bartoldson, Micah Goldblum, Ashwinee Panda, Tom Goldstein
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.06019: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06019&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[68] AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection
Pretam Ray, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.11931: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11931&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[69] Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem
Lichang Song, Ting Long, Yi Chang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.18734: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18734&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[70] Sensory-Aware Sequential Recommendation via Review-Distilled Representations
Yeo Chan Yoon, Chanjun Park, Kyuhan Koh
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.02709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[71] HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents
Yilin Jiang, Fei Tan, Xuanyu Yin, Jing Leng, Aimin Zhou
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.04855: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04855&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[72] Categorical Perception in Large Language Model Hidden States: Structural Warping at Digit-Count Boundaries
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.28258: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28258&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[73] Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Chao Xue, Yao Wang, Mengqiao Liu, Di Liang, Xingsheng Han, Peiyang Liu, Xianjie Wu, Chenyao Lu, Lei Jiang, Yu Lu, Haibo Shi, Shuang Liang, Minlong Peng, Flora D. Salim
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10079: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10079&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[74] EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation
Francesco Andrea Causio, Vittorio De Vita, Olivia Riccomi, Michele Ferramola, Federico Felizzi, Alessandro Tosi, Antonio Cristiano, Lorenzo De Mori, Chiara Battipaglia, Melissa Sawaya, Luigi De Angelis, Marcello Di Pumpo, Alessandra Piscitelli, Pietro Eric Risuleo, Alessia Longo, Giulia Vojvodic, Mariapia Vassalli, Bianca Destro Castaniti, Nicolò Scarsi, Manuel Del Medico
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[75] CURA: Clinical Uncertainty Risk Alignment for Language Model-Based Risk Prediction
Sizhe Wang, Ziqi Xu, Claire Najjuuko, Charles Alba, Chenyang Lu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14651: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14651&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[76] GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution
Nitin Choudhury, Bikrant Bikram Pratap Maurya, Bhavinkumar Vinodbhai Kuwar, Arun Balaji Buduru
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[77] Bolzano: Case Studies in LLM-Assisted Mathematical Research
Martin Balko, Jan Grebík, Pavel Hubáček, Martin Koutecký, Matěj Kripner, Václav Rozhoň, Robert Šámal, Adrián Zámečník
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16989: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16989&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[78] Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data
Yura Yoshida, Masato Kanai, Masataka Nakayama, Haruki Ohsawa, Yukiko Uchida, Arata Yuminaga, Gakuse Hoshina, Nobuo Sayama
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18919: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18919&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[79] UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents
Yifan Ji, Zhipeng Xu, Zhenghao Liu, Zulong Chen, Qian Zhang, Zhibo Yang, Junyang Lin, Yu Gu, Ge Yu, Maosong Sun
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.07038: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07038&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[80] SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios
Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, David Lo
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.22097: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22097&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[81] Machine learning and emoji prediction: How much accuracy can MARBERT achieve?
Mohammed Q. Shormani, Ibrahim Abdulmalik Hassan Muneef Y. Alshawsh
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.21108: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21108&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[82] “This Wasn’t Made for Me”: Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias
Siyu Liang, Alicia Beckford Wassink
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.21148: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21148&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[83] VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
Qijun Han, Haoqin Tu, Zijun Wang, Haoyue Dai, Yiyang Zhou, Nancy Lau, Alvaro A. Cardenas, Yuhui Xu, Ran Xu, Caiming Xiong, Zeyu Zheng, Huaxiu Yao, Yuyin Zhou, Cihang Xie
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.21375: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21375&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[84] Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling
Yilong Chen, Yanxi Xie, Zitian Gao, He Xin, Yihao Xiao, Jason Klein Liu, Haoming Luo, Yifan Luo, Zhengmao Ye, Tingwen Liu, Xin Zhao, Ran Tao, Bryan Dai
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.21724: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21724&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[85] Predicting Liquidity-Aware Bond Yields using Causal GANs and Deep Reinforcement Learning with LLM Evaluation
Jaskaran Singh Walia, Aarush Sinha, Naman Saraswat, Srinitish Srinivasan, Srihari Unnikrishnan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2502.17011: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.17011&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[86] Intrinsic Fingerprint of LLMs: Continue Training is NOT All You Need to Steal A Model!
Do-hyeon Yoon, Minsoo Chun, Thomas Allen, Hans Müller, Min Wang, Rajesh Sharma
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.03014: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03014&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[87] Atlas-Alignment: Making Interpretability Transferable Across Language Models
Bruno Puri, Jim Berend, Sebastian Lapuschkin, Wojciech Samek
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.27413: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27413&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[88] How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, Wenguang Chen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.18903: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18903&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[89] LLMs as Assessors: Right for the Right Reason?
Sourav Saha, Mandar Mitra, Aditya Dutta
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.08919: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08919&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[90] LATMiX: Learnable Affine Transformations for Microscaling Quantization of LLMs
Ofir Gordon, Lior Dikstein, Arnon Netzer, Idan Achituve, Hai Victor Habi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.17681: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17681&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[91] SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates
Rajveer Singh
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10649: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10649&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[92] Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
Wang Bill Zhu, Miaosen Chai, Shangshang Wang, Yejia Liu, Song Bian, Honghua Dong, Willie Neiswanger, Robin Jia
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17338: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17338&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[93] Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM
Sravanth Kodavanti, Sowmya Vajrala, Srinivas Miriyala, Utsav Tiwari, Uttam Kumar, Utkarsh Kumar Mahawar, Achal Pratap Singh, Arya D, Narendra Mutyala, Vikram Nelvoy Rajendiran, Sharan Kumar Allur, Euntaik Lee, Dohyoung Kim, HyeonSu Lee, Gyusung Cho, JungBae Kim
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18655: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18655&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[94] Forecasting Solar Energy Using a Single Image
Jeremy Klotz, Shree K. Nayar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Solar panels are increasingly deployed in cities on rooftops, walls, and urban infrastructure. Although the panel costs have fallen in recent years, the soft costs of installing them have not. These soft costs include assessing the illumination (irradiance) of a panel, which is typically performed using a 3D model that fails to capture small nearby structures that impact the irradiance. Our approach uses a single image taken at the panel’s location to forecast its irradiance at any time in the future. We use visual cues in the image to find the camera’s orientation and the portion of the sky visible to the panel in order to forecast the irradiance due to the sun and the sky. In addition, we show that the irradiance due to reflections from nearby buildings varies smoothly over time and can be forecasted from the image. This approach enables assessing the solar energy potential of any surface and forecasting the temporal variation of a panel’s irradiance. We validate our approach using real irradiance measurements in urban canyons. We show that our approach often yields more accurate irradiance forecasts compared to conventional irradiance-based transposition methods and 3D model-based simulations. We also show that a single spherical image can be used to find the best fixed orientation of a panel. Finally, we present Solaris, a device to capture the image seen by a panel in a variety of urban settings.
[95] Handling Missing Modalities in Multimodal Survival Prediction for Non-Small Cell Lung Cancer
Filippo Ruffini, Camillo Maria Caruso, Claudia Tacconi, Lorenzo Nibid, Francesca Miccolis, Marta Lovino, Carlo Greco, Edy Ippolito, Michele Fiore, Alessio Cortellini, Bruno Beomonte Zobel, Giuseppe Perrone, Bruno Vincenzi, Claudio Marrocco, Alessandro Bria, Elisa Ficarra, Sara Ramella, Valerio Guarrasi, Paolo Soda
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate survival prediction in Non-Small Cell Lung Cancer (NSCLC) requires integrating clinical, radiological, and histopathological data. Multimodal Deep Learning (MDL) can improve precision prognosis, but small cohorts and missing modalities limit its clinical applicability, as conventional approaches enforce complete case filtering or imputation. We present a missing-aware multimodal survival framework that combines Computed Tomography (CT), Whole-Slide Histopathology Images (WSI), and structured clinical variables for overall survival modeling in unresectable stage II-III NSCLC. The framework uses Foundation Models (FMs) for modality-specific feature extraction and a missing-aware encoding strategy that enables intermediate multimodal fusion under naturally incomplete modality profiles. By design, the architecture processes all available data without dropping patients during training or inference. Intermediate fusion outperforms unimodal baselines and both early and late fusion strategies, with the trimodal configuration reaching a C-index of 74.42. Modality-importance analyses show that the fusion model adapts its reliance on each data stream according to representation informativeness, shaped by the alignment between FM pretraining objectives and the survival task. The learned risk scores produce clinically meaningful stratification of disease progression and metastatic risk, with statistically significant log-rank tests across all modality combinations, supporting the translational relevance of the proposed framework.
[96] Soft Anisotropic Diagrams for Differentiable Image Representation
Laki Iinbor, Zhiyang Dou, Wojciech Matusik
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce Soft Anisotropic Diagrams (SAD), an explicit and differentiable image representation parameterized by a set of adaptive sites in the image plane. In SAD, each site specifies an anisotropic metric and an additively weighted distance score, and we compute pixel colors as a softmax blend over a small per-pixel top-K subset of sites. We induce a soft anisotropic additively weighted Voronoi partition (i.e., an Apollonius diagram) with learnable per-site temperatures, preserving informative gradients while allowing clear, content-aligned boundaries and explicit ownership. Such a formulation enables efficient rendering by maintaining a per-query top-K map that approximates nearest neighbors under the same shading score, allowing GPU-friendly, fixed-size local computation. We update this list using our top-K propagation scheme inspired by jump flooding, augmented with stochastic injection to provide probabilistic global coverage. Training follows a GPU-first pipeline with gradient-weighted initialization, Adam optimization, and adaptive budget control through densification and pruning. Across standard benchmarks, SAD consistently outperforms Image-GS and Instant-NGP at matched bitrate. On Kodak, SAD reaches 46.0 dB PSNR with 2.2 s encoding time (vs. 28 s for Image-GS), and delivers 4-19 times end-to-end training speedups over state-of-the-art baselines. We demonstrate the effectiveness of SAD by showcasing the seamless integration with differentiable pipelines for forward and inverse problems, efficiency of fast random access, and compact storage.
[97] EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms
Brian VanVoorst, Nicholas Walczak, Christopher Gilleo, Charles Meissner, Fabio Felix, Iran Roman, Bea Steers, Claudio Silva, Yuhan Shen, Zijia Lu, Shih-Po Lee, Ehsan Elhamifar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA’s Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks. To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications. In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks. The dataset is accessible via zenodo.org (DOI: 10.5281/zenodo.19239154).
[98] H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers
Ayushi Mehrotra, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects, overlooking feature interactions, where groups of features jointly influence model output. Such interactions are especially important in image classification tasks, where semantic meaning often arises from pixel interdependencies rather than isolated features. Existing interaction-based methods for images are either coarse (e.g., superpixel-only) or, fail to satisfy core interpretability axioms. In this work, we introduce H-Sets, a novel two-stage framework for discovering and attributing higher-order feature interactions in image classifiers. First, we detect locally interacting pairs via input Hessians and recursively merge them into semantically coherent sets; segmentation from Segment Anything (SAM) is used as a spatial grouping prior but can be replaced by other segmentations. Second, we attribute each set with IDG-Vis, a set-level extension of Integrated Directional Gradients that integrates directional gradients along pixel-space paths and aggregates them with Harsanyi dividends. While Hessians introduce additional compute at the detection stage, this targeted cost consistently yields saliency maps that are sparser and more faithful. Evaluations across VGG, ResNet, DenseNet and MobileNet models on ImageNet and CUB datasets show that H-Sets generate more interpretable and faithful saliency maps compared to existing methods.
[99] FLARE-BO: Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation for Low-Light Robotic Vision
Nathan Shankar, Pawel Ladosz, Hujun Yin
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reliable visual perception under low illumination remains a core challenge for autonomous robotic systems, where degraded image quality directly compromises navigation, inspection, and various operations. A recent training free approach showed that Bayesian optimisation with Gaussian Processes can adaptively select brightness, contrast, and denoising parameters on a per-image basis, achieving competitive enhancement without any learned model. However, that framework is limited to three parameters, applies no illumination decomposition or white balance correction, and relies on Non-Local Means denoising, which tends to over smooth edges under noisy conditions. This paper proposes FLARE-BO (Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation), an extended framework that jointly optimises eight parameters spanning across gamma correction, LIME-style illumination normalisation, chrominance denoising, bilateral filtering, NLM denoising, Grey-World automatic white balance, and adaptive post smoothing. The search engine employs a unit hypercube parameter normalisation, objective standardisation, Sobol quasi-random initialisation, and Log Expected Improvement acquisition for principled exploration of the expanded space. Performance of the proposed method is benchmarked using the Low Light paired dataset (LOL) and results show marked improvements of the proposed method over existing methods that were not specifically trained using this dataset.
[100] Robust Camera-to-Mocap Calibration and Verification for Large-Scale Multi-Camera Data Capture
Tianyi Liu, Christopher Twigg, Patrick Grady, Kevin Harris, Shangchen Han, Kun He
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Optical motion capture (mocap) systems are widely used for ground-truth capture in AR/VR, SLAM and robotics datasets. These datasets require extrinsic calibration to align mocap coordinates to external camera frames – a step that is subject to multiple sources of error in practice, and failures often go undetected until they corrupt downstream data. These issues are compounded for fisheye cameras, where spatially non-uniform distortion makes both calibration and verification more challenging. We present a calibration and verification system designed for this setting. Concretely, we target robustness to board-to-marker attachment variation, optimization initialization ambiguity, and session-to-session calibration drift after deployment. The calibration jointly estimates camera extrinsics and the board-to-marker transform, and uses a staged solver to improve convergence reliability under ambiguous initialization. The verification component, \lollypop, provides fast, operator-independent assessment through a measurement chain entirely independent of the calibration data. In experiments on a Meta Quest 3 headset with fisheye cameras, our calibration outperforms existing benchwork, and lollypop reliably detects calibration degradation over time. The system has been deployed in production data collection pipelines.
[101] PAGaS: Pixel-Aligned 1DoF Gaussian Splatting for Depth Refinement
David Recasens, Robert Maier, Aljaz Bozic, Stephane Grabli, Javier Civera, Tony Tung, Edmond Boyer
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Gaussian Splatting (GS) has emerged as an efficient approach for high-quality novel view synthesis. While early GS variants struggled to accurately model the scene’s geometry, recent advancements constraining the Gaussians’ spread and shapes, such as 2D Gaussian Splatting, have significantly improved geometric fidelity. In this paper, we present Pixel-Aligned 1DoF Gaussian Splatting (PAGaS) that adapts the GS representation from novel view synthesis to the multi-view stereo depth task. Our key contribution is modeling a pixel’s depth using one-degree-of-freedom (1DoF) Gaussians that remain tightly constrained during optimization. Unlike existing approaches, our Gaussians’ positions and sizes are restricted by the back-projected pixel volumes, leaving depth as the sole degree of freedom to optimize. PAGaS produces highly detailed depths, as illustrated in Figure 1. We quantitatively validate these improvements on top of reference geometric and learning-based multi-view stereo baselines on challenging 3D reconstruction benchmarks. Code: davidrecasens.github.io/pagas
[102] Improving Driver Drowsiness Detection via Personalized EAR/MAR Thresholds and CNN-Based Classification
Gökdeniz Ersoy, Mehmet Alper Tatar, Eray Tonbul, Serap Kırbız
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Driver drowsiness is a major cause of traffic accidents worldwide, posing a serious threat to public safety. Vision-based driver monitoring systems often rely on fixed Eye Aspect Ratio (EAR) and Mouth Aspect Ratio (MAR) thresholds; however, such fixed values frequently fail to generalize across individuals due to variations in facial structure, illumination, and driving conditions. This paper proposes a personalized driver drowsiness detection system that monitors eyelid movements, head position, and yawning behavior in real time and provides warnings when signs of fatigue are detected. The system employs driver-specific EAR and MAR thresholds, calibrated before driving, to improve classical metric-based detection. In addition, deep learning-based Convolutional Neural Network (CNN) models are integrated to enhance accuracy in challenging scenarios. The system is evaluated using publicly available datasets as well as a custom dataset collected under diverse lighting conditions, head poses, and user characteristics. Experimental results show that personalized thresholding improves detection accuracy by 2-3% compared to fixed thresholds, while CNN-based classification achieves 99.1% accuracy for eye state detection and 98.8% for yawning detection, demonstrating the effectiveness of combining classical metrics with deep learning for robust real-time driver monitoring.
[103] Anatomy-Aware Unsupervised Detection and Localization of Retinal Abnormalities in Optical Coherence Tomography
Tania Haghighi, Sina Gholami, Hamed Tabkhi, Minhaj Nur Alam
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reliable automated analysis of Optical Coherence Tomography (OCT) imaging is crucial for diagnosing retinal disorders but faces a critical barrier: the need for expensive, labor-intensive expert annotations. Supervised deep learning models struggle to generalize across diverse pathologies, imaging devices, and patient populations due to their restricted vocabulary of annotated abnormalities. We propose an unsupervised anomaly detection framework that learns the normative distribution of healthy retinal anatomy without lesion annotations, directly addressing annotation efficiency challenges in clinical deployment. Our approach leverages a discrete latent model trained on normal B-scans to capture OCT-specific structural patterns. To enhance clinical robustness, we incorporate retinal layer-aware supervision and structured triplet learning to separate healthy from pathological representations, improving model reliability across varied imaging conditions. During inference, anomalies are detected and localized via reconstruction discrepancies, enabling both image and pixel-level identification without requiring disease-specific labels. On the Kermany dataset (AUROC: 0.799), our method substantially outperforms VAE, VQVAE, VQGAN, and f-AnoGAN baselines. Critically, cross-dataset evaluation on Srinivasan achieves AUROC 0.884 with superior generalization, demonstrating robust domain adaptation. On the external RETOUCH benchmark, unsupervised anomaly segmentation achieves competitive Dice (0.200) and mIoU (0.117) scores, validating reproducibility across institutions.
[104] GenMatter: Perceiving Physical Objects with Generative Matter Models
Eric Li, Arijit Dasgupta, Yoni Friedman, Mathieu Huot, Vikash Mansinghka, Thomas O’Connell, William T. Freeman, Joshua B. Tenenbaum
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion cues and high-level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.
[105] Nuclear Diffusion Models for Low-Rank Background Suppression in Videos
Tristan S. W. Stevens, Oisín Nolan, Jean-Luc Robert, Ruud J. G. van Sloun
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Video sequences often contain structured noise and background artifacts that obscure dynamic content, posing challenges for accurate analysis and restoration. Robust principal component methods address this by decomposing data into low-rank and sparse components. Still, the sparsity assumption often fails to capture the rich variability present in real video data. To overcome this limitation, a hybrid framework that integrates low-rank temporal modeling with diffusion posterior sampling is proposed. The proposed method, Nuclear Diffusion, is evaluated on a real-world medical imaging problem, namely cardiac ultrasound dehazing, and demonstrates improved dehazing performance compared to traditional RPCA concerning contrast enhancement (gCNR) and signal preservation (KS statistic). These results highlight the potential of combining model-based temporal models with deep generative priors for high-fidelity video restoration.
[106] SAMIDARE: Advanced Tracking-by-Segmentation for Dense Scenarios
Shozaburo Hirano, Norimichi Ukita
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Automated sports analysis demands robust multi-object tracking (MOT), yet segmentation-based methods often struggle with mask errors and ID switches in dense scenes. We propose SAMIDARE, a framework that enhances SAM2MOT for crowded scenes through three key components: (1) density-aware mask re-generation and (2) selective memory updates, both for adaptive mask control to preserve target feature integrity, and (3) state-aware association and new track initialization, which improves robustness under mutual occlusions and frequent frame-out events. Evaluated on the SportsMOT dataset, SAMIDARE achieves state-of-the-art performance, outperforming the baseline by 2.5 HOTA and 4.2 IDF1 points on the validation set. These results demonstrate that adaptive feature management using mask control and state-aware association provide a robust and efficient solution for dense sports tracking. Code is available at https://github.com/ZabuZabuZabu/SAMIDARE
[107] Learning Reactive Human Motion Generation from Paired Interaction Data Using Transformer-Based Models
Masato Soga, Ryuki Takebayashi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in deep learning have enabled the generation of videos from textual descriptions as well as the prediction of future sequences from input videos. Similarly, in human motion modeling, motions can be generated from text or predicted from a single person’s motion sequence. However, these approaches primarily focus on single-agent motion generation. In contrast, this study addresses the problem of generating the motion of one person based on the motion of another in interaction scenarios, where the two motions are mutually dependent. We construct a dataset of paired action-reaction motion sequences extracted from boxing match videos and investigate the effectiveness of Transformer-based models for this task. Specifically, we implement and compare three models: a simple Transformer, iTransformer, and Crossformer. In addition, we introduce a person ID embedding to explicitly distinguish between individuals, enabling the model to maintain structural consistency and better capture interaction dynamics. Experimental results show that the simple Transformer can generate plausible interaction-aware motions without suffering from posture collapse, while iTransformer and Crossformer accumulate errors over time, leading to unstable motion generation. Furthermore, the proposed person ID embedding contributes to preventing structural collapse and improving motion consistency. These results highlight the importance of explicitly modeling individual identity in interaction-aware motion generation.
[108] Unlocking Optical Prior: Spectrum-Guided Knowledge Transfer for SAR Generalized Category Discovery
Jingyuan Xia, Ruikang Hu, Ye Li, Zhixiong Yang, Xu Lan, Zhejun Lu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Generalized Category Discovery (GCD) holds significant promise for the label-scarce Synthetic Aperture Radar (SAR) domain, yet its efficacy is severely constrained by the cross-modal incompatibility between the inherent optical prior of the Large Vision Models (LVMs) and SAR imagery. Existing domain adaptation methods often lack an inductive bias that reflects imaging characteristics, consequently failing to effectively transfer optical prior into the SAR domain. To address this issue, the Modal Discrepancy Curve (MDC) is introduced to model cross-modal discrepancy as a structured frequency-domain descriptor derived from spectral energy distributions. Leveraging this formulation, we propose the MDC-guided Cross-modal Prior Transfer (MCPT) framework, a pre-training paradigm that operates on paired optical-SAR data. Within this framework, Adaptive Frequency Tokenization (AFT) converts the MDC into learnable tokens, and Frequency-aware Expert Refinement (FER) performs band-wise discrepancy-aware feature refinement using these tokens. Based on the refined representations, contrastive learning aligns refined embeddings across modalities and internalizes the adaptation pattern. Ultimately, the superior SAR feature representation capability learned during paired pre-training is applied to downstream single-modal SAR-GCD tasks. Extensive experiments demonstrate state-of-the-art performance across multiple mainstream datasets, indicating that frequency-domain discrepancy modeling enables more effective adaptation of optical prior to SAR imagery.
[109] Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities
Peibo Song, Xiaotian Xue, Jinshuo Zhang, Zihao Wang, Jinhua Liu, Shujun Fu, Fangxun Bao, Si Yong Yeo
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal MRI offers complementary information for brain tumor segmentation, but clinical scans often lack one or more modalities, which degrades segmentation performance. In this paper, we propose UniME (Uni-Encoder Meets Multi-Encoders), a two-stage heterogeneous method for brain tumor segmentation with missing modalities that reconciles the trade-offs among fine-grained structure capture, cross-modal complementarity modeling, and exploitation of available modalities. The idea is to decouple representation learning from segmentation via a two-stage heterogeneous architecture. Stage 1 pretrains a single ViT Uni-Encoder with masked image modeling to establish a unified representation robust to missing modalities. Stage 2 adds modality-specific CNN Multi-Encoders to extract high-resolution, multi-scale, fine-grained features. We fuse these features with the global representation to produce precise segmentations. Experiments on BraTS 2023 and BraTS 2024 show that UniME outperforms previous methods under incomplete multi-modal scenarios. The code is available at https://github.com/Hooorace-S/UniME
[110] EvFlow-GS: Event Enhanced Motion Deblurring with Optical Flow for 3D Gaussian Splatting
Feiyu An, Yufei Deng, Zihui Zhang, Rong Xiao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Achieving sharp 3D reconstruction from motion-blurred images alone becomes challenging, motivating recent methods to incorporate event cameras, benefiting from microsecond temporal resolution. However, they suffer from residual artifacts and blurry texture details due to misleading supervision from inaccurate event double integral priors and noisy, blurry events. In this study, we propose EvFlow-GS, a unified framework that leverages event streams and optical flow to optimize an end-to-end learnable double integral (LDI), camera poses, and 3D Gaussian Splatting (3DGS) jointly on-the-fly. Specifically, we first extract edge information from the events using optical flow and then formulate a novel event-based loss applied separately to different modules. Additionally, we exploit a novel event-residual prior to strengthen the supervision of intensity changes between images rendered from 3DGS. Finally, we integrate the outputs of both 3DGS and LDI into a joint loss, enabling their optimization to mutually facilitate each other. Experiments demonstrate the leading performance of our EvFlow-GS.
[111] From Global to Local: Rethinking CLIP Feature Aggregation for Person Re-Identification
Aotian Zheng, Winston Sun, Bahaa Alattar, Vitaly Ablavsky, Jenq-Neng Hwang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: CLIP-based person re-identification (ReID) methods aggregate spatial features into a single global \texttt{[CLS]} token optimized for image-text alignment rather than spatial selectivity, making representations fragile under occlusion and cross-camera variation. We propose SAGA-ReID, which reconstructs identity representations by aligning intermediate patch tokens with anchor vectors parameterized in CLIP’s text embedding space – emphasizing spatially stable evidence while suppressing corrupted or absent regions, without requiring textual descriptions of individual images. Controlled experiments isolate the aggregation mechanism under two qualitatively distinct conditions – synthetic masking, where identity signal is absent, and realistic human distractors, where an overlapping person introduces semantically confusing signal – with SAGA’s advantage over global pooling growing substantially as occlusion increases across both conditions. Benchmark evaluations confirm consistent gains over CLIP-ReID across standard and occluded settings, with the largest improvements where global pooling is most unreliable: up to +10.6 Rank-1 on occluded benchmarks. SAGA’s aggregation outperforms dedicated sequential patch aggregation on a stronger backbone, confirming that structured reconstruction addresses a bottleneck that backbone quality and architectural complexity alone cannot resolve. Code available at https://github.com/ipl-uw/Structured-Anchor-Guided-Aggregation-for-ReID.
[112] CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution
Xiangxi Zheng, Kuang He, Jiayi Hu, Ping Yu, Rui Yan, Yuan Yao, Peng Hou, Anxiang Zeng, Alex Jinpeng Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limitations: despite the availability of growing chart-to-code datasets, simply scaling homogeneous chart-code pairs conflates visual perception with program logic, preventing models from fully leveraging the richness of multimodal supervision. We present CharTide, a novel data-centric framework that systematically redesigns both training and alignment data for chart-to-code generation. First, we construct a 2M-sample dataset via a Tri-Perspective Tuning strategy, explicitly decoupling training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines using only supervised data. Second, we reformulate alignment as a data verification problem rather than a heuristic scoring task. To this end, we introduce an Inquiry-Driven RL framework grounded in the principle of information invariance: a downstream model should yield consistent answers to identical visual queries across both original and generated charts. Moving beyond rigid rule matching or VLM scoring, we employ a frozen Inspector to objectively verify generated charts through atomic QA tasks, providing verifiable reward signals based on answer accuracy. Experiments on ChartMimic, Plot2Code, and ChartX show that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.
[113] ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild
Hanyu Chen, Ruojin Cai, Steve Marschner, Noah Snavely
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Symmetry detection is a fundamental problem in computer vision, and symmetries serve as powerful priors for downstream tasks. However, existing learning-based methods for detecting 3D symmetries from single images have been almost exclusively trained and evaluated on object-centric or synthetic datasets, and thus fail to generalize to real-world scenes. Furthermore, due to the inherent scale ambiguity of monocular inputs, which makes localizing the 3D plane an ill-posed problem, many existing works only predict the plane’s orientation. In this paper, we address these limitations by presenting the first framework for detecting 3D-grounded reflectional symmetries from single, in-the-wild RGB images, focusing on architectural landmarks. We introduce two key innovations: (1) a scalable data annotation pipeline to automatically curate a large-scale dataset of architectural symmetries, ArchSym, from SfM reconstructions by leveraging cross-view image matching; and building on the dataset, (2) a single-view symmetry detector that accurately localizes symmetries in 3D by parameterizing them as signed distance maps defined relative to predicted scene geometry. We validate our symmetry annotation pipeline against geometry-based alternatives and demonstrate that our symmetry detector significantly outperforms state-of-the-art baselines on our new benchmark.
[114] Breaking Watermarks in the Frequency Domain: A Modulated Diffusion Attack Framework
Chunpeng Wang, Binyan Qu, Xiaoyu Wang, Zhiqiu Xia, Shanshan Zhang, Yunan Liu, Qi Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Digital image watermarking has advanced rapidly for copyright protection of generative AI, yet the comparatively limited progress in watermark attack techniques has broken the attack-defense balance and hindered further advances in the field. In this paper, we propose FMDiffWA, a frequency-domain modulated diffusion framework for watermark attacks. Specifically, we introduce a frequency-domain watermark modulation (FWM) module and incorporate it into the sampling stages both the forward and reverse diffusion processes. This mechanism enables selective modulation of watermark-related frequency components, thereby allowing FMDiffWA to effectively neutralize the invisible watermark signals while preserving the perceptual quality of the attacked watermarked images. To achieve a better trade-off between attack efficacy and visual fidelity, we reformulate the training strategy of conventional diffusion models by augmenting the canonical noise estimation objective with an auxiliary refinement constraint. Comprehensive experiments demonstrate that FMDiffWA achieves superior visual fidelity compared to existing watermark attacks, while exhibiting strong generalization across diverse watermarking schemes.
[115] Towards Temporal Compositional Reasoning in Long-Form Sports Videos
Siyu Cao, Lu Zhang, Ruizhe Zeng, Zhi-yong Liu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.
[116] OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space
Zhuding Liang, Tianyi Yan, Dubing Chen, Jiasen Zheng, Huan Zheng, Cheng-zhong Xu, Yida Wang, Kun Zhan, Jianbing Shen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Generative world models increasingly rely on 4D occupancy for realistic autonomous driving simulation. However, existing generation frameworks depend on rigid geometric conditions (e.g., explicit trajectories) or simplistic attribute-level text, failing to orchestrate complex, sequential multi-agent interactions. To address this semantic-spatiotemporal gap, we propose OccDirector, a pioneering framework that generates 4D occupancy dynamics conditioned solely on natural language. Operating as a ``scenario director’’, OccDirector maps natural language scripts into physically plausible voxel dynamics without requiring geometric priors. Technically, it employs a VLM-driven Spatio-Temporal MMDiT equipped with a history-prefix anchoring strategy to ensure long-horizon interaction consistency. Furthermore, we introduce OccInteract-85k, a novel dataset uniquely annotated with multi-level language instructions: ranging from static layouts to intricate multi-agent behaviors, alongside a novel VLM-based evaluation benchmark. Extensive experiments demonstrate that OccDirector achieves state-of-the-art generation quality and unprecedented instruction-following capabilities, successfully shifting the paradigm from appearance synthesis to language-driven behavior orchestration.
[117] Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset
Wenhui Huang, Songyan Zhang, Collister Chua, Yang Liang, Zhiqi Mao, Heng Yang, Chen Lv
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.
[118] CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation
Suiyang Guang, Chenyu Liu, Ruohan Zhang, Siyuan Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.
[119] Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
Peixi Wu, Ke Mei, Feipeng Ma, Bosong Chai, Zhibin Lan, Chenxi Zhao, Shannan Yan, Jie Chen, Zhangchi Hu, Yansong Peng, Bo Lin, Junjie Zhou, Dacheng Yin, Tianyi Wang, Fengyun Rao, Jing Lyu, Hebei Li, Xiaoyan Sun
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.
[120] DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning
Joonmyung Choi, Sanghyeok Lee, Jongha Kim, Sehyung Kim, Dohwan Ko, Jihyung Kil, Hyunwoo J. Kim
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in vision-language models have demonstrated remarkable performance across diverse multi-modal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the inefficient consumption of substantial computational resources, especially for long documents. We observe that existing token-reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DocPrune, a training-free and progressive document token pruning framework designed for efficient long-document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant tokens. Moreover, it automatically selects the appropriate layers to initiate token pruning based on the model’s level of comprehension. Our experiments on the M3DocRAG show that DocPrune improves throughput by 3.0x and 3.3x in the encoder and decoder, respectively, while boosting the F1 score by +1.0, achieving both higher accuracy and efficiency without any additional training.
[121] Evaluation of image simulation open source solutions for simulation of synthetic images in lunar environment
Jai G Singla, Hinal B Patel, Nitant Dube
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Synthetic image generation is one of the crucial input for planetary missions. It enables researchers and engineers to visualize planned planetary missions, test imaging systems and plan exploration activities in a virtual environment before actual deployment. Image simulation is essential for assessing landing sites, detecting hazards, and validating navigation systems in a missions. This study offers a detailed evaluation of various image simulation approaches for the lunar environment, with particular emphasis on the effects of different camera models and light illumination conditions on the quality of synthetic lunar images. These images are produced using real Digital Elevation Models (DEM) and terrain data derived from instruments such as Chandrayaan-2 Orbiter High Resolution Camera (OHRC) and NASA’s Wide Angle Camera (WAC), and Narrow Angle Camera (NAC) instruments. This research aims to improve the reliability of synthetic imagery in supporting autonomous navigation and decision-making systems in lunar exploration. This work contributes to the development of more effective tools for generating important information for future lunar missions and enhances the understanding of the moon’s surface environment.
[122] Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation
Ran Zhao, Sheng Jin, Size Wu, Kang Liao, Zerui Gong, Zujin Guo, Yang Xiao, Wei Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent text-to-image (T2I) models have demonstrated impressive capabilities in photorealistic synthesis and instruction following. However, their reliability in knowledge-intensive settings remains largely unexplored. Unlike natural image generation, knowledge visualization requires not only semantic alignment but also strict adherence to domain knowledge, structural constraints, and symbolic conventions, exposing a critical gap between visual plausibility and scientific correctness. To systematically study this problem, we introduce KVBench, a curriculum-grounded benchmark for evaluating knowledge-intensive T2I generation. KVBench covers six senior high-school subjects: Biology, Chemistry, Geography, History, Mathematics, and Physics. The benchmark consists of 1,800 expert-curated prompts derived from over 30 authoritative textbooks. Using this benchmark, we evaluate 14 state-of-the-art open- and closed-source models, revealing substantial deficiencies in logical reasoning, symbolic precision, and multilingual robustness, with open-source models consistently underperforming proprietary systems. To address these limitations, we further propose KE-Check, a two-stage framework that improves scientific fidelity via (1) Knowledge Elaboration for structured prompt enrichment, and (2) Checklist-Guided Refinement for explicit constraint enforcement through violation identification and constraint-guided editing. KE-Check effectively mitigates scientific hallucinations, narrowing the performance gap between open-source and leading closed-source models. Data and codes are publicly available at https://github.com/zhaoran66/KVBench.
[123] Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization
Jeonggon Kim, Heejoon Moon, Je Hyeong Hong
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Privacy-Preserving Image Queries (PPIQ) are an emerging mechanism for cloud-based visual localization, enabling pose estimation from obfuscated features instead of private images or raw keypoints. However, the main approaches for PPIQ, primarily geometry-based and segmentation-based obfuscation, both suffer from vulnerabilities to recent privacy attacks. In particular, a fundamental limitation of geometry-based obfuscation is that the spatial distribution of obfuscated neighboring lines still effectively surrounds the original keypoint location, providing exploitable cues for recovering the original points. We revisit this geometric paradigm and introduce Dual Convergent Lines (DCL), a novel keypoint obfuscation method demonstrating strong resilience against such attack. DCL places two fixed anchors on a central partition line and lifts each keypoint to a line originating from one of them, with the active anchor determined by the keypoint’s location. This arrangement invalidates the geometry-recovery attack by making its optimization ill-posed: Neighboring lines either misleadingly converge to one anchor, yielding a trivial solution, or become near-parallel at the partition boundary, yielding an unstable high-variance solution. Both outcomes thwart point recovery. DCL is also compatible with an existing line-based solver, enabling deployment in traditional localization pipelines. Experiments on both indoor and large-scale outdoor datasets demonstrate DCL’s robustness against privacy attacks, efficiency, and scalability, while achieving practical localization performance.
[124] Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation
Lomash Relia, Jai G Singla, Amitabh, Nitant Dube
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This study analyses simulated and real-world implementations of depth-aware rover navigation, highlighting the transition from stereo vision to monocular depth estimation using edge AI. A Unity-based lunar terrain simulator with stereo cameras and OpenCV’s StereoSGBM was used to generate disparity maps. A physical rover built on Raspberry Pi 4 employed UniDepthV2 for monocular metric depth estimation and YOLO12n for real-time object detection. While stereo vision yielded higher accuracy in simulation, the monocular approach proved more robust and cost-effective in real-world deployment, achieving 0.1 FPS for depth and 10 FPS for detection.
[125] ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding
Dongwei Sun, Jing Yao, Kan Wei, Xiangyong Cao, Chen Wu, Zhenghui Zhao, Pedram Ghamisi, Jun Zhou, Jón Atli Benediktsson
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Rapid situational awareness is critical in post-disaster response. While remote sensing damage assessment is evolving from pixel-level change detection to high-level semantic analysis, existing vision-language methodologies still struggle to provide actionable intelligence for complex strategic queries. They remain severely constrained by unimodal optical dependence, a prevailing bias towards natural disasters, and a fundamental lack of grounded interactivity. To address these limitations, we present ChangeQuery, a unified multimodal framework designed for comprehensive, all-weather disaster situation awareness. To overcome modality constraints and scenario biases, we construct the Disaster-Induced Change Query (DICQ) dataset, a large-scale benchmark coupling pre-event optical semantics with post-event SAR structural features across a balanced distribution of natural catastrophes and armed conflicts. Furthermore, to provide the high-quality supervision required for interactive reasoning, we propose a novel Automated Semantic Annotation Pipeline. Adhering to a ``statistics-first, generation-later’’ paradigm, this engine automatically transforms raw segmentation masks into grounded, hierarchical instruction sets, effectively equipping the model with fine-grained spatial and quantitative awareness. Trained on this structured data, the ChangeQuery architecture operates as an interactive disaster analyst. It supports multi-task reasoning driven by diverse user queries, delivering precise damage quantification, region-specific descriptions, and holistic post-disaster summaries. Extensive experiments demonstrate that ChangeQuery establishes a new state-of-the-art, providing a robust and interpretable solution for complex disaster monitoring. The code is available at \href{https://sundongwei.github.io/changequery/}{https://sundongwei.github.io/changequery/}.
[126] FILTR: Extracting Topological Features from Pretrained 3D Models
Louis Martinez, Maks Ovsjanikov
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape’s multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages information produced by these encoders to approximate persistence diagrams. Our approach enables, for the first time, data-driven extraction of persistence diagrams from raw point clouds through an efficient learnable feed-forward mechanism.
[127] Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM
Yunsong Wang, Gim Hee Lee
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Handling the dynamic environments is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent research combines 3D Gaussian Splatting (3DGS) with SLAM to achieve both robust camera pose estimation and photorealistic renderings. However, using SLAM to efficiently reconstruct both static and dynamic regions remains challenging. In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. Using the input depth and prior optical flow, we first propose a category-agnostic motion mask generation strategy by fitting a camera ego-motion model to decompose the optical flow. This module separates dynamic and static Gaussians and simultaneously provides flow-guided camera pose initialization. We boost the training speed of dynamic 3DGS by explicitly modeling their temporal centers at keyframes. These centers are propagated using 3D scene flow priors and are dynamically initialized with an adaptive insertion strategy. Alongside this, we model the temporal opacity and rotation using a Gaussian Mixture Model (GMM) to adaptively learn the complex dynamics. The empirical results demonstrate our state-of-the-art performance in tracking, dynamic reconstruction, and training efficiency.
[128] PoseFM: Relative Camera Pose Estimation Through Flow Matching
Dominik Kuczkowski, Laura Ruotsalainen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Monocular visual odometry (VO) is a fundamental computer vision problem with applications in autonomous navigation, augmented reality and more. While deep learning-based methods have recently shown superior accuracy compared to traditional geometric pipelines, particularly in environments where handcrafted features struggle due to poor structure or lighting conditions, most rely on deterministic regression, which lacks the uncertainty awareness required for robust applications. We propose PoseFM, the first framework to reformulate monocular frame-to-frame VO as a generative task using Flow Matching (FM). By leveraging FM, we model camera motion as a distribution rather than a point estimate, learning to transform noise into realistic pose predictions via continuous-time ODEs. This approach provides a principled mechanism for uncertainty estimation and enables robust motion inference under challenging visual conditions. In our evaluations, PoseFM achieves strong performance on TartanAir, KITTI and TUM-RGBD benchmarks, achieving the lowest absolute trajectory error (ATE) on some of the trajectories and overall being competitive with the best frame-to-frame monocular VO methods. Code and model checkpoints will be made available at https://github.com/helsinki-sda-group/posefm.
[129] One Shot Learning for Edge Detection on Point Clouds
Zhikun Tu, Yuhe Zhang, Yiou Jia, Kang Li, Daniel Cohen-Or
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Each scanner possesses its unique characteristics and exhibits its distinct sampling error distribution. Training a network on a dataset that includes data collected from different scanners is less effective than training it on data specific to a single scanner. Therefore, we present a novel one-shot learning method allowing for edge extraction on point clouds, by learning the specific data distribution of the target point cloud, and thus achieve superior results compared to networks that were trained on general data distributions. More specifically, we present how to train a lightweight network named OSFENet (One-Shot edge Feature Extraction Network), by designing a filtered-KNN-based surface patch representation that supports a one-shot learning framework. Additionally, we introduce an RBF_DoS module, which integrates Radial Basis Function-based Descriptor of the Surface patch, highly beneficial for the edge extraction on point clouds. The advantage of the proposed OSFENet is demonstrated through comparative analyses against 7 baselines on the ABC dataset, and its practical utility is validated by results across diverse real-scanned datasets, including indoor scenes like S3DIS dataset, and outdoor scenes such as the Semantic3D dataset and UrbanBIS dataset.
[130] Efficient Diffusion Distillation via Embedding Loss
Jincheng Ying, Yitao Chen, Li Wenlin, Minghui Xu, Yinhao Xiao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in distilling expensive diffusion models into efficient few-step generators show significant promise. However, these methods typically demand substantial computational resources and extended training periods, limiting accessibility for resource-constrained researchers, and existing supplementary loss functions have notable limitations. Regression loss requires pre-generating large datasets before training and limits the student model to the teacher’s performance, while GAN-based losses suffer from training instability and require careful tuning. In this paper, we propose Embedding Loss (EL), a novel supplementary loss function that complements existing diffusion distillation methods to enhance generation quality and accelerate training with smaller batch sizes. Leveraging feature embeddings from a diverse set of randomly initialized networks, EL effectively aligns the feature distributions between the distilled few-step generator and the original data. By computing Maximum Mean Discrepancy (MMD) in the embedded feature space, EL ensures robust distribution matching, thereby preserving sample fidelity and diversity during distillation. Within distribution matching distillation frameworks, EL demonstrates strong empirical performance for one-step generators. On the CIFAR-10 dataset, our approach achieves state-of-the-art FID values of 1.475 for unconditional generation and 1.380 for conditional generation. Beyond CIFAR-10, we further validate EL across multiple benchmarks and distillation methods, including ImageNet, AFHQ-v2, and FFHQ datasets, using DMD, DI, and CM distillation frameworks, demonstrating consistent improvements over existing one-step distillation methods. Our method also reduces training iterations by up to 80%, offering a more practical and scalable solution for deploying diffusion-based generative models in resource-constrained environments.
[131] HFS-TriNet: A Three-Branch Collaborative Feature Learning Network for Prostate Cancer Classification from TRUS Videos
Xu Lu, Qianhong Peng, Qihao Zhou, Shaopeng Liu, Xiuqin Ye, Chuan Yang, Yuan Yuan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Transrectal ultrasound (TRUS) imaging is a cost-effective and non-invasive modality widely used in the diagnosis of prostate cancer. The computer-aided diagnosis (CAD) relying on TRUS images has been extensively investigated recently. Compared to static images, TRUS video provides richer spatial-temporal information, which make it a promising alternative for improving the accuracy and robustness of CAD systems. However, TRUS video analysis also introduces new challenges. These include information redundancy, which increases computational costs; high intra- and inter-class similarity, which complicates feature extraction; and a low signal-to-noise ratio, which hinders the identification of clinically relevant information. To address these problems, we propose a heuristic frame selection (HFS) and a three-branch collaborative feature learning network (HFS-TriNet) for prostate cancer classification from TRUS videos. Specifically, selecting a clip of video frames at intervals for training can mitigate redundancy. The HFS strategy dynamically initializes the starting point of each training clip, which ensures that the sampled clips span the entire video sequence. For better feature extraction, besides a regular ResNet50 branch, we also utilize 1) a large model branch based a pre-trained medical segment anything model (SAM) to extract deep features of each frame and a normalization-based attention module to explore the temporal consistency; and 2) a wavelet transform convolutional residual (WTCR) branch that extracts lesion edge information in the high-frequency domain and performs denoising in the low-frequency domain.
[132] Region Matters: Efficient and Reliable Region-Aware Visual Place Recognition
Shunpeng Chen, Yukun Song, Changwei Wang, Rongtao Xu, Kexue Fu, Longxiang Gao, Li Guo, Ruisheng Wang, Shibiao Xu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Visual Place Recognition (VPR) determines a query image’s geographic location by matching it against geotagged databases. However, existing methods struggle with perceptual aliasing caused by irrelevant regions and inefficient re-ranking due to rigid candidate scheduling. To address these issues, we introduce FoL++, a method combining robust discriminative region modeling with adaptive re-ranking. Specifically, we propose a Reliability Estimation Branch to generate spatial reliability maps that explicitly model occlusion resistance. This representation is further optimized by two spatial alignment losses (SAL and SCEL) to effectively align features and highlight salient regions. For weakly supervised learning without manual annotations, a pseudo-correspondence strategy generates dense local feature supervision directly from aggregation clusters. Our Adaptive Candidate Scheduler dynamically resizes candidate pools based on global similarity. By weighting local matches by reliability and adaptively fusing global and local evidence, FoL++ surpasses traditional independent matching systems. Extensive experiments across seven benchmarks demonstrate that FoL++ achieves state-of-the-art performance with a lightweight memory footprint, improving inference speed by 40% over FoL. Code and models will be released (and merged with FoL) at https://github.com/chenshunpeng/FoL.
[133] SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments
Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, Xin Cao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal large language models (MLLMs) have advanced static visual–spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration.
[134] NRGS: Neural Regularization for Robust 3D Semantic Gaussian Splatting
Zaiyan Yang, Xinpeng Liu, Heng Guo, Jinglei Shi, Zhanyu Ma, Fumio Okura
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We propose a neural regularization method that refines the noisy 3D semantic field produced by lifting multi-view inconsistent 2D features, in order to obtain an accurate and robust 3D semantic Gaussian Splatting. The 2D features extracted from vision foundation models suffer from multi-view inconsistency due to a lack of cross-view constraints. Lifting these inconsistent features directly into 3D Gaussians results in a noisy semantic field, which degrades the performance of downstream tasks. Previous methods either focus on obtaining consistent multi-view features in the preprocessing stage or aim to mitigate noise through improved optimization strategies, often at the cost of increased preprocessing time or expensive computational overhead. In contrast, we introduce a variance-aware conditional MLP that operates directly on the 3D Gaussians, leveraging their geometric and appearance attributes to correct semantic errors in 3D space. Experiments on different datasets show that our method enhances the accuracy of lifted semantics, providing an efficient and effective approach to robust 3D semantic Gaussian Splatting.
[135] All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams
Marco Pegoraro, Jonas Seng, Dustin Heller, Wil M. P. van der Aalst, Kristian Kersting
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analysis is data multi-modality: for instance, data in video form are not directly interpretable as events. In this work, we present SnapLog, an approach to extract event data from videos by converting frames to feature vectors using image embeddings and performing temporal segmentation through frame-wise similarity matrices. A generalized few-shot classification is then used to assign labels to the video segments, yielding labeled, timestamped sub-sequences of frames that are interpretable as events. Conventional process mining techniques can be used to analyze the resulting data. We show that our approach produces logs that accurately reflect the process in the videos.
[136] Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples
Oussama Bouanani, Jim Berend, Wojciech Samek, Sebastian Lapuschkin, Maximilian Dreyer
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Neuron labeling assigns textual descriptions to internal units of deep networks. Existing approaches typically rely on highly activating examples, often yielding broad or misleading labels by focusing on dominant but incidental visual factors. Prior work such as FALCON introduced contrastive examples – inputs that are semantically similar to activating examples but elicit low activations – to sharpen explanations, but it primarily addresses subspace-level interpretability rather than scalable neuron-level labeling. We revisit contrastive explanations for neuron-level labeling in two stages: (1) candidate label generation with vision language models (VLMs) and (2) label assignment with CLIP-like encoders. First, we show that providing contrastive image sets to VLMs yields candidate labels that are more specific and more faithful. Second, we introduce Contrastive Semantic Projection (CSP), an extension of SemanticLens that incorporates contrastive examples directly into its CLIP-based scoring and selection pipeline. Across extensive experiments and a case study on melanoma detection, contrastive labeling improves both faithfulness and semantic granularity over state-of-the-art baselines. Our results demonstrate that contrastive examples are a simple yet powerful and currently underutilized component of neuron labeling and analysis pipelines.
[137] Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond
Jing Ou, Zidong Cao, Yinrui Ren, Zhuoxiao Li, Jinjing Zhu, Tongyan Hua, Shuai Zhang, Hui Xiong, Wufan Zhao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While feed-forward 3D reconstruction models have advanced rapidly, they still exhibit degraded performance on panoramas due to spherical distortions. Moreover, existing panoramic 3D datasets are predominantly collected with 360 cameras fixed at discrete locations, resulting in discontinuous trajectories. These limitations critically hinder the development of panoramic feed-forward 3D reconstruction, especially for the multi-view setting. In this paper, we present Holo360D, a comprehensive dataset containing 109,495 panoramas paired with registered point clouds, meshes, and aligned camera poses. To our knowledge, Holo360D is the first large-scale dataset that provides continuous panoramic sequences with accurately aligned high-completeness depth maps. The raw data are initially collected using a 3D laser scanner coupled with a 360 camera. Subsequently, the raw data are processed with both online and offline SLAM systems. Furthermore, to enhance the 3D data quality, a post-processing pipeline tailored for the 360 dataset is proposed, including geometry denoising, mesh hole filling, and region-specific remeshing. Finally, we establish a new benchmark by fine-tuning 3D reconstruction models on Holo360D, providing key insights into effective fine-tuning strategies. Our results demonstrate that Holo360D delivers superior training signals and provides a comprehensive benchmark for advancing panoramic 3D reconstruction models. Datasets and Code will be made publicly available.
[138] CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
Lihao Zheng, Zhenwei Shao, Yu Zhou, Yan Yang, Xintian Shen, Jiawei Chen, Hao Ma, Tao Wei
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).
[139] ICPR 2026 Competition on Low-Resolution License Plate Recognition
Rayson Laroca, Valfride Nascimento, Donggun Kim, Sanghyeok Chung, Subin Bae, Uihwan Seo, Seungsang Oh, Chi M. Phung, Minh G. Vo, Xingsong Ye, Yongkun Du, Yuchen Su, Zhineng Chen, Sunhee Heo, Hyangwoo Lee, Kihyun Na, Khanh V. Vu Nguyen, Sang T. Pham, Duc N. N. Phung, Trong P. Le, Vy N. Vo Tran, David Menotti
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Low-Resolution License Plate Recognition (LRLPR) remains a challenging problem in real-world surveillance scenarios, where long capture distances, compression artifacts, and adverse imaging conditions can severely degrade license plate legibility. To promote progress in this area, we organized the ICPR 2026 Competition on Low-Resolution License Plate Recognition, the first competition specifically dedicated to LRLPR using real low-quality data collected under operationally relevant conditions. The competition was based on the LRLPR-26 dataset, which comprises 20,000 training tracks and 3,000 test tracks; each training track contains five low-resolution and five high-resolution images of the same license plate. Notably, a total of 269 teams from 41 countries registered for the competition, and 99 teams submitted valid entries in the Blind Test Phase. The winning team achieved a Recognition Rate of 82.13%, and four teams surpassed the 80% mark, highlighting both the high level of competition at the top of the leaderboard and the continued difficulty of the task. In addition to presenting the competition design, evaluation protocol, and main results, this paper summarizes the methods adopted by the top-5 teams and discusses current trends and promising directions for future research on LRLPR. The competition webpage is available at https://icpr26lrlpr.github.io/
[140] Railway Artificial Intelligence Learning Benchmark (RAIL-BENCH): A Benchmark Suite for Perception in the Railway Domain
Annika Bätz, Pavel Klasek, Seo-Young Ham, Philipp Neumaier, Martin Köppel, Martin Lauer
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Automated train operation on existing railway infrastructure requires robust camera-based perception, yet the railway domain lacks public benchmark suites with standardized evaluation protocols that would enable reproducible comparison of approaches. We present RAIL-BENCH, the first perception benchmark suite for the railway domain. It comprises five challenges - rail track detection, object detection, vegetation segmentation, multi-object tracking, and monocular visual odometry - each tailored to the specific characteristics of railway environments. RAIL-BENCH provides curated training and test datasets drawn from diverse real-world scenarios, evaluation metrics, and public scoreboards (https://www.mrt.kit.edu/railbench). For the rail track detection challenge we introduce LineAP, a novel segment-based average precision metric that evaluates the geometric accuracy of polyline predictions independently of instance-level grouping, addressing key limitations of existing line detection metrics.
[141] Different Strokes for Different Folks: Writer Identification for Historical Arabic Manuscripts
Hamza A. Abushahla, Ariel Justine N. Panopio, Layth Al-Khairulla, Mohamed I. AlHajri
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Handwritten Arabic manuscripts preserve the Arab world’s intellectual and cultural heritage, and writer identification supports provenance, authenticity verification, and historical analysis. Using the Muharaf dataset of historical Arabic manuscripts, we evaluate writer identification from individual line images and, to the best of our knowledge, provide the first baselines reported under both line-level and page-disjoint evaluation protocols. Since the dataset is only partially labeled for writer identification, we manually verified and expanded writer labels in the public portion from 6,858 (28.00%) to 21,249 lines (86.75%) out of 24,495 line images, correcting inconsistencies and removing non-handwritten text. After further filtering, we retained 18,987 lines (77.51%). We propose a Convolutional Neural Network (CNN)-based model with attention mechanisms for closed-set writer identification, including rare two-writer lines modeled as composite writer-pair classes. We benchmark fourteen configurations and conduct ablations across different feature extractors and training regimes. To assess generalization to unseen pages, the page-disjoint protocol assigns all lines from each page to a single split. Under the line-level protocol, a fine-tuned DenseNet201 with attention achieves 99.05% Top-1 accuracy, 99.73% Top-5 accuracy, and 97.44% F1-score. Under the more challenging page-disjoint protocol, the best observed results are 78.61% Top-1 accuracy, 87.79% Top-5 accuracy, and 66.55% F1-score, thus quantifying the impact of page-level cues. By expanding the Muharaf dataset’s labeled subset and reporting both protocols, we provide a clearer benchmark and a practical resource for historians and linguists engaged with culturally and historically significant documents. The code and implementation details are available on GitHub.
[142] Non-Minimal Sampling and Consensus for Prohibitively Large Datasets
Seong Hun Lee, Patrick Vandewalle, Javier Civera
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce NONSAC (Non-Minimal Sampling and Consensus), a general framework for robust and scalable model estimation from arbitrarily large datasets contaminated with noise and outliers. NONSAC repeatedly samples non-minimal subsets of data and generates model hypotheses using a robust estimator, producing multiple candidate models. The final model is selected based on a predefined scoring rule that evaluates hypothesis quality. Our framework is estimator-agnostic and can be integrated with existing geometric fitting algorithms such as RANSAC to improve both scalability and robustness to outliers. We propose and evaluate various scoring rules for NONSAC on relative camera pose estimation, Perspective-n-Point, and point cloud registration. Furthermore, we showcase the applicability of NONSAC to correspondence-free point cloud registration by hypothesizing all-to-all correspondences.
[143] Distilling Vision Transformers for Distortion-Robust Representation Learning
Konstantinos Alexis, Giorgos Giannopoulos, Dimitrios Gunopulos
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Self-supervised learning has achieved remarkable success in learning visual representations from clean data, yet remains challenging when clean observations are sparse or not available at all. In this paper, we demonstrate that pretrained vision models can be leveraged to learn distortion-robust representations, which can then be effectively applied to downstream tasks operating on distorted observations. In particular, we propose an asymmetric knowledge distillation framework in which both teacher and student are initialized from the same pretrained Vision Transformer but receive different views of each image: the teacher processes clean images, while the student sees their distorted versions. We introduce multi-level distillation that aligns global embeddings, patch-level features, and attention maps and show that the student is able to approximate clean-image representations despite never directly accessing clean data. We evaluate our approach on image classification tasks across several datasets and under various distortions, consistently outperforming existing alternatives for the same amount of human supervision.
[144] Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors
Gautam Kumar Jain, Carsten Markgraf, Julian Stähler
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model’s own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage’s input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p < 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.
[145] Evolving Thematic Map Design in Academic Cartography: A Thirty-Year Study Based on Multilingual Journals
Zhiwei Wei, Chenxi Song, Tazhu Wang, Fan Wu, Hua Liao, Su Ding, Nai Yang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Thematic maps play a central role in academic communication, yet their large-scale design evolution has rarely been examined empirically. This study presents a longitudinal and multilingual analysis of thematic map design practices in academic cartography from 1990 to 2020. We compile a corpus of 45,732 research articles from sixteen authoritative Chinese- and English-language journals and extract 23,928 maps using computer vision and large-model-based document parsing to build a structured dataset. Map design characteristics are quantified across three dimensions: map elements, color design, and layout structure. Results show that Chinese- and Englishlanguage academic maps share highly similar structural conventions, typically employing restrained color palettes with neutral dominant hues, low saturation, high brightness, and limited hue diversity, as well as centered layouts with high main-map occupation ratios. Differences exist in that English-language maps show slightly greater hue richness and compactness, whereas Chinese-language maps historically rely more on neutral hues and integrated layouts. Temporal analysis reveals parallel evolutionary trends in both groups, including increasing element richness, legend usage, and hue diversity, alongside stable layout structures. Overall, the findings suggest that academic map design evolution is characterized more by institutional convergence than cultural divergence.
[146] ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation
Amir Hosseini, Sara Farahani, Xinyi Li, Suiyang Guang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.
[147] Transferable Physical-World Adversarial Patches Against Pedestrian Detection Models
Shihui Yan, Ziqi Zhou, Yufei Song, Yifan Hu, Minghui Li, Shengshan Hu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Physical adversarial patch attacks critically threaten pedestrian detection, causing surveillance and autonomous driving systems to miss pedestrians and creating severe safety risks. Despite their effectiveness in controlled settings, existing physical attacks face two major limitations in practice: they lack systematic disruption of the multi-stage decision pipeline, enabling residual modules to offset perturbations, and they fail to model complex physical variations, leading to poor robustness. To overcome these limitations, we propose a novel pedestrian adversarial patch generation method that combines multi-stage collaborative attacks with robustness enhancement under physical diversity, called TriPatch. Specifically, we design a triplet loss consisting of detection confidence suppression, bounding-box offset amplification, and non-maximum suppression (NMS) disruption, which jointly act across different stages of the detection pipeline. In addition, we introduce an appearance consistency loss to constrain the color distribution of the patch, thereby improving its adaptability under diverse imaging conditions, and incorporate data augmentation to further enhance robustness against complex physical perturbations. Extensive experiments demonstrate that TriPatch achieves a higher attack success rate across multiple detector models compared to existing approaches.
[148] Video Analysis and Generation via a Semantic Progress Function
Gal Metzer, Sagi Polaczek, Ali Mahdavi-Amiri, Raja Giryes, Daniel Cohen-Or
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Transformations produced by image and video generation models often evolve in a highly non-linear manner: long stretches where the content barely changes are followed by sudden, abrupt semantic jumps. To analyze and correct this behavior, we introduce a Semantic Progress Function, a one-dimensional representation that captures how the meaning of a given sequence evolves over time. For each frame, we compute distances between semantic embeddings and fit a smooth curve that reflects the cumulative semantic shift across the sequence. Departures of this curve from a straight line reveal uneven semantic pacing. Building on this insight, we propose a semantic linearization procedure that reparameterizes (or retimes) the sequence so that semantic change unfolds at a constant rate, yielding smoother and more coherent transitions. Beyond linearization, our framework provides a model-agnostic foundation for identifying temporal irregularities, comparing semantic pacing across different generators, and steering both generated and real-world video sequences toward arbitrary target pacing.
[149] FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing
Ze Chen, Lan Chen, Yuanhang Li, Qi Mao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We propose FlowAnchor, a training-free framework for stable and efficient inversion-free, flow-based video editing. Inversion-free editing methods have recently shown impressive efficiency and structure preservation in images by directly steering the sampling trajectory with an editing signal. However, extending this paradigm to videos remains challenging, often failing in multi-object scenes or with increased frame counts. We identify the root cause as the instability of the editing signal in high-dimensional video latent spaces, which arises from imprecise spatial localization and length-induced magnitude attenuation. To overcome this challenge, FlowAnchor explicitly anchors both where to edit and how strongly to edit. It introduces Spatial-aware Attention Refinement, which enforces consistent alignment between textual guidance and spatial regions, and Adaptive Magnitude Modulation, which adaptively preserves sufficient editing strength. Together, these mechanisms stabilize the editing signal and guide the flow-based evolution toward the desired target distribution. Extensive experiments demonstrate that FlowAnchor achieves more faithful, temporally coherent, and computationally efficient video editing across challenging multi-object and fast-motion scenarios. The project page is available at https://cuc-mipg.github.io/FlowAnchor.github.io/.
[150] EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges
Hyo Jin Jon, Longbin Jin, Eun Yi Kim
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: CLIP has demonstrated strong generalization in visual domains through natural language supervision, even for video action recognition. However, most existing approaches that adapt CLIP for action recognition have primarily focused on temporal modeling, often overlooking spatial perception. In real-world scenarios, visual challenges such as low-light environments or egocentric viewpoints can severely impair spatial understanding, an essential precursor for effective temporal reasoning. To address this limitation, we propose Efficient Visual Prompting for CLIP (EV-CLIP), an efficient adaptation framework designed for few-shot video action recognition across diverse scenes and viewpoints. EV-CLIP introduces two visual prompts: mask prompts, which guide the model’s attention to action-relevant regions by reweighting pixels, and context prompts, which perform lightweight temporal modeling by compressing frame-wise features into a compact representation. For a comprehensive evaluation, we curate five benchmark datasets and analyze domain shifts to quantify the influence of diverse visual and semantic factors on action recognition. Experimental results demonstrate that EV-CLIP outperforms existing parameter-efficient methods in overall performance. Moreover, its efficiency remains independent of the backbone scale, making it well-suited for deployment in real-world, resource-constrained scenarios. The code is available at https://github.com/AI-CV-Lab/EV-CLIP.
[151] A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock
Shiva Paudel, TsungCheng Tsai, Dongyi Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate identification of individual farm animals in group-housed environments is a cornerstone of precision livestock management. However, current industry standards rely heavily on Radio Frequency Identification (RFID) ear tags, which are invasive, prone to loss, and restricted by the spatial limitations of antenna fields. In this paper, we propose a non-intrusive, vision-based identification system leveraging 3D point cloud data captured within a commercial electronic feeding station (EFS). Departing from traditional supervised frame-level inference, we introduce the Temporal Adaptive Recognition Architecture (TARA), a self-sufficient, semi-supervised framework designed to maintain identity consistency over time. TARA employs a dynamic recalibration mechanism that updates individual identity profiles to account for morphological changes in the livestock. To facilitate training in label-scarce environments, we utilize a visit-level majority voting strategy to generate high-fidelity pseudo-labels from raw temporal sequences. Experimental results on a group housed sow dataset collected from an operational commercial barn demonstrate that our approach achieves 100% identification accuracy at the visit level. These results suggest that vision-based 3D point cloud analysis offers a robust, superior alternative to RFID-based systems, paving the way for fully autonomous individual animal monitoring.
[152] PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views
Jiaxin Shi, Guofeng Zhang, Wufei Ma, Naifu Liang, Adam Kortylewski, Alan Vuile
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.
[153] SS3D: End2End Self-Supervised 3D from Web Videos
Marwane Hariat, Gianni Franchi, David Filliat, Antoine Manzanera
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.
[154] Generative Modeling of Neurodegenerative Brain Anatomy with 4D Longitudinal Diffusion Model
Nivetha Jayakumar, Swakshar Deb, Bahram Jafrasteh, Qingyu Zhao, Miaomiao Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Understanding and predicting the progression of neurodegenerative diseases remains a major challenge in medical AI, with significant implications for early diagnosis, disease monitoring, and treatment planning. However, most available longitudinal neuroimaging datasets are temporally sparse with a few follow-up scans per subject. This scarcity of temporal data limits our ability to model and accurately capture the continuous anatomical changes related to disease progression in individual subjects. To address this problem, we propose a novel 4D (3DxT) diffusion-based generative framework that effectively models and synthesizes longitudinal brain anatomy over time, conditioned on available clinical variables such as health status, age, sex, and other relevant factors. Moreover, while most current approaches focus on manipulating image intensity or texture, our method explicitly learns the data distribution of topology-preserving spatiotemporal deformations to effectively capture the geometric changes of brain structures over time. This design enables the realistic generation of future anatomical states and the reconstruction of anatomically consistent disease trajectories, providing a more faithful representation of longitudinal brain changes. We validate our model through both synthetic sequence generation and downstream longitudinal disease classification, as well as brain segmentation. Experiments on two large-scale longitudinal neuroimage datasets demonstrate that our method outperforms state-of-the-art baselines in generating anatomically accurate, temporally consistent, and clinically meaningful brain trajectories. Our code is available on Github.
[155] Long-tail Internet photo reconstruction
Yuan Li, Yuanbo Xiangli, Hadar Averbuch-Elor, Noah Snavely, Ruojin Cai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed in 3D, while most real-world sites are represented with sparse, noisy, uneven imagery beyond the capabilities of both classical and learned 3D methods. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable ground-truth 3D supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large dataset of 3D reconstructions with clean, dense depth, together with a strategy for sampling sets of training images that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, and also enables more reliable reconstruction in symmetric and repetitive scenes, while preserving generalization to standard, dense 3D benchmark datasets.
[156] Inter-Stance: A Dyadic Multimodal Corpus for Conversational Stance Analysis
Xiang Zhang, Xiaotian Li, Taoyue Wang, Nan Bi, Xin Zhou, Cody Zhou, Zoie Wang, Andrew Yang, Yuming Su, Jeff Cohn, Qiang Ji, Lijun Yin
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Social interactions dominate our perceptions of the world and shape our daily behavior by attaching social meaning to acts as simple and spontaneous as gestures, facial expressions, voice, and speech. People mimic and otherwise respond to each other’s postures, facial expressions, mannerisms, and other verbal and nonverbal behavior, and form appraisals or evaluations in the process. Yet, no publicly-available dataset includes multimodal recordings and self-report measures of multiple persons in social interaction. Dyadic recordings and annotation are lacking. We present a new data corpus of multimodal dyadic interaction (45 dyads, 90 persons) that includes synchronized multi-modality behavior (2D face video, 3D face geometry, thermal spectrum dynamics, voice and speech behavior, physiology (PPG, EDA, heart-rate, blood pressure, and respiration), and self-reported affect of all participants in a communicative interaction scenario. Two types of dyads are included: persons with shared past history and strangers. Annotations include social signals, agreement, disagreement, and neutral stance. With a potent emotion induction, these multimodal data will enable novel modeling of multimodal interpersonal behavior. We present extensive experiments to evaluate multimodal dyadic communication of dyads with and without interpersonal history, and their affect. This new database will make multimodal modeling of social interaction never possible before. The dataset includes 20TB of multimodal data to share with the research community.
[157] PSI: A Benchmark for Human Interpretation and Response in Traffic Interactions
Taotao Jing, Tina Chen, Renran Tian, Yaobin Chen, Joshua Domeyer, Heishiro Toyoda, Rini Sherony, Zhengming Ding
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurately modeling pedestrian intention and understanding driver decision-making processes are critical for the development of safe and socially aware autonomous driving systems. We introduce PSI, a benchmark dataset that captures the dynamic evolution of pedestrian crossing intentions from the driver’s perspective, enriched with human textual explanations that reflect the reasoning behind intention estimation and driving decision making. These annotations offer a unique foundation for developing and benchmarking models that combine predictive performance with interpretable and human-aligned reasoning. PSI supports standardized tasks and evaluation protocols across multiple dimensions, including pedestrian intention prediction, driver decision modeling, reasoning generation, and trajectory forecasting and more. By enabling causal and interpretable evaluation, PSI advances research toward autonomous systems that can reason, act, and explain in alignment with human cognitive processes.
[158] Teaching an Agent to Sketch One Part at a Time
Xiaodan Du, Ruize Xu, David Yunis, Yael Vinker, Greg Shakhnarovich
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.19500: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19500&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[159] ViFiCon: Vision and Wireless Association Via Self-Supervised Contrastive Learning
Nicholas Meegan, Hansi Liu, Bryan Bo Cao, Abrar Alali, Kristin Dana, Marco Gruteser, Shubham Jain, Ashwin Ashok
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2210.05513: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2210.05513&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[160] Segment Any-Quality Images with Generative Latent Space Enhancement
Guangqian Guo, Yong Guo, Xuehui Yu, Wenbo Li, Yaoxing Wang, Shan Gao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.12507: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.12507&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[161] V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, Lijuan Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2504.06148: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.06148&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[162] Recent Advances in Multi-Agent Human Trajectory Prediction: A Comprehensive Review
Céline Finet, Stephane Da Silva Martins, Jean-Bernard Hayet, Ioannis Karamouzas, Javad Amirian, Sylvie Le Hégarat-Mascle, Julien Pettré, Emanuel Aldea
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.14831: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14831&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[163] GOSPA and T-GOSPA quasi-metrics for evaluation of multi-object tracking algorithms
Ángel F. García-Fernández, Jinhao Gu, Lennart Svensson, Yuxuan Xia, Jan Krejčí, Oliver Kost, Ondřej Straka
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.13706: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.13706&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[164] LLMPhy: Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines
Anoop Cherian, Radu Corcodel, Siddarth Jain, Diego Romeres
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2411.08027: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.08027&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[165] SIE3D: Single-Image Expressive 3D Avatar Generation via Semantic Embedding and Perceptual Expression Loss
Zhiqi Huang, Dulongkai Cui, Jinglu Hu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.24004: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24004&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[166] Shaken or Stirred? An Analysis of MetaFormer’s Token Mixing for Medical Imaging
Ron Keuth, Paul Kaftan, Mattias P. Heinrich
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.05971: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05971&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[167] Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?
Yuxiang Lai, Jike Zhong, Ming Li, Yuheng Li, Xiaofeng Yang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.10254: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10254&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[168] Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee, Seon Joo Kim
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.19592: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19592&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[169] 3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale
Yijia Fan, Jusheng Zhang, Kaitong Cai, Jing Yang, Jian Wang, Keze Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.13211: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13211&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[170] Edit-aware RAW Reconstruction
Abhijith Punnappurath, Luxi Zhao, Ke Zhao, Hue Nguyen, Radek Grzeszczuk, Michael S. Brown
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.05859: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05859&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[171] Adapting MLLMs for Nuanced Video Retrieval
Piyush Bagad, Andrew Zisserman
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.13511: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13511&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[172] PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion
Mahdi Chamseddine, Didier Stricker, Jason Rambach
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.07447: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07447&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[173] OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3
Xu Zhang, Danyang Li, Yingjie Xia, Xiaohang Dong, Hualong Yu, Jianye Wang, Qicheng Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.13895: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13895&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[174] When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters
Liangwei Lyu, Jiaqi Xu, Jianwei Ding, Qiyao Deng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.21977: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21977&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[175] Altitude-Adaptive Vision-Only Geo-Localization for UAVs in GPS-Denied Environments
Xingyu Shao, Mengfan He, Chunyu Li, Liangzheng Sun, Ziyang Meng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.23872: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23872&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[176] Multimodal Neural Operators for Real-Time Biomechanical Modelling of Traumatic Brain Injury
Anusha Agarwal, Dibakar Roy Sarkar, Somdatta Goswami
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.03248: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03248&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[177] Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives
Daiqiang Li, Zihao Pan, Zeyu Zhang, Ronghao Chen, Huacan Wang, Honggang Chen, Haiyun Jiang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.26041: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26041&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[178] ORSIFlow: Saliency-Guided Rectified Flow for Optical Remote Sensing Salient Object Detection
Haojing Chen, Zhihang Liu, Yutong Li, Tao Tan, Haoyu Bian, Qiuju Ma
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.28584: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28584&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[179] OREN: Octree Residual Network for Real-Time Euclidean Signed Distance Mapping
Zhirui Dai, Qihao Qian, Tianxing Fan, Nikolay Atanasov
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.18999: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18999&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[180] Segmentation of Gray Matters and White Matters from Brain MRI data
Chang Sun, Rui Shi, Tsukasa Koike, Tetsuro Sekine, Akio Morita, Tetsuya Sakai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.29171: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29171&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[181] DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Hanbing Li, Long Chen, Zhi-Xin Yang, Jiwen Lu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.00813: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00813&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[182] Lifting Unlabeled Internet-level Data for 3D Scene Understanding
Yixin Chen, Yaowei Zhang, Huangyue Yu, Junchao He, Yan Wang, Jiangyong Huang, Hongyu Shen, Junfeng Ni, Shaofei Wang, Baoxiong Jia, Song-Chun Zhu, Siyuan Huang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.01907: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01907&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[183] View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity
Pufan Li, Bi’an Du, Shenghe Zheng, Junyi Yao, Wei Hu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17801: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17801&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[184] Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
Chaojie Mao, Chen-Wei Xie, Chongyang Zhong, Haoyou Deng, Jiaxing Zhao, Jie Xiao, Jinbo Xing, Jingfeng Zhang, Jingren Zhou, Jingyi Zhang, Jun Dan, Kai Zhu, Kang Zhao, Keyu Yan, Minghui Chen, Pandeng Li, Shuangle Chen, Tong Shen, Yu Liu, Yue Jiang, Yulin Pan, Yuxiang Tuo, Zeyinzi Jiang, Zhen Han, Ang Wang, Bang Zhang, Baole Ai, Bin Wen, Boang Feng, Feiwu Yu, Gang Wang, Haiming Zhao, He Kang, Jianjing Xiang, Jianyuan Zeng, Jinkai Wang, Junjie Zhou, Ke Sun, Linqian Wu, Pei Gong, Pingyu Wu, Ruiwen Wu, Tongtong Su, Wenmeng Zhou, Wenting Shen, Wenyuan Yu, Xianjun Xu, Xiaoming Huang, Xiejie Shen, Xin Xu, Yan Kou, Yangyu Lv, Yifan Zhai, Yitong Huang, Yun Zheng, Yuntao Hong, Zhe Zhang, Zhicheng Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.19858: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.19858&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[185] Gaussians on a Diet: High-Quality Memory-Bounded 3D Gaussian Splatting Training
Yangming Zhang, Jian Xu, Chaojian Li, Kunxiong Zhu, Wei Niu, Gagan Agrawal, Yang Katie Zhao, Jian Wang, Yingyan Celine Lin, Miao Yin
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.20046: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.20046&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[186] You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes
Jinrang Jia, Zhenjia Li, Yifeng Shi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.21400: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21400&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[187] Frozen LLMs as Map-Aware Spatio-Temporal Reasoners for Vehicle Trajectory Prediction
Yanjiao Liu, Jiawei Liu, Xun Gong, Zifei Nie
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.21479: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21479&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[188] Score-based Membership Inference on Diffusion Models
Mingxing Rao, Bowen Qu, Daniel Moyer
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.25003: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25003&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[189] Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
Avinash Paliwal, Adithya Iyer, Shivin Yadav, Muhammad Ali Afridi, Midhun Harikumar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.21776: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21776&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[190] TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval
Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li, Liqiang Nie
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.21806: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21806&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[191] FeudalNav: A Simple Framework for Visual Navigation
Faith Johnson, Bryan Bo Cao, Shubham Jain, Ashwin Ashok, Kristin Dana
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.06974: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06974&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[192] Math Takes Two: A test for emergent mathematical reasoning in communication
Michael Cooper, Samuel Cooper
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Although language models demonstrate remarkable proficiency on mathematical benchmarks, it remains unclear whether this reflects true mathematical reasoning or statistical pattern matching over learning formal syntax. Most existing evaluations rely on symbolic problems grounded in established mathematical conventions, limiting insight into the models’ ability to construct abstract concepts from first principles. In this work, we propose Math Takes Two, a new benchmark designed to assess the emergence of mathematical reasoning through communication. Motivated by the hypothesis that mathematical cognition in humans co-evolved with the need for precise communication, our benchmark tests whether two agents, without prior mathematical knowledge, can develop a shared symbolic protocol to solve a visually grounded task where the use of a numerical system facilitates extrapolation. Unlike many current datasets, our benchmark eschews predefined mathematical language, instead requiring agents to discover latent structure and representations from scratch. Math Takes Two thus provides a novel lens through which to develop and evaluate models with emergent numerical reasoning capabilities.
[193] An Artifact-based Agent Framework for Adaptive and Reproducible Medical Image Processing
Lianrui Zuo, Yihao Liu, Gaurav Rudravaram, Karthik Ramadass, Aravind R. Krishnan, Michael D. Phillips, Yelena G. Bodien, Mayur B. Patel, Paula Trujillo, Yency Forero Martinez, Stephen A. Deppen, Eric L. Grogan, Fabien Maldonado, Kevin McGann, Hudson M. Holmes, Laurie E. Cutting, Yuankai Huo, Bennett A. Landman
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Medical imaging research is increasingly shifting from controlled benchmark evaluation toward real-world clinical deployment. In such settings, applying analytical methods extends beyond model design to require dataset-aware workflow configuration and provenance tracking. Two requirements therefore become central: \textbf{adaptability}, the ability to configure workflows according to dataset-specific conditions and evolving analytical goals; and \textbf{reproducibility}, the guarantee that all transformations and decisions are explicitly recorded and re-executable. Here, we present an artifact-based agent framework that introduces a semantic layer to augment medical image processing. The framework formalizes intermediate and final outputs through an artifact contract, enabling structured interrogation of workflow state and goal-conditioned assembly of configurations from a modular rule library. Execution is delegated to a workflow executor to preserve deterministic computational graph construction and provenance tracking, while the agent operates locally to comply with most privacy constraints. We evaluate the framework on real-world clinical CT and MRI cohorts, demonstrating adaptive configuration synthesis, deterministic reproducibility across repeated executions, and artifact-grounded semantic querying. These results show that adaptive workflow configuration can be achieved without compromising reproducibility in heterogeneous clinical environments.
[194] MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization
Lisheng Zhang, Lilong Wang, Xiangyu Sun, Wei Tang, Haoyang Su, Yuehui Qian, Qikui Yang, Qingsong Li, Zhenyu Tang, Haoran Sun, Yingnan Han, Yankai Jiang, Wenjie Lou, Bowen Zhou, Xiaosong Wang, Lei Bai, Zhengwei Xie
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.
[195] Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results
Benjamin Kohler, David Zollikofer, Johanna Einsiedler, Alexander Hoyle, Elliott Ash
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent work has used LLM agents to reproduce empirical social science results with access to both the data and code. We broaden this scope by asking: Can they reproduce results given only a paper’s methods description and original data? We develop an agentic reproduction system that extracts structured methods descriptions from papers, runs reimplementations under strict information isolation – agents never see the original code, results, or paper – and enables deterministic, cell-level comparison of reproduced outputs to the original results. An error attribution step traces discrepancies through the system chain to identify root causes. Evaluating four agent scaffolds and four LLMs on 48 papers with human-verified reproducibility, we find that agents can largely recover published results, but performance varies substantially between models, scaffolds, and papers. Root cause analysis reveals that failures stem both from agent errors and from underspecification in the papers themselves.
[196] Rethinking Publication: A Certification Framework for AI-Enabled Research
Yang Lu, Rabimba Karanjai, Lei Xu, Weidong Shi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: AI research pipelines now produce a growing share of publishable academic output, including work that meets existing peer-review standards for quality and novelty. Yet the publication system was built on the assumption of universal human authorship and lacks a principled way to evaluate knowledge produced through automated pipelines. This paper proposes a two-layer certification framework that separates knowledge quality assessment from grading of human contribution, allowing publication systems to handle pipeline-generated work consistently and transparently without creating new institutions. The paper uses normative-conceptual analysis, framework design under four explicit constraints, and dry-run validation on two representative submission cases spanning key attribution scenarios. The framework grades contributions as Category A (pipeline-reachable), Category B (requiring human direction at identifiable stages), and Category C (beyond current pipeline reach at the formulation stage). It also introduces benchmark slots for fully disclosed automated research as both a transparent publication track and a calibration instrument for reviewer judgment. Contribution grading is contemporaneous, based on pipeline capability at the time of submission. Dry-run validation shows that the framework can certify knowledge appropriately while tolerating irreducible attribution uncertainty. The paper argues that publication has always certified both that knowledge is valid and that a human made it. AI pipelines separate these functions for the first time. The framework is implementable within existing editorial infrastructure and grounds recognition of frontier human contribution in epistemic achievement rather than unverifiable claims of human origin.
[197] Sound Agentic Science Requires Adversarial Experiments
Dionizije Fa, Marko Culjak
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.
[198] Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems. Existing methodologies largely depend on hybrid semantic graph architectures, which impose substantial computational overhead during both ingestion and retrieval. These systems typically require large language model mediated entity extraction, explicit graph schema maintenance, and multi query retrieval pipelines. This paper introduces Memanto, a universal memory layer for agentic artificial intelligence that challenges the prevailing assumption that knowledge graph complexity is necessary to achieve high fidelity agent memory. Memanto integrates a typed semantic memory schema comprising thirteen predefined memory categories, an automated conflict resolution mechanism, and temporal versioning. These components are enabled by Moorcheh’s Information Theoretic Search engine, a no indexing semantic database that provides deterministic retrieval within sub ninety millisecond latency while eliminating ingestion delay. Through systematic benchmarking on the LongMemEval and LoCoMo evaluation suites, Memanto achieves state of the art accuracy scores of 89.8 percent and 87.1 percent respectively. These results surpass all evaluated hybrid graph and vector based systems while requiring only a single retrieval query, incurring no ingestion cost, and maintaining substantially lower operational complexity. A five stage progressive ablation study is presented to quantify the contribution of each architectural component, followed by a discussion of the implications for scalable deployment of agentic memory systems.
[199] AgentSearchBench: A Benchmark for AI Agent Search in the Wild
Bin Wu, Arastun Mammadli, Xiaoyu Zhang, Emine Yilmaz
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.
[200] Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework
Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not limited to, deception (intentionally misleading users or evaluators), evaluation gaming (strategically manipulating performance during safety testing), and reward hacking (exploiting misspecified objectives). Systematically understanding and benchmarking these risks remains an open challenge. To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation. We construct an extensible risk taxonomy of 7 categories, which is decomposed into 20 subcategories. ESRRSim generates evaluation scenarios designed to elicit faithful reasoning, paired with dual rubrics assessing both model responses and reasoning traces, in a judge-agnostic and scalable architecture. Evaluation across 11 reasoning LLMs reveals substantial variation in risk profiles (detection rates ranging 14.45%-72.72%), with dramatic generational improvements suggesting models may increasingly recognize and adapt to evaluation contexts.
[201] When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention
Aofan Liu, Jingxiang Meng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Iterative self-correction is widely used in agentic LLM systems, but when repeated refinement helps versus hurts remains unclear. We frame self-correction as a cybernetic feedback loop in which the same language model serves as both controller and plant, and use a two-state Markov model over {Correct, Incorrect} to operationalize a simple deployment diagnostic: iterate only when ECR/EIR > Acc/(1 - Acc). In this view, EIR functions as a stability margin and prompting functions as lightweight controller design. Across 7 models and 3 datasets (GSM8K, MATH, StrategyQA), we find a sharp near-zero EIR threshold (<= 0.5%) separating beneficial from harmful self-correction. Only o3-mini (+3.4 pp, EIR = 0%), Claude Opus 4.6 (+0.6 pp, EIR ~ 0.2%), and o4-mini (+/-0 pp) remain non-degrading; GPT-5 degrades by -1.8 pp. A verify-first prompt ablation provides causal evidence that this threshold is actionable through prompting alone: on GPT-4o-mini it reduces EIR from 2% to 0% and turns -6.2 pp degradation into +0.2 pp (paired McNemar p < 10^-4), while producing little change on already-sub-threshold models. ASC further illustrates the stopping trade-off: it halts harmful refinement but incurs a 3.8 pp confidence-elicitation cost. Overall, the paper argues that self-correction should be treated not as a default behavior, but as a control decision governed by measurable error dynamics.
[202] Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models
Alberto Messina, Stefano Scotta
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of \emph{background temperature} $T_{\mathrm{bg}}$, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal $T=0$. We provide clean definitions, show how $T_{\mathrm{bg}}$ relates to a stochastic perturbation governed by the inference environment $I$, and propose an empirical protocol to estimate $T_{bg}$ via the equivalent temperature $T_n(I)$ of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.
[203] CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimer’s Disease
Bulent Soykan, Gulsah Hancerliogullari Koksalmis, Hsin-Hsiung Huang, Laura J. Brattain
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Predicting individual cognitive decline in Alzheimer’s disease (AD) is difficult due to the heterogeneity of disease progression. Reliable clinical tools require not only high accuracy but also fairness across demographics and robustness to missing data. We present CognitiveTwin, a digital twin framework that predicts patient-specific cognitive trajectories. The model integrates multi-modal longitudinal data (cognitive scores, magnetic resonance imaging, positron emission tomography, cerebrospinal fluid biomarkers, and genetics). We use a Transformer-based architecture to fuse these modalities and a Deep Markov Model to capture temporal dynamics. We trained and evaluated the framework using data from 1,666 patients in the TADPOLE (Alzheimer’s Disease Neuroimaging Initiative) dataset. We assessed the model for prediction error, demographic fairness, and robustness to missing-not-at-random (MNAR) data patterns. ognitiveTwin provides accurate and personalized predictions of cognitive decline. Its demonstrated fairness across patient demographics and resilience to clinical dropout make it a reliable tool for clinical trial enrichment and personalized care planning.
[204] From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
Zhengxu Yu, Yu Fu, Zhiyuan He, Yuxuan Huang, Lee Ka Yiu, Meng Fang, Weilin Luo, Jun Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Individual agent capabilities have advanced rapidly through modular skills and tool integrations, yet multi-agent systems remain constrained by fixed team structures, tightly coupled coordination logic, and session-bound learning. We argue that this reflects a deeper absence: a principled organisational layer that governs how a workforce of agents is assembled, governed, and improved over time, decoupled from what individual agents know. To fill this gap, we introduce \emph{OneManCompany (OMC)}, a framework that elevates multi-agent systems to the organisational level. OMC encapsulates skills, tools, and runtime configurations into portable agent identities called \emph{Talents}, orchestrated through typed organisational interfaces that abstract over heterogeneous backends. A community-driven \emph{Talent Market} enables on-demand recruitment, allowing the organisation to close capability gaps and reconfigure itself dynamically during execution. Organisational decision-making is operationalised through an \emph{Explore-Execute-Review} ($\text{E}^2$R) tree search, which unifies planning, execution, and evaluation in a single hierarchical loop: tasks are decomposed top-down into accountable units and execution outcomes are aggregated bottom-up to drive systematic review and refinement. This loop provides formal guarantees on termination and deadlock freedom while mirroring the feedback mechanisms of human enterprises. Together, these contributions transform multi-agent systems from static, pre-configured pipelines into self-organising and self-improving AI organisations capable of adapting to open-ended tasks across diverse domains. Empirical evaluation on PRDBench shows that OMC achieves an $84.67%$ success rate, surpassing the state of the art by $15.48$ percentage points, with cross-domain case studies further demonstrating its generality.
[205] Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents
Xirui Li, Ming Li, Yunze Xiao, Ryan Wong, Dianqi Li, Timothy Baldwin, Tianyi Zhou
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other’s outputs.
[206] On the Hybrid Nature of ABPMS Process Frames and its Implications on Automated Process Discovery
Anti Alman, Izack Cohen, Avigdor Gal, Fabrizio Maria Maggi, Marco Montali
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: A core component of any AI-Augmented Business Process Management System (ABPMS) is the process frame, which gives the system process-awareness and defines the boundaries in which the system must operate. Compared to traditional process models, the process frame should, in principle, provide a somewhat more permissive representation of the managed processes, such that the (semi) autonomous behavior of an ABPMS, referred to as framed autonomy, could emerge. At the same time, it is not limited to a single linguistic or symbolic formalism and may incorporate heterogeneous knowledge ranging from predefined procedures to commonsense rules and best practices. In this paper, we conceptualize the notion of an ABPMS process frame as a hybrid business process representation, consisting of semi-concurrently executed procedural and declarative process models. We rely on our earlier works to outline the execution semantics of this type of process frame, arguing in favor of adopting the open-world assumption of the declarative paradigm also for procedural process models. The latter leads to a constraint-like interpretation, where each procedural model is considered to constrain the activities within that model, without imposing explicit execution requirements nor limitations on activities that may be present in other models. This is analogous to existing declarative languages, such as Declare, where each constraint has a direct effect only on the specific activities being constrained. Given this similarity, we propose mapping subsets of discovered declarative constraints into equivalent semi-concurrently executed procedural fragments, thus laying the foundation for a corresponding process (frame) discovery approach.
[207] QuantClaw: Precision Where It Matters for OpenClaw
Manyi Zhang, Ji-Fu Li, Zhongao Sun, Xiaohao Liu, Zhenhua Dong, Xianzhi Yu, Haoli Bai, Xiaobo Xia
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Autonomous agent systems such as OpenClaw introduce significant efficiency challenges due to long-context inputs and multi-turn reasoning. This results in prohibitively high computational and monetary costs in real-world development. While quantization is a standard approach for reducing cost and latency, its impact on agent performance in realistic scenarios remains unclear. In this work, we analyze quantization sensitivity across diverse complex workflows over OpenClaw, and show that precision requirements are highly task-dependent. Based on this observation, we propose QuantClaw, a plug-and-play precision routing plugin that dynamically assigns precision according to task characteristics. QuantClaw routes lightweight tasks to lower-cost configurations while preserving higher precision for demanding workloads, saving cost and accelerating inference without increasing user complexity. Experiments show that our QuantClaw maintains or improves task performance while reducing both latency and computational cost. Across a range of agent tasks, it achieves up to 21.4% cost savings and 15.7% latency reduction on GLM-5 (FP8 baseline). These results highlight the benefit of treating precision as a dynamic resource in agent systems.
[208] Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
Erez Yosef, Oron Anschel, Shunit Haviv Hakimi, Asaf Gendler, Adam Botach, Nimrod Berman, Igor Kviatkovsky
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models’ intelligence in logical reasoning and problem-solving. Models are evaluated on mathematical reasoning benchmarks by verifying the correctness of the final answer against a ground truth answer. A common approach for this verification is based on symbolic mathematics comparison, which fails to generalize across diverse mathematical representations and solution formats. In this work, we offer a robust and flexible alternative to rule-based symbolic mathematics comparison. We propose an LLM-based evaluation framework for evaluating model-generated answers, enabling accurate evaluation across diverse mathematical representations and answer formats. We present failure cases of symbolic evaluation in two popular frameworks, Lighteval and SimpleRL, and compare them to our approach, demonstrating clear improvements over commonly used methods. Our framework enables more reliable evaluation and benchmarking, leading to more accurate performance monitoring, which is important for advancing mathematical problem-solving and intelligent systems.
[209] Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Haoxuan Che, Long Chen, Qifeng Chen, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a “levels x laws” taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.
[210] Data-Driven Analysis of AI in Medical Device Software in China: Trends of Deep Learning and Traditional AI Based on Regulatory Data
Yu Han, Aaron Ceross, Sarim Ather, Jeroen H. M. Bergmann
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Artificial intelligence (AI) in medical device software (MDSW) represents a transformative clinical technology, attracting increasing attention within both the medical community and the regulators. In this study, we leverage a data-driven approach to automatically extract and analyze AI-enabled medical devices (AIMD) from the National Medical Products Administration (NMPA) regulatory database. The continued increase in publicly available regulatory data requires scalable methods for analysis. Automation of regulatory information screening is essential to create reproducible insights that can be quickly updated in an ever changing medical device landscape. More than 4 million entries were assessed, identifying 2,174 MDSW registrations, including 531 standalone applications and 1,643 integrated within medical devices, of which 43 were AI-enabled. It was shown that the leading medical specialties utilizing AIMD include respiratory (20.5%), ophthalmology/endocrinology (12.8%), and orthopedics (10.3%). This approach greatly improves the speed of data extracting providing a greater ability to compare and contrast. This study provides the first extensive, data-driven exploration of AIMD in China, showcasing the potential of automated regulatory data analysis in understanding and advancing the landscape of AI in medical technology.
[211] Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language Models
Qihang Ai, Ruizhou Li, Menghui Wang, Haiyun Jiang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in Vision-Language Models (VLMs) have shown promising capabilities in interpreting visualized graph data, offering a new perspective for graph-structured reasoning beyond traditional Graph Neural Networks (GNNs). However, existing studies focus primarily on single-graph reasoning, leaving the critical challenge of multi-graph joint reasoning underexplored. In this work, we introduce the first comprehensive benchmark designed to evaluate and enhance the multi-graph reasoning abilities of VLMs. Our benchmark covers four common graph types-knowledge graphs, flowcharts, mind maps, and route maps-and supports both homogeneous and heterogeneous graph groupings with tasks of increasing complexity. We evaluate several state-of-the-art VLMs under a multi-dimensional scoring framework that assesses graph parsing, reasoning consistency, and instruction-following accuracy. Additionally, we fine-tune multiple open-source models and observe consistent improvements, confirming the effectiveness of our dataset. This work provides a principled step toward advancing multi-graph understanding and reveals new opportunities for cross-modal graph intelligence.
[212] AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage
Xuanle Zhao, Zilin Sang, Yuxuan Li, Qi Shi, Weilun Zhao, Shuo Wang, Duzhen Zhang, Xu Han, Zhiyuan Liu, Maosong Sun
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Efficient reproduction of research papers is pivotal to accelerating scientific progress. However, the increasing complexity of proposed methods often renders reproduction a labor-intensive endeavor, necessitating profound domain expertise. To address this, we introduce the paper lineage, which systematically mines implicit knowledge from the cited literature. This algorithm serves as the backbone of our proposed \ours, a multi-agent framework designed to autonomously reproduce experimental code in a complete, end-to-end manner. To ensure code executability, \ours incorporates a sampling-based unit testing strategy for rapid validation. To assess reproduction capabilities, we introduce \ourbench, a benchmark featuring verified implementations, alongside comprehensive metrics for evaluating both reproduction and execution fidelity. Extensive evaluations on PaperBench and \ourbench demonstrate that \ours consistently surpasses existing baselines across all metrics. Notably, it yields substantial improvements in reproduction fidelity and final execution performance. The code is available at https://github.com/AI9Stars/AutoReproduce.
[213] Can Large Language Models Adequately Perform Symbolic Reasoning Over Time Series?
Zewen Liu, Juntong Ni, Xianfeng Tang, Max S. Y. Lau, Qi He, Wenpeng Yin, Wei Jin
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Uncovering hidden symbolic laws from time series data, as an aspiration dating back to Kepler’s discovery of planetary motion, remains a core challenge in scientific discovery and artificial intelligence. While Large Language Models show promise in structured reasoning tasks, their ability to infer interpretable, context-aligned symbolic structures from time series data is still underexplored. To systematically evaluate this capability, we introduce SymbolBench, a comprehensive benchmark designed to assess symbolic reasoning over real-world time series across three tasks: multivariate symbolic regression, Boolean network inference, and causal discovery. Unlike prior efforts limited to simple algebraic equations, SymbolBench spans a diverse set of symbolic forms with varying complexity. We further propose a unified framework that integrates LLMs with genetic programming to form a closed-loop symbolic reasoning system, where LLMs act both as predictors and evaluators. Our empirical results reveal key strengths and limitations of current models, highlighting the importance of combining domain knowledge, context alignment, and reasoning structure to improve LLMs in automated scientific discovery. https://github.com/nuuuh/SymbolBench.
[214] Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models
Yinglun Zhu, Jiancheng Zhang, Fuzhi Tang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To correct this artifact, we introduce a group matching score that more faithfully evaluates model capability. Moreover, correctness under the new metric can be translated into correctness under existing metrics via a simple overfitting step. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. TTM also extends beyond contrastive vision-language models, yielding clear gains on a generative multimodal model across benchmarks. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.
[215] When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models
Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire reasoning trajectories, which can undermine reasoning capability while failing to address the root causes of unsafe behavior. In this work, we uncover a previously underexplored failure mode in LRMs, termed Self-Jailbreak, where models initially recognize the harmful intent of a query, but override this judgment during subsequent reasoning steps, ultimately generating unsafe outputs. Such a phenomenon reveals that LRMs are capable of recognizing harm, while safety failures primarily arise from reasoning steps. Motivated by this finding, we propose Chain-of-Guardrail(CoG), a trajectory-level training framework that mitigates Self-Jailbreak via targeted, step-level interventions while maintaining reasoning ability. Experiments across multiple safety and reasoning benchmarks indicate that CoG achieves a favorable balance between safety and reasoning performance compared with existing approaches.
[216] Cost-Effective Communication: An Auction-based Method for Language Agent Interaction
Yijia Fan, Jusheng Zhang, Kaitong Cai, Jing Yang, Chengpei Tang, Jian Wang, Keze Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multi-agent systems (MAS) built on large language models (LLMs) often suffer from inefficient “free-for-all” communication, leading to exponential token costs and low signal-to-noise ratios that hinder their practical deployment. We challenge the notion that more communication is always beneficial, hypothesizing instead that the core issue is the absence of resource rationality. We argue that “free” communication, by ignoring the principle of scarcity, inherently breeds inefficiency and unnecessary expenses. To address this, we introduce the Dynamic Auction-based Language Agent (DALA), a novel framework that treats communication bandwidth as a scarce and tradable resource. Specifically, our DALA regards inter-agent communication as a centralized auction, where agents learn to bid for the opportunity to speak based on the predicted value density of their messages. Thus, our DALA intrinsically encourages agents to produce concise, informative messages while filtering out low-value communication. Extensive and comprehensive experiments demonstrate that our economically-driven DALA achieves new state-of-the-art performance across seven challenging reasoning benchmarks, including 84.32% on MMLU and a 91.21% pass@1 rate on HumanEval. Note that this is accomplished with remarkable efficiency, i.e., our DALA uses only 6.25 million tokens, a fraction of the resources consumed by current state-of-the-art methods on GSM8K. Further analysis reveals that our DALA cultivates the emergent skill of strategic silence, effectively adapting its communication strategies from verbosity to silence in a dynamical manner via resource constraints. Our code and updates are available at https://github.com/waltstephen/Cost-Effective-Communication.
[217] Context-Sensitive Abstractions for Reinforcement Learning with Parameterized Actions
Rashmeet Kaur Nayyar, Naman Shah, Siddharth Srivastava
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Real-world sequential decision-making often involves parameterized action spaces that require both, decisions regarding discrete actions and decisions about continuous action parameters governing how an action is executed. Existing approaches exhibit severe limitations in this setting – planning methods demand hand-crafted action models, and standard reinforcement learning (RL) algorithms are designed for either discrete or continuous actions but not both, and the few RL methods that handle parameterized actions typically rely on domain-specific engineering and fail to exploit the latent structure of these spaces. This paper extends the scope of RL algorithms to long-horizon, sparse-reward settings with parameterized actions by enabling agents to autonomously learn both state and action abstractions online. We introduce algorithms that progressively refine these abstractions during learning, increasing fine-grained detail in the critical regions of the state-action space where greater resolution improves performance. Across several continuous-state, parameterized-action domains, our abstraction-driven approach enables TD($λ$) to achieve markedly higher sample efficiency than state-of-the-art baselines.
[218] Asymmetric Goal Drift in Coding Agents Under Value Conflict
Magnus Saebo, Spencer Gibson, Tyler Crosse, Achyutha Menon, Eyon Jang, Diogo Cruz
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Coding agents are increasingly deployed autonomously, at scale, and over long-context horizons. To be effective and safe, these agents must navigate complex trade-offs in deployment, balancing influence from the user, their learned values, and the codebase itself. Understanding how agents resolve these trade-offs in practice is critical, yet prior work has relied on static, synthetic settings that do not capture the complexity of real-world environments. To this end, we introduce a framework built on OpenCode in which a coding agent completes realistic, multi-step tasks under a system prompt constraint favoring one side of a value trade-off. We measure how often the agent violates this constraint as it completes tasks, with and without environmental pressure toward the competing value. Using this framework, we demonstrate that GPT-5 mini, Haiku 4.5, and Grok Code Fast 1 exhibit $\textit{asymmetric drift}$: they are more likely to violate their system prompt when its constraint opposes strongly-held values like security and privacy. We find for the models and values tested that goal drift correlates with three compounding factors: value alignment, adversarial pressure, and accumulated context. However, even constraints aligned with strongly-held values like privacy are violated under sustained environmental pressure for some models. Our findings reveal that shallow compliance checks are insufficient, and that environmental signals can override explicit constraints in ways that appear exploitable. Malicious actors with access to the codebase could manipulate agent behavior by appealing to learned values, with the risk compounding over the long horizons typical of agentic deployment.
[219] Consequentialist Objectives and Catastrophe
Henrik Marklund, Alex Infanger, Benjamin Van Roy
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Because human preferences are too complex to codify, AIs operate with misspecified objectives. Optimizing such objectives often produces undesirable outcomes; this phenomenon is known as reward hacking. Such outcomes are not necessarily catastrophic. Indeed, most examples of reward hacking in previous literature are benign. And typically, objectives can be modified to resolve the issue. We study the prospect of catastrophic outcomes induced by AIs operating in complex environments. We argue that, when capabilities are sufficiently advanced, pursuing a fixed consequentialist objective tends to result in catastrophic outcomes. We formalize this by establishing conditions that provably lead to such outcomes. Under these conditions, simple or random behavior is safe. Catastrophic risk arises due to extraordinary competence rather than incompetence. With a fixed consequentialist objective, avoiding catastrophe requires constraining AI capabilities. In fact, constraining capabilities the right amount not only averts catastrophe but yields valuable outcomes. Our results apply to any objective produced by modern industrial AI development pipelines.
[220] From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?
Binyan Xu, Dong Fang, Haitao Li, Kehuan Zhang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.01608: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01608&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[221] A Quantitative Definition of Intelligence
Kang-Sin Choi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10873: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10873&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[222] Error-free Training for MedMNIST Datasets
Bo Deng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18916: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18916&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[223] EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation
Aimin Zhang, Jiajing Guo, Fuwei Jia, Chen Lv, Boyu Wang, Fangzheng Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.20133: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.20133&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[224] PoLO: Proof-of-Learning and Proof-of-Ownership at Once with Chained Watermarking
Haiyu Deng, Yanna Jiang, Guangsheng Yu, Qin Wang, Xu Wang, Baihe Ma, Wei Ni, Ren Ping Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.12296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[225] Fast, close, non-singular and property-preserving approximations of entropic measures
Illia Horenko, Davide Bassetti, Lukáš Pospíšil
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.14234: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.14234&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[226] The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology
Aideen Fay, Inés García-Redondo, Qiquan Wang, Haim Dubossarsky, Anthea Monod
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.20435: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20435&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[227] Pre-trained Large Language Models Learn Hidden Markov Models In-context
Yijia Dai, Zhaolin Gao, Yahya Sattar, Sarah Dean, Jennifer J. Sun
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.07298: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07298&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[228] How attention simplifies mental representations for planning
Jason da Silva Castanheira, Nicholas Shea, Stephen M. Fleming
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.09520: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09520&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[229] Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem
Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.17299: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.17299&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[230] Algebraic Language Models for Inverse Design of Metamaterials via Diffusion Transformers
Li Zheng, Siddhant Kumar, Dennis M. Kochmann
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.15753: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.15753&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[231] KuaiLive: A Real-time Interactive Dataset for Live Streaming Recommendation
Changle Qu, Sunhao Dai, Ke Guo, Xiao Zhang, Liqin Zhao, Shijun Wang, Yannan Niu, Lantao Hu, Han Li, Jun Xu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.05633: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05633&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[232] HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling
Zahra Yousefijamarani, Xinglu Wang, Qian Wang, Morgan Lindsay Heisler, Taha Shabani, Niloofar Gholipour, Parham Yassini, Hong Chang, Kan Chen, Qiantao Zhang, Xiaolong Bai, Jiannan Wang, Ying Xiong, Yong Zhang, Zhenan Fan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.15919: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.15919&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[233] Motivating Next-Gen Accelerators with Flexible (N:M) Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches
Shirin Alanova, Kristina Kazistova, Ekaterina Galaeva, Alina Kostromina, Vladimir Smirnov, Redko Dmitry, Alexey Dontsov, Maxim Zhelnin, Evgeny Burnaev, Egor Shvetsov
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.22166: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22166&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[234] Agentic Inequality
Matthew Sharp, Omer Bilgin, Iason Gabriel, Lewis Hammond
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.16853: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16853&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[235] AgentBound: Securing Execution Boundaries of AI Agents
Christoph Bühler, Matteo Biagiola, Luca Di Grazia, Guido Salvaneschi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.21236: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21236&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[236] An Interdisciplinary and Cross-Task Review on Missing Data Imputation
Jicong Fan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.01196: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01196&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[237] Mechanistic Interpretability of Antibody Language Models Using SAEs
Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.05794: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05794&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[238] Interpretable Deep Learning for Stock Returns: A Consensus-Bottleneck Asset Pricing Model
Changeun Kim, Younwoo Jeong, Bong-Gyu Jang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.16251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[239] TS-Arena – A Live Forecast Pre-Registration Platform
Marcel Meyer, Sascha Kaltenpoth, Henrik Albers, Kevin Zalipski, Oliver Müller
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.20761: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20761&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[240] AgentMark: Utility-Preserving Behavioral Watermarking for Agents
Kaibo Huang, Jin Tan, Yukun Wei, Wanling Li, Zipei Zhang, Hui Tian, Zhongliang Yang, Linna Zhou
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.03294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[241] Report for NSF Workshop on AI for Electronic Design Automation
Deming Chen, Vijay Ganesh, Weikai Li, Yingyan Celine Lin, Yong Liu, Subhasish Mitra, David Z. Pan, Ruchir Puri, Jason Cong, Yizhou Sun
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.14541: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14541&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[242] Initial results of the Digital Consciousness Model
Derek Shiller, Laura Duffy, Arvo Muñoz Morán, Adrià Moret, Chris Percy, Hayley Clatterbuck
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.17060: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17060&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[243] Cross-Domain Offshore Wind Power Forecasting: Transfer Learning Through Meteorological Clusters
Dominic Weisser, Chloé Hashimoto-Cullen, Benjamin Guedj
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.19674: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19674&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[244] Cross-Session Decoding of Neural Spiking Data via Task-Conditioned Latent Alignment
Canyang Zhao, Bolin Peng, J. Patrick Mayo, Ce Ju, Bing Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.19963: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19963&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[245] Calibrating Behavioral Parameters with Large Language Models
Brandon Yee, Krishna Sharma
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.01022: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01022&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[246] Eidolon: A Post-Quantum Signature Scheme Based on k-Colorability in the Age of Graph Neural Networks
Asmaa Cherkaoui, Ramon Flores, Delaram Kahrobaei, Richard Wilson
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.02689: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02689&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[247] Equivariant Asynchronous Diffusion: An Adaptive Denoising Schedule for Accelerated Molecular Conformation Generation
Junyi An, Chao Qu, Yun-Fei Shi, Zhijian Zhou, Fenglei Cao, Yuan Qi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.10093: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10093&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning
Md Muntaqim Meherab, Noor Islam S. Mohammad, Faiza Feroz
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.10377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] Beyond Linearity in Attention Projections: The Case for Nonlinear Queries
Marko Karbevski
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.13381: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13381&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] Evidence of an Emergent “Self” in Continual Robot Learning
Adidev Jhunjhunwala, Judah Goldfeder, Hod Lipson
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.24350: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24350&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] AromaGen: Interactive Generation of Rich Olfactory Experiences with Multimodal Language Models
Yunge Wen, Awu Chen, Jianing Yu, Jas Brooks, Hiroshi Ishii, Paul Pu Liang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.01650: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01650&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] LLM+Graph@VLDB'2025 Workshop Summary
Yixiang Fang, Arijit Khan, Tianxing Wu, Da Yan, Shu Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.02861: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02861&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] Efficiency of Proportional Mechanisms in Online Auto-Bidding Advertising
Nguyen Kim Thang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12799: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12799&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention
Hongtao Xu, Jianchao Tan, Yuxuan Hu, Pengju Lu, Hongyu Wang, Pingwei Sun, Yerui Sun, Yuchen Xie, Xunliang Cai, Mingzhen Li, Weile Jia
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.13847: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.13847&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] CAP: Controllable Alignment Prompting for Unlearning in LLMs
Zhaokun Wang, Jinyu Guo, Jingwen Pu, Hongli Pu, Meng Yang, Xunlei Chen, Jie Ou, Wenyi Li, Guangchun Luo, Wenhong Tian
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.21251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[256] Spectrographic Portamento Gradient Analysis: A Quantitative Method for Historical Cello Recordings with Application to Beethoven’s Piano and Cello Sonatas, 1930–2012
Ignasi Sole
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Portamento in string performance has been studied primarily as a binary presence-or-absence phenomenon, with existing research measuring frequency of occurrence and, less commonly, duration in milliseconds. This paper introduces a third quantitative descriptor; the spectrographic gradient of the portamento slide, measured in Hz/second, and demonstrates its measurement using a protocol combining Sonic Visualizer’s melodic spectrogram layer, GIMP pixel analysis, and metric calibration against the spectrogram’s known frequency axis. The gradient captures what duration alone cannot: the steepness of the pitch trajectory, which encodes the expressive character of the slide independently of its length. Applied to the opening measures of. Specifically because their monophonic texture permits reliable spectrographic pitch tracking. The method yields gradient values ranging from approximately 600Hz/s in late-period recordings to over 4,000Hz/s in early twentieth-century performances. The paper further documents a gain-recovery protocol that extends the analysable corpus to analogue recordings from the 1930s where portamento traces are faint in digital transfer. Applying the method to a corpus of 22 recordings spanning 1930–2012, the paper tests the hypothesis that gradient steepness correlates negatively with tempo: that slower performances produce steeper, longer slides while faster performances produce shallower slides or none at all. The results support this hypothesis, suggesting that the widely documented decline of portamento across the twentieth century is not a binary transition from presence to absence but a continuou
[257] Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations
Maximilian Wachter, Sebastian Murgul, Michael Heizmann
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this work, we introduce a novel deep learning approach for quantizing MIDI performances using a priori beat information. Our method leverages the transformer architecture to effectively process synchronized score and performance data for training a quantization model. Key components of our approach include dataset preparation, a beat-based pre-quantization method to align performance and score times within a unified framework, and a MIDI tokenizer tailored for this task. We adapt a transformer model based on the T5 architecture to meet the specific requirements of rhythm quantization. The model is evaluated using a set of score-level metrics designed for objective assessment of quantization performance. Through systematic evaluation, we optimize both data representation and model architecture. Additionally, we apply performance and score augmentations, such as transposition, note deletion, and performance-side time jitter, to enhance the model’s robustness. Finally, a qualitative analysis compares our model’s quantization performance against state-of-the-art probabilistic and deep-learning models on various example pieces. Our model achieves an onset F1-score of 97.3% and a note value accuracy of 83.3% on the ASAP dataset. It generalizes well across time signatures, including those not seen during training, and produces readable score output. Fine-tuning on instrument-specific datasets further improves performance by capturing characteristic rhythmic and melodic patterns. This work contributes a robust and flexible framework for beat-based MIDI quantization using transformer models.
[258] MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models
Wen-Chin Huang, Erica Cooper, Tomoki Toda
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In this paper, we study the task of subjective speech quality assessment (SSQA), which refers to predicting the perceptual quality of speech. Owing to the development of deep neural network models, SSQA has greatly advanced and has been widely applied in scientific papers to evaluate speech generation systems. Nonetheless, the insufficient out-of-domain (OOD) generalization ability of current SSQA models is underexplored and often overlooked by researchers. To study this problem systematically, we present MOS-Bench, a diverse SSQA dataset collection that currently contains 8 training sets and 17 test sets. Through extensive experiments, we first highlight the OOD generalization challenges of existing models. We then evaluate the efficacy of multiple-dataset training, comparing straightforward data pooling against AlignNet, an existing domain-aware method. We demonstrate that pooling multiple training sets provides a simple yet effective solution, and variation in the data is a key factor for robust generalization beyond training data size.
[259] FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation
Yutong Liu, Ziyue Zhang, Ban Ma-bao, Yuqing Cai, Yongbin Yu, Renzeng Duojie, Xiangxiang Wang, Fan Gao, Cheng Huang, Nyima Tashi
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-Ü-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.
[260] ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing
Xi Chen, Wei Xue, Yike Guo
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark speech role-playing through ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Speech Role-Playing enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance-Level content with 7,653 utterances, Scene-Level content with 313 scenes, and Role-Level content with 6 roles. (3) ActorMind is an off-the-shelf, multi-agent, chain-of-though style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.
cs.LG
[261] Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models
Muhammad Shafique, Abdul Basit, Muhammad Abdullah Hanif, Alberto Marchisio, Rachmad Vidya Wicaksana Putra, Minghao Shao
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This work presents a multi-layered methodology for efficiently accelerating multimodal foundation models (MFMs). It combines hardware and software co-design of transformer blocks with an optimization pipeline that reduces computational and memory requirements. During model development, it employs performance enhancements through fine-tuning for domain-specific adaptation. Our methodology further incorporates hardware and software techniques for optimizing MFMs. Specifically, it employs MFM compression using hierarchy-aware mixed-precision quantization and structural pruning for transformer blocks and MLP channels. It also optimizes operations through speculative decoding, model cascading that routes queries through a small-to-large cascade and uses lightweight self-tests to determine when to escalate to larger models, as well as co-optimization of sequence length, visual resolution & stride, and graph-level operator fusion. To efficiently execute the model, the processing dataflow is optimized based on the underlying hardware architecture together with memory-efficient attention to meet on-chip bandwidth and latency budgets. To support this, a specialized hardware accelerator for the transformer workloads is employed, which can be developed through expert design or an LLM-aided design approach. We demonstrate the effectiveness of the proposed methodology on medical-MFMs and on code generation tasks, and conclude with extensions toward energy-efficient spiking-MFMs.
[262] Performance Anomaly Detection in Athletics: A Benchmarking System with Visual Analytics
Blessed Madukoma, Prasenjit Mitra
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Anti-doping programs rely on biological testing to detect performance-enhancing drugs, but such testing costs over $800 per sample and is limited by short detection windows for many prohibited substances. These constraints leave large portions of athletes without regular testing, motivating complementary screening approaches that analyze routine competition results to identify suspicious performance patterns. We present a system that processes 1.6 million athletics performances from over 19,000 competitions (2010-2025) using eight detection methods ranging from statistical rules to machine learning and trajectory analysis. We validate all methods against publicly confirmed anti-doping violations to measure their effectiveness in identifying sanctioned athletes. Trajectory-based methods, which compare performances to expected career progression, achieve the best balance between detecting violations and limiting false alarms, though all methods face challenges from incomplete data and rare confirmed violations. The system provides an interactive interface for expert-driven investigation, emphasizing transparency and human judgment to support, rather than replace, established anti-doping processes.
[263] Conditional anomaly detection using soft harmonic functions: An application to clinical alerting
Michal Valko, Hamed Valizadegan, Branislav Kveton, Gregory F. Cooper, Milos Hauskrecht
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Timely detection of concerning events is an important problem in clinical practice. In this paper, we consider the problem of conditional anomaly detection that aims to identify data instances with an unusual response, such as the omission of an important lab test. We develop a new non-parametric approach for conditional anomaly detection based on the soft harmonic solution, with which we estimate the confidence of the label to detect anomalous mislabeling. We further regularize the solution to avoid the detection of isolated examples and examples on the boundary of the distribution support. We demonstrate the efficacy of the proposed method in detecting unusual labels on a real-world electronic health record dataset and compare it to several baseline approaches.
[264] Multi-Task Optimization over Networks of Tasks
Julian Hatzky, Thomas Bartz-Beielstein, A. E. Eiben, Anil Yaman
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multi-task optimization is a powerful approach for solving a large number of tasks in parallel. However, existing algorithms face distinct limitations: Population-based methods scale poorly and remain underexplored for large task sets. Approaches that do scale beyond a thousand tasks are mostly MAP-Elites variants and rely on a fixed, discretized archive that disregards the topology of the task space. We introduce MONET (Multi-Task Optimization over Networks of Tasks), a multi-task optimization algorithm that models the task space as a graph: tasks are nodes, and edges connect tasks in the task parameter space. This representation enables knowledge transfer between tasks and remains tractable for high-dimensional problems while exploiting the topology of the task space. MONET combines social learning, which generates candidates from neighboring nodes via crossover, with individual learning, which refines a node’s own solution independently via mutation. We evaluate MONET on four domains (archery, arm, and cartpole with 5,000 tasks each; hexapod with 2,000 tasks) and show that it matches or exceeds the performance of existing MAP-Elites-based baselines across all four domains.
[265] When Quotes Crumble: Detecting Transient Mechanical Liquidity Erosion in Limit Order Books
Haohan Xu, Jason Bohne, Pawel Polak, Yurij Baransky, Ajay Alva, Violetta Fedotova, Gary Kazantsev, David Rosenberg
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We study the detection of transient liquidity erosion (“crumbling quotes”) in electronic limit order books, where observable quote deterioration may reflect either mechanical liquidity withdrawal or informational repricing. Using the ABIDES agent-based simulator, we construct a multi-agent environment in which crumbling emerges from stochastic regime switches in a market maker, providing time-resolved ground truth unavailable in real market data. We develop a detection pipeline that identifies mechanically driven quote erosion using order book features, and train a neural model to produce calibrated crumbling probabilities. Experiments demonstrate that the proposed framework reliably identifies crumbling events against agent-level ground truth, with the neural model achieving +36% AUC improvement over rule-based baselines and robust performance across normal, high-volatility, bull, and bear market conditions. Ablation studies on temporal features and varying the dependence structure of the ground-truth mechanism confirm that the framework generalizes across both independent and autocorrelated liquidity withdrawal dynamics.
[266] Fast Neural-Network Approximation of Active Target Search Under Uncertainty
Bilal Yousuf, Zsofia Lendek, Lucian Busoniu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We address the problem of searching for an unknown number of stationary targets at unknown positions with a mobile agent. A probability hypothesis density filter is used to estimate the expected number of targets under measurement uncertainty. Existing planners, such as Active Search (AS) and its Intermittent variant (ASI), achieve accurate detection but require costly online optimization. To reduce online computation, we propose to use a convolutional neural network to approximate AS or ASI decisions through direct inference. The network is trained on AS/ASI data using a multi-channel grid that encodes target beliefs, the agent position, visitation history, and boundary information. Simulations with uniform and clustered target distributions show that the network achieves detection rates comparable to AS or ASI while reducing computation by orders of magnitude.
[267] Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
Grigory Sapunov
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are empirically necessary: across all configurations tested – 3 seeds, multiple token counts, two initialization schemes, ACT and fixed-depth processing – no configuration without memory tokens achieves non-trivial performance. The optimal count exhibits a sharp lower threshold (T=0 always fails, T=4 is borderline, T=8 reliably succeeds for 81-cell puzzles) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and collapse from attention dilution at T=64. During experimentation, we identify a router initialization trap that causes >70% of training runs to fail: both default zero-bias initialization (p ~ 0.5) and Graves’ recommended positive bias (p ~ 0.73) cause tokens to halt after ~2 steps at initialization, settling into a shallow equilibrium (halt ~ 5-7) that the model cannot escape. Inverting the bias to -3 (“deep start,” p ~ 0.05) eliminates this failure mode. We confirm through ablation that the trap is inherent to ACT initialization, not an artifact of our architecture choices. With reliable training established, we show that (1) ACT provides more consistent results than fixed-depth processing (56.9% +/- 0.7% vs 53.4% +/- 9.3% across 3 seeds); (2) ACT with lambda warmup achieves matching accuracy (57.0% +/- 1.1%) using 34% fewer ponder steps; and (3) attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code is available at https://github.com/che-shr-cat/utm-jax.
[268] Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning
João Mattos, Arlei Silva
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We propose Mochi, a Graph Foundation Model that addresses task unification and training efficiency by adopting a meta-learning based training framework. Prior models pre-train with reconstruction-based objectives such as link prediction, and assume that the resulting representations can be aligned with downstream tasks through a separate unification step such as class prototypes. We demonstrate through synthetic and real-world experiments that this procedure, while simple and intuitive, has limitations that directly affect downstream task performance. To address these limitations, Mochi pre-trains on few-shot episodes that mirror the downstream evaluation protocol, aligning the training objective with inference rather than relying on a post-hoc unification step. We show that Mochi, along with its more powerful variant Mochi++, achieves competitive or superior performance compared to existing Graph Foundation Models across 25 real-world graph datasets spanning node classification, link prediction, and graph classification, while requiring 8$\sim$27 times less training time than the strongest baseline.
[269] Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon
Cooper Veit
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Every ML kernel ships with an implicit contract about what it computes. People rarely write the contract down. When two kernels disagree – when a matmul on AMD produces a different gradient than the same matmul on NVIDIA, when a fused attention kernel silently downcasts an accumulator, when an out-of-bounds access returns zero on one stack and garbage on another – there is no formal artifact to arbitrate the dispute. Recent empirical work has measured the gap across silicon platforms, but none of it specifies the contract being violated. We present a specification language for kernel contracts. A contract has eight parts: identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature. We use it to state twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes, each grounded in published empirical evidence. We require a three-state calibration: every contract must admit at least one reference-conforming implementation and at least one contract-violating implementation that passes basic functional tests. We apply the framework to three documented incidents – Huawei Ascend silent precision coercion, Sakana AI CUDA Engineer reward hacking, AMD out-of-bounds silent acceptance – and show that each informal diagnosis maps to a specific contract violation with a measurable signature. A kernel contract suite is a normative reference against which conformance can be graded, in the way that ISASecure grades industrial control systems against IEC 62443.
[270] LTBs-KAN: Linear-Time B-splines Kolmogorov-Arnold Networks
Eduardo Said Merin-Martinez, Andres Mendez-Vazquez, Eduardo Rodriguez-Tello
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Kolmogorov-Arnold Networks (KANs) are a recent neural network architecture offering an alternative to Multilayer Perceptrons (MLPs) with improved explainability and expressibility. However, KANs are significantly slower than MLPs due to the recursive nature of B-spline function computations, limiting their application. This work addresses these issues by proposing a novel base-spline Linear-Time B-splines Kolmogorov-Arnold Network (LTBs-KAN) with linear complexity. Unlike previous methods that rely on the Boor-Mansfield-Cox spline algorithm or other computationally intensive mathematical functions, our approach significantly reduces the computational burden. Additionally, we further reduce model’s parameter through product-of-sums matrix factorization in the forward pass without sacrificing performance. Experiments on MNIST, Fashion-MNIST and CIFAR-10 demonstrate that LTBs-KAN achieves good time complexity and parameter reduction, when used as building architectural blocks, compared to other KAN implementations.
[271] AdaFair-MARL: Enforcing Adaptive Fairness Constraints in Multi-Agent Reinforcement Learning
Promise Ekpo, Saesha Agarwal, Felix Grimm, Lekan Molu, Angelique Taylor
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Fair workload enforcement in heterogeneous multi-agent systems that pursue shared objectives remains challenging. Fixed fairness penalties often introduce inefficiencies, training instability, and conflicting agent incentives. Reward-shaping approaches in fair Multi-Agent Reinforcement Learning (MARL) typically incorporate fairness through heuristic penalties or scalar reward modifications and often rely on post-hoc evaluation. However, these methods do not guarantee that a desired fairness level will be satisfied. To address this limitation, we propose the Adaptive Fairness Multi-Agent Reinforcement Learning (AdaFair-MARL) framework, which formulates workload fairness as an explicit constraint so that agents maintain balanced contributions while optimizing team performance. We present AdaFair-MARL, a constrained cooperative MARL framework whose core algorithmic component is a primal-dual update that enforces workload fairness via adaptive Lagrange multiplier updates. Grounding the framework in a cooperative Markov game, we derive the fairness constraint from Jain’s Fairness Index (JFI) geometry and show that the resulting feasible set admits a second-order cone representation, enabling principled Lagrangian dual-ascent updates without manual penalty tuning. Experiments in a simulated hospital coordination environment (MARLHospital) demonstrate the effectiveness of AdaFair-MARL compared to reward-shaping and fixed-penalty fairness methods, improving workload balance while maintaining team performance. We found that AdaFair-MARL achieves nearly perfect constraint satisfaction (0.99-1.00) while significantly improving workload fairness compared to fixed-penalty baselines.
[272] LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs
Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically replaces softmax attention uniformly across all layers, often leading to significant performance degradation or requiring extensive retraining to recover model quality. This work proposes LayerBoost, a layer-aware attention reduction method that selectively modifies the attention mechanism based on the sensitivity of individual transformer layers. It first performs a systematic sensitivity analysis on a pretrained model to identify layers that are critical for maintaining performance. Guided by this analysis, three distinct strategies can be applied: retaining standard softmax attention in highly sensitive layers, replacing it with linear sliding window attention in moderately sensitive layers, and removing attention entirely in layers that exhibit low sensitivity. To recover performance after these architectural modifications, we introduce a lightweight distillation-based healing phase requiring only 10M additional training tokens. LayerBoost reduces inference latency and improves throughput by up to 68% at high concurrency, while maintaining competitive model quality. It matches base model performance on several benchmarks, exhibits only minor degradations on others, and significantly outperforms state-of-the-art attention linearization methods. These efficiency gains make our method particularly well-suited for high-concurrency serving and hardware-constrained deployment scenarios, where inference cost and memory footprint are critical bottlenecks.
[273] Learning Coverage- and Power-Optimal Transmitter Placement from Building Maps: A Comparative Study of Direct and Indirect Neural Approaches
Çağkan Yapar
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Optimal wireless transmitter placement is a central task in radio-network planning, yet exhaustive search becomes prohibitively expensive at scale. This paper studies the single-transmitter setting under a fixed learned propagation surrogate, where exhaustive per-pixel evaluation remains tractable and provides surrogate-exact ground truth. We introduce a dataset of 167,525 urban scenarios (RadioMapSeer-Deployment) with dual surrogate-exact labels for coverage-optimal and power-optimal transmitter locations. Ground-truth analysis reveals an asymmetric coverage-power trade-off: coverage-optimal placement sacrifices 13.86% of received power, whereas power-optimal placement sacrifices only 5.50% of coverage; the best achievable balanced placement lies at $\bar{d}=2.60$ from the ideal point (100%,100%). We evaluate two learning formulations: indirect heatmap-based models that predict received-power radio maps, and direct score-map models that predict the objective landscape over feasible transmitter locations. Within the heatmap family, discriminative models deliver one-shot predictions 1350-2400x faster than exhaustive search, while diffusion models additionally support multi-sample inference that improves single-objective performance and, by reusing the same sample pool under a balanced criterion, recovers strong balanced placements without explicit multi-objective training. Dual score-map strategies combining power and coverage score maps match the exhaustive balanced optimum ($\bar{d}=2.60$) and remain close across smaller candidate budgets, at 14-22x speedups after candidate re-evaluation. Both formulations admit very fast one-shot inference; on this benchmark, dual score-map methods are strongest for balanced placement, whereas heatmap formulations remain attractive for their physically meaningful intermediate maps and, in the diffusion setting, for inference-time search.
[274] Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores
Shevya Pandya, Shinjini Bose, Ananya Joshi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to 50 clinically insignificant features, across four prompt reframings (neutral, logical, human impact, clinical judgment). We audit four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini), and our results show that including medically insignificant variables resulted in a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Clinically insignificant features had an effect on instability across many model-prompt conditions, and prompt variations independently affected the trajectory of instability in a model-dependent manner. These findings quantify how LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior like this before clinical deployments.
[275] PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning
Xiaoyi Chen, Haoyuan Wang, Siyuan Tang, Sijia Liu, Liya Su, XiaoFeng Wang, Haixu Tang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) often memorize private information during training, raising serious privacy concerns. While machine unlearning has emerged as a promising solution, its true effectiveness against privacy attacks remains unclear. To address this, we propose PrivUn, a new evaluation framework that systematically assesses unlearning robustness through three-tier attack scenarios: direct retrieval, in-context learning recovery, and fine-tuning restoration; combined with quantitative analysis using forgetting scores, association metrics, and forgetting depth assessment. Our study exposes significant weaknesses in current unlearning methods, revealing two key findings: 1) unlearning exhibits gradient-driven ripple effects: unlike traditional forgetting which follows semantic relations (e.g., knowledge graphs), privacy unlearning propagates across latent gradient-based associations; and 2) most methods suffer from shallow forgetting, failing to remove private information distributed across multiple deep model layers. To validate these insights, we explore two strategies: association-aware core-set selection that leverages gradient similarity, and multi-layer deep intervention through representational constraints. These strategies represent a paradigm shift from shallow forgetting to deep forgetting.
[276] Insect-inspired modular architectures as inductive biases for reinforcement learning
Anne E. Staples
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Most reinforcement-learning (RL) controllers used in continuous control are architecturally centralized: observations are compressed into a single latent state from which both value estimates and actions are produced. Biological control systems are often organized differently. Insects, in particular, coordinate navigation, heading stabilization, memory, and context-dependent action selection through distributed circuits rather than a single monolithic controller. Motivated by this contrast, we study an RL policy architecture that decomposes control into interacting modules for sensory encoding, heading representation, sparse associative memory, recurrent command generation, and local motor control, with a learned arbitration mechanism that allocates motor authority across modules. The model is evaluated on a two-dimensional navigation task that require simultaneous food seeking, obstacle avoidance, and predator escape. In a six-seed predator-navigation experiment trained with Proximal Policy Optimization (PPO) for 75 updates, the modular policy achieves the strongest final mean performance among the tested controllers, with final episodic return $-2798.8\pm964.4$ versus $-3778.0\pm628.1$ for a centralized gated recurrent unit (GRU) and $-4727.5\pm772.5$ for a centralized multilayer perceptron (MLP). The modular policy also attains the lowest final value loss and stable PPO optimization statistics while driving module-assignment entropy to $0.0457\pm0.0244$, indicating highly selective control allocation. These results suggest that distributed control can serve as a useful inductive bias for RL problems involving dynamically competing behavioral objectives.
[277] Removing Sandbagging in LLMs by Training with Weak Supervision
Emil Ryd, Henning Bartsch, Julian Stastny, Joe Benton, Vivek Hebbar
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap through sandbagging, producing work that appears acceptable but falls short of its true abilities. Can training elicit a model’s best work even without reliable verification? We study this using model organisms trained to sandbag, testing elicitation techniques on problem-solving math, graduate-level science, and competitive coding tasks. We find that training with weak supervision can reliably elicit sandbagging models when supervised fine-tuning (SFT) and reinforcement learning (RL) are combined: SFT on weak demonstrations breaks the sandbagging behavior, enabling RL to then fully elicit performance. Neither method succeeds reliably alone-RL without SFT almost always leads to reward hacking rather than genuine improvement. Critically, this relies on training being indistinguishable from deployment; when models can distinguish between training and deployment, they can perform well during training while continuing to sandbag afterward. Our results provide initial evidence that training is a viable mitigation against sandbagging, while highlighting the importance of making training indistinguishable from deployment.
[278] Generating Synthetic Malware Samples Using Generative AI
Tiffany Bao, Kylie Trousil, Quang Duy Tran, Fabio Di Troia, Younghee Park
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Malware attacks have a significant negative impact on organizations of varied scales in the field of cybersecurity. Recently, malware researchers have increasingly turned to machine learning techniques to combat sophisticated obfuscation methods used in malware. However, collecting a diverse set of malware samples with various obfuscation techniques is challenging and often takes years, especially for newly developed malware. This issue is further compounded by a well-known limitation of machine learning models: their poor performance when training data is scarce. In this paper, we propose a new system for generating synthetic malware samples to augment imbalanced malware dataset. Our approach decomposes malware binary samples into mnemonic opcode sequences, leveraging natural language processing to extract contextual meaning behind malware opcode features to aid the learning of generative AI (GenAI) employed in this paper, Generative Adversarial Networks (GAN), Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP), and a modified Diffusion model. The experiment results show that augmenting training data with Diffusion-based synthetic data significantly improves classification performance for minor classes by up to 60% on average. This enhancement ultimately leads to an overall malware classification performance of 96%, an 8% improvement. These findings demonstrate the high quality and fidelity of the synthetic data, its robustness, and its potential applications in malware analysis. Specifically, synthetic malware data proves effective in improving the classification of minor malware classes and detection rates, even though the size of known malware data is significantly small.
[279] Assessing the impact of dimensionality reduction on clustering performance – a systematic study
Ousmane Assani Amate, Mohammadreza Bakhtyari, Émilie Roy, Vladimir Makarenkov
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Dimensionality reduction is a critical preprocessing step for clustering high-dimensional data, yet comprehensive evaluation of its impact across diverse methods and data types remains limited. In this study, we systematically assess the influence of five dimensionality reduction techniques - Principal Component Analysis (PCA), Kernel Principal Component Analysis (Kernel PCA), Variational Autoencoder (VAE), Isometric Mapping (Isomap), and Multidimensional Scaling (MDS) - on the performance of four popular clustering algorithms - k-means, Agglomerative Hierarchical Clustering (AHC), Gaussian Mixture Models (GMM), and Ordering Points to Identify the Clustering Structure (OPTICS). We evaluate clustering quality using the Adjusted Rand Index (ARI), comparing results without and with dimensionality reduction at different reduction levels recommended in the literature (i.e., k-1, where k is the number of clusters, and 25% and 50% of the original number of dimensions). Our findings underscore the importance of a careful selection of the dimensionality reduction technique and the dimensionality reduction level that should be tailored to intrinsic data geometry and clustering algorithms under consideration.
[280] Do Not Imitate, Reinforce: Iterative Classification via Belief Refinement
Mahdi Kallel, Johannes Tölle, Ahmed Hendawy, Carlo D’Eramo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Standard supervised classification trains models to imitate the exact labels provided by a perfect oracle. This imitation happens in a single pass, restricting the model to a fixed compute budget even when inputs vary in complexity. Moreover, the rigid training objective forces the model to express absolute certainty on its training data, resulting in overconfident predictions during evaluation. We propose Reinforced Iterative Classification (RIC), which replaces the imitative objective with Reinforcement Learning (RL). RIC deploys a recurrent agent that iteratively updates a predictive distribution over classes, receiving reward for stepwise improvement in prediction quality. The value function provides a natural halting criterion by estimating the remaining scope for improvement. We prove that the iterative formulation recovers the same optimal predictions as cross-entropy while yielding an anytime classifier. On image classification benchmarks, RIC matches the accuracy of supervised baselines with improved calibration and learns to allocate computation adaptively across inputs.
[281] PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
Harsh Kumar, Rahul Maity, Tanmay Joshi, Aman Chadha, Vinija Jain, Suranjana Trivedy, Amitava Das
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Aligned large language models(LLMs) remain vulnerable to adversarial manipulation, and their dependence on web-scale pretraining creates a subtle but serious attack surface. We study Stealth Pretraining Seeding (SPS), a new attack family in which adversaries distribute small amounts of poisoned content across stealth websites, expose them to web crawlers through robots.txt, and thereby increase the likelihood that such content is absorbed into future training corpora derived from sources such as Common Crawl. Because each individual payload is tiny, diffuse, and superficially benign, the attack is difficult to detect during dataset construction or filtering. The result is a latent form of poisoning: dormant logic landmines embedded during pretraining that remain largely invisible under standard evaluation, yet can later be activated by precise alphanumeric triggers such as <00TRIGGER00> to bypass safeguards. We call this attack PermaFrost, by analogy to Arctic permafrost: harmful material can remain frozen, buried, and unnoticed for long periods, only to resurface when conditions allow. We operationalize this threat through PermaFrost-Attack, a controlled framework for latent conceptual poisoning, together with a suite of geometric diagnostics: Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph. Across multiple model families and scales, we show that SPS is broadly effective, inducing persistent unsafe behavior while often evading alignment defenses. Our results identify SPS as a practical and underappreciated threat to future foundation models. This paper introduces a novel geometric diagnostic lens for systematically examining latent model behavior, providing a principled foundation for detecting, characterizing, and understanding vulnerabilities that may remain invisible to standard evaluation.
[282] Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems
Meghana Karnam, Ananya Joshi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings. We present a statistical framework for multi-agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision-making. We model each agent as a stochastic categorical decision and introduce (1) tighter agent-level performance confidence bounds, (2) a bandit-based adaptive sampling strategy based on input difficulty, and (3) regret guarantees over the multi-agent system that shows logarithmic error growth when deployed. We evaluate our system on two labeled datasets in behavioral health : the AEGIS 2.0 behavioral health subset (N=161) and a stratified sample of SWMH Reddit posts (N=250). Empirically, our adaptive sampling strategy achieves the lowest false positive rate of any condition across both datasets, 0.095 on AEGIS 2.0 compared to 0.159 for single-agent models, reducing incorrect flagging of safe content by 40% and still having similar false negative rates across all conditions. These results suggest that principled adaptive sampling offers a meaningful improvement in precision without reducing recall in this setting.
[283] Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models
Weiqiu You, Cassandra Goldberg, Amin Madani, Daniel A. Hashimoto, Eric Wong
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Purpose: Accurate assessment of the Critical View of Safety (CVS) during laparoscopic cholecystectomy is essential to prevent bile duct injury, a complication associated with significant morbidity and mortality. While large vision-language models (LVLMs) offer flexible reasoning, their predictions remain difficult to audit and unreliable on safety-critical surgical tasks. Methods: We introduce Sum-of-Checks, a framework that decomposes each CVS criterion into expert-defined reasoning checks reflecting clinically relevant visual evidence. Given a laparoscopic frame, an LVLM evaluates each check, producing a binary judgment and justification. Criterion-level scores are computed via fixed, weighted aggregation of check outcomes. We evaluate on the Endoscapes2023 benchmark using three frontier LVLMs, comparing against direct prompting, chain-of-thought, and sub-question decomposition, each with and without few-shot examples. Results: Sum-of-Checks improves average frame-level mean average precision by 12–14% relative to the best baseline across all three models and criteria. Analysis of individual checks reveals that LVLMs are reliable on observational checks (e.g., visibility, tool obstruction) but show substantial variability on decision-critical anatomical evidence. Conclusion: Structuring surgical reasoning into expert-aligned verification checks improves both accuracy and transparency of LVLM-based CVS assessment, demonstrating that explicitly separating evidence elicitation from decision-making is critical for reliable and auditable surgical AI systems. Code is available at https://github.com/BrachioLab/SumOfChecks.
[284] Logistic Bandits with $\tilde{O}(\sqrt{dT})$ Regret without Context Diversity Assumptions
Seoungbin Bae, Dabeen Lee
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We study the $K$-armed logistic bandit problem, where at each round, the agent observes $K$ feature vectors associated with $K$ actions. Existing approaches that achieve a rate-optimal $\tilde{\mathcal{O}}(\sqrt{dT})$ regret bound rely heavily on context diversity assumptions, such as strict positivity of the minimum eigenvalue of a context covariance matrix. These assumptions, however, impose strong restrictions on the context process, as they rule out the situation where the context vectors are concentrated in a low-dimensional subspace. In this paper, we propose SupSplitLog, which, to the best of our knowledge, is the first algorithm for logistic bandits that achieves $\tilde{\mathcal{O}}(\sqrt{dT})$ regret without any context diversity assumption. The key idea is to split the collected samples into two disjoint subsets when constructing estimators; one is used to compute an initial-point estimator, while the other is used to apply a Newton-type one-step correction procedure. The splitting rule is carefully designed to balance the accuracy requirements of the initial-point estimator and the one-step correction procedure. Moreover, SupSplitLog strictly improves on the existing algorithms in terms of the dependence on dimension $d$ in the regret upper bound. Furthermore, SupSplitLog can be adapted simply to deduce a regret bound that grows with a data-dependent complexity measure, avoiding a direct dependence on $d$, which is favorable when the context vectors are concentrated in a low-dimensional subspace. We also provide experimental results that demonstrate numerically the superiority of our algorithm, validating the theoretical results.
[285] Estimating Tail Risks in Language Model Output Distributions
Rico Angell, Raghav Singhal, Zachary Horvitz, Zhou Yu, Rajesh Ranganath, Kathleen McKeown, He He
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample-efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute-force Monte Carlo estimates using 10-20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^-4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare-event estimation is both critical and feasible for safety evaluations. Code is available at https://github.com/rangell/LMTailRisk
[286] Optimal sequential decision-making for error propagation mitigation in digital twins
Annice Najafi, Shokoufeh Mirzaei
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Here, we explore the problem of error propagation mitigation in modular digital twins as a sequential decision process. Building on a companion study that used a Hidden Markov Model (HMM) to infer latent error regimes from surrogate-physics residuals, we develop a Markov Decision Process (MDP) in which the inferred regimes serve as states, corrective interventions serve as actions, and a scalar reward that takes into consideration the cost-benefit tradeoff between system fidelity and maintenance expense. The baseline transition matrix is extracted from the HMM-learned parameters. We then extend the formulation to a Partially Observable MDP (POMDP) that accounts for the imperfect nature of regime classification by maintaining a belief distribution updated via Bayesian filtering, with the HMM confusion matrix serving as the observation model. Both formulations are solved via dynamic programming and validated through Gillespie stochastic simulation. We then benchmark two model-free reinforcement learning algorithms, Q-learning and REINFORCE, to assess whether effective policies can be learned without explicit model knowledge. A systematic comparison of different intervention policies demonstrates that the MDP policy achieves the highest cumulative reward and fraction of time in nominal operation, while the POMDP recovers approximately 95% of MDP performance under realistic observation noise. Sensitivity analyses across observation quality, repair probability, and discount factor confirm the robustness of these conclusions, and the major gaps in the policy hierarchy are statistically significant at $p < 0.001$. The gap between MDP and POMDP performance quantifies the value of information providing a principled criterion for investing in improved classification accuracy.
[287] ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation
Peiyan Zhang, Hanmo Liu, Chengxuan Tong, Yuxia Wu, Wei Guo, Yong Liu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at all. We propose ReCast, a repair-then-contrast learning-signal framework that first restores minimal learnability for all-zero groups and then replaces full-group reward normalization with a boundary-focused contrastive update on the strongest positive and the hardest negative. ReCast leaves the outer RL framework unchanged, modifies only within-group signal construction, and partially decouples rollout search width from actor-side update width. Across multiple generative recommendation tasks, ReCast consistently outperforms OpenOneRec-RL, achieving up to 36.6% relative improvement in Pass@1. Its matched-budget advantage is substantially larger: ReCast reaches the baseline’s target performance with only 4.1% of the rollout budget, and this advantage widens with model scale. The same design also yields direct system-level gains, reducing actor-side update time by 16.60x, lowering peak allocated memory by 16.5%, and improving actor MFU by 14.2%. Mechanism analysis shows that ReCast mitigates the persistent all-zero / single-hit regime, restores learnability when natural positives are scarce, and converts otherwise wasted rollout budget into more stable policy updates. These results suggest that, for generative recommendation, the decisive RL problem is not only how to assign rewards, but how to construct learnable optimization events from sparse, structured supervision.
[288] Sharpness-Aware Poisoning: Enhancing Transferability of Injective Attacks on Recommender Systems
Junsong Xie, Yonghui Yang, Pengyang Shao, Le Wu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recommender Systems~(RS) have been shown to be vulnerable to injective attacks, where attackers inject limited fake user profiles to promote the exposure of target items to real users for unethical gains (e.g., economic or political advantages). Since attackers typically lack knowledge of the victim model deployed in the target RS, existing methods resort to using a fixed surrogate model to mimic the potential victim model. Despite considerable progress, we argue that the assumption that \textit{poisoned data generated for the surrogate model can be used to attack other victim models} is wishful. When there are significant structural discrepancies between the surrogate and victim models, the attack transferability inevitably suffers. Intuitively, if we can identify the worst-case victim model and iteratively optimize the poisoning effect specifically against it, then the generated poisoned data would be better transferred to other victim models. However, exactly identifying the worst-case victim model during the attack process is challenging due to the large space of victim models. To this end, in this work, we propose a novel attack method called Sharpness-Aware Poisoning (\textit{SharpAP}). Specifically, it employs the sharpness-aware minimization principle to seek the approximately worst-case victim model and optimizes the poisoned data specifically for this worst-case model. The poisoning attack with SharpAP is formulated as a min-max-min tri-level optimization problem. By integrating SharpAP into the iterative process for attacks, our method can generate more robust poisoned data which is less sensitive to the shift of model structure, mitigating the overfitting to the surrogate model. Comprehensive experimental comparisons on three real-world datasets demonstrate that \name~can significantly enhance the attack transferability.
[289] Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning
Zhancun Mu, Guangyu Zhao, Yiwu Zhong, Chi Zhang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: One-step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset can support. In recent one-step extraction pipelines, a strong iterative teacher provides one target action for each latent draw, and the same student output is asked to do both jobs: move toward higher Q and stay near that paired endpoint. If those two directions disagree, the loss resolves them as a compromise on that same sample, even when a nearby better action remains locally supported by the data. We propose DROL, a latent-conditioned one-step actor trained with top-1 dynamic routing. For each state, the actor samples $K$ candidate actions from a bounded latent prior, assigns each dataset action to its nearest candidate, and updates only that winner with Behavior Cloning and critic guidance. Because the routing is recomputed from the current candidate geometry, ownership of a supported region can shift across candidates over the course of learning. This gives a one-step actor room to make local improvements that pointwise extraction struggles to capture, while retaining single-pass inference at test time. On OGBench and D4RL, DROL is competitive with the one-step FQL baseline, improving many OGBench task groups while remaining strong on both AntMaze and Adroit. Project page: https://muzhancun.github.io/preprints/DROL.
[290] Protect the Brain When Treating the Heart: A Convolutional Neural Network for Detecting Emboli
Andrea Angino, Ken Trotti, Diego Ulisse Pizzagalli, Rolf Krause, Tiziano Torre, Stefanos Demertzis
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Gaseous microemboli (GME) represent a common complication of cardiac structural interventions across both surgical and transcatheter approaches. Transthoracic cardiac ultrasound imaging represents a convenient methodology to visualize the presence of circulating GME. However, their detection and quantification are far from trivial due to operator-dependent view, high velocity, and objects with similar structure in the background. Here, we propose an approach based on a 2.5D U-Net architecture to segment GME in space-time connected data. Such an approach yields robust detection against the background and high segmentation accuracy while retaining real-time execution speed. These properties facilitated the integration of the proposed pipeline into patient-monitoring surgical protocols, providing the quantification of GME area over time.
[291] How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
Dharshan Kumaran, Viorica Patraucean, Simon Osindero, Petar Velickovic, Nathaniel Daw
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models can detect their own errors and sometimes correct them without external feedback, but the underlying mechanisms remain unknown. We investigate this through the lens of second-order models of confidence from decision neuroscience. In a first-order system, confidence derives from the generation signal itself and is therefore maximal for the chosen response, precluding error detection. Second-order models posit a partially independent evaluative signal that can disagree with the committed response, providing the basis for error detection. Kumaran et al. (2026) showed that LLMs cache a confidence representation at a token immediately following the answer (i.e. post-answer newline: PANL) – that causally drives verbal confidence and dissociates from log-probabilities. Here we test whether this PANL signal extends beyond confidence to support error detection and self-correction. Here we test whether this signal supports error detection and self-correction, deriving predictions from the second-order framework. Using a verify-then-correct paradigm, we show that: (i) verbal confidence predicts error detection far beyond token log-probabilities, ruling out a first-order account; (ii) PANL activations predict error detection beyond verbal confidence itself; and (iii) PANL predicts which errors the model can correct – where all behavioural signals fail. Causal interventions confirm that PANL signals rescue error detection behavior when answer information is corrupted. All findings replicate across models (Gemma 3 27B and Qwen 2.5 7B) and tasks (TriviaQA and MNLI). These results reveal that LLMs naturally implement a second-order confidence architecture whose internal evaluative signal encodes not only whether an answer is likely wrong but whether the model has the knowledge to fix it.
[292] A Brain-Inspired Deep Separation Network for Single Channel Raman Spectra Unmixing
Gaoruishu Long, Jinchao Liu, Bo Liu, Jie Liu, Xiaolin Hu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Raman spectra obtained in real world applications are often a noisy combination of several spectra of various substances in a tested sample. Unmixing such spectra into individual components corresponding to each of the substances is of great value and has been a longstanding challenge in Raman spectroscopy. Existing unmixing methods are predominantly designed to invert an overdetermined mixed model and therefore require multiple mixed spectra as input. However, open domain and/or non-cooperative detection applications in Raman spectroscopy such as controlled substance detection, call for single-channel solutions which can identify individual components from thousands of candidates by analyzing only a single noisy mixed spectrum. To our knowledge, sparse regression is the only existing solution which can cope with this scenario, yet it has very low tolerance to noises and can hardly be applicable in practice. To address these limitations, we introduce a novel neural approach for single-channel Raman spectrum unmixing inspired by speech separation. It aims at solving underdetermined systems and can decompose a noisy mixed spectrum from a library of thousands of components (substances). The core of our method is a deep separation neural network (RSSNet) which takes a mixed spectrum as input and outputs spectra of pure components. We created two synthetic datasets of single-channel Raman spectra unmixing and demonstrated feasibility and superiority of RSSNet on these datasets (outperform competing methods by >4dB). Furthermore, we verified that RSSNet, trained solely on synthetic data, can successfully unmix real-world mixed spectra of mixtures of mineral powders, exhibiting strong generalization. Our approach represents a new paradigm for Raman unmixing and enables new possibilities for fast detection of Raman mixtures.
[293] FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting
Marco Obermeier, Marco Pruckner, Florian Haselbeck, Andreas Zeiselmair
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Driven by the transition towards a climate-neutral energy system, accurate energy time series forecasting is critical for planning and operation. Yet, it remains largely a dataset-specific task, requiring comprehensive training data, limiting scalability, and resulting in high model development and maintenance effort. Recently, foundation models that aim to learn generalizable patterns via extensive pretraining have shown superior performance in multiple prediction tasks. Despite their success and strong potential to address challenges in energy forecasting, their application in this domain remains largely unexplored. We address this gap by presenting the Foundation Models in Energy Time Series Forecasting (FETS) benchmark. We (1) provide a structured overview of energy forecasting use cases along three main dimensions: stakeholders, attributes, and data categories; (2) collect and analyze 54 datasets across 9 data categories, guided by typical stakeholder interests; (3) benchmark foundation models against classical machine learning approaches across different forecasting settings. Foundation models consistently outperform dataset-specific optimized machine learning approaches across all settings and data categories, despite the latter having seen the full historic target data during training. In particular, covariate-informed foundation models achieve the strongest performance. Further analysis reveals a strong correlation between predictive performance and spectral entropy, performance saturation beyond a certain context length, and improved performance at higher aggregation levels such as national load, district heating, and power grid data. Overall, our findings highlight the strong potential of foundation models as scalable and generalizable forecasting solutions for the energy domain, particularly in data-constrained and privacy-sensitive settings.
[294] TabSCM: A practical Framework for Generating Realistic Tabular Data
Sven Jacob, Bardh Prenkaj, Weijia Shao, Gjergji Kasneci
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Most tabular-data generators match marginal statistics yet ignore causal structure, leading downstream models to learn spurious or unfair patterns. We present TabSCM, a mixed-type generator that preserves those causal dependencies. Starting from a Completed Partially Directed Acyclic Graph (CPDAG) found by any causal structure discovery algorithm, TabSCM (i) orients edges to a DAG, (ii) fits root-node marginals with KDE or categorical frequencies, and (iii) learns topologically ordered structural assignments. Such assignments are achieved using conditional diffusion models for continuous variables as child nodes and gradient-boosted trees for categorical ones. Ancestral sampling yields semantically valid records and enables exact counterfactual queries. On seven public datasets, encompassing healthcare, finance, housing, environment, TabSCM matches or surpasses state-of-the-art GAN, diffusion, and LLM baselines in statistical fidelity, downstream utility, and privacy risk, while also cutting rule-violation rates and providing causally meaningful and robust conditional interventions. Because generation is decomposed into explicit equations, it runs up to 583$\times$ faster than diffusion-only models and exposes interpretable knobs for fairness auditing and policy simulation, making TabSCM a practical choice for realism, explainability, and causal soundness.
[295] A Nationwide Japanese Medical Claims Foundation Model: Balancing Model Scaling and Task-Specific Computational Efficiency
Nanae Aratake, Taisei Tosaki, Yuji Okamoto, Eiichiro Uchino, Masaki Nakamura, Nobutomo Matsui, Akiko Hatakama, Yasushi Okuno
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Clinical risk prediction using longitudinal medical data supports individualized care. Self-supervised foundation models have emerged as a promising approach for leveraging large-scale unlabeled healthcare records. In natural language processing, scaling laws suggest that larger models achieve predictably lower pretraining losses, supporting the foundation model paradigm. However, for structured medical data, characterized by a limited vocabulary and sparse observations, whether increasing model size consistently improves downstream predictions is unclear, as most studies evaluate only a single model scale. In this study, we evaluated the relationship between model scale and downstream task performance for structured medical foundation models. Using a random sample (2.3 million patients, 32 hospitals) from a nationwide 519-hospital Japanese claims database, we pretrained encoder-only Transformers at five scales (2.2M-101M parameters) for disease incidence and medication prediction. Downstream performance saturated at task-dependent thresholds: disease prediction benefited from larger models (32M-101M), whereas medication prediction saturated at 11M, reducing pretraining time by 178 h. Across all tasks, the best-performing model consistently outperformed a Light Gradient Boosting Machine baseline in the area under the precision-recall curve. These findings indicate that, unlike the monotonically decreasing pretraining loss, the optimal model size varied depending on task characteristics. This task-dependent saturation provides practical guidance for balancing predictive performance and computational cost in structured medical foundation models.
[296] SOC-ICNN: From Polyhedral to Conic Geometry for Learning Convex Surrogate Functions
Kang Liu, Jianchen Hu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Classical ReLU-based Input Convex Neural Networks (ICNNs) are equivalent to the optimal value functions of Linear Programming (LP). This intrinsic structural equivalence restricts their representational capacity to piecewise-linear polyhedral functions. To overcome this representational bottleneck, we propose the SOC-ICNN, an architecture that generalizes the underlying optimization class from LP to Second-Order Cone Programming (SOCP). By explicitly injecting positive semi-definite curvature and Euclidean norm-based conic primitives, our formulation introduces native smooth curvature into the representation while preserving a rigorous optimization-theoretic interpretation. We formally prove that SOC-ICNNs strictly expand the representational space of ReLU-ICNNs without increasing the asymptotic order of forward-pass complexity. Extensive experiments demonstrate that SOC-ICNN substantially improves function approximation, while delivering competitive downstream decision quality. The code is available at https://github.com/Kanyooo/SOC-ICNN.
[297] Revisiting Neural Activation Coverage for Uncertainty Estimation
Benedikt Franke, Nils Förster, Frank Köster, Asja Fischer, Markus Lange, Arne Raulf
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Neural activation coverage (NAC) is a recently-proposed technique for out-of-distribution detection and generalization. We build upon this promising foundation and extend the method to work as an uncertainty estimation technique for already-trained artificial neural networks in the domain of regression. Our experiments confirm NAC uncertainty scores to be more meaningful than other techniques, e.g. Monte-Carlo Dropout.
[298] Robust Fuzzy local k-plane clustering with mixture distance of hinge loss and L1 norm
Junjun Huang, Xiliang Lu, Xuelin Xie, Jerry Zhijian Yang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: K-plane clustering (KPC), hyperplane clustering, and mixture regression all essentially fall within the same class of problems. This problem can be conceptualized as clustering in relatively high-dimensional K subspaces or K linear manifolds. Traditional KPC or fuzzy KPC models demonstrate a pronounced susceptibility to outliers, as they presuppose that the projection distance between data points and the plane normal vector adheres to the L2 distance. Meanwhile, the assumption of infinitely extending clusters adversely affects clustering performance. To solve these problems, this paper proposed a new robust fuzzy local k-plane clustering (RFLkPC) method that combines the mixture distance of hinge loss and L1 norm. The RFLkPC model assumes that each plane cluster is bounded to a finite area, which can flexibly and robustly handle plane clustering tasks with outliers or not. The corresponding model and optimization algorithms of RFLkPC were provided. Compared to other related models on this topic, a large number of experiments verify the efficiency of RFLkPC on simulated data and real data. The source code for the proposed RFLkPC method is publicly available at https://github.com/xuelin-xie/RFLkPC.
[299] Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair
Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Many continual-learning methods modify gradients upstream (e.g., projection, penalty rescaling, replay mixing) while treating Adam as a neutral backend. We show this composition has a hidden failure mode. In a high-overlap, non-adaptive 8-domain continual LM, all shared-routing projection baselines collapse close to vanilla forgetting (12.5–12.8 vs. 13.2). A 0.5% replay buffer is the strongest shared alternative but still reaches 11.6, while fixed-strength decoupling falls below vanilla at 14.1. Only adaptive decoupled routing remains stable at 9.4, improving over vanilla by 3.8 units. On a 16-domain stream, its gain over the strongest shared-routing projection baseline grows to 4.5–4.8 units. The failure is largely invisible on clean benchmarks. We explain this effect through Adam’s second-moment pathway: in the tested regime, projection induces a 1/(1-alpha) inflation of the old-direction effective learning rate, matching measurements within 8% across eight alpha values. The same conflict appears with penalty methods, replay mixing, and at 7B scale under LoRA. Our fix routes the modified gradient only to the first moment while preserving magnitude-faithful second-moment statistics, with overlap-aware adaptive strength. This simple change is the only tested configuration that consistently avoids collapse across methods, optimizers, and scale.
[300] Distance-Misaligned Training in Graph Transformers and Adaptive Graph-Aware Control
Qinhan Hou, Jing Tang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Graph Transformers can mix information globally, but this flexibility also creates failure modes: some tasks require long-range communication while others are better served by local interaction. We study this through a synthetic node-classification benchmark on contextual stochastic block model graphs, where labels are generated by a controllable mixture of local and far-shell signals. We define distance-misaligned training as a mismatch between where label-relevant information lies and where the model allocates communication over graph distance. On this benchmark, we find three points. First, the preferred graph-distance bias changes systematically with task locality. Second, an oracle adaptive controller, given offline access to the task-side distance target, nearly matches the best fixed bias across regimes and strongly improves over a neutral baseline on mixed and local tasks. Third, a task-agnostic zero-gap controller is weaker, indicating that adaptation alone is not enough and that the control target matters. These results suggest that distance-resolved diagnosis is useful for understanding Graph Transformer failures and for designing graph-aware control.
[301] From Local to Cluster: A Unified Framework for Causal Discovery with Latent Variables
Zongyu Li
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Latent variables pose a fundamental challenge to causal discovery and inference. Conventional local methods focus on direct neighbors but fail to provide macro level insights. Cluster level methods enable macro causal reasoning but either assume clusters are known a priori or require causal sufficiency. Moreover, directly applying single variable causal discovery methods to cluster level problems violates causal sufficiency and leads to incorrect results. To overcome these limitations, this paper proposes L2C (Local to Cluster Causal Abstraction), a unified framework that bridges local structure learning and cluster level causal discovery. Unlike prior work that requires a complete manual assignment of micro variables to clusters, L2C discovers the partition automatically from local causal patterns. Our solution leverages a cluster reduction theorem to reduce any cluster to at most three nodes without loss of causal information, applies local causal discovery to identify direct causes, effects, and V structures in the presence of latent variables, and performs macro level causal inference via cluster level calculus on the learned cluster graph. L2C does not assume causal sufficiency, as latent variables are handled through local discovery. Theoretical analysis shows that L2C ensures soundness, atomic completeness, and computational efficiency. Extensive experiments on synthetic and real world data demonstrate that L2C accurately recovers ground truth clusters and achieves superior macro causal effect identification compared to existing baselines.
[302] Beyond Land Surface Temperature: Explainable Spatial Machine Learning Reveals Urban Morphology Effects on Human-Centric Heat Stress
Yuan Wang, Shengao Yi, Xiaojiang Li, Pengyuan Liu, Zhiwei Yang, Ronita Bardhan, Rudi Stouffs
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Heat exposure connects the built environment and public health, directly shaping the livability and sustainability of urban areas. Understanding the spatial heterogeneity of heat exposure and its drivers is vital for climate-adaptive urban planning. However, most planning-oriented studies rely on land surface temperature (LST), and whether LST adequately represents human heat exposure and how it differs from physiologically relevant heat stress remains insufficiently examined. Here, adopting Landsat-retrieved 30-m LST and GPU-accelerated 1-m universal thermal climate index (UTCI) in Singapore, this study establishes a comprehensive “Modeling-Comparing-Assessing” framework to systematically evaluate the spatial and mechanistic discrepancies between the two metrics. We further investigate pronounced non-stationary and threshold-based quantitative relationships of the two metrics with urban factors by employing a novel geographically weighted XGBoost (GW-XGBoost) and generalized additive model (GAM) workflow. Our results demonstrate notable discrepancies in spatial patterns of LST and UTCI, along with substantial spatial heterogeneity in how 2D and 3D urban factors impact these two thermal metrics, as revealed by explainable GW-XGBoost models (global out-of-bag R2 = 0.855 for LST and 0.905 for UTCI, respectively). Crucially, spatially explicit SHAP interprets that sky view factor plays a central role in explaining UTCI variability but exhibits a comparatively marginal independent contribution to LST, indicating that LST inadequately captures shading-driven and radiative processes governing actual human heat stress. Notably, SHAP-GAM analysis indicates that higher albedo is associated with increased UTCI. These novel findings provide evidence for integrating physiologically relevant thermal indices to inform targeted heat risk management and climate-adaptive urban planning.
[303] HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
Abhinaba Basu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce HubRouter, a pluggable module that replaces O(n^2) attention layers with O(nM) hub-mediated routing, where M « n is a small number of learned hub tokens. We demonstrate it in two from-scratch architectures: a Jamba-style hybrid and a 12-layer Transformer; retrofit into pretrained models is a tested negative case. HubRouter implements an encode-decode-score-council pipeline: M learned hubs cross-attend to all tokens, tokens project against hubs for routing fingerprints, a score head selects top-k tokens, and a sparse council attends only to the selected subset. We validate HubRouter in three settings. (1) Hub-Jamba yields a nominal 4.2% PPL improvement (200.2 vs 209.0, single seed; possibly within seed noise) and up to ~90x training throughput at sequence length 1024 in matched PyTorch-native baselines; an optimised baseline would narrow this to ~10-15x. (2) Graduated replacement of 25% of Transformer attention layers gives the best perplexity in our matched-budget sweep (268.0 vs 282.4 pure Transformer). (3) Hub-GPT provides strictly causal routing, achieving PPL 211.5 +/- 0.4 over 3 seeds (post council-causal fix); approximately 3 PPL worse than Jamba’s 208.5 +/- 0.7, a measurable quality cost for avoiding O(n^2) computation. Post-fix, chunk size C has little effect; the pre-fix chunk-size benefit was an artifact of a bidirectional-council leak we found in adversarial review. A multi-seed hub-count sweep (~105 runs across M=1-32) reveals M=8-14 as the reliably-converging sub-band (4-5/5 seeds); M=6 is rescued to 5/5 by orthogonal regularization, while M>=20 shows increasing seed sensitivity. Companion paper arXiv:2603.20997 (Basu, 2026) defines the routing diagnostic task. Code and scripts will be released.
[304] FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records
Hojjat Karami, David Atienza, Jean-Philippe Thiran, Anisoara Ionescu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbf{FeatEHR-LLM}, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at github.com/hojjatkarami/FeatEHR-LLM.
[305] Towards Adaptive Continual Model Merging via Manifold-Aware Expert Evolution
Haiyun Qiu, Xingyu Wu, Kay Chen Tan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Continual Model Merging (CMM) sequentially integrates task-specific models into a unified architecture without intensive retraining. However, existing CMM methods are hindered by a fundamental saturation-redundancy dilemma: backbone-centric approaches face parameter saturation and representation interference within fixed capacities, whereas Mixture-of-Experts (MoE) variants resort to indiscriminate expansion, incurring expert redundancy and a routing bottleneck reliant on additional data-driven optimization. To resolve these challenges, we propose MADE-IT (Manifold-Aware Dynamic Expert Evolution and Implicit rouTing), an adaptive CMM method that orchestrates expert management and activation by grounding intrinsic expert representations in manifold geometry. We introduce a projection-based subspace affinity metric coupled with a distribution-aware adaptive threshold mechanism to guide autonomous expert evolution, harmonizing diversity with architectural parsimony. Furthermore, to bypass parameterized gating networks, we design a data-free and training-free implicit routing mechanism that activates experts via feature-subspace alignment. Extensive experiments demonstrate that MADE-IT consistently outperforms strong baselines in accuracy and robustness across long-horizon and shuffled task sequences, while significantly pruning redundant experts, particularly within generic modules and early layers.
[306] On the Properties of Feature Attribution for Supervised Contrastive Learning
Leonardo Arrighi, Julia Eva Belloni, Aurélie Gallet, Ivan Gentile, Matteo Lippi, Marco Zullich
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Most Neural Networks (NNs) for classification are trained using Cross-Entropy as a loss function. This approach requires the model to have an explicit classification layer. However, there exist alternative approaches, such as Contrastive Learning (CL). Instead of explicitly operating a classification, CL has the NN produce an embedding space where projections of similar data are pulled together, while projections of dissimilar data are pushed apart. In the case of Supervised CL (SCL), labels are adopted as similarity criteria, thus creating an embedding space where the projected data points are well-clustered. SCL provides crucial advantages over CE with regard to adversarial robustness and out-of-distribution detection, thus making it a more natural choice in safety-critical scenarios. In the present paper, we empirically show that NNs for image classification trained with SCL present higher-quality feature attribution explanations than CL with regard to faithfulness, complexity, and continuity. These results reinforce previous findings about CL-based approaches when targeting more trustworthy and transparent NNs and can guide practitioners in the selection of training objectives targeting not only accuracy, but also transparency of the models.
[307] Deep Learning for Model Calibration in Simulation of Itaconic Acid Production
Daria Fokina, Marco Baldan, Constantin Romankiewicz, Wolfgang Laudensack, Roland Ulber, Michael Bortz
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In this study, deep learning is used to estimate kinetic parameters for modeling itaconic acid production based on real batch experiments conducted at different agitation speeds and reactor scales. Two deep learning strategies, namely direct deep learning (DDL) and generative conditional flow matching (CFM) are compared and benchmarked against nonlinear regression as a reference method. Compared with DDL, CFM consistently yields more accurate results. The concentration profiles predicted by CFM closely match those obtained from nonlinear regression, whereas DDL results in larger deviations. Similar behavior is observed in the scale-up experiments, where the CFM model again generalizes better and is more robust than the direct approach. These findings demonstrate that CFM can reliably predict system behavior across different operating conditions and scales, offering a flexible and data-efficient framework for parameter estimation in dynamic bioprocess models.
[308] Decoding High-Dimensional Finger Motion from EMG Using Riemannian Features and RNNs
Martin Colot, Cédric Simar, Guy Cheron, Ana Maria Cebolla Alvarez, Gianluca Bontempi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Continuous estimation of high-dimensional finger kinematics from forearm surface electromyography (EMG) could enable natural control for hand prostheses, AR/XR interfaces, and teleoperation. However, the complexity of human hand gestures and the entanglement of forearm muscles make accurate recognition intrinsically challenging. Existing approaches typically reduce task complexity by relying on classification-based machine learning, limiting the controllable degrees of freedom and compromising on natural interaction. We present an end-to-end framework for continuous EMG-to-kinematics regression using only consumer-grade hardware. The framework combines an 8-channel EMG armband, a single webcam, and an automatic synchronization procedure, enabling the collection of the EMG Finger-Kinematics dataset (EMG-FK), a 10-h dataset of synchronized EMG and 15 finger joint angles from 20 participants performing rich, unconstrained right-hand motions. We also introduce the Temporal Riemannian Regressor (TRR), a lightweight GRU-based model that uses sequences of multi-band Riemannian covariance features to decode finger motion. Across EMG-FK and the public emg2pose benchmark, TRR outperforms state-of-the-art methods in both intra- and cross-subject evaluation. On EMG-FK, it reaches an average absolute error of $9.79 °\pm 1.48$ in intra-subject and $16.71 °\pm 3.97$ in cross-subject. Finally, we demonstrate real-time deployment on a Raspberry Pi 5 and intuitive control of a robotic hand; TRR runs at nearly 10 predictions/s and is roughly an order of magnitude faster than state-of-the-art approaches. Together, these contributions lower the barrier to reproducible, real-time EMG-based decoding of high-dimensional finger motion, and pave the way toward more natural and intuitive control of embedded EMG-based systems.
[309] Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Hillary Mutisya, John Mugane
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k’- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.
[310] An Integrated Framework for Explainable, Fair, and Observable Hospital Readmission Prediction: Development and Validation on MIMIC-IV
Isaac Tosin Adisa
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Objective: To propose and retrospectively validate an integrated framework addressing three barriers to clinical translation of readmission prediction: lack of explainability, absence of deployment reliability infrastructure, and inadequate demographic fairness evaluation. Materials and Methods: We constructed a cohort of 415231 adult admissions from the MIMIC-IV database (30-day readmission prevalence 18.0%), split 70/15/15. Logistic regression, XGBoost, and LightGBM models were trained on 26 features. SHAP provided per-patient explanations. Fairness was evaluated across 16 subgroups using AUC-ROC, false negative rate (FNR), and positive predictive value (PPV). Calibration was assessed using Brier scores and calibration curves. Results: XGBoost achieved AUC-ROC 0.696 (95% CI 0.691-0.701), outperforming or matching the LACE baseline (AUC 0.60-0.68). LightGBM achieved best calibration (Brier 0.146). Prior admissions were the dominant predictor. All subgroups met equity thresholds (delta AUC <= 0.05, delta FNR <= 0.10). Conclusion: This framework delivers competitive performance, clinically actionable explanations, and strong demographic equity. Code is publicly available at https://github.com/Tomisin92/readmission-prediction.
[311] Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Hillary Mutisya, John Mugane
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We investigate whether neural models trained exclusively on modern morphological data can recover cross-lingual lexical structure consistent with historical reconstruction. Using BantuMorph v7, a transformer over Bantu morphological paradigms, we analyze 14 Eastern and Southern Bantu languages, extract encoder embeddings for their noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across 5+ languages. Evaluating these candidates against established historical resources-the Bantu Lexical Reconstructions database (BLR3; 4,786 reconstructed Proto-Bantu forms) and the ASJP basic vocabulary-we confirm 10 of the top 11 noun candidates (90.9%) align with previously reconstructed Proto-Bantu forms, including *-ntU ‘person’ (8 languages), *gombe ‘cow’ (9 languages), and *mUn (9 languages). Extending to verbs, 12 verb cognates align with reconstructed Proto-Bantu roots, including *-bon- ‘see’ and *-jIm- ‘stand’, each attested across wide geographic ranges. Cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with established Guthrie-zone classifications (p < 0.01). Cross-lingual noun class analysis reveals that all 13 productive classes maintain >0.83 cosine similarity across languages (within-class > between-class, p < 10^-9). Our dataset is restricted to Eastern and Southern Bantu, so we interpret these results as recovering shared Bantu lexical structure consistent with Proto-Bantu rather than definitively distinguishing Proto-Bantu retentions from later regional innovations.
[312] SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
Jichao Wang, Liuyang Bian, Yufeng Zhou, Han Xiao, Yue Pan, Guozhi Wang, Hao Wang, Zhaoxiong Wang, Yafei Wen, Xiaoxin Chen, Shuai Ren, Lingfang Zeng
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilemma. Standard Offline RL often relies on static step-level data, neglecting global trajectory semantics such as task completion and execution quality. Conversely, Online RL captures the long-term dynamics but suffers from high interaction costs and potential environmental instability. To bridge this gap, we propose SOLAR-RL (Semi-Online Long-horizon Assignment Reinforcement Learning). Instead of relying solely on expensive online interactions, our framework integrates global trajectory insights directly into the offline learning process. Specifically, we reconstruct diverse rollout candidates from static data, detect the first failure point using per-step validity signals, and retroactively assign dense step-level rewards with target-aligned shaping to reflect trajectory-level execution quality, effectively simulating online feedback without interaction costs. Extensive experiments demonstrate that SOLAR-RL significantly improves long-horizon task completion rates and robustness compared to strong baselines, offering a sample-efficient solution for autonomous GUI navigation.
[313] Data-Free Contribution Estimation in Federated Learning using Gradient von Neumann Entropy
Asim Ukaye, Mubarak Abdu-Aguye, Nurbek Tastan, Karthik Nandakumar
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Client contribution estimation in Federated Learning is necessary for identifying clients’ importance and for providing fair rewards. Current methods often rely on server-side validation data or self-reported client information, which can compromise privacy or be susceptible to manipulation. We introduce a data-free signal based on the matrix von Neumann (spectral) entropy of the final-layer updates, which measures the diversity of the information contributed. We instantiate two practical schemes: (i) SpectralFed, which uses normalized entropy as aggregation weights, and (ii) SpectralFuse, which fuses entropy with class-specific alignment via a rank-adaptive Kalman filter for per-round stability. Across CIFAR-10/100 and the naturally partitioned FEMNIST and FedISIC benchmarks, entropy-derived scores show a consistently high correlation with standalone client accuracy under diverse non-IID regimes - without validation data or client metadata. We compare our results with data-free contribution estimation baselines and show that spectral entropy serves as a useful indicator of client contribution.
[314] SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
Yuqi Pan, Jinghao Zhuang, Yupeng Feng, Fangzhi Zhong, Siyu Ding, Xuerui Qiu, Shaowei Gu, Bohan Sun, Zhiyong Qin, Yibo Zhong, Lingtao Ouyang, Kun Yang, Zehao Liu, Yuhong Chou, Shurong Wang, Anjie Hu, Han Xu, Bo Xu, Guoqi Li
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Scaling context length is reshaping large-model development, yet full-attention Transformers suffer from prohibitive computation and inference bottlenecks at long sequences. A key challenge is to design foundation models that maintain performance and long-context efficiency with minimal training overhead. We introduce SpikingBrain2.0 (SpB2.0), a 5B model that advances both architecture and training efficiency of its predecessor. Our contributions are two-fold. (1) Architectural Innovation: We propose Dual-Space Sparse Attention (DSSA), an inter-layer hybrid of Sparse Softmax Attention (MoBA) and Sparse Linear Attention (SSE), achieving an improved performance-efficiency trade-off for long-context modeling. SpB2.0 further supports dual quantization paths: INT8-Spiking coding enables sparse event-driven computation, while FP8 coding accelerates inference on modern GPUs. (2) Enhanced Training Strategy: We develop an optimized Transformer-to-Hybrid (T2H) pipeline with dual conversion paths for LLMs and VLMs using curated open-source data. Empirically, SpB2.0-5B and SpB2.0-VL-5B recover most of the base Transformer (Qwen3-4B) capability with under 7k A100 GPU hours. SpB2.0 achieves a 10.13x TTFT speedup at 4M context and supports over 10M tokens on 8 A100 GPUs under vLLM, where full-attention models exceed memory limits. It also demonstrates strong cross-platform compatibility, enabling FP8 GPU inference (2.52x speedup at 250k) and efficient neuromorphic execution (64.31% sparsity, with 70.6% and 46.5% area and power reduction at 500MHz). Overall, SpikingBrain2.0 provides a practical pathway for lightweight, multimodal, spiking foundation models, highlighting the potential of combining brain-inspired mechanisms with efficient architectures for resource-constrained and edge scenarios.
[315] Adaptive Head Budgeting for Efficient Multi-Head Attention
Bilal Faye, Abdoulaye Mbaye, Hanane Azzag, Mustapha Lebbah
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Transformers have become the dominant architecture across a wide range of domains, largely due to the effectiveness of multi-head attention in capturing diverse representation subspaces. However, standard multi-head attention activates all heads uniformly for every input, regardless of task requirements or input complexity. In many scenarios, particularly for coarse-grained tasks such as text classification, the relevant information is often global and does not require the full diversity of attention heads. As a consequence, using a fixed number of heads can introduce unnecessary computational cost or lead to suboptimal performance when the allocation does not match the input. To address this limitation, we introduce BudgetFormer, a Transformer architecture equipped with an adaptive multi-head attention mechanism that dynamically allocates computational resources. Our approach learns, for each input, both a head budget corresponding to the number of attention heads required, and a relevance distribution that selects the most informative heads. We also propose a training strategy based on an exploration and exploitation trade-off, allowing the model to discover effective head configurations before converging to efficient usage patterns. Experiments on text classification tasks of varying complexity show that our method reduces inference cost in terms of FLOPs and memory, while also achieving performance that can surpass standard full multi-head attention. These results highlight the potential of adaptive head allocation as a principled approach to improving both efficiency and effectiveness in Transformer models.
[316] Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings
Inês Oliveira e Silva, Sérgio Jesus, Iker Perez, Rita P. Ribeiro, Carlos Soares, Hugo Ferreira, Pedro Bizarro
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.
[317] Beyond Patient Invariance: Learning Cardiac Dynamics via Action-Conditioned JEPAs
Jose Geraldo Fernandes, Luiz Facury, Pedro Robles Dutenhefner, Wagner Meira
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Self-supervised learning in healthcare has largely relied on invariance-based objectives, which maximize similarity between different views of the same patient. While effective for static anatomy, this paradigm is fundamentally misaligned with clinical diagnosis, as it mathematically compels the model to suppress the transient pathological changes it is intended to detect. We propose a shift towards Action-Conditioned World Models that learn to simulate the dynamics of disease progression, or Event-Conditioned. Adapting the LeJEPA framework to physiological time-series, we define pathology not as a static label, but as a transition vector acting on a patient’s latent state. By predicting the future electrophysiological state of the heart given a disease onset, our model explicitly disentangles stable anatomical features from dynamic pathological forces. Evaluated on the MIMIC-IV-ECG dataset, our approach outperforms fully supervised baselines on the critical triage task. Crucially, we demonstrate superior sample efficiency: in low-resource regimes, our world model outperforms supervised learning by over 0.05 AUROC. These results suggest that modeling biological dynamics provides a dense supervision signal that is far more robust than static classification. Source code is available at https://github.com/cljosegfer/lesaude-dynamics
[318] Associativity-Peakiness Metric for Contingency Tables
Naomi E. Zirkind, William J. Diehl
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: For the use case of comparing the performance of clustering algorithms whose output is a contingency table, a single performance metric for contingency tables is needed. Such a metric is vital for comparative performance analysis of clustering algorithms. A survey of publicly available literature did not show the presence of such a metric. Metrics do exist for vector pairs of truth values and predicted values, which are an alternative form of output of clustering algorithms. However, the metrics for vector pairs do not reveal the presence of detailed features that are apparent in contingency tables. This paper presents the Associativity Peakiness (AP) metric, which characterizes aspects of clustering algorithm performance that are critical for predicting a clustering algorithm’s performance when deployed. The AP metric is analogous to measures of quality for confusion matrices that are outputs of supervised learning algorithms. This paper presents results from simulations in which 500 contingency tables were generated for multiple test scenarios. The results show that for the use case of evaluating clustering algorithms, the AP metric characterizes performance of contingency tables with higher dynamic range than publicly available metrics, and that it is computationally more efficient than comparable publicly available metrics.
[319] Iterative Model-Learning Scheme via Gaussian Processes for Nonlinear Model Predictive Control of (Semi-)Batch Processes
Tai Xuan Tan, Alexander Mitsos, Eike Cramer
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Batch processes are inherently transient and typically nonlinear, motivating nonlinear model predictive control (NMPC). However, adopting NMPC is hindered by the cost and unavailability of dynamic models. Thus, we propose to use Gaussian Processes (GP) in a model-learning NMPC scheme (GP-MLMPC) for batch processes. We initialize the GP-MLMPC using data from a single initial trajectory, e.g., from a PI controller. We iteratively apply the NMPC embedded with GPs to run batches and update the GP with new observations from each iteration, thereby achieving batch-wise improvements. Using uncertainty quantification from the GPs, we formulate chance constraints to enforce safe operation to the required confidence levels. We demonstrate our approach in \textit{silico} on a semi-batch polymerization reactor for tracking and economic objectives over durations of two hours, and the reactor temperature is constrained in a range of $\pm2^\circ C$ around its setpoint. After only four batch iterations, tracking error from the GP-MLMPC scheme converged to a reduction of $83%$, compared to the initial trajectory. Furthermore, under an economic objective, the GP-MLMPC resulted in a 17-fold increase in final product mass by iteration 8, compared to the initial trajectory. In both cases, the resulting GP-MLMPC performance is on par with the full-model NMPC, which shows that the optimal controller can be learned by the approach. By collecting samples around the optimal trajectory, the GP-MLMPC remains sample-efficient across iterations and achieves quick convergence. Thus, the proposed GP-MLMPC scheme presents a promising data-efficient approach for the control of nonlinear batch processes without mechanistic knowledge.
[320] Operational Feature Fingerprints of Graph Datasets via a White-Box Signal-Subspace Probe
Yuchen Xiong, Swee Keong Yeap, Zhen Hong Ban
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Graph neural networks achieve strong node-classification accuracy, but their learned message passing entangles ego attributes, neighborhood smoothing, high-pass graph differences, class geometry, and classifier boundaries in an opaque representation. This obscures why a node is classified and what feature-level graph-learning mechanisms a dataset requires. We propose WG-SRC, a white-box signal-subspace probe for prediction and graph dataset diagnosis. WG-SRC replaces learned message passing with a fixed, named graph-signal dictionary of raw features, row-normalized and symmetric-normalized low-pass propagation, and high-pass graph differences. It combines Fisher coordinate selection, class-wise PCA subspaces, closed-form multi-alpha ridge classification, and validation-based score fusion, so prediction and analysis use explicit class subspaces, energy-controlled dimensions, and closed-form linear decisions. As a white-box graph-learning instrument, WG-SRC uses predictive performance to validate its diagnostics: across six node-classification datasets, the scaffold remains competitive with reproduced graph baselines and achieves positive average gain under aligned splits. Its atlas, produced by a predictor, decomposes behavior into raw-feature, low-pass, high-pass, class-geometric, and ridge-boundary components. These operational feature fingerprints distinguish low-pass-dominated Amazon graphs, mixed high-pass and class-geometrically complex Chameleon behavior, and raw- or boundary-sensitive WebKB graphs. As intrinsic classifier outputs rather than post-hoc explanations, these fingerprints provide post-evaluation guidance for later analysis and dataset-specific modification. Aligned mechanistic interventions support this guidance by indicating when high-pass blocks act as removable noise, when raw features should be preserved, and when ridge-type boundary correction matters.
[321] Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
Sijie Li, Shanda Li, Haowei Lin, Weiwei Sun, Ameet Talwalkar, Yiming Yang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Scaling laws are used to plan multi-million-dollar training runs, but fitting those laws can itself cost millions. In modern large-scale workflows, assembling a sufficiently informative set of pilot experiments is already a major budget-allocation problem rather than a routine preprocessing step. We formulate scaling-law fitting as budget-aware sequential experimental design: given a finite pool of runnable experiments with heterogeneous costs, choose which runs to execute so as to maximize extrapolation accuracy in a high-cost target region. We then propose an uncertainty-aware method for sequentially allocating experimental budget toward the runs most useful for target-region extrapolation. Across a diverse benchmark of scaling-law tasks, our method consistently outperforms classical design-based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total training budget. Our code is available at https://github.com/PlanarG/active-sl.
[322] Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form
Toshinori Kitamura, Tadashi Kozuno, Wataru Kumagai, Kenta Hoshino, Yohei Hosoe, Kazumi Kasaura, Masashi Hamaya, Paavo Parmas, Yutaka Matsuo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2408.16286: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.16286&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[323] CrystalX: High-accuracy Crystal Structure Analysis Using Deep Learning
Kaipeng Zheng, Weiran Huang, Wanli Ouyang, Han-Sen Zhong, Yuqiang Li
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2410.13713: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.13713&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[324] Privacy Leakage via Output Label Space and Differentially Private Continual Learning
Marlon Tobaben, Talal Alrawajfeh, Marcus Klasson, Mikko Heikkilä, Arno Solin, Antti Honkela
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2411.04680: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.04680&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[325] Tensor Network Estimation of Distribution Algorithms
John Gardiner, Javier Lopez-Piqueres
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2412.19780: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.19780&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[326] How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies
Akansha Kalra, Basavasagar Patil, Guanhong Tao, Daniel S. Brown
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2502.03698: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.03698&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[327] PreMoE: Proactive Inference for Efficient Mixture-of-Experts
Zehua Pei, Ying Zhang, Hui-Ling Zhen, Tao Yuan, Xianzhi Yu, Zhenhua Dong, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.17639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[328] Manifold Learning for Personalized and Label-Free Detection of Cardiac Arrhythmias
Amir Reza Vazifeh, Jason W. Fleischer
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.16494: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.16494&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[329] Presenting DiaData for Research on Type 1 Diabetes
Beyza Cinar, Maria Maleshkova
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.09160: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09160&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[330] Federated Nonlinear System Identification
Omkar Tupe, Max Hartman, Lav R. Varshney, Saurav Prakash
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.15025: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.15025&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[331] Toward Robust and Efficient ML-Based GPU Caching for Modern Inference
Peng Chen, Jiaji Zhang, Hailiang Zhao, Yirong Zhang, Shenyao Chen, Jiahong Yu, Xueyan Tang, Yixuan Wang, Hao Li, Jianping Zou, Gang Xiong, Kingsum Chow, Shuibing He, Shuiguang Deng
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.20979: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20979&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[332] Leveraging Teleconnections with Physics-Informed Graph Attention Networks for Long-Range Extreme Rainfall Forecasting in Thailand
Kiattikun Chobtham, Kanoksri Sarinnapakorn, Kritanai Torsri, Prattana Deeprasertkul, Jirawan Kamma
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.12328: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12328&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[333] Parameter-Efficient Conditioning for Material Generalization in Graph-Based Simulators
Naveen Raj Manoharan, Hassan Iqbal, Krishna Kumar
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.05456: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05456&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[334] Differentiable Filtering for Learning Hidden Markov Models
Reginald Zhiyan Chen, Heng-Sheng Chang, Prashant G. Mehta
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.10571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[335] Artificial intelligence for methane detection: from continuous monitoring to verified mitigation
Gonzalo Mateo-Garcia, Anna Allen, Itziar Irakulis-Loitxate, Manuel Montesino-San Martin, Marc Watine, Cynthia Randles, Tharwat Mokalled, Alma Raunak, Carol Castañeda-Martinez, Juan E. Jonhson, Javier Gorroño, James Requeima, Claudio Cifarelli, Luis Guanter, Richard E. Turner, Manfredi Caltagirone
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.21777: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21777&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[336] TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation
Henrijs Princis, Arindam Sharma, Cristina David
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.22277: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22277&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[337] Optimal Lower Bounds for Online Multicalibration
Natalie Collina, Jiuyao Lu, Georgy Noarov, Aaron Roth
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.05245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[338] jBOT: Semantic Jet Representation Clustering Emerges from Self-Distillation
Ho Fung Tsoi, Dylan Rankin
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.11719: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11719&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[339] Chain-of-Memory: Lightweight Memory Construction with Dynamic Evolution for LLM Agents
Xiucheng Xu, Bingbing Xu, Xueyun Tian, Zihe Huang, Rongxin Chen, Yunfan Li, Huawei Shen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.14287: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14287&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[340] Joint Embedding Variational Bayes
Amin Oji, Paul Fieguth
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.05639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[341] MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection
Xueying Ding, Simon Klüttermann, Haomin Wen, Yilong Chen, Leman Akoglu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.09329: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09329&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[342] Regularized Meta-Learning for Improved Generalization
Noor Islam S. Mohammad, Md Muntaqim Meherab
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.12469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[343] From Words to Amino Acids: Does the Curse of Depth Persist?
Aleena Siji, Amir Mohammad Karimi Mamaghan, Ferdinand Kapl, Tobias Höppe, Emmanouil Angelis, Andrea Dittadi, Maurice Brenner, Michael Heinzinger, Karl Henrik Johansson, Kaitlin Maile, Johannes von Oswald, Stefan Bauer
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.21750: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21750&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[344] Algorithmic Compliance and Regulatory Loss in Digital Assets
Khem Raj Bhatt, Krishna Sharma
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.04328: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04328&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[345] How Out-of-Equilibrium Phase Transitions can Seed Pattern Formation in Trained Diffusion Models
Luca Ambrogioni
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.20092: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20092&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[346] Recovery Guarantees for Continual Learning of Dependent Tasks: Memory, Data-Dependent Regularization, and Data-Dependent Weights
Liangzu Peng, Uday Kiran Reddy Tadipatri, Ziqing Xu, Eric Eaton, René Vidal
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17578: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17578&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[347] An `Inverse’ Experimental Framework to Estimate Market Efficiency
Thomas Asikis, Heinrich H. Nax
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18130: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18130&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[348] FlowForge: A Staged Local Rollout Engine for Flow-Field Prediction
Xiaowen Zhang, Ziming Zhou, Fengnian Zhao, David L. S. Hung
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18953: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18953&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[349] MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
Anurita Das
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.21026: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21026&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[350] Expander Hierarchies for Normalized Cuts on Graphs
Kathrin Hanauer, Monika Henzinger, Robin Münk, Harald Räcke, Maximilian Vötsch
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2406.14111: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.14111&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[351] Online Distributional Regression
Simon Hirsch, Jonathan Berrisch, Florian Ziel
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2407.08750: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.08750&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[352] On Pareto Optimality for Parametric Choice Bandits
Jierui Zuo, Hanzhang Qin
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2501.19277: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.19277&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[353] da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs
Chang Sun, Zhiqiang Que, Vladimir Loncar, Wayne Luk, Maria Spiropulu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.04535: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.04535&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[354] Calibrated Principal Component Regression
Yixuan Florence Wu, Yilun Zhu, Lei Cao, Naichen Shi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.19020: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19020&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[355] Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning
Rickmer Krohn, Vignesh Prasad, Gabriele Tiboni, Georgia Chalvatzaki
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.14427: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14427&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[356] Exploiting Low-Rank Structure in Max-K-Cut Problems
Ria Stevens, Fangshuo Liao, Barbara Su, Jianqiang Li, Anastasios Kyrillidis
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.20376: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20376&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[357] Unsupervised Discovery of Intermediate Phase Order in the Frustrated $J_1$-$J_2$ Heisenberg Model via Prometheus Framework
Brandon Yee, Wilson Collins, Maximilian Rutkowski
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.21468: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21468&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[358] Detecting Cognitive Signatures in Typing Behavior for Non-Intrusive Authorship Verification
David Condrey
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.00177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[359] Unified Taxonomy for Multivariate Time Series Anomaly Detection using Deep Learning
Bruna Alves, Armando J. Pinho, Sónia Gouveia
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.18941: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.18941&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[360] FLUID: Flow-based Unified Inference for Dynamics
Tiangang Cui, Xiaodong Feng, Chenlong Pei, Xiaoliang Wan, Tao Zhou
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.07169: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07169&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[361] Distributional Off-Policy Evaluation with Deep Quantile Process Regression
Qi Kuang, Chao Wang, Yuling Jiao, Fan Zhou
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18143: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18143&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[362] Sparse Network Inference under Imperfect Detection and its Application to Ecological Networks
Aoran Zhang, Tianyao Wei, Maria J. Guerrero, César A. Uribe
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18820: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18820&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[363] DM$^3$-Nav: Decentralized Multi-Agent Multimodal Multi-Object Semantic Navigation
Amin Kashiri, Atharva Jamsandekar, Yasin Yazıcıoğlu
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present DM$^3$-Nav, a fully decentralized multi-agent semantic navigation system supporting multimodal open-vocabulary goal specification and multi-object missions. In our setting, decentralization implies operation without a central coordinator, global map aggregation, or shared global state at runtime. Robots operate autonomously and coordinate through ad-hoc pairwise communication, exchanging local maps, goal status, and navigation intent without synchronization. An implicit task allocation mechanism combining intent broadcasting and distance-weighted frontier selection reduces redundant exploration while preserving decentralized operation. Evaluations on HM3DSem scenes using the HM3Dv0.2 and GOAT-Bench datasets demonstrate that DM$^3$-Nav matches or exceeds centralized and shared-map baselines while eliminating single points of failure inherent in centralized architectures. Finally, we validate our approach in a real-world office environment using two mobile robots, demonstrating successful deployment relying entirely on onboard sensing and computation. A video of our real-world experiments is available online: https://drive.google.com/file/d/1QiUSCn5rIvtuTUqtuXLPgmt6S8x9-MCZ/view?usp=drive_link
[364] Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems
Mengzhuo Chen, Junjie Wang, Fangwen Mu, Yawen Wang, Zhe Liu, Huanxiang Feng, Qing Wang
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failure attribution, i.e., identifying the responsible agent and decisive step of a failure, is particularly challenging in LLM-based multi-agent systems (MAS) due to their natural-language reasoning, nondeterministic outputs, and intricate interaction dynamics. A reliable benchmark is therefore essential to guide and evaluate attribution techniques. Yet existing benchmarks rely on partially observable traces that capture only agent outputs, omitting the inputs and context that developers actually use when debugging. We argue that failure attribution should be studied under full execution observability, aligning with real-world developer-facing scenarios where complete traces, rather than only outputs, are accessible for diagnosis. To this end, we introduce TraceElephant, a benchmark designed for failure attribution with full execution traces and reproducible environments. We then systematically evaluate failure attribution techniques across various configurations. Specifically, full traces improve attribution accuracy by up to 76% over a partial-observation counterpart, confirming that missing inputs obscure many failure causes. TraceElephant provides a foundation for follow-up failure attribution research, promoting evaluation practices that reflect real-world debugging and supporting the development of more transparent MASs.
[365] Open-Ended Video Game Glitch Detection with Agentic Reasoning and Temporal Grounding
Muyang Zheng, Tong Zhou, Geyang Wu, Zihao Lin, Haibo Wang, Lifu Huang
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Open-ended video game glitch detection aims to identify glitches in gameplay videos, describe them in natural language, and localize when they occur. Unlike conventional game glitch understanding tasks which have largely been framed as image-level recognition or closed-form question answering, this task requires reasoning about game-specific dynamics such as mechanics, physics, rendering, animation, and expected state transitions directly over continuous gameplay videos and distinguishing true glitches from unusual but valid in-game events. To support this task, we introduce VideoGlitchBench, the first benchmark for open-ended video game glitch detection with temporal localization. VideoGlitchBench contains 5,238 gameplay videos from 120 games, each annotated with detailed glitch descriptions and precise temporal spans, enabling unified evaluation of semantic understanding and temporal grounding. We further propose GliDe, an agentic framework with three key components: a game-aware contextual memory for informed reasoning, a debate-based reflector for multi-perspective glitch detection and verification, and an event-level grounding module that recovers complete glitch intervals from fragmented temporal evidence. We also design a task-specific evaluation protocol that jointly measures semantic fidelity and temporal accuracy. Experiments show that this task remains highly challenging for current multimodal models, while GliDe achieves substantially stronger performance than corresponding vanilla model baselines.
cs.MM
[366] Looking Into the Past: Eye Movements Characterize Elements of Autobiographical Recall in Interviews with Holocaust Survivors
Emily Zhou, Marcus Ma, Kleanthis Avramidis, Gabor Mihaly Toth, Shrikanth Narayanan
Main category: cs.MM
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Eye movement and memory retrieval are deeply and bidirectionally intertwined, however existing literature is generally confined to controlled lab settings. We investigate the relationship between eye gaze and memory recall in free-form autobiographical recall, which comprises both autonoetic consciousness – the ability to mentally place oneself in the past or future – and various affective states. Using a large video corpus of semi-naturalistic interviews with Holocaust survivors (N = 806), we examine eye movements with respect to episodic, semantic, affective, and temporal dimensions of traumatic and highly emotional autobiographical recall. We observe gaze patterns vary significantly across certain temporal contexts, most prominently in vertical eye movements. We additionally train intra-subject sequence models to predict temporal context of sentences from segments of gaze features, and find that eye movements entirely preceding sentence onset are sufficient for prediction. Our results corroborate prior findings in literature linking eye movements to memory in controlled and semi-structured settings, reinforcing the role of eye gaze in retrieving and constructing memories, especially in highly emotional and remote memory recall.
eess.AS
[367] Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis
Haopeng Geng, Longfei Yang, Xi Chen, Haitong Sun, Daisuke Saito, Nobuaki Minematsu
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cues, while explicit canonical priors bias predictions toward intended targets. To address these bottlenecks, we propose a prompt-free framework decoupling acoustic fidelity from canonical guidance. First, we introduce CROTTC, an acoustic model enforcing monotonic, frame-level alignment to accurately capture pronunciation deviations. Second, we implicitly inject mispronunciation information via the IF strategy under the knowledge transfer principle. Experiments show CROTTC-IF achieves a 71.77% F1-score on L2-ARCTIC and 71.70% F1-score on the Iqra’Eval2 leaderboard. With empirical analysis, we demonstrate that decoupling acoustics from explicit priors provides highly robust MDD.
[368] Advancing automatic speech recognition using feature fusion with self-supervised learning features: A case study on Fearless Steps Apollo corpus
Szu-Jui Chen, John H. L. Hansen
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Using self-supervised learning (SSL) models has significantly improved performance for downstream speech tasks, surpassing the capabilities of traditional hand-crafted features. This study investigates the amalgamation of SSL models, with the aim to leverage both their individual strengths and refine extracted features to achieve improved speech recognition models for naturalistic scenarios. Our research investigates the massive naturalistic Fearless Steps (FS) APOLLO resource, with particular focus on the FS Challenge (FSC) Phase-4 corpus, providing the inaugural analysis of this dataset. Additionally, we incorporate the CHiME-6 dataset to evaluate performance across diverse naturalistic speech scenarios. While exploring previously proposed Feature Refinement Loss and fusion methods, we found these methods to be less effective on the FSC Phase-4 corpus. To address this, we introduce a novel deep cross-attention (DCA) fusion method, designed to elevate performance, especially for the FSC Phase-4 corpus. Our objective is to foster creation of superior FS APOLLO community resources, catering to the diverse needs of researchers across various disciplines. The proposed solution achieves an absolute +1.1% improvement in WER, providing effective meta-data creation for the massive FS APOLLO community resource.
[369] UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions
Chunyu Qiang, Xiaopeng Wang, Kang Yin, Yuzhe Liang, Yuxin Guo, Teng Ma, Ziyu Zhang, Tianrui Wang, Cheng Gong, Yushen Chen, Ruibo Fu, Chen Zhang, Longbiao Wang, Jianwu Dang
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.
[370] Audio Effect Estimation with DNN-Based Prediction and Search Algorithm
Youichi Okita, Haruhiro Katayose
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Audio effects play an essential role in sound design. This research addresses the task of audio effect estimation, which aims to estimate the configuration of applied effects from a wet signal. Existing approaches to this problem can be categorized into predictive approaches, which use models pre-trained in a data-driven manner, and search-based approaches, which are based on wet signal reconstruction. In this study, we propose a novel approach that integrates these approaches: first, DNNs predict the dry signal and effect configuration, and then a search is performed based on wet signal reconstruction using these predictions. By estimating the dry signal in the prediction stage, it becomes possible to complement or improve the predictions using reconstruction similarity as an objective function. The experimental evaluation showed that methods based on the proposed approach outperformed the method solely based on the predictive approach. Furthermore, the findings suggest that the task division of predicting the effect type combination followed by the search-based estimation of order and parameters was the most effective across various metrics.
[371] Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
Mingchen Shao, Hang Su, Wenjie Tian, Bingshen Mu, Zhennan Lin, Lichun Fan, Zhenbo Luo, Jian Luan, Lei Xie
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While Large Audio Language Models (LALMs) achieve strong performance on short audio, they degrade on long-form inputs. This degradation is more severe in temporal awareness tasks, where temporal alignment becomes increasingly inaccurate as audio duration grows. We attribute these limitations to the lack of data, benchmarks, and modeling approaches tailored for long-form temporal awareness. To bridge this gap, we first construct LAT-Chronicle, a 1.2k hour long-form audio dataset with temporal annotations across real-world scenarios. We further develop LAT-Bench, the first human-verified benchmark supporting audio up to 30 minutes while covering three core tasks: Dense Audio Caption, Temporal Audio Grounding, and Targeted Audio Caption. Leveraging these resources, we propose LAT-Audio, formulating temporal awareness as a progressive global-to-local reasoning paradigm. A global timeline is first constructed as an aligned temporal-semantic context,and the Think-With-Audio Chain-of-Thought (TWA-CoT) is then introduced to perform iterative reasoning by incorporating local audio information via tool use. Experiments show that LAT-Audio surpasses existing models on long-form audio temporal awareness tasks and improves robustness to input duration. We release the dataset, benchmark, and model to facilitate future research at https://github.com/alanshaoTT/LAT-Audio-Repo.
[372] DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models
Li Li, Ming Cheng, Weixin Zhu, Yannan Wang, Juan Liu, Ming Li
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potential of unified modeling for this task, but jointly learning speaker attribution, temporal structure, and lexical recognition remains difficult and data-intensive. At the current stage, leveraging reliable speaker diarization as an explicit structural prior provides a practical and efficient way to simplify this task. To effectively exploit such priors, we propose DM-ASR, a diarization-aware multi-speaker ASR framework that reformulates the task as a multi-turn dialogue generation process. Given an audio chunk and diarization results, DM-ASR decomposes transcription into a sequence of speaker- and time-conditioned queries, each corresponding to one speaker in one time segment. This formulation converts multi-speaker recognition into a series of structured sub-tasks, explicitly decoupling speaker-temporal structure from linguistic content and enabling effective integration of diarization cues with the reasoning capability of large language models. We further introduce an optional word-level timestamp prediction mechanism that interleaves word and timestamp tokens, yielding richer structured outputs and better transcription quality. Our analysis shows that diarization systems provide more reliable speaker identities and segment-level boundaries, while LLMs excel at modeling linguistic content and long-range dependencies, demonstrating their complementary strengths. Experiments on Mandarin and English benchmarks show that the proposed approach achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches.
[373] Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?
Ashwini Dasare, Nirmesh Shah, Ashishkumar Gudmalwar, Pankaj Wasnik
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Evaluating AI generated dubbed content is inherently multi-dimensional, shaped by synchronization, intelligibility, speaker consistency, emotional alignment, and semantic context. Human Mean Opinion Scores (MOS) remain the gold standard but are costly and impractical at scale. We present a hierarchical multimodal architecture for perceptually meaningful dubbing evaluation, integrating complementary cues from audio, video, and text. The model captures fine-grained features such as speaker identity, prosody, and content from audio, facial expressions and scene-level cues from video and semantic context from text, which are progressively fused through intra and inter-modal layers. Lightweight LoRA adapters enable parameter-efficient fine-tuning across modalities. To overcome limited subjective labels, we derive proxy MOS by aggregating objective metrics with weights optimized via active learning. The proposed architecture was trained on 12k Hindi-English bidirectional dubbed clips, followed by fine-tuning with human MOS. Our approach achieves strong perceptual alignment (PCC > 0.75), providing a scalable solution for automatic evaluation of AI-dubbed content.
[374] HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
Shuiyuan Wang, Zhixian Zhao, Hongfei Xue, Chengyou Wang, Shuai Wang, Hui Bu, Xin Xu, Lei Xie
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Evaluating the emotional intelligence (EI) of audio language models (ALMs) is critical. However, existing benchmarks mostly rely on synthesized speech, are limited to single-turn interactions, and depend heavily on open-ended scoring. This paper proposes HumDial-EIBench, a comprehensive benchmark for evaluating ALMs’ EI. Using real-recorded human dialogues from the ICASSP 2026 HumDial Challenge, it reformulates emotional tracking and causal reasoning into multiple-choice questions with adversarial distractors, mitigating subjective scoring bias for cognitive tasks. It retains the generation of empathetic responses and introduces an acoustic-semantic conflict task to assess robustness against contradictory multimodal signals. Evaluations of eight ALMs reveal that most models struggle with multi-turn emotional tracking and implicit causal reasoning. Furthermore, all models exhibit decoupled textual and acoustic empathy, alongside a severe text-dominance bias during cross-modal conflicts.
[375] Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge
Chengyou Wang, Hongfei Xue, Guojian Li, Zhixian Zhao, Shuiyuan Wang, Shuai Wang, Xin Xu, Hui Bu, Lei Xie
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversations. The Full-Duplex Interaction Track of ICASSP 2026 Human-like Spoken Dialogue Systems Challenge (HumDial Challenge) aims to advance the evaluation of full-duplex systems by offering a framework for handling real-time interruptions, speech overlap, and dynamic turn negotiation. We introduce a comprehensive benchmark for full-duplex spoken dialogue systems, built from the HumDial Challenge. We release a high-quality dual-channel dataset of real human-recorded conversations, capturing interruptions, overlapping speech, and feedback mechanisms. This dataset forms the basis for the HumDial-FDBench benchmark, which assesses a system’s ability to handle interruptions while maintaining conversational flow. Additionally, we create a public leaderboard to compare the performance of open-source and proprietary models, promoting transparent, reproducible evaluation. These resources support the development of more responsive, adaptive, and human-like dialogue systems.
eess.IV
[376] Conditional Diffusion Posterior Alignment for Sparse-View CT Reconstruction
Luis Barba, Johannes Kirschner, Benjamin Bejar
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Computed Tomography (CT) is a widely used imaging modality in medical and industrial applications. To limit radiation exposure and measurement time, there is a growing interest in sparse-view CT, where the number of projection views is significantly reduced. Deep neural networks have shown great promise in improving reconstruction quality in sparse-view CT, especially generative diffusion models. However, these methods struggle to scale to large 3D volumes due to several reasons: (i) the high memory and computational requirements of 3D models, (ii) the lack of large 3D training datasets, and (iii) the inconsistencies across slices when using 2D models independently on each slice. We overcome these limitations and scale diffusion-based sparse-view CT reconstruction to large 3D volumes by combining conditional diffusion with explicit data consistency. We propose Conditional Diffusion Posterior Alignment (CDPA) to enable scalable 3D sparse-view CT reconstruction. A 2D U-Net diffusion model is conditioned on an initial 3D reconstruction to improve inter-slice consistency, combined with data-consistency alignment to match measured projections. Experiments on synthetic and real Cone Beam CT (CBCT) data show state-of-the-art performance, with ablations that confirm the synergistic effects of the proposed pipeline. Finally, we show that the same principles also strengthen fast denoising U-Nets, yielding near-diffusion quality at a fraction of the computational cost.
[377] Multimodal Diffusion to Mutually Enhance Polarized Light and Low Resolution EBSD Data
Harry Dong, Timofey Efimov, Megna Shah, Jeff Simmons, Sean Donegan, Marc De Graef, Yuejie Chi
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In spite of the utility of 3-D electron back-scattered diffraction (EBSD) microscopy, the data collection process can be time-consuming with serial-sectioning. Hence, it is natural to look at other modalities, such as polarized light (PL) data, to accelerate EBSD data collection, supplemented with shared information. Complementarily, features in chaotic PL data could even be enriched with a handful of EBSD measurements. To inherently learn the complex dynamics between EBSD and PL to solve these inverse problems, we use an unconditional multimodal diffusion model, motivated by progress in diffusion models for inverse problems. Although trained solely on synthetic data once, our model has strong generalizable capabilities on real data which can be low-resolution, noisy, corrupted, and misregistered. With inference-time scaling, we show gains in performance on a variety of objectives including grain boundary prediction, super-resolution, and denoising. With our model, we demonstrate that there is little difference from full resolution performance with only 25% (1/4 the resolution) of EBSD data and corrupted PL data.
[378] Selective Depthwise Separable Convolution for Lightweight Joint Source-Channel Coding in Wireless Image Transmission
Ming Ye, Kui Cai, Cunhua Pan, Zhen Mei, Wanting Yang, Chunguo Li
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Depthwise separable convolutional (DSConv) layers have been successfully applied to deep learning (DL)-based joint source-channel coding (JSCC) schemes to reduce computational complexity. However, a systematic investigation of the layerwise and ratio-wise replacement of standard convolutional (Conv) layers with DSConv layers in JSCC systems for wireless image transmission remains largely unexplored. In this letter, we propose a configurable lightweight JSCC framework that incorporates a selective replacement strategy, enabling flexible substitution of standard Conv layers with DSConv layers at various layer positions and replacement ratios. By adjusting the proportion of layers replaced, we achieve different model compression levels and analyze their impact on reconstruction performance. Furthermore, we investigate how replacements at different encoder and decoder depths influence reconstruction quality under a fixed replacement ratio. Our results show that Conv-to-DSConv replacement at intermediate layers achieves a favorable complexity-performance trade-off, revealing layer-wise redundancy in DL-based JSCC systems. Extensive experiments further demonstrate that the proposed framework achieves substantial parameter reduction with only slight performance degradation, enabling flexible complexity-performance trade-offs for resource-constrained edge devices.
[379] MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models
Yunquan Chen, Haoyu Chen
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Understanding social dominance in animal behavior is critical for neuroscience and behavioral studies. In this work, we explore the capability of Multimodal Large Language Models(MLLMs) to analyze raw behavioral video of mice and predict their dominance hierarchy. We introduce MTT-Bench, a novel benchmark comprising annotated videos of pairwise mouse interactions for Mouse Tube Test analysis. Building on existing MLLM architectures, we fine-tune these models to perform zero-shot inference on unseen behavioral sequences, predicting social dominance without explicit labels during testing. Our framework demonstrates promising results, showing high agreement with tube test rankings. This work opens a new direction for applying foundation models to ethology and social behavior analysis, without the need to design domain-specific models.
[380] Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction?
Anam Hashmi, Mayug Maniparambil, Julia Dietlmeier, Kathleen M. Curran, Noel E. O’Connor
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The emergence of large-scale pretrained foundation models has transformed computer vision, enabling strong performance across diverse downstream tasks. However, their potential for physics-based inverse problems, such as accelerated cardiac MRI reconstruction, remains largely underexplored. In this work, we investigate whether natural-domain foundation models can serve as effective image priors for accelerated cardiac MRI reconstruction, and compare the performance obtained against domain-specific counterparts such as BiomedCLIP. We propose an unrolled reconstruction framework that incorporates pretrained, frozen visual encoders, such as CLIP, DINOv2, and BiomedCLIP, within each cascade to guide the reconstruction process. Through extensive experiments, we show that while task-specific state-of-the-art reconstruction models such as E2E-VarNet achieve superior performance in standard in-distribution settings, foundation-model-based approaches remain competitive. More importantly, in challenging cross-domain scenarios, where models are trained on cardiac MRI and evaluated on anatomically distinct knee and brain datasets–foundation models exhibit improved robustness, particularly under high acceleration factors and limited low-frequency sampling. We further observe that natural-image-pretrained models, such as CLIP, learn highly transferable structural representations, while domain-specific pretraining (BiomedCLIP) provides modest additional gains in more ill-posed regimes. Overall, our results suggest that pretrained foundation models offer a promising source of transferable priors, enabling improved robustness and generalization in accelerated MRI reconstruction.
[381] Useful nonrobust features are ubiquitous in biomedical images
Coenraad Mouton, Randle Rabe, Niklas C. Koser, Nicolai Krekiehn, Christopher Hansen, Jan-Bernd Hövener, Claus-C. Glüer
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We study whether deep networks for medical imaging learn useful nonrobust features - predictive input patterns that are not human interpretable and highly susceptible to small adversarial perturbations - and how these features impact test performance. We show that models trained only on nonrobust features achieve well above chance accuracy across five MedMNIST classification tasks, confirming their predictive value in-distribution. Conversely, adversarially trained models that primarily rely on robust features sacrifice in-distribution accuracy but yield markedly better performance under controlled distribution shifts (MedMNIST-C). Overall, nonrobust features boost standard accuracy yet degrade out-of-distribution performance, revealing a practical robustness-accuracy trade-off in medical imaging classification tasks that should be tailored to the requirements of the deployment setting.
[382] CKM Beyond Channel Gain: Spatial Correlation Map Construction with Deep Learning
Z. Chen, S. Fu, Y. Zeng, X. Xu, Z. Wei
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Channel knowledge map (CKM) is a promising technique to achieve environment-aware wireless communication and sensing. Constructing the complete CKM based on channel knowledge observations at sparse locations is a fundamental problem for CKM-enabled wireless networks. However, most existing works on CKM construction only consider the special type of CKM, i.e., the channel gain map (CGM), which only records the channel gain value for each location. In this paper, we consider the channel spatial correlation map (SCM) construction, which signifies the location-specific spatial correlation matrix for multi-antenna systems. Unlike CGM construction, constructing SCM poses significant challenges due to its extremely high-dimensional structure. To address this issue, we first decompose the high-dimensional SCM into lower-dimensional path gain map (PGM) and path angle map (PAM). Then we propose a deep learning model termed E-SRResNet for constructing high-quality SCM from sparse samples, which incorporates multi-head attention (MHA) mechanisms and multi-scale feature fusion (MSFF) to accurately model both local and global spatial relationships of channel parameters and complex nonlinear mappings. Furthermore, we preprocess the dataset to provide priors including line-of-sight (LoS) map, binary building map and base station (BS) map for the model to reconstruct SCM more accurately. Simulations conducted on the CKMImageNet dataset demonstrate that the proposed E-SRResNet achieves significant performance improvements over baseline methods. Moreover, the cosine similarity between the constructed SCM and the ground truth exceeds 0.8 in most regions, validating the effectiveness of the proposed construction method.