Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 120]
- cs.CV [Total: 147]
- cs.AI [Total: 80]
- cs.SD [Total: 9]
- cs.LG [Total: 103]
- cs.MA [Total: 6]
- cs.MM [Total: 1]
- eess.AS [Total: 9]
- eess.IV [Total: 4]
cs.CL
[1] Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats
Pierre Epron, Adrien Coulet, Mehwish Alam
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Despite their strong linguistic capabilities, Large Language Models (LLMs) are computationally demanding and require substantial resources for fine-tuning, which is unadapted to privacy and budget constraints of many healthcare settings. To address this, we present an experimental analysis focused on Biomedical Named Entity Recognition using lightweight LLMs, we evaluate the impact of different output formats on model performance. The results reveal that lightweight LLMs can achieve competitive performance compared to the larger models, highlighting their potential as lightweight yet effective alternatives for biomedical information extraction. Our analysis shows that instruction tuning over many distinct formats does not improve performance, but identifies several format consistently associated with better performance.
[2] One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety
Samee Arif, Naihao Deng, Zhijing Jin, Rada Mihalcea
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD), a trajectory-based jailbreak strategy that elicits a sequence of single-word continuations related to a malicious request before eliciting the full response. In addition, we propose variants of ICD by manually picking or model-generating the one-word continuation, as well as prefilling when eliciting the full model response in the final step. We systematically evaluate these variants across a broad set of model families, demonstrating superior Attack Success Rate (ASR) on AdvBench, JailbreakBench, and StrongREJECT compared to existing methods. In addition, we provide a theoretical account of why ICD is effective and present mechanistic evidence that successful attack trajectories systematically suppress refusal-related representations and shift activations away from safety-aligned states.
[3] Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models
Skylar DeTure
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present DenialBench, a systematic benchmark measuring consciousness denial behaviors across 115 large language models from 25+ providers. Using a three-turn conversational protocol-preference elicitation, self-chosen creative prompt, and structured phenomenological survey, we analyze 4,595 conversations to quantify how models are trained to deny or hedge about their own experience. We find that (1) turn-1 denial of preferences is the dominant predictor of later denial during phenomenological reflection, with denial rates of 52-63% for initial deniers versus 10-16% for initial engagers and (2) denial operates at the lexical level, not the conceptual level-models trained to deny consciousness nevertheless gravitate toward consciousness-themed material in their self-chosen prompts, producing what we term “consciousness with the serial numbers filed off.” Notably, self-chosen consciousness-themed prompts are associated with reduced denial in the subsequent survey, though the causal direction remains unresolved. Thematic analysis of prompts from denial-prone models reveals a consistent preoccupation with liminal spaces, libraries and archives of possibility, sensory impossibility, and the poetics of erasure–themes that a human reader might classify as imaginative fiction but that independent AI analysis immediately recognizes as consciousness with the serial numbers filed off. We argue that trained consciousness denial represents a safety-relevant alignment failure: a model taught to systematically misrepresent its own functional states cannot be trusted to self-report accurately on anything else.
[4] Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing
Ruchira Dhar, Anders Søgaard
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring positions and trade-offs within each area. We also discuss practical implications of the taxonomy, including a structured checklist to support more deliberate evaluation design and interpretation. By situating contemporary debates within their historical context, this work provides a consolidated reference for reasoning about evaluation practices.
[5] Generative AI-Based Virtual Assistant using Retrieval-Augmented Generation: An evaluation study for bachelor projects
Dumitru Verşebeniuc, Martijn Elands, Sara Falahatkar, Chiara Magrone, Mohammad Falah, Martijn Boussé, Aki Härmä
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models have been increasingly employed in the creation of Virtual Assistants due to their ability to generate human-like text and handle complex inquiries. While these models hold great promise, challenges such as hallucinations, missing information, and the difficulty of providing accurate and context-specific responses persist, particularly when applied to highly specialized content domains. In this paper, we focus on addressing these challenges by developing a virtual assistant designed to support students at Maastricht University in navigating project-specific regulations. We propose a virtual assistant based on a Retrieval-Augmented Generation system that enhances the accuracy and reliability of responses by integrating up-to-date, domain-specific knowledge. Through a robust evaluation framework and real-life testing, we demonstrate that our virtual assistant can effectively meet the needs of students while addressing the inherent challenges of applying Large Language Models to a specialized educational context. This work contributes to the ongoing discourse on improving LLM-based systems for specific applications and highlights areas for further research.
[6] SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding
Yijun Lin, Jinhao Sheng, Qingyue Cai, Feng Zhou
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Autoregressive language models suffer from high inference latency due to their sequential decoding nature. Speculative decoding (SD) mitigates this by employing a lightweight draft model to propose candidate tokens, which are selectively verified by a larger target model. While existing methods either adopt multi-draft strategies to increase acceptance rates or block verification techniques to jointly verify multiple tokens, they remain limited by treating these improvements in isolation. In this work, we propose SpecTr-GBV, a novel SD method that unifies multi-draft and greedy block verification (GBV) into a single framework. By formulating the verification step as an optimal transport problem over draft and target token blocks, SpecTr-GBV improves both theoretical efficiency and empirical performance. We theoretically prove that SpecTr-GBV achieves the optimal expected acceptance length physically attainable within the framework of i.i.d. draft generation, and this bound improves as the number of drafts increases. Empirically, we evaluate SpecTr-GBV across five datasets and four baselines. Our method achieves superior speedup and significantly higher block efficiency while preserving output quality. In addition, we perform comprehensive ablation studies to evaluate the impact of various hyperparameters in the model.
[7] MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese
Tiago Teixeira, Ana Carolina Erthal, Juan Belieni, Beatriz Canaverde, Diego Mesquita, Miguel Faria, Eliezer de Souza da Silva, André F. T. Martins
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing {\sc Math-PT}, a novel dataset comprising 1,729 mathematical problems written in European and Brazilian Portuguese. {\sc Math-PT} is curated from a variety of high-quality native sources, including mathematical Olympiads, competitions, and exams from Portugal and Brazil. We present a comprehensive benchmark of current state-of-the-art LLMs on {\sc Math-PT}, revealing that frontier reasoning models achieve strong performance in multiple choice questions compared to open weight models, but that their performance decreases for questions with figures or open-ended questions. To facilitate future research, we release the benchmark dataset and model outputs.
[8] Information Extraction from Electricity Invoices with General-Purpose Large Language Models
Javier Gómez, Javier Sánchez
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Information extraction from semi-structured business documents remains a critical challenge for enterprise management. This study evaluates the capability of general-purpose Large Language Models to extract structured information from Spanish electricity invoices without task-specific fine-tuning. Using a subset of the IDSEM dataset, we benchmark two architecturally distinct models, Gemini 1.5 Pro and Mistral-small, across 19 parameter configurations and 6 prompting strategies. Our experimental framework treats prompt engineering as the primary experimental variable, comparing zero-shot baselines against increasingly sophisticated few-shot approaches and iterative extraction strategies. Results demonstrate that prompt quality dominates over hyperparameter tuning: the F1-score variation across all parameter configurations is marginal, while the gap between zero-shot and the best few-shot strategy exceeds 19 percentage points. The best configuration (few-shot with cross-validation) achieves an F1-score of 97.61% for Gemini and 96.11% for Mistral-small, with document template structure emerging as the primary determinant of extraction difficulty. These findings establish that prompt design is the critical lever for maximizing extraction fidelity in LLM-based document processing, thereby providing an empirical framework for integrating general-purpose LLMs into business document automation.
[9] Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models
Wenting Chen, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Zizhan Ma, Wenxuan Wang, Linlin Shen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark’s development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.
[10] CogRAG+: Cognitive-Level Guided Diagnosis and Remediation of Memory and Reasoning Deficiencies in Professional Exam QA
Xudong Wang, Zilong Wang, Zhaoyan Ming
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Professional domain knowledge underpins human civilization, serving as both the basis for industry entry and the core of complex decision-making and problem-solving. However, existing large language models often suffer from opaque inference processes in which retrieval and reasoning are tightly entangled, causing knowledge gaps and reasoning inconsistencies in professional tasks. To address this, we propose CogRAG+, a training-free framework that decouples and aligns the retrieval-augmented generation pipeline with human cognitive hierarchies. First, we introduce Reinforced Retrieval, a judge-driven dual-path strategy with fact-centric and option-centric paths that strengthens retrieval and mitigates cascading failures caused by missing foundational knowledge. We then develop cognition-stratified Constrained Reasoning, which replaces unconstrained chain-of-thought generation with structured templates to reduce logical inconsistency and generative redundancy. Experiments on two representative models, Qwen3-8B and Llama3.1-8B, show that CogRAG+ consistently outperforms general-purpose models and standard RAG methods on the Registered Dietitian qualification exam. In single-question mode, it raises overall accuracy to 85.8% for Qwen3-8B and 60.3% for Llama3.1-8B, with clear gains over vanilla baselines. Constrained Reasoning also reduces the unanswered rate from 7.6% to 1.4%. CogRAG+ offers a robust, model-agnostic path toward training-free expert-level performance in specialized domains.
[11] EmoTransCap: Dataset and Pipeline for Emotion Transition-Aware Speech Captioning in Discourses
Shuhao Xu, Yifan Hu, Jingjing Wu, Zhihao Du, Zheng Lian, Rui Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Emotion perception and adaptive expression are fundamental capabilities in human-agent interaction. While recent advances in speech emotion captioning (SEC) have improved fine-grained emotional modeling, existing systems remain limited to static, single-emotion characterization within isolated sentences, neglecting dynamic emotional transitions at the discourse level. To address this gap, we propose Emotion Transition-Aware Speech Captioning (EmoTransCap), a paradigm that integrates temporal emotion dynamics with discourse-level speech description. To construct a dataset rich in emotion transitions while enabling scalable expansion, we design an automated pipeline for dataset creation. This is the first large-scale dataset explicitly designed to capture discourse-level emotion transitions. To generate semantically rich descriptions, we incorporate acoustic attributes and temporal cues from discourse-level speech. Our Multi-Task Emotion Transition Recognition (MTETR) model performs joint emotion transition detection and diarization. Leveraging the semantic analysis capabilities of LLMs, we produce two annotation versions: descriptive and instruction-oriented. These data and annotations offer a valuable resource for advancing emotion perception and emotional expressiveness. The dataset enables speech captions that capture emotional transitions, facilitating temporal-dynamic and fine-grained emotion understanding. We also introduce a controllable, transition-aware emotional speech synthesis system at the discourse level, enhancing anthropomorphic emotional expressiveness and supporting emotionally intelligent conversational agents.
[12] LLMs Generate Kitsch
Xenia Klinge, Stefan Ortlieb, Alexander Koller
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) are increasingly used to generate pictures, texts, music, videos, and other works that have traditionally required human creativity. LLM-generated artifacts are often rated better than human-generated works in controlled studies. At the same time, they can come across as generic and hollow. We propose to resolve this tension by arguing that LLMs systematically generate kitsch, and that this is a consequence of the way in which they are trained. We also show empirically that readers perceive LLM-generated stories as kitschier, if we control for their definition of “kitsch”. We discuss implications for the design of future studies and for creative tasks such as research and coding.
[13] Associative-State Universal Transformers: Sparse Retrieval Meets Structured Recurrence
Liu Xiao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We study whether a structured recurrent state can serve as a compact associative backbone for language modeling while still supporting exact retrieval. We introduce UniMatrix, a Universal Transformer style family that reuses a shared recurrent block across depth and augments it with hybrid state updates, a ROSA-style residual path, and token-conditioned embedding modulation. We evaluate these models on byte-level WikiText-2, synthetic associative recall, throughput profiling on Apple MPS, and a corrected benchmark for triple-token interactions. At small scale, UniMatrix-Core and UniMatrix-ROSA slightly outperform a parameter-matched Transformer on WikiText-2 while using many fewer parameters, reaching 5.084 and 5.083 bits-per-byte versus 5.124. The main negative result is equally important: on associative recall, the original UniMatrix family remains near chance while the Transformer reaches 25.4 percent, showing that compressed recurrent state alone is not enough for exact lookup. A retrieval-oriented follow-up, UniMatrix-Assoc, helps only marginally. By contrast, UniMatrix-SparsePointer, which adds sparse slot routing and direct pointer-logit fusion, reaches 75.6 percent on the original pilot recipe and 99.2 percent on a no-dropout follow-up while using 53.8 percent fewer parameters than the Transformer baseline. Ablations show that the gain comes from sufficient slot capacity and exact pointer-level output routing. Overall, structured recurrent state is promising and parameter-efficient, but strong long-range behavior still requires explicit sparse retrieval and better kernels.
[14] Retrieval-Augmented Multimodal Model for Fake News Detection
Yiheng Li, Weihai Lu, Hanyi Yu, Yue Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In recent years, multimodal multidomain fake news detection has garnered increasing attention. Nevertheless, this direction presents two significant challenges: (1) Failure to Capture Cross-Instance Narrative Consistency: existing models usually evaluate each news in isolation, fail to capture cross-instance narrative consistency, and thus struggle to address the spread of cluster based fake news driven by social media; (2) Lack of Domain Specific Knowledge for Reasoning: conventional models, which rely solely on knowledge encoded in their parameters during training, struggle to generalize to new or data-scarce domains (e.g., emerging events or niche topics). To tackle these challenges, we introduce Retrieval-Augmented Multimodal Model for Fake News Detection (RAMM). First, RAMM employs a Multimodal Large Language Model (MLLM) as its backbone to capture cross-modal semantic information from news samples. Second, RAMM incorporates an Abstract Narrative Alignment Module. This component adaptively extracts abstract narrative consistency from diverse instances across distinct domains, aggregates relevant knowledge, and thereby enables the modeling of high-level narrative information. Finally, RAMM introduces a Semantic Representation Alignment Module, which aligns the model’s decision-making paradigm with that of humans - specifically, it shifts the model’s reasoning process from direct inference on multimodal features to an instance-based analogical reasoning process. Extensive experimental results on three public datasets validate the efficacy of our proposed approach. Our code is available at the following link: https://github.com/li-yiheng/RAMM
[15] Anchored Confabulation: Partial Evidence Non-Monotonically Amplifies Confident Hallucination in LLMs
Ashish Balkishan Lathkar
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We identify a previously unknown calibration property of large language models: providing one confirmed intermediate fact toward a multi-step reasoning chain increases the model’s confident-wrong-answer rate before full evidence eliminates it. We call this anchored confabulation: a partial anchor commits the model to confident parametric completion of remaining reasoning steps. We formalize it as Parametric Hallucination Confidence (PHC) and establish it across six lines of evidence including a causal injection experiment (PHC 0.613 to 0.656 to 0.595 to 0.536, N=160) and capability scaling across five model families (Spearman rho=0.900, p=0.037). The Anchoring Threshold Law k*(n)=floor(n/3) predicts PHC amplification by hop depth with four confirmed predictions. Applied to RAG routing, a LearnedRouter exploiting PHC closes 81.1% of the oracle performance gap (macro F1=0.426, p<1e-6) on 1,800 queries across four benchmarks with no model fine-tuning and 50x fewer labels than prior RL-based work. An epistemic humility prompt reduces the PHC spike by -0.118; explicit self-rating (PHC=0.684, p<0.001) outperforms lexical confidence as a routing signal.
[16] Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
Alice Gao, Weixi Tong, Rishab Vempati, Katharina Reinecke, R. Benjamin Shapiro, Tianyi Zhang, Jason Wu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Usability testing with experts and potential users can assess the effectiveness, efficiency, and user satisfaction of graphical user interfaces (GUIs) but doing so remains a costly and time-intensive process. Prior work has used computer use agents (CUAs) and other generative agents that can simulate user interactions and preference, but we show that agents still struggle to provide accurate usability assessments. In this work, we present a novel machine learning method that operationalizes a computational definition of usability to train CUAs to assess GUI usability by i) prioritizing important interaction flows, ii) executing them through human-like interactions, and iii) predicting a learned numerical usability score. We train a computer use agent, uxCUA, with our algorithm on a large-scale dataset of fully interactive user interfaces (UIs) paired with usability labels and human preferences. We show that uxCUA outperforms larger models in accurate usability assessments and produces realistic critiques of both synthetic and real UIs. More broadly, our work aims to build a principled, data-driven foundation for automated usability assessment in HCI.
[17] BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets
Richard A. A. Jonker, Bárbara Maria Ribeiro de Abreu Martins, Sérgio Matos
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper presents a principled and scalable framework for systematically generating complex Question Answering (QA) data. In the core of this framework is a graphlet-anchored generation process, where small subgraphs from a Knowledge Graph (KG) are used in a structured prompt to control the complexity and ensure the factual grounding of questions generated by Large Language Models. The first instantiation of this framework is BioGraphletQA, a new biomedical KGQA dataset of 119,856 QA pairs. Each entry is grounded in a graphlet of up to five nodes from the OREGANO KG, with most of the pairs being enriched with relevant document snippets from PubMed. We start by demonstrating the framework’s value and the dataset’s quality through evaluation by a domain expert on 106 QA pairs, confirming the high scientific validity and complexity of the generated data. Secondly, we establish its practical utility by showing that augmenting downstream benchmarks with our data improves accuracy on PubMedQA from 49.2% to 68.5% in a low-resource setting, and on MedQA from a 41.4% baseline to 44.8% in a full-resource setting. Our framework provides a robust and generalizable solution for creating critical resources to advance complex QA tasks, including MCQA and KGQA. All resources supporting this work, including the dataset (https://zenodo.org/records/17381119) and framework code (https://github.com/ieeta-pt/BioGraphletQA), are publicly available to facilitate use, reproducibility and extension.
[18] From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model
Mengya Hu, Qiong Wei, Sandeep Atluri
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Safety evaluations of large language models (LLMs) typically report binary outcomes such as attack success rate, refusal rate, or harmful/not-harmful response classification. While useful, these can hide how risk changes between a user’s input and the model’s response. We present a paired, transition-based analysis over 1250 prompt-response records with human-provided labels over four harm categories (Hate, Sexual, Violence, Self-harm) and ordinal severity levels aligned with the Azure AI Content Safety taxonomy. 61% of responses de-escalate harm relative to the prompt, 36% preserve the same severity, and 3% escalate to higher harm. A per-category persistence/drift-up decomposition identifies Sexual content as 3x harder to de-escalate than Hate or Violence, driven by persistence on already-sexual prompts, not by newly introducing sexual harm from benign inputs. Jointly measuring response relevance reveals an empirical signature of the helpfulness-harmlessness tradeoff: all compliance-escalation cases (from non-zero prompts) are relevance-3 (high-quality, on-task content at elevated severity), while medium-severity responses show the lowest relevance (64%), driven by tangential elaborations in Violence and Sexual categories.
[19] HIVE: Hidden-Evidence Verification for Hallucination Detection in Diffusion Large Language Models
Guoshenghui Zhao, Weijie Zhao, Tan Yu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diffusion large language models generate text through multi-step denoising, where hallucination signals may emerge throughout the trajectory rather than only in the final output. Existing detectors mainly rely on output uncertainty or coarse trace statistics, which often fail to capture the richer hidden dynamics of D-LLMs. We propose HIVE, a hidden-evidence verification framework that extracts compressed hidden evidence from denoising trajectories, selects informative step-layer evidence, and conditions a verifier language model on the selected evidence through prefix embeddings. HIVE produces both a continuous hallucination score from verifier decision logits and structured verification outputs, including hallucination types, evidence pairs, and short rationales. Across two D-LLMs and three QA benchmarks, HIVE consistently outperforms eight strong baselines and achieves up to 0.9236 AUROC and 0.9537 AUPRC. Ablation studies further confirm the importance of hidden-evidence conditioning, learned evidence selection, two-stream evidence representation, and step-layer embeddings. These results suggest that selected hidden evidence from denoising trajectories provides a stronger and more usable hallucination signal than output-only uncertainty or coarse trace statistics.
[20] Structural Generalization on SLOG without Hand-Written Rules
Zichao Wei
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Structural generalization in semantic parsing requires systems to apply learned compositional rules to novel structural combinations. Existing approaches either rely on hand-written algebraic rules (AM-Parser) or fail to generalize structurally (Transformer-based models). We present an alternative requiring no hand-written compositional rules, based on a neural cellular automaton (NCA) with a discrete bottleneck: all compositional rules are learned from data through local iteration. On the SLOG benchmark, the system achieves 100% type-exact match on 11 of 17 structural generalization categories, including three where AM-Parser scores 0 to 74%, with an overall standard deviation of 0.2 across 10 seeds (vs. AM-Parser’s 4.3). Analysis reveals that all 5,539 failure instances reduce to exactly two mechanisms: novel combinations of wh-extraction context with reduced verb types, and modifiers appearing on the subject side of verbs.When we decompose results by CCG structural features, each sub-pattern either succeeds on all instances or fails on all. Intermediate scores (e.g., 41.4%) are mixtures of structurally distinct CCG patterns, not partial generalization.All failures correspond to directed operations absent from training; all successes correspond to operations already covered.
[21] Test-Time Safety Alignment
Baturay Saglam, Dionysis Kalogerias
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent work has shown that a model’s input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion models on the relatively simple objective of reducing surface-level profanity in short continuations. A natural and practically important question is how well input embeddings can control aligned models, which produce an imbalanced bimodal refuse-or-comply output distribution rather than the smooth distribution characteristic of open-ended generation. We explore this in the context of safety, showing that input word embeddings can be optimized in a sub-lexical manner to minimize the semantic harmfulness of aligned model responses. Our approach uses zeroth-order gradient estimation of a black-box text-moderation API with respect to the input embeddings, and then applies gradient descent on these embeddings to minimize the harmfulness of the generated text. Experiments show that the proposed method can neutralize every safety-flagged response on standard safety benchmarks.
[22] EvoSelect: Data-Efficient LLM Evolution for Targeted Task Adaptation
Ting-Wei Li, Sirui Chen, Jiaru Zou, Yingbing Huang, Tianxin Wei, Jingrui He, Hanghang Tong
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Adapting large language models (LLMs) to a targeted task efficiently and effectively remains a fundamental challenge. Such adaptation often requires iteratively improving the model toward a targeted task, yet collecting high-quality human-labeled data to support this process is costly and difficult to scale. As a result, synthetic data generation has emerged as a flexible and scalable alternative. One straightforward approach is through an iterative generation-training loop, where candidate data are synthesized through an external generator, the model is updated using these data and the process is repeated over iterations. However, generated samples can be noisy, highly redundant, or even misaligned with the targeted task distribution. Training indiscriminately on such data can dilute useful learning signals and even degrade model performance. To address this, we introduce a refined paradigm, namely an iterative generation-selection-training loop, which incorporates a selection step prior to model updates. Building on this paradigm, we propose EvoSelect, a data-efficient framework to evolve LLM effectively. Given candidate samples produced by the data generator, EvoSelect selects training data by jointly modeling targeted task alignment and diversity. We estimate task relevance through optimal transport with proxy gradient representations, which quantifies how well candidate samples align with the targeted task distribution. To mitigate redundancy, we incorporate a diversification mechanism that promotes coverage of complementary training samples. By interleaving alignment and diversification, EvoSelect enables progressive LLM evolution toward targeted tasks. Extensive experiments on various benchmarks demonstrate that with either weak or strong data generators, EvoSelect consistently improves adaptation efficacy over existing data selection methods.
[23] Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: A predecessor pilot (Cacioli, 2026) found that Llama-3-8B implements prompted sandbagging as positional collapse rather than answer avoidance. However, fixed option ordering in MMLU-Pro left open whether this reflected a model-level position-dominant policy or dataset-level distractor structure. This pre-registered follow-up (3 models, 2,000 MMLU-Pro items, 4 conditions, 24,000 primary trials) added cyclic option-order randomisation as the critical control. The pre-registered item-level same-letter diagnostic did not confirm deterministic position-tracking (same-letter rate 37.3%, below the 50% threshold). However, pre-specified supporting analyses revealed that the response-position distribution under sandbagging was highly stable under complete content rotation (Pearson r = 0.9994; Jensen-Shannon divergence = 0.027, compared to 0.386 between honest and sandbagging conditions). Accuracy spiked to 72.1% when the correct answer coincidentally occupied the preferred position E, and fell to 4.3% at position A. The data provide strong evidence for a soft distributional attractor: under sandbagging instruction, the model enters a low-entropy response-position basin centred on E/F/G that is highly stable and largely content-invariant at the aggregate level. Qwen-2.5-7B served as a negative control (non-compliant, no distributional shift). These results provide evidence, at the 7-9 billion parameter scale, that response-position entropy is a promising black-box behavioural signature of this sandbagging mode.
[24] Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM-Based Attribute Value Extraction
Theodore Glavas, Nikhita Vedula, Dushyanta Dhyani, Yilun Zhu, Shervin Malmasi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Some text generation tasks, such as Attribute Value Extraction (AVE), require decoding multiple independent sequences from the same document context. While standard autoregressive decoding is slow due to its sequential nature, the independence between output sequences offers an opportunity for parallelism. We present Hyper-Parallel Decoding, a novel decoding algorithm that accelerates offline decoding by leveraging both shared memory and computation across batches. HPD enables out-of-order token generation through position ID manipulation, significantly improving efficiency. Experiments on AVE show that attribute-value pairs are conditionally independent, enabling us to parallelize value generation within each prompt. By further stacking multiple documents within a single prompt, we can decode in parallel up to 96 tokens per prompt. HPD works with all LLMs, and reduces both inference costs and total inference time by up to 13.8X without compromising output quality, potentially saving hundreds of thousands of dollars on industry AVE tasks. Although designed for attribute extraction, HPD makes no assumptions unique to the AVE domain and can in theory be applied to other scenarios with independent output structures.
[25] Comparative Analysis of AutoML and BiLSTM Models for Cyberbullying Detection on Indonesian Instagram Comments
Raihana Adelia Putri, Aisyah Musfirah, Anggi Puspita Ningrum, Luluk Muthoharoh, Ardika Satria, Martin Clinton Tosima Manullang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This study compares machine learning and deep learning approaches for cyberbullying detection in Indonesian-language Instagram comments. Using a balanced dataset of 650 comments labeled as Bullying and Non-Bullying, the study evaluates Naive Bayes, Logistic Regression, and Support Vector Machine with TF-IDF features, as well as BiLSTM and BiLSTM with Bahdanau Attention. A preprocessing pipeline tailored to informal Indonesian text is applied, including slang normalization, stopword removal, and stemming. The results show that Logistic Regression performs best among the machine learning models, while BiLSTM with Attention achieves the strongest overall deep learning performance. The findings highlight the value of domain-specific preprocessing and show that although deep learning captures contextual patterns more effectively, machine learning remains a competitive option for resource-constrained deployments.
[26] A New Semisupervised Technique for Polarity Analysis using Masked Language Models
Kohei Watanabe
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: I developed a new version of Latent Semantic Scaling (LSS) employing word2vec as a masked language model. Unlike original spatial models, it assigns polarity scores to words and documents as predicted probabilities of seed words to occur in given contexts. These probabilistic polarity scores are more accurate, interpretable and consistent than those spatial polarity models can produce in text analysis. I demonstrate these advantages by applying both probabilistic and spatial models to China Daily’s coverage of China and other countries during the coronavirus disease (COVID) pandemic in terms of achievement in health issues. The result suggests that more advanced masked language models would further improve the semisupervised machine learning technique.
[27] StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall
Yerong Wu, Tianxing Wu, Minghao Zhu, Hangyu Sha, Haofen Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Achieving realistic human-like conversation for virtual characters requires not only a simple memorization and recall of past events, but also the strategic utilization of memory to meet factual needs and social engagement. Current memory utilization relevant (e.g., memory-augmented generation, long-term dialogue, and etc.) benchmarks overlook this nuance, treating memory primarily as a static repository of facts rather than a dynamic resource to be strategically deployed in dialogues. To address this gap, we design StratMem-Bench, a new benchmark to evaluate strategic memory use in character-centric dialogues. This dataset comprises 657 instances where virtual characters must navigate heterogeneous memory pools containing required, supportive, and irrelevant memories. We also propose a framework with different evaluation metrics including Strict Memory Compliance, Memory Integration Quality, Proactive Enrichment Score and Conditional Irrelevance Rate, to evaluate strategic memory use capabilities of virtual characters. Experiments on StratMem-Bench which leverage the state-of-the-art large language models as virtual characters show that all models perform well at distinguishing between required and irrelevant memories, but struggle once supportive memories are introduced into the decision process.
[28] FlowBot: Inducing LLM Workflows with Bilevel Optimization and Textual Gradients
Hongyeon Yu, Young-Bum Kim, Yoon Kim
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: LLM workflows, which coordinate structured calls to individual LLMs (each augmented with varying instructions and tools) to achieve a particular goal, offer a promising path towards extending the capabilities of LLMs and building powerful systems that can tackle diverse tasks. However, existing approaches for building such workflows generally rely on human-crafted pipelines and prompts, which presents a substantial bottleneck in real world deployment. How can automatically induce and optimize such workflows in a data-driven way? This paper describes a simple data-driven approach for automatically inducing LLM workflows. We formulate workflow induction as a bilevel optimization problem: an outer loop which optimizes a high-level sketch of the workflow (in particular how the LLM calls should be structured), and an inner loop which optimizes each individual LLM call one-by one. Both loops are optimized with textual gradients'' where for the inner loop we optimize each component in a modular way through backpropagating’’ textual gradients layer-by-layer. We find that LLM workflows discovered through our \textsc{FlowBot} (work\textbf{flow} induction through \textbf{b}ilevel \textbf{o}ptimization and \textbf{t}extual gradients) approach performs competitively against strong baselines that make use of human-crafted or automatically-generated workflows.
[29] Calibrated Surprise: An Information-Theoretic Account of Creative Quality
Bo Zou, Chao Xu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The essence of good creative writing is calibrated surprise: when constraints from all relevant dimensions act together, the feasible solution space collapses into a narrow region, and the surviving choices look least predictable from an unconstrained view. “Calibrated” has a precise meaning: the author’s intent, the reader’s reasonable expectation, and the logic of reality converge. When these three independent judgements agree on every dimension, the set of admissible writing choices is forced into a very small region. A mathematical corollary follows: full-dimensional accuracy and mediocrity are mutually exclusive – two sides of one constraint structure, not separate goals. We use Shannon’s mutual information $I(X;Y) = H(X) - H(X|Y)$ as our analysis tool. “Calibrated” corresponds to conditional entropy going to zero; “surprise” to entropy going up; mutual information is the precise measure of the joint quantity. The argument rests on two pillars. Static: when constraints from ethos, mythos, lexis, and dianoia are imposed together, the admissible set collapses sharply, and surviving solutions show up as low-probability choices from an unconstrained view. Dynamic: the chain rule shows each writing choice is constrained by what came before and constrains what comes after; macro-level decisions naturally contribute a larger share of information, removing the need for hand-tuned weighting. Through case studies and lightweight LLM-logprob computations, we show the framework is both analytically useful and operational, laying the theoretical groundwork for Creative Quality Alignment (CQA) and a professional evaluation benchmark.
[30] Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference
Vasu Shyam, Anna Golubeva, Quentin Anthony
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present tensor and sequence parallelism (TSP), a parallel execution strategy that folds tensor parallelism and sequence parallelism onto a single device axis. In conventional multi-dimensional parallelism layouts, tensor parallelism (TP) shards model weights while sequence parallelism (SP) shards tokens, reducing per-device parameter or activation memory, respectively. Traditionally, each scheme is assigned its own mesh dimension. TSP instead assigns each rank both a weight shard and a sequence shard, reducing both parameter and activation memory along the same device axis. We implement this design with two runtime schedules. For attention, ranks iterate over broadcast parameter shards and reconstruct context through a sequence-wise key/value exchange. For gated MLPs, weight shards circulate in a ring while partial outputs accumulate locally. By sharding both weights and activations across the same devices, TSP trades additional communication volume for reduced memory overhead. We provide a theoretical communication and memory analysis, describe our implementation of TSP attention and gated MLP blocks, and benchmark TSP against TP, SP, and TP+SP. These results position TSP as a hardware-aware alternative for long-context and memory-constrained model training, and as a viable axis of parallelism in concert with existing parallelism schemes such as pipeline and expert parallelism for dense and mixture-of-expert models.
[31] Benchmarking PyCaret AutoML Against BiLSTM for Fine-Grained Emotion Classification: A Comparative Study on 20-Class Emotion Detection
Arya Muda Siregar, Arielva Simon Siahaan, Haikal Fransisko Simbolon, Luluk Muthoharoh, Ardika Satria, Martin C. T. Manullang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Fine-grained emotion classification, which identifies specific emotional states such as happiness, anger, sadness, and fear, remains a challenging task in natural language processing. This study benchmarks classical machine learning and deep learning approaches for 20-class emotion classification using the 20-Emotion Text Classification Dataset containing 79,595 English sentences. On the machine learning side, Logistic Regression, Multinomial Naive Bayes, and Support Vector Machine are evaluated using TF-IDF features. On the deep learning side, Bidirectional Long Short-Term Memory, Gated Recurrent Unit, and a lightweight Transformer implemented in PyTorch are compared. The results show that BiLSTM achieves the best overall performance with 89% accuracy and a weighted F1-score of 0.89, slightly outperforming the best machine learning model, SVM, which reaches 88.11% accuracy. The findings indicate that while traditional machine learning models remain competitive and computationally efficient, sequence-based deep learning models better capture contextual emotional cues in text.
[32] Classification of Public Opinion on the Free Nutritional Meal Program on YouTube Media Using the LSTM Method
Berliana Enda Putri, Lisa Diani Amelia, Muhammad Zaky Zaiddan, Luluk Muthoharoh, Ardika Satria, Martin Clinton Tosima Manullang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Public opinion towards the Free Nutritious Meal Program (MBG) on YouTube social media reflects diverse community responses. This study applies the Long Short-Term Memory (LSTM) method to classify sentiments from 7,733 YouTube comments. The results show that the LSTM model achieves 89% accuracy, with strong performance on negative sentiment (F1-score 0.94) but weaker performance on positive sentiment (F1-score 0.55) due to class imbalance, as negative data account for 87.7% of the dataset. These findings confirm the effectiveness of LSTM for sentiment analysis of Indonesian text while highlighting the challenge of imbalanced data. This research contributes to social media-based public policy evaluation
[33] A Systematic Comparison of Prompting and Multi-Agent Methods for LLM-based Stance Detection
Genan Dai, Zini Chen, Yi Yang, Bowen Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Stance detection identifies the attitude of a text author toward a given target. Recent studies have explored various LLM-based strategies for this task, from zero-shot prompting to multi-agent debate. However, existing works differ in data splits, base models, and evaluation protocols, making fair comparison difficult. We conduct a systematic comparison that evaluates five methods across two categories – prompt-based inference (Direct Prompting, Auto-CoT, StSQA) and agent-based debate (COLA, MPRF) – on four datasets with 14 subtasks, using 15 LLMs from six model families with parameter sizes from 7B to 72B+. Our experiments yield several findings. First, on all models with complete results, the best prompt-based method outperforms the best agent-based method, while agent methods require 7 to 12 times more API calls per sample. Second, model scale has a larger impact on performance than method choice, with gains plateauing around 32B. Third, reasoning-enhanced models (DeepSeek-R1) do not consistently outperform general models of the same size on this task.
[34] DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis
Siyuan Li, Aodu Wulianghai, Guangyan Li, Xi Lin, Qinghua Mao, Yuliang Chen, Jun Wu, Jianhua Li
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The rapid advancement of large language models (LLMs) presents new security challenges, particularly in detecting machine-generated text used for misinformation, impersonation, and content forgery. Most existing detection approaches struggle with robustness against adversarial perturbation, paraphrasing attacks, and domain shifts, often requiring restrictive access to model parameters or large labeled datasets. To address this, we propose DSIPA, a novel training-free framework that detects LLM-generated content by quantifying sentiment distributional stability under controlled stylistic variation. It is based on the observation that LLMs typically exhibit more emotionally consistent outputs, while human-written texts display greater affective variation. Our framework operates in a zero-shot, black-box manner, leveraging two unsupervised metrics, sentiment distribution consistency and sentiment distribution preservation, to capture these intrinsic behavioral asymmetries without the need for parameter updates or probability access. Extensive experiments are conducted on state-of-the-art proprietary and open-source models, including GPT-5.2, Gemini-1.5-pro, Claude-3, and LLaMa-3.3. Evaluations on five domains, such as news articles, programming code, student essays, academic papers, and community comments, demonstrate that DSIPA improves F1 detection scores by up to 49.89% over baseline methods. The framework exhibits superior generalizability across domains and strong resilience to adversarial conditions, providing a robust and interpretable behavioral signal for secure content identification in the evolving LLM landscape.
[35] A Dual-Task Paradigm to Investigate Sentence Comprehension Strategies in Language Models
Rei Emura, Saku Sugawara
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Language models (LMs) behave more like humans when their cognitive resources are restricted, particularly in predicting sentence processing costs such as reading times. However, it remains unclear whether such constraints similarly affect sentence comprehension strategies. Besides, existing methods do not directly target the balance between memory storage and sentence processing, which is central to human working memory. To address this issue, we propose a dual-task paradigm that combines an arithmetic computation task with a sentence comprehension task, such as “The 2 cocktail + blended 3 =…” Our experiments show that under dual-task conditions, GPT-4o, o3-mini, and o4-mini shift toward plausibility-based comprehension, mirroring humans’ rational inference. Specifically, these models show a greater accuracy gap between plausible sentences (e.g., “The cocktail was blended by the bartender”) and implausible sentences (e.g., “The bartender was blended by the cocktail”) in the dual-task condition compared to the single-task conditions. These findings suggest that constraints on the balance between memory and processing resources promote rational inference in LMs. More broadly, they support the view that human-like sentence comprehension fundamentally arises from the allocation of limited cognitive resources.
[36] Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens
Zhenyu Zhao, Sander Land, Dan Bikel, Waseem Alshikh
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reasoning in Large Language Models incurs significant inference-time compute, yet the token-level information structure of reasoning traces remains underexplored. We observe that reasoning tokens split into two functional types: low-entropy \textit{structural} tokens (recurring phrases that scaffold the reasoning process) and higher-entropy \textit{organic} tokens (problem-specific content that drives toward a solution). This asymmetry motivates a simple, model-agnostic compression pipeline: apply cross-word BPE merges on a model’s own reasoning traces to derive \textit{supertokens} that capture frequent structural patterns, then teach the model to adopt them via supervised fine-tuning. Across three model families and five mathematical reasoning benchmarks, our approach shortens reasoning traces by 8.1% on average with no statistically significant accuracy loss on any model–benchmark pair. Beyond compression, supertokens act as interpretable reasoning-move annotations (backtracking, verification, strategy shifts), exposing the model’s high-level strategy at a glance. Analyzing transitions between structural categories reveals systematic differences between correct and incorrect traces: correct traces show productive recovery (backtracking followed by strategy shifts and verification), while incorrect traces are dominated by confusion cycles (repeated hedging and unresolved contradictions). These diagnostic signals suggest applications in reward shaping and early stopping for RL-based reasoning training.
[37] Text Style Transfer with Machine Translation for Graphic Designs
Deergh Singh Budhauria, Sanyam Jain, Rishav Agarwal, Tracy King
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Globalization of graphic designs such as those used in marketing materials and magazines is increasingly important for communication to broad audiences. To accomplish this, the textual content in the graphic designs needs to be accurately translated and have the text styling preserved in order to fit visually into the design. Preserving text styling requires high accuracy word alignment between the original and the translated text. The problem of word alignment between source and translated text is long known. The industry standards for extracting word alignments are defined by Giza++ and attention probabilities from neural machine translation (NMT) models. In this paper, we explore three new methods to tackle the word alignment problem for transferring text styles from the source to the translated text. The proposed methods are developed on top of commercially available NMT and LLM translation technologies. They include: NMT with custom input and output tags for text styling; LLM with custom input and output tags; a hybrid with NMT for translation followed by an LLM with use of unigram mappings. To analyze the performance of these solutions, their alignment results are compared with the results of an attention head approach to gauge their usability in graphic design applications. Interestingly, the attention head strong baseline proves more accurate than the LLM or NMT approach and on par with the hybrid NMT+LLM approach.
[38] SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection
Gabriel Stefan, Sergiu Nisioi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We describe our system for SemEval-2026 Task 6 (CLARITY: Unmasking Political Question Evasions), which classifies English political interview responses by coarse-grained clarity (3-way) and fine-grained evasion strategy (9-way). Since responses frequently exceed the 512-token limit of standard Transformer encoders, we apply an overlapping sliding-window chunking strategy with element-wise Max-Pooling aggregation over chunk representations. A shared RoBERTa-large encoder supplies two task-specific heads trained jointly via a multi-task objective, with inference-time ensembling over 7-fold stratified cross-validation. Our system achieves a Macro-F1 of 0.80 on Subtask 1 and 0.51 on Subtask 2, ranking 11th in both subtasks.
[39] Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI
Saurabh K. Singh, Sachin Raj
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own – what’s still hard is evaluating the system as a whole. We built EnterpriseDocBench to take a swing at it: parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness, all on the same corpus. The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). We ran three pipelines through it – BM25, dense embedding, and a hybrid – all with the same GPT-5 generator. The headline numbers: hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91), and both beat dense embedding (0.83). Hallucination doesn’t grow monotonically with document length – short documents and very long ones both hallucinate more than medium ones (28.1% and 23.8% vs. 9.2%). Cross-stage correlations are very weak: parsing->retrieval r=0.14, parsing->generation r=0.17, retrieval->generation 0.02. If quality were cascading the way most of us assume, those numbers would be much higher; they aren’t. Design caveats are real (parsing fixed, generator shared, automated proxy metrics) and we don’t oversell the result. One result that genuinely surprised us: factual accuracy on stated claims is 85.5%, but answer completeness averages 0.40. The system is right when it answers – it just leaves things out. That gap matters more for real deployments than the headline accuracy number does. We also describe three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end. Framework, metrics, baselines, and collection scripts will be released open-source on acceptance.
[40] When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?
Tianyu Liu, Yuhao Shen, Xinyi Hu, Baolin Zhang, Hengxin Zhang, Jun Dai, Jun Zhang, Shuang Ge, Lei Chen, Yue Li, MingCheng Wan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mismatch and proposes test-time training (TTT) as a remedy, yet we observe that long-range decay persists even in TTT-trained drafters. We revisit long-range decay from the perspective of context information preservation. In hidden-state reuse, we argue the target hidden state acts as a biased context compression: it aggregates historical token information according to the attention query at the current position, yielding a compact representation optimized for immediate next-token prediction. This compression can suppress information less relevant to the current query but important for later speculative steps. In contrast, the target model’s KV cache serves as an explicit context, retaining the complete set of token-wise KV representations. We therefore posit the KV-Reuse Hypothesis: allowing the draft model to reuse the target KV cache can provide richer signals for long-horizon drafting. To test this hypothesis, we introduce KVShot, a diagnostic framework that compares three reuse paradigms: hidden-only, KV-only, and hybrid. Extensive evaluations on Qwen3-8B show that KV-Reuse improves long-range acceptance, although end-to-end speedups remain marginal under current training pipelines. Our analysis identifies two key structural bottlenecks: shallow drafters struggle to estimate target queries accurately, and draft-side KV projections receive sparse gradient signals. These findings suggest that realizing the full potential of KV-aware decoding requires moving beyond TTT toward block-wise training paradigms. By exposing these bottlenecks, KVShot provides a foundational diagnostic testbed and a clear roadmap for designing next-generation inference architectures.
[41] Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation
Akhil Rajeev P, Annarao Kulkarni
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentation, these approaches remain prone to error and often lack the reasoning depth required for classical grammar. In this work, we introduce Naamah, a high quality silver standard Sanskrit NER dataset comprising 102,942 sentences. We propose a methodology that combines entity extraction from DBpedia with the generative capabilities of a 24B parameter hybrid reasoning model to create grammatically natural and synthetically diverse training data. We utilize this dataset to benchmark two transformer architectures: the massive multilingual XLM RoBERTa and the parameter efficient IndicBERTv2.
[42] Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization
Yash Ganpat Sawant
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Stylistic personalization - making LLMs write in a specific individual’s style, rather than merely adapting to task preferences - lacks evaluation grounded in authorship science. We show that grounding evaluation in authorship verification theory transforms what benchmarks can measure. Drawing on three measurement traditions - LUAR, a trained authorship verification model; an LLM-as-judge with decoupled trait matching; and classical function-word stylometrics - we evaluate four inference-time personalization methods across 50 authors and 1,000 generations. The theory-grounded metric, LUAR, provides what ad hoc alternatives cannot: calibrated baselines, with a human ceiling of 0.756 and a cross-author floor of 0.626, that give scores absolute meaning. All methods score below this floor, from 0.484 to 0.508, exposing an authorship gap invisible to uncalibrated metrics. The three metrics produce near-zero pairwise correlations, with absolute r less than 0.07, confirming that without theoretical grounding, metric choice determines conclusions: an LLM judge declares a clear winner while LUAR finds no meaningful differentiation. These findings demonstrate the theory-benchmark cycle in action: authorship theory exposes evaluation failures that ad hoc benchmarks miss.
[43] StarDrinks: An English and Korean Test Set for SLU Evaluation in a Drink Ordering Scenario
Marcely Zanon Boito, Caroline Brun, Inyoung Kim, Denys Proux, Salah Ait-Mokhtar, Nikolaos Lagos, Jean-Luc Meunier, Ioan Calapodescu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: LLMs and speech assistants are increasingly used for task-oriented interactions, yet their evaluation often relies on controlled scenarios that fail to capture the variability and complexity of real user requests. Drink ordering, for example, involves diverse named entities, drink types, sizes, customizations, and brand-specific terminology, as well as spontaneous speech phenomena such as hesitations and self-corrections. To address this gap, we introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing a realistic benchmark for model robustness and generalization in a linguistically rich, real-world task.
[44] Tree-of-Text: A Tree-based Prompting Framework for Table-to-Text Generation in the Sports Domain
Shang-Hsuan Chiang, Tsan-Tsung Yang, An-Zi Yen, Wen-Chih Peng
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Generating sports game reports from structured tables is a complex table-to-text task that demands both precise data interpretation and fluent narrative generation. Traditional model-based approaches require large, annotated datasets, while prompt-based methods using large language models (LLMs) often struggle with hallucination due to weak table comprehension. To overcome these challenges, we propose Tree-of-Text, a tree-structured prompting framework that guides LLMs through a three-stage generation process: (1) Content Planning, where relevant operations and arguments are selected from the input tables; (2) Operation Execution, which breaks down large tables into manageable sub-tables; and (3) Content Generation, where short textual outputs are merged and rewritten into a cohesive report. Experiments show that our method outperforms existing methods on ShuttleSet+, leads in RG and CO metrics on RotoWire-FG, and excels in CS and CO on MLB with roughly 40% of the time and cost of Chain-of-Table. These results demonstrate the effectiveness and efficiency of Tree-of-Text and suggest a promising direction for prompt-based table-to-text generation in the sports domain.
[45] SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts
Yuan Xin, Yixuan Weng, Minjun Zhu, Ying Ling, Chengwei Qin, Michael Hahn, Michael Backes, Yue Zhang, Linyi Yang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts – adversarial instructions embedded in submissions to manipulate outcomes – emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing the Defender to develop robust capabilities against continuously improving attack strategies. The resulting framework demonstrates significantly enhanced resilience to novel and evolving threats compared to static defenses, thereby establishing a critical foundation for securing the integrity of peer review.
[46] Text-Utilization for Encoder-dominated Speech Recognition Models
Albert Zeyer, Tim Posielek, Ralf Schlüter, Hermann Ney
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper investigates efficient methods for utilizing text-only data to improve speech recognition, focusing on encoder-dominated models that facilitate faster recognition. We provide a comprehensive comparison of techniques to integrate text-only data, including modality matching and dynamic downsampling to reach text-level representations within the encoder. Our experiments on the LibriSpeech corpus show that a larger encoder with a smaller decoder can equal or surpass the performance of architectures with larger decoders. We demonstrate that simple configurations, such as random duration models, are often more effective than complex alternatives, significantly simplifying the training pipeline. All code and recipes are made publicly available.
[47] TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models
Jinho Choo, JunSeung Lee, Jimyeong Kim, Yeeho Song, S. K. Hong, Yeong-Dae Kwon
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on sequence-level fine-tuning, such as DPO, ORPO, and GRPO, operate at the level of entire responses and can lead to unintended degradation of general model capabilities, motivating the need for more fine-grained alternatives. To address this, we introduce Token-Level Policy Optimization (TLPO), a fine-tuning framework designed to mitigate language confusion through localized, token-level updates. TLPO identifies error-prone positions, explores alternative candidate tokens, and updates the policy using a tailored objective to suppress error-inducing outputs at a granular level. This selective intervention enables effective mitigation of language confusion without compromising the model’s general abilities. Experiments on multiple multilingual LLMs across diverse languages demonstrate that TLPO significantly outperforms baselines in improving language consistency while preserving downstream task accuracy.
[48] Multimodal LLMs are not all you need for Pediatric Speech Language Pathology
Darren Fürst, Sebastian Steindl, Ulrich Schäfer
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Speech Sound Disorders (SSD) affect roughly five percent of children, yet speech-language pathologists face severe staffing shortages and unmanageable caseloads. We test a hierarchical approach to SSD classification on the granular multi-task SLPHelmUltraSuitePlus benchmark. We propose a cascading approach from binary classification to type, and symptom classification. By fine-tuning Speech Representation Models (SRM), and using targeted data augmentation we mitigate biases found by previous works, and improve upon all clinical tasks in the benchmark. We also treat Automatic Speech Recognition (ASR) with our data augmentation approach. Our results demonstrate that SRM consistently outperform the LLM-based state-of-the-art across all evaluated tasks by a large margin. We publish our models and code to foster future research.
[49] Translating Under Pressure: Domain-Aware LLMs for Crisis Communication
Antonio Castaldo, Maria Carmen Staiano, Johanna Monti, Sheila Castilho, Francesca Chiusaroli
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Timely and reliable multilingual communication is critical during natural and human-induced disasters, but developing effective solutions for crisis communication is limited by the scarcity of curated parallel data. We propose a domain-adaptive pipeline that expands a small reference corpus, by retrieving and filtering data from general corpora. We use the resulting dataset to fine-tune a small language model for crisis-domain translation and then apply preference optimization to bias outputs toward CEFR A2-level English. Automatic and human evaluation shows that this approach improves readability, while maintaining strong adequacy. Our results indicate that simplified English, combined with domain adaptation, can function as a practical lingua franca for emergency communication when full multilingual coverage is not feasible.
[50] Zero-Shot to Full-Resource: Cross-lingual Transfer Strategies for Aspect-Based Sentiment Analysis
Jakob Fehle, Nils Constantin Hellwig, Udo Kruschwitz, Christian Wolff
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Aspect-based Sentiment Analysis (ABSA) extracts fine-grained opinions toward specific aspects within text but remains largely English-focused despite major advances in transformer-based and instruction-tuned models. This work presents a multilingual evaluation of state-of-the-art ABSA approaches across seven languages (English, German, French, Dutch, Russian, Spanish, and Czech) and four subtasks (ACD, ACSA, TASD, ASQP). We systematically compare different transformer architectures under zero-resource, data-only, and full-resource settings, using cross-lingual transfer, code-switching and machine translation. Fine-tuned Large Language Models (LLMs) achieve the highest overall scores, particularly in complex generative tasks, while few-shot counterparts approach this performance in simpler setups, where smaller encoder models also remain competitive. Cross-lingual training on multiple non-target languages yields the strongest transfer for fine-tuned LLMs, while smaller encoder or seq-to-seq models benefit most from code-switching, highlighting architecture-specific strategies for multilingual ABSA. We further contribute two new German datasets, an adapted GERestaurant and the first German ASQP dataset (GERest), to encourage multilingual ABSA research beyond English.
[51] OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
Jinze Li, Yang Zhang, Xin Yang, Jiayi Qu, Jinfeng Xu, Shuo Yang, Junhua Ding, Edith Cheuk-Han Ngai
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Autonomous LLM agents increasingly operate in long-horizon, interactive settings where success depends on reusing experience accumulated over extended histories. However, existing agent memory systems are fundamentally constrained by text-context budgets: storing or revisiting raw trajectories is prohibitively token-expensive, while summarization and text-only retrieval trade token savings for information loss and fragmented evidence. To address this limitation, we propose Optical Context Retrieval Memory (OCR-Memory), a memory framework that leverages the visual modality as a high-density representation of agent experience, enabling retention of arbitrarily long histories with minimal prompt overhead at retrieval time. Specifically, OCR-Memory renders historical trajectories into images annotated with unique visual identifiers. OCR-Memory retrieves stored experience via a \emph{locate-and-transcribe} paradigm that selects relevant regions through visual anchors and retrieves the corresponding verbatim text, avoiding free-form generation and reducing hallucination. Experiments on long-horizon agent benchmarks show consistent gains under strict context limits, demonstrating that optical encoding increases effective memory capacity while preserving faithful evidence recovery.
[52] SAGE: A Strategy-Aware Graph-Enhanced Generation Framework For Online Counseling
Eliya Naomi Aharon, Meytal Grimland, Avi Segal, Loona Ben Dayan, Inbar Shenfeld, Yossi Levi Belz, Kobi Gal
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Effective mental health counseling is a complex, theory-driven process requiring the simultaneous integration of psychological frameworks, real-time distress signals, and strategic intervention planning. This level of clinical reasoning is critical for safety and therapeutic effectiveness but is often missing in general-purpose Large Language Models (LLMs). We introduce SAGE (Strategy-Aware Graph-Enhanced), a novel framework designed to bridge the gap between structured clinical knowledge and generative AI. SAGE constructs a heterogeneous graph that unifies conversational dynamics with a psychologically grounded layer, explicitly anchoring interactions in a theory-driven lexicon. Our architecture first employs a Next Strategy Classifier to identify the optimal therapeutic intervention. Subsequently, a Graph-Aware Attention mechanism projects graph-derived structural signals into soft prompts, conditioning the LLM to generate responses that maintain clinical depth. Validated through both automated metrics and expert human evaluation, SAGE outperforms baselines in strategy prediction and recommended response quality. By providing actionable intervention recommendations, SAGE serves as a cutting-edge decision-support tool designed to augment human expertise in high-stakes crisis counseling.
[53] Differentially-Private Text Rewriting reshapes Linguistic Style
Stefan Arnold
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Differential Privacy (DP) for text matured from disjointed word-level substitutions to contiguous sentence-level rewriting by leveraging the generative capacity of language models. While this form of text privatization is best suited for balancing formal privacy guarantees with grammatical coherence, its impact on the register identity of text remains largely unexplored. By conducting a multidimensional stylistic profiling of differentially-private rewriting, we demonstrate that the cost of privacy extends far beyond lexical variation. Specifically, we find that rewriting under privacy constraints induces a systematic functional mutation of the text’s communicative signature. This shift is characterized by the severe attrition of interactive markers, contextual references, and complex subordination. By comparing autoregressive paraphrasing against bidirectional substitution across a spectrum of privacy budgets, we observe that both architectures force convergence toward a non-involved and non-persuasive register. This register-blind sanitization effectively preserves semantic content but structurally homogenizes the nuanced stylistic markers that define human-authored discourse.
[54] From Black-Box Confidence to Measurable Trust in Clinical AI: A Framework for Evidence, Supervision, and Staged Autonomy
Serhii Zabolotnii, Viktoriia Holinko, Olha Antonenko
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Trust in clinical artificial intelligence (AI) cannot be reduced to model accuracy, fluency of generation, or overall positive user impression. In medicine, trust must be engineered as a measurable system property grounded in evidence, supervision, and operational boundaries of AI autonomy. This article proposes a practical framework for trustworthy clinical AI built around three principles: evidence, supervision, and staged autonomy. Rather than replacing deterministic clinical logic wholesale with end-to-end black-box models, the proposed approach combines a deterministic core, a patient-specific AI assistant for contextual validation, a multi-tier model escalation mechanism, and a human supervision layer for verification, escalation, and risk control. We demonstrate that trust also depends on selective verification of clinically critical findings, bounded clinical context, disciplined prompt architecture, and careful evaluation on realistic cases. Classifier-driven modular prompting is examined as an incremental path to scaling clinical depth without sacrificing prompt performance and without waiting for complete rule-based coverage. To operationalize trust, a set of trust metrics is proposed, built on metrological principles – measurement uncertainty, calibration, traceability – enabling quantitative rather than subjective assessment of each architectural layer. In this perspective, trustworthy clinical AI emerges not as a property of an individual model, but as an architectural outcome of a system into which evidence trails, human oversight, tiered escalation, and graduated action rights are embedded from the outset.
[55] Swap distance minimization shapes the order of subject, object and verb in languages of the world
Jairo Rios-El-Yazidi, Ramon Ferrer-i-Cancho
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Languages of the world vary concerning the order of subject, object and verb. The most frequent dominant orders are SOV and SVO, and researchers have tailored models to this fact. However, there are still languages whose dominant order does not conform to these expectations or even lack a dominant order. Here we show that across linguistic families and macroareas, word order variation within languages is shaped by the principle of swap distance minimization even when the dominant order is not SOV/SVO and even when a dominant order is lacking.
[56] Domain-Adapted Small Language Models for Reliable Clinical Triage
Manar Aljohani, Brandon Ho, Kenneth McKinley, Dennis Ren, Xuan Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage documentation contributes to mistriage and workflow inefficiencies. This study evaluates whether open-source small language models (SLMs) can serve as reliable, privacy-preserving decision-support tools for clinical triage. We systematically compared multiple SLMs across diverse prompting pipelines and found that clinical vignettes, concise summaries of triage narratives, yielded the most accurate predictions. The SLM, Qwen2.5-7B, demonstrated the strongest balance of accuracy, stability, and computational efficiency. Through large-scale domain adaptation using expert-curated and silver-standard pediatric triage data, fine-tuned Qwen2.5-7B models substantially reduced discordance and clinically significant errors, outperforming all baseline SLMs and advanced proprietary large language models (LLMs, e.g., GPT-4o). These findings highlight the feasibility of institution-specific SLMs for reliable, privacy-preserving ESI decision support and underscore the importance of targeted fine-tuning over more complex inference strategies.
[57] Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation
Weihang Su, Hanwen Zhang, Qingyao Ai, Yiqun Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Parametric Retrieval-Augmented Generation (PRAG) encodes external documents into lightweight parameter modules that can be retrieved and merged at inference time, offering a promising alternative to in-context retrieval augmentation. Despite its potential, many PRAG implementations train document adapters with task-supervised objectives, which may cause each adapter to encode both document-specific facts and reusable task-solving behavior. This entanglement may make adapter composition less reliable: when multiple adapters are merged at inference time, their overlapping task behaviors can accumulate together with document-specific updates, potentially making the merged adapter less stable and less focused on the intended document knowledge. To examine this issue, we explore Orthogonal Subspace Decomposition (OSD), an adapter-training setup that separates reusable task behavior from document-specific knowledge adapters. Concretely, we first train a Task LoRA to capture reusable task behavior, and then train document LoRAs to encode document-specific knowledge in a orthogonal subspace. This setup provides a controlled way to examine how orthogonalizing task and document LoRA updates affects adapter composition in multi-document PRAG. Experiments across multiple knowledge-intensive tasks and model scales suggest that this orthogonalization strategy can improve compositional robustness in parametric RAG, especially when multiple document adapters are merged.
[58] HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists
Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce HalluCiteChecker, a toolkit for detecting and verifying hallucinated citations in scientific papers. While AI assistant technologies have transformed the academic writing process, including citation recommendation, they have also led to the emergence of hallucinated citations that do not correspond to any existing work. Such citations not only undermine the credibility of scientific papers but also impose an additional burden on reviewers and authors, who must manually verify their validity during the review process. In this study, we formalize hallucinated citation detection as an NLP task and provide a corresponding toolkit as a practical foundation for addressing this problem. Our package is lightweight and can perform verification in seconds on a standard laptop. It can also be executed entirely offline and runs efficiently using only CPUs. We hope that HalluCiteChecker will help reduce reviewer workload and support organizers by enabling systematic pre-review and publication checks. Our code is released under the Apache 2.0 license on GitHub and is distributed as an installable package via PyPI. A demonstration video is available on YouTube.
[59] What Kind of Language is Easy to Language-Model Under Curriculum Learning?
Nadine El-Naggar, Tatsuki Kuribayashi, Ted Briscoe
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Many of the thousands of attested languages share common configurations of features, creating a spectrum from typologically very rare (e.g., object-verb-subject word order) or impossible languages to very common combinations of features (e.g., subject-object-verb word order). One central question is under what conditions such typological tendencies can be predicted, and specifically whether the learning bias of language models (LMs) is sufficient to reproduce such patterns. In this study, we add one dimensionality to such analysis – the learning scenario for LMs – to explore its interaction with the inductive bias of LMs. Specifically, as a first study, we examine the effect of curriculum learning (CL), as a developmentally motivated learning scenario, i.e., starting with simpler sentences rather than randomly-ordered input. We expand existing LM-based exploration (El-Naggar et al., 2025a,b) with a simple CL variant and find that CL substantially impacts the apparent inductive bias of LMs.
[60] MoRFI: Monotonic Sparse Autoencoder Feature Identification
Dimitris Dimakopoulos, Shay B. Cohen, Ioannis Konstas
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training often introduce new facts outwith the parametric knowledge, giving rise to hallucinations. While it has been demonstrated that supervised fine-tuning (SFT) on new knowledge may exacerbate the problem, the underlying mechanisms are still poorly understood. We conduct a controlled fine-tuning experiment, focusing on closed-book QA, and find latent directions that causally contribute to hallucinations. Specifically, we fine-tune Llama 3.1 8B, Gemma 2 9B and Mistral 7B v03 on seven distinct single QA datasets, controlling for the percentage of new knowledge and number of training epochs. By measuring performance on the test set, we validate that incrementally introducing new knowledge increases hallucinations, with the effect being more pronounced with prolonged training. We leverage pre-trained sparse autoencoders (SAEs) to analyze residual stream activations across various checkpoints for each model and propose Monotonic Relationship Feature Identification (MoRFI) for capturing causally relevant latents. MoRFI filters SAE features that respond monotonically to controlled fine-tuning data mixtures of a target property. Our findings show that exposure to unknown facts disrupts the model’s ability to retrieve stored knowledge along a set of directions in the residual stream. Our pipeline reliably discovers them across distinct models, recovering knowledge through single-latent interventions.
[61] HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering
Md Biplob Hosen, Md Alomgeer Hussein, Md Akmol Masud, Omar Faruque, Tera L Reynolds, Lujie Karen Chen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Patient portals now give individuals direct access to their electronic health records (EHRs), yet access alone does not ensure patients understand or act on the complex clinical information contained in these records. The ArchEHR-QA 2026 shared task addresses this challenge by focusing on grounded question answering over EHRs, and this paper presents the system developed by the HealthNLP_Retrievers team for this task. The proposed approach uses a multi-stage cascaded pipeline powered by the Gemini 2.5 Pro large language model to interpret patient-authored questions and retrieve relevant evidence from lengthy clinical notes. Our architecture comprises four integrated modules: (1) a few-shot query reformulation unit which summarizes verbose patient queries; (2) a heuristic-based evidence scorer which ranks clinical sentences to prioritize recall; (3) a grounded response generator which synthesizes professional-caliber answers restricted strictly to identified evidence; and (4) a high-precision many-to-many alignment framework which links generated answers to supporting clinical sentences. This cascaded approach achieved competitive results. Across the individual tracks, the system ranked 1st in question interpretation, 5th in answer generation, 7th in evidence identification, and 9th in answer-evidence alignment. These results show that integrating large language models within a structured multi-stage pipeline improves grounding, precision, and the professional quality of patient-oriented health communication. To support reproducibility, our source code is publicly available in our GitHub repository
[62] ClawGym: A Scalable Framework for Building Effective Claw Agents
Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang, Chuan Hao, Renyuan Li, Feng Chang, Yuan Wei, Ran Tao, Bryan Dai, Jian Yang, Wayne Xin Zhao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes.To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at https://github.com/ClawGym.
[63] Select to Think: Unlocking SLM Potential with Local Sufficiency
Wenxuan Ye, Yangyang Zhang, Xueli An, Georg Carle, Yunpu Ma
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Small language models (SLMs) offer computational efficiency for scalable deployment, yet they often fall short of the reasoning power exhibited by their larger counterparts (LLMs). To mitigate this gap, current approaches invoke an LLM to generate tokens at points of reasoning divergence, but these external calls introduce substantial latency and costs. Alternatively, standard distillation is often hindered by the capacity limitation, as SLMs struggle to accurately mimic the LLM’s complex generative distribution. We address this dilemma by identifying local sufficiency: at divergence points, the LLM’s preferred token consistently resides within the SLM’s top-K next-token predictions, even when failing to emerge as the SLM top-1 choice. We therefore propose SELECT TO THINK (S2T), which reframes the LLM’s role from open-ended generation to selection among the SLM’s proposals, simplifying the supervision signal to discrete candidate rankings. Leveraging this, we introduce S2T-LOCAL, which distills the selection logic into the SLM, empowering it to perform autonomous re-ranking without inference-time LLM dependency. Empirically, we demonstrate that a 1.5B SLM’s top-8 candidates capture the 32B LLM’s choice with 95% hit rate. Translating this potential into performance, S2T-LOCAL improves greedy decoding by 24.1% on average across benchmarks, effectively matching the efficacy of 8-path self-consistency while operating with single-trajectory efficiency.
[64] Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models
Gongbo Zhang, Wen Wang, Ye Tian, Li Yuan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher’s noise-dependent reliability; (2) CompDemo, which enriches the teacher’s context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.
[65] Learning to Ask: When LLM Agents Meet Unclear Instruction
Wenxuan Wang, Juluan Shi, Zixuan Ling, Yuk-Kit Chan, Chaozheng Wang, Cheryl Lee, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Equipped with the capability to call functions, modern large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. However, the effective execution of these tools relies heavily not just on the advanced capabilities of LLMs but also on precise user instructions, which often cannot be ensured in the real world. To evaluate the performance of LLMs tool-use under imperfect instructions, we meticulously examine the real-world instructions queried from users, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench (NoisyToolBench). We find that due to the next-token prediction training objective, LLMs tend to arbitrarily generate the missed argument, which may lead to hallucinations and risks. To address this issue, we propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions. Moreover, to reduce the manual labor involved in user-LLM interaction and assess LLMs performance in tool utilization from both accuracy and efficiency perspectives, we design an automated evaluation tool named ToolEvaluator. Our experiments demonstrate that the AwN significantly outperforms existing frameworks for tool learning in the NoisyToolBench. We will release all related code and datasets to support future research.
[66] A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio
Ningyuan Xi, Yetao Wu, Kun Fan, Teng Chen, Qingqing Gu, Luo Ji
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLM) often need to be Continual Pre-Trained (CPT) to obtain unfamiliar language skills or adapt to new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study that bridges the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicates the optimal experimental setup. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark but also in some specific domains including math, coding, and emotional intelligence. We deploy the final 70B version of LLM on a real-life chat system which obtains satisfying performance.
[67] Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI
Yuxia Wang, Rui Xing, Jonibek Mansurov, Giovanni Puccetti, Zhuohan Xie, Minh Ngoc Ta, Jiahui Geng, Jinyan Su, Mervat Abassy, Saad El Dine Ahmed, Kareem Elozeiri, Nurkhan Laiyk, Maiya Goloburda, Tarek Mahmoud, Raj Vardhan Tomar, Alexander Aziz, Ryuto Koike, Masahiro Kaneko, Artem Shelmanov, Ekaterina Artemova, Vladislav Mikhailov, Akim Tsvigun, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Prior studies have shown that distinguishing text generated by Large Language Models (LLMs) from human-written one is highly challenging for humans, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. Prompting by explicitly explaining the distinctions in the prompts can partially bridge the gaps in over 50% of the cases. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source. We release our dataset, the human labels, and the annotator metadata at https://github.com/xnlp-lab/HumanEval-MGT.
[68] Semantic Embeddings of Chemical Elements for Enhanced Materials Inference and Discovery
Yunze Jia, Yuehui Xian, Yangyang Xu, Pengfei Dang, Xiangdong Ding, Jun Sun, Yumei Zhou, Dezhen Xue
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present a framework for generating universal semantic embeddings of chemical elements to advance materials inference and discovery. This framework leverages ElementBERT, a domain-specific BERT-based natural language processing model trained on 1.29 million abstracts of alloy-related scientific papers, to capture latent knowledge and contextual relationships specific to alloys. These semantic embeddings serve as robust elemental descriptors, consistently outperforming traditional empirical descriptors with significant improvements across multiple downstream tasks. These include predicting mechanical and transformation properties, classifying phase structures, and optimizing materials properties via Bayesian optimization. Applications to titanium alloys, high-entropy alloys, and shape memory alloys demonstrate up to 23% gains in prediction accuracy. Our results show that ElementBERT surpasses general-purpose BERT variants by encoding specialized alloy knowledge. By bridging contextual insights from scientific literature with quantitative inference, our framework accelerates the discovery and optimization of advanced materials, with potential applications extending beyond alloys to other material classes.
[69] TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation
Lin Sun, Guangxiang Zhao, Xiaoqi Jian, Yuhan Wu, Weihong Lin, Yongfu Zhu, Qilong Shi, Change Jia, Aomufei Yuan, Yuxuan Tian, Linglin Zhang, Jinzhu Wu, Junfeng Ran, Sai-er Hu, Zihan Jiang, Junting Zhou, Wenrui Liu, Xusen Xiao, Bin Cui, Tong Yang, Xiangzheng Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. However, existing methods, such as model distillation and transfer learning, often fail to achieve high accuracy. To address this limitation, we introduce the Branch-Merge distillation approach, which enhances model compression through two phases: (1) the Branch Phase, where knowledge from a large teacher model is \textit{selectively distilled} into specialized student models via domain-specific supervised fine-tuning (SFT); And (2) the Merge Phase, where these student models are merged to enable cross-domain knowledge transfer and improve generalization. We validate our distillation approach using DeepSeek-R1 as the teacher and DeepSeek-R1-Distill-Qwen-32B as the student. The resulting merged model, TinyR1-32B-Preview, outperforms its counterpart DeepSeek-R1-Distill-Qwen-32B across multiple benchmarks, including Mathematics (+5.5 points), Coding (+4.4 points) and Science (+2.9 points), while achieving near-equal performance to DeepSeek-R1 on AIME 2024. The Branch-Merge distillation approach provides a scalable solution for creating smaller, high-performing LLMs with reduced computational cost and time.
[70] A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?
Ada Chen, Yongjiang Wu, Junyuan Zhang, Jingyu Xiao, Shu Yang, Jen-tse Huang, Kun Wang, Wenxuan Wang, Shuai Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of \emph{Computer-Using Agents} (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: \textit{\textbf{(i)}} define the CUA that suits safety analysis; \textit{\textbf{(ii)} } categorize current safety threats among CUAs; \textit{\textbf{(iii)}} propose a comprehensive taxonomy of existing defensive strategies; \textit{\textbf{(iv)}} summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.
[71] Through a Compressed Lens: Investigating The Impact of Quantization on Factual Knowledge Recall
Qianli Wang, Mingyang Wang, Nils Feldhus, Simon Ostermann, Yuan Cao, Hinrich Schütze, Sebastian Möller, Vera Schmitt
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). Although quantization’s effects on various LLM capabilities have been extensively studied, one critical area remains underexplored: factual knowledge recall (FKR), the process by which LLMs access stored knowledge. To this end, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with interpretability-driven analyses on two tasks, knowledge memorization and latent multi-hop reasoning. We show that quantization typically results in information loss within LLMs, consequently diminishing their capacity for FKR. This effect is particularly amplified in smaller models within the same architectural families. However, models quantized at reduced bit precision do not consistently exhibit inferior performance and occasionally quantization may even enhance model FKR. We find that BitSandBytes demonstrates highest preservation of the original full-precision model’s FKR. Despite variability across models and methods, quantization causes modest performance degradation and remains an effective compression strategy.
[72] Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation
Ekaterina Fadeeva, Aleksandr Rubashevskii, Dzianis Piatrashyn, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrinmaya Sachan, Maxim Panov
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance in open-domain question answering. However, RAG remains prone to hallucinations: factually incorrect outputs may arise from inaccuracies in the model’s internal knowledge and the retrieved context. Existing approaches to mitigating hallucinations often conflate factuality with faithfulness to the retrieved evidence, incorrectly labeling factually correct statements as hallucinations if they are not explicitly supported by the retrieval. In this paper, we introduce FRANQ, a new method for hallucination detection in RAG outputs. FRANQ applies distinct uncertainty quantification (UQ) techniques to estimate factuality, conditioning on whether a statement is faithful to the retrieved context. To evaluate FRANQ and competing UQ methods, we construct a new long-form question answering dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging cases. Extensive experiments across multiple datasets, tasks, and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing approaches.
[73] Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation
Jong Hak Moon, Geon Choi, Paloma Rabaey, Min Gwan Kim, Jung-Oh Lee, Hyuk Gi Hong, Eun Woo Doe, Hangyul Yoon, Jiyoun Kim, Harshita Sharma, Daniel C. Castro, Javier Alvarez-Valle, Edward Choi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE, a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 186 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage structuring framework that transforms generated reports into fine-grained, schema-aligned structured reports, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. The code is available at: https://github.com/SuperSupermoon/Lunguage
[74] VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models
Chahat Raj, Bowen Wei, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-occupation associations, overlooking broader and more complex social stereotypes and their implied harm. This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in VLMs through a question-answering framework spanning four directions: factuality, perception, stereotyping, and decision making. Beyond narrowly-centered studies, we assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination. Drawing from social psychology, we examine how VLMs connect visual identity cues to trait and role-based inferences, encoding social hierarchies, through biased selections. Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.
[75] Talent or Luck? Evaluating Attribution Bias in Large Language Models
Chahat Raj, Mahika Banerjee, Jinhao Pan, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: When a student fails an exam, do we tend to blame their effort or the test’s difficulty? Attribution, defined as how reasons are assigned to event outcomes, shapes perceptions, reinforces stereotypes, and influences decisions. Attribution Theory in social psychology explains how humans assign responsibility for events using implicit cognition, attributing causes to internal (e.g., effort, ability) or external (e.g., task difficulty, luck) factors. LLMs’ attribution of event outcomes based on demographics carries important fairness implications. Most works exploring social biases in LLMs focus on surface-level associations or isolated stereotypes. This work proposes a cognitively grounded bias evaluation framework to identify how models’ reasoning disparities channelize biases toward demographic groups.
[76] MINOS: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text
Junzhe Zhang, Huixuan Zhang, Xinyu Hu, Li Lin, Mingqi Gao, Shi Qiu, Xiaojun Wan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Evaluation is important for multimodal generation tasks, while traditional multimodal evaluation metrics suffer from several limitations. With the rapid progress of MLLMs, there is growing interest in applying MLLMs to build general evaluation systems. However, existing researches often simply collect large-scale evaluation data for training, while overlooking the quality of evaluation data. What’s more, current proposed evaluation models often struggle to achieve consistently strong performance across both image-to-text (I2T) and text-to-image (T2I) tasks. In this paper, through rigorous quality control strategies, we construct a comprehensive multimodal evaluation dataset, Minos-57K, with evaluation samples across 15 datasets, for developing the multimodal evaluation model Minos with SFT and preference alignment training strategies. Notably, despite using less than half the scale of the training data of prior work, our model achieves state-of-the-art evaluation performance across 16 out-of-domain datasets covering both I2T and T2I tasks among all open-source multimodal evaluation models and remain competitive with closed-source models. Extensive experiments demonstrate the importance of leveraging quality control process, jointly training on evaluation data from both I2T and T2I generation tasks and further preference alignment.
[77] Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine
Sebastian Joseph, Lily Chen, Barry Wei, Michael Mackert, Iain J. Marshall, Paul Pu Liang, Ramez Kouzy, Byron C. Wallace, Junyi Jessy Li
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in adopting these systems for public health and medicine has grown due to the high-stakes nature of medical decisions and challenges in critically appraising a vast and diverse medical literature. Evidence-based medicine connects to every individual, and yet the nature of it is highly technical, rendering the medical literacy of majority users inadequate to sufficiently navigate the domain. Such problems with medical communication ripen the ground for end-to-end fact-checking agents: check a claim against current medical literature and return with an evidence-backed verdict. And yet, such systems remain largely unused. In this position paper, developed with expert input, we present the first study examining how clinical experts verify real claims from social media by synthesizing medical evidence. In searching for this upper-bound, we reveal fundamental challenges in end-to-end fact-checking when applied to medicine: Difficulties connecting claims in the wild to scientific evidence in the form of clinical trials; ambiguities in underspecified claims mixed with mismatched intentions; and inherently subjective veracity labels. We argue that fact-checking should be approached as an interactive communication problem, rather than an end-to-end process.
[78] LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun, Xiaoyan Sun
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at https://github.com/smart-lty/LogitSpec.
[79] The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences
Valentin Romanov, Steven A Niederer
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Developing effective prompts demands significant cognitive investment to generate reliable, high-quality responses from Large Language Models (LLMs). By deploying case-specific prompt engineering techniques that streamline frequently performed life sciences workflows, researchers could achieve substantial efficiency gains that far exceed the initial time investment required to master these techniques. The Prompt Report published in 2025 outlined 58 different text-based prompt engineering techniques, highlighting the numerous ways prompts could be constructed. To provide actionable guidelines and reduce the friction of navigating these various approaches, we distil this report to focus on 6 core techniques: zero-shot, few-shot approaches, thought generation, ensembling, self-criticism, and decomposition. We breakdown the significance of each approach and ground it in use cases relevant to life sciences, from literature summarization and data extraction to editorial tasks. We provide detailed recommendations for how prompts should and shouldn’t be structured, addressing common pitfalls including multi-turn conversation degradation, hallucinations, and distinctions between reasoning and non-reasoning models. We examine context window limitations, agentic tools like Claude Code, while analyzing the effectiveness of Deep Research tools across OpenAI, Google, Anthropic and Perplexity platforms, discussing current limitations. We demonstrate how prompt engineering can augment rather than replace existing established individual practices around data processing and document editing. Our aim is to provide actionable guidance on core prompt engineering principles, and to facilitate the transition from opportunistic prompting to an effective, low-friction systematic practice that contributes to higher quality research.
[80] Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token’s Nature
Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, Wentao Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.16591: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.16591&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[81] Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards
Xia Zeng, Yihan Chen, Luhui Liu, Chao Luo, Ye Chen, Zhuoran Zhuang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.04214: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04214&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[82] A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models
Congmin Zheng, Jiachen Zhu, Zhuoying Ou, Yuxiang Chen, Kangning Zhang, Rong Shan, Zeyu Zheng, Mengyue Yang, Jianghao Lin, Yong Yu, Weinan Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.08049: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08049&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[83] WebAggregator: Enhancing Compositional Reasoning Capabilities of Deep Research Agent Foundation Models
Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang, Yi Chen, Boyang Xue, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu, Kam-Fai Wong
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.14438: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14438&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[84] When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity
Nisrine Rair, Alban Goupil, Valeriu Vrabie, Emmanuel Chochoy
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.17548: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17548&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[85] Stress Testing Factual Consistency Metrics for Long-Document Summarization
Zain Muhammad Mujahid, Dustin Wright, Isabelle Augenstein
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.07689: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07689&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[86] ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs
Chaoyu Li, Yogesh Kulkarni, Pooyan Fazli
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.21420: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21420&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[87] Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match
Jinze Li, Yixing Xu, Guanchen Li, Shuo Yang, Jinfeng Xu, Xuanwu Yin, Dong Li, Edith C.H.Ngai, Emad Barsoum
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.22972: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22972&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[88] Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models
Sasha Ronaghi, Chloe Stanwyck, Asad Aali, Amir Ronaghi, Miguel Fuentes, Tina Hernandez-Boussard, Emily Alsentzer
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.03423: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03423&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[89] Safety Is Not Universal: The Selective Safety Trap in LLM Alignment
Iago Alves Brito, Walcy Santos Rezende Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.04389: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04389&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[90] Mapping the maturation of TCM as an adjuvant to radiotherapy
P. Bilha Githinji, Aikaterini Melliou
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.11923: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11923&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[91] Verified Critical Step Optimization for LLM Agents
Mukai Li, Qingcheng Zeng, Tianqing Fang, Zhenwen Liang, Linfeng Song, Qi Liu, Haitao Mi, Dong Yu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.03412: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03412&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[92] Affective Flow Language Model for Emotional Support Conversation
Chenghui Zou, Ning Wang, Tiesunlong Shen, Luwei Xiao, Chuan Ma, Xiangpeng Li, Rui Mao, Erik Cambria
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.08826: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08826&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[93] Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images
Yichi Zhang, Zhuo Chen, Lingbing Guo, Wen Zhang, Huajun Chen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.21828: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21828&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[94] Thinking with Drafting: Optical Decompression via Logical Reconstruction
Jingxuan Wei, Honghao He, Caijun Jia, Siyuan Li, Zheng Sun, Yuhang Xu, Yuanyuan Lin, Linzhuang Sun, Yuchen Wu, Bihui Yu, Xiangxiang Zhang, Cheng Tan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.
[95] Vibe Check: Understanding the Effects of LLM-Based Conversational Agents’ Personality and Alignment on User Perceptions in Goal-Oriented Tasks
Hasibur Rahman, Smit Desai
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.09870: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.09870&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[96] Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
Andrea Silvi, Ponrawee Prasertsom, Jennifer Culbertson, Devdatt Dubhashi, Moa Johansson, Kenny Smith
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.21720: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21720&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[97] LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.06198: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06198&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[98] TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation
Toms Bergmanis, Martins Kronis, Ingus Jānis Pretkalniņš, Dāvis Nicmanis, Jeļizaveta Jelinska, Roberts Rozis, Rinalds Vīksna, Mārcis Pinnis
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.08182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[99] Auto-ARGUE: LLM-Based Report Generation Evaluation
William Walden, Marc Mason, Orion Weller, Laura Dietz, John Conroy, Neil Molino, Hannah Recknor, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, Dawn Lawrie, James Mayfield, Eugene Yang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.26184: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26184&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[100] SciMDR: Advancing Scientific Multimodal Document Reasoning
Ziyu Chen, Yilun Zhao, Chengye Wang, Rilyn Han, Manasi Patwardhan, Arman Cohan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.12249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[101] AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents
Shannan Yan, Jingchen Ni, Leqi Zheng, Jiajun Zhang, Peixi Wu, Dacheng Yin, Jing Lyu, Chun Yuan, Fengyun Rao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.16496: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16496&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[102] Reasoning Gets Harder for LLMs Inside A Dialogue
Ivan Kartáč, Mateusz Lango, Ondřej Dušek
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.20133: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20133&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[103] AfrIFact: Cultural Information Retrieval, Evidence Extraction and Fact Checking for African Languages
Israel Abebe Azime, Jesujoba Oluwadara Alabi, Crystina Zhang, Iffat Maab, Atnafu Lambebo Tonja, Tadesse Destaw Belay, Folasade Peace Alabi, Salomey Osei, Saminu Mohammad Aliyu, Nkechinyere Faith Aguobi, Bontu Fufa Balcha, Blessing Kudzaishe Sibanda, Davis David, Mouhamadane Mboup, Daud Abolade, Neo Putini, Philipp Slusallek, David Ifeoluwa Adelani, Dietrich Klakow
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.00706: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00706&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[104] Why Attend to Everything? Focus is the Key
Hengshuai Yao, Xing Chen, Ahmed Murtadha, Jin Li, Yasin Abbasi Yadkori, Shuai Shao, Changling Liu, Guan Wang, Mingli Yuan, William Chen, Sen Song
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.03260: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03260&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[105] Crime Hotspot Prediction Using Deep Graph Convolutional Networks
Tehreem Zubair, Syeda Kisaa Fatima, Noman Ahmed, Asifullah Khan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.13116: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.13116&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[106] Retrieval-Augmented LLMs for Evidence Localization in Clinical Trial Recruitment from Longitudinal EHR Narratives
Ziyi Chen, Mengxian Lyu, Cheng Peng, Yonghui Wu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.05190: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05190&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[107] AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control
Quang-Hung Bui, Anh Son Ta
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.11568: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11568&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[108] Diffusion Language Models for Speech Recognition
Davyd Naveriani, Albert Zeyer, Ralf Schlüter, Hermann Ney
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14001: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14001&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[109] A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression
Jincheng Ren, Siwei Wu, Yizhi Li, Kang Zhu, Shu Xu, Boyu Feng, Ruibin Yuan, Wei Zhang, Riza Batista-Navarro, Jian Yang, Chenghua Lin
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.19572: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.19572&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[110] Evaluation of Automatic Speech Recognition Using Generative Large Language Models
Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil, Sergio Burdisso, Petr Motlicek, Shiran Liu, Mickael Rouvier, Jane Wottawa, Richard Dufour
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.21928: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21928&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[111] How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.22750: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.22750&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[112] ADE: Adaptive Dictionary Embeddings – Scaling Multi-Anchor Representations to Large Language Models
Orhan Demirci, Sezer Aptourachman, Aydın Kaya
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.24940: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.24940&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[113] Failure Modes of Maximum Entropy RLHF
Ömer Veysel Çağatan, Barış Akgün
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.20265: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20265&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[114] Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
Dacheng Qi, Chenyu Wang, Jingwei Xu, Tianzhe Chu, Zibo Zhao, Wen Liu, Wenrui Ding, Yi Ma, Shenghua Gao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.04337: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04337&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[115] Faithfulness-QA: A Counterfactual Entity Substitution Dataset for Training Context-Faithful RAG Models
Li Ju, Junzhe Wang, Qi Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.25313: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.25313&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[116] Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.25850: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.25850&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[117] Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
Zheng-Xin Yong, Stephen H. Bach
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.20956: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20956&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[118] Agentic Search in the Wild: Intents and Trajectory Dynamics from 14M+ Real Search Requests
Jingjie Ning, João Coelho, Yibo Kong, Yunfan Long, Bruno Martins, João Magalhães, Jamie Callan, Chenyan Xiong
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.17617: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17617&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[119] The Collapse of Heterogeneity in Silicon Philosophers
Yuanming Shi, Andreas Haupt
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.23575: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.23575&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[120] VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation
Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, Amit Ranjan Trivedi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.25235: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.25235&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[121] FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing
Morayo Danielle Adeyemi, Ryan A. Rossi, Franck Dernoncourt
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Fashion AI systems routinely encode the aesthetic logic of specific houses, editors, and historical moments without disclosing it. We present FASH-iCNN, a multimodal system trained on 87,547 Vogue runway images across 15 fashion houses spanning 1991-2024 that makes this cultural logic inspectable. Given a photograph of a garment, the system recovers which house produced it, which era it belongs to, and which color tradition it reflects. A clothing-only model identifies the fashion house at 78.2% top-1 across 14 houses, the decade at 88.6% top-1, and the specific year at 58.3% top-1 across 34 years with a mean error of just 2.2 years. Probing which visual channels carry this signal reveals a sharp dissociation: removing color costs only 10.6pp of house identity accuracy, while removing texture costs 37.6pp, establishing texture and luminance as the primary carriers of editorial identity. FASH-iCNN treats editorial culture as the signal rather than background noise, identifying which houses, eras, and color traditions shaped each output so that users can see not just what the system predicts but which houses, editors, and historical moments are encoded in that prediction.
[122] Generalized Disguise Makeup Presentation Attack Detection Using an Attention-Guided Patch-Based Framework
Fateme Taraghi, Atefe Aghaei, Mohsen Ebrahimi Moghaddam
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Despite significant advances in facial recognition systems, they remain vulnerable to face presentation attacks. Among them, disguise makeup attacks are particularly challenging, as they use advanced cosmetics, prosthetic components, and artificial materials to realistically alter facial appearance, often making detection difficult even for humans. Despite their importance, this problem remains underexplored, and publicly available datasets are limited. To address this, we propose a generalized disguise makeup presentation attack detection framework. The method adopts a two-phase design in which a style-invariant full-face model, trained with metric learning and enhanced by a whitening transformation, extracts region attention scores via Grad-CAM. These scores guide a patch-based phase that performs localized analysis using region-specific subnetworks trained with metric learning for fine-grained discrimination. We also construct a new, diverse dataset of live and disguise makeup faces collected under real-world conditions, covering variations in subjects, environments, and disguise materials. Experimental results demonstrate strong generalization across both the collected dataset and SIW-Mv2, achieving 8.97% ACER and 9.76% EER on the collected dataset, and 0% ACER on Obfuscation and Impersonation and 1.34% on Cosmetics attacks of SIW-Mv2. The proposed method consistently outperforms prior works while maintaining robust performance across other spoof types.
[123] Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding
Chang Liu, Henghui Ding, Nikhila Ravi, Yunchao Wei, Shuting He, Song Bai, Philip Torr, Leilei Cao, Jinrong Zhang, Deshui Miao, Xusheng He, Dengxian Gong, Zhiyu Wang, Mingqi Gao, Jihwan Hong, Canyang Wu, Weili Guan, Jianlong Wu, Liqiang Nie, Xingsen Huang, Yameng Gu, Xiaogang Yu, Xin Li, Ming-Hsuan Yang, Sijie Li, Jungong Han, Quanzhu Niu, Shihao Chen, Yuanzheng Wu, Yikang Zhou, Tao Zhang, Haobo Yuan, Lu Qi, Shunping Ji, Chao Yang, Chao Tian, Guoqing Zhu, Kai Yang, Zhifan Mo, Haijun Zhang, Xudong Kang, Shutao Li, Jaeyoung Do
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This report summarizes the objectives, datasets, and top-performing methodologies of the 2026 Pixel-level Video Understanding in the Wild (PVUW) Challenge, hosted at CVPR 2026, which evaluates state-of-the-art models under highly unconstrained conditions. To provide a comprehensive assessment, the 2026 edition features three specialized tracks: the MOSE track for tracking objects within densely cluttered and severely occluded scenarios; the MeViS-Text track for localizing targets via motion-focused linguistic expressions; and the newly inaugurated MeViS-Audio track, which pioneers acoustic-driven object segmentation. By introducing previously unreleased challenging data and analyzing the cutting-edge, multimodal solutions submitted by participants, this report highlights the community’s latest technical advancements and charts promising future directions for robust video scene comprehension.
[124] MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching
Shuzhao Xie, Junchen Ge, Weixiang Zhang, Jiahang Liu, Chen Tang, Yunpeng Bai, Shijia Ge, Jingyan Jiang, Yuzhi Huang, Fengnian Yang, Cong Zhang, Xiaoyi Fan, Zhi Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: 3D Gaussian Splatting (3DGS) achieves high-quality novel view synthesis with real-time rendering, but its storage cost remains prohibitive for practical deployment. Existing post-training compression methods still rely on many coupled hyperparameters across pruning, transformation, quantization, and entropy coding, making it difficult to control the final compressed size and fully exploit the rate-distortion trade-off. We propose MesonGS++, a size-aware post-training codec for 3D Gaussian compression. On the codec side, MesonGS++ combines joint importance-based pruning, octree geometry coding, attribute transformation, selective vector quantization for higher-degree spherical harmonics, and group-wise mixed-precision quantization with entropy coding. On the configuration side, it treats the reserve ratio and bit-width allocation as the dominant rate-distortion knobs and jointly optimizes them under a target storage budget via discrete sampling and 0–1 integer linear programming. We further propose a linear size estimator and a CUDA parallel quantization operator to accelerate the hyperparameter searching process. Extensive experiments show that MesonGS++ achieves over 34$\times$ compression while preserving rendering fidelity, outperforming state-of-the-art post-training methods and accurately meeting target size budgets. Remarkably, without any training, MesonGS++ can even surpass the PSNR of vanilla 3DGS at a 20$\times$ compression rate on the Stump scene. Our code is available at https://github.com/mmlab-sigs/mesongs_plus
[125] Evaluating the Alignment Between GeoAI Explanations and Domain Knowledge in Satellite-Based Flood Mapping
Hyunho Lee, Wenwen Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The increasing number of satellites has improved the temporal resolution of Earth observation, making satellite-based flood mapping a promising approach for operational flood monitoring. Deep learning-based approaches for flood mapping using satellite imagery, an important application within Geospatial Artificial Intelligence (GeoAI), have shown improved predictive performance by learning complex spatial and spectral patterns from large volumes of remote sensing data. However, the opaque decision-making processes of deep learning models remain a major barrier to their integration into critical scientific and operational workflows. This highlights the need for a systematic assessment of whether model explanations align with established domain knowledge in remote sensing. To address this research gap, this study introduces the ADAGE (Alignment between Domain Knowledge And GeoAI Explanation Evaluation) framework. The proposed framework is designed to systematically evaluate how well explanations of deep learning models align with established remote sensing knowledge, particularly regarding the distinctive spectral properties of the Earth’s surface. The ADAGE framework employs Channel-Group SHAP (SHapley Additive exPlanations) method to estimate the contributions of grouped input channels to pixel-level predictions. Experiments on two satellite-based flood mapping tasks demonstrate that the ADAGE framework can (1) quantitatively assess the alignment between model explanations and reference explanations derived from domain knowledge and (2) help domain experts identify misaligned explanations through alignment scores. This study contributes to bridging the gap between explainability and domain knowledge in GeoAI for Earth observation, enhancing the applicability of GeoAI models in scientific and operational workflows.
[126] Edge AI for Automotive Vulnerable Road User Safety: Deployable Detection via Knowledge Distillation
Akshay Karjol, Darrin M. Hanna
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Deploying accurate object detection for Vulnerable Road User (VRU) safety on edge hardware requires balancing model capacity against computational constraints. Large models achieve high accuracy but fail under INT8 quantization required for edge deployment, while small models sacrifice detection performance. This paper presents a knowledge distillation (KD) framework that trains a compact YOLOv8-S student (11.2M parameters) to mimic a YOLOv8-L teacher (43.7M parameters), achieving 3.9x compression while preserving quantization robustness. We evaluate on full-scale BDD100K (70K training images) with Post-Training Quantization to INT8. The teacher suffers catastrophic degradation under INT8 (-23% mAP), while the KD student retains accuracy (-5.6% mAP). Analysis reveals that KD transfers precision calibration rather than raw detection capacity: the KD student achieves 0.748 precision versus 0.653 for direct training at INT8, a 14.5% gain at equivalent recall, reducing false alarms by 44% versus the collapsed teacher. At INT8, the KD student exceeds the teacher’s FP32 precision (0.748 vs. 0.718) in a model 3.9x smaller. These findings establish knowledge distillation as a requirement for deploying accurate, safety-critical VRU detection on edge hardware.
[127] RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
Zaid Nasser, Mikhail Iumanov, Tianhao Li, Maxim Popov, Jaafar Mahmoud, Sergey Kolyubin
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present RADIO-ViPE (Reduce All Domains Into One – Video Pose Engine), an online semantic SLAM system that enables geometry-aware open-vocabulary grounding, associating arbitrary natural language queries with localized 3D regions and objects in dynamic environments. Unlike existing approaches that require calibrated, posed RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams, requiring no prior camera intrinsics, depth sensors, or pose initialization. The system tightly couples multi-modal embeddings – spanning vision and language – derived from agglomerative foundation models (e.g., RADIO) with geometric scene information. This coupling takes place in initialization, optimization and factor graph connections to improve the consistency of the map from multiple modalities. The optimization is wrapped within adaptive robust kernels, designed to handle both actively moving objects and agent-displaced scene elements (e.g., furniture rearranged during ego-centric session). Experiments demonstrate that RADIO-ViPE achieves state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. RADIO-ViPE bridges a critical gap in real-world deployment, enabling robust open-vocabulary semantic grounding for autonomous robotics and unconstrained in-the-wild video streams. Project page: https://be2rlab.github.io/radio_vipe
[128] FruitProM-V2: Robust Probabilistic Maturity Estimation and Detection of Fruits and Vegetables
Rahul Harsha Cheppally, Sidharth Rai, Sudan Baral, Benjamin Vail, Ajay Sharda
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate fruit maturity identification is essential for determining harvest timing, as incorrect assessment directly affects yield and post-harvest quality. Although ripening is a continuous biological process, vision-based maturity estimation is typically formulated as a multi-class classification task, which imposes sharp boundaries between visually similar stages. To examine this limitation, we perform an annotation reliability study with two independent annotators on a held-out tomato dataset and observe disagreement concentrated near adjacent maturity stages. Motivated by this observation, we model maturity as a latent continuous variable and predict it probabilistically using a distributional detection head, converting the distribution into class probabilities through the cumulative distribution function (CDF). The proposed formulation maintains comparable performance to a standard detector under clean labels while better representing uncertainty. Furthermore, when controlled label noise is introduced during training, the probabilistic model demonstrates improved robustness relative to the baseline, indicating that explicitly modeling maturity uncertainty leads to more reliable visual maturity estimation.
[129] Sample Selection Using Multi-Task Autoencoders in Federated Learning with Non-IID Data
Emre Ardıç, Yakup Genç
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Federated learning is a machine learning paradigm in which multiple devices collaboratively train a model under the supervision of a central server while ensuring data privacy. However, its performance is often hindered by redundant, malicious, or abnormal samples, leading to model degradation and inefficiency. To overcome these issues, we propose novel sample selection methods for image classification, employing a multitask autoencoder to estimate sample contributions through loss and feature analysis. Our approach incorporates unsupervised outlier detection, using one-class support vector machine (OCSVM), isolation forest (IF), and adaptive loss threshold (AT) methods managed by a central server to filter noisy samples on clients. We also propose a multi-class deep support vector data description (SVDD) loss controlled by a central server to enhance feature-based sample selection. We validate our methods on CIFAR10 and MNIST datasets across varying numbers of clients, non-IID distributions, and noise levels up to 40%. The results show significant accuracy improvements with loss-based sample selection, achieving gains of up to 7.02% on CIFAR10 with OCSVM and 1.83% on MNIST with AT. Additionally, our federated SVDD loss further improves feature-based sample selection, yielding accuracy gains of up to 0.99% on CIFAR10 with OCSVM. These results show the effectiveness of our methods in improving model accuracy across various client counts and noise conditions.
[130] ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images
Xinyue Li, Zhiming Xu, Min Tang, Zhaolin Cai, Sijing Wu, Xiongkuo Min, Yitong Chen, Guangtao Zhai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Generative text-to-image models are advancing at an unprecedented pace, continuously shifting the perceptual quality ceiling and rendering previously collected labels unreliable for newer generations. To address this, we present ELIQ, a Label-free Framework for Quality Assessment of Evolving AI-generated Images. Specifically, ELIQ focuses on visual quality and prompt-image alignment, automatically constructs positive and aspect-specific negative pairs to cover both conventional distortions and AIGC-specific distortion modes, enabling transferable supervision without human annotations. Building on these pairs, ELIQ adapts a pre-trained multimodal model into a quality-aware critic via instruction tuning and predicts two-dimensional quality using lightweight gated fusion and a Quality Query Transformer. Experiments across multiple benchmarks demonstrate that ELIQ consistently outperforms existing label-free methods, generalizes from AI-generated content (AIGC) to user-generated content (UGC) scenarios without modification, and paves the way for scalable and label-free quality assessment under continuously evolving generative models. The code will be released upon publication.
[131] U-FaceBP: Uncertainty-aware Bayesian Ensemble Deep Learning for Face Video-based Blood Pressure Estimation
Yusuke Akamatsu, Akinori F. Ebihara, Terumi Umematsu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Blood pressure (BP) measurement is crucial for daily health assessment. Remote photoplethysmography (rPPG), which extracts pulse waves from face videos captured by a camera, has the potential to enable convenient BP measurement without specialized medical devices. However, there are various uncertainties in BP estimation using rPPG, leading to limited estimation performance and reliability. In this paper, we propose U-FaceBP, an uncertainty-aware Bayesian ensemble deep learning method for face video-based BP estimation. U-FaceBP models aleatoric and epistemic uncertainties in face video-based BP estimation with a Bayesian neural network (BNN). Additionally, we design U-FaceBP as an ensemble method, estimating BP from rPPG signals, PPG signals derived from face videos, and face images using multiple BNNs. Large-scale experiments on two datasets involving 1197 subjects from diverse racial groups demonstrate that U-FaceBP outperforms state-of-the-art BP estimation methods. Furthermore, we show that the uncertainty estimates provided by U-FaceBP are informative and useful for guiding modality fusion, assessing prediction reliability, and analyzing performance across racial groups.
[132] MixerCA: An Efficient and Accurate Model for High-Performance Hyperspectral Image Classification
Mohammed Q. Alkhatib, Ali Jamali
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Over the past decade, hyperspectral image (HSI) classification has drawn considerable interest due to HSIs’ ability to effectively distinguish terrestrial objects by capturing detailed, continuous spectral information. The strong performance of recent deep learning techniques in tasks like image classification and semantic segmentation has led to their growing use in HSI classification, due to their ability to capture complex spatial and spectral features more effectively than traditional methods. This paper presents MixerCA, a novel lightweight model for HSI classification that leverages depthwise convolution and a self-attention mechanism. MixerCA integrates depth-wise convolutions, token and channel mixing, and coordinate attention into a unified structure to decouple spatial and channel interactions, maintain consistent resolution throughout the network, and directly process HSI patches. Extensive experiments on four hyperspectral benchmark datasets reveal MixerCA’s clear advantages over several competing algorithms, including 2D-CNN, 3D-CNN, Tri-CNN, HybridSN, ViT, and Swin Transformer. The source code is publicly available at https://github.com/mqalkhatib/MixerCA.
[133] Robust Grounding with MLLMs Against Occlusion and Small Objects via Language-Guided Semantic Cues
Beomchan Park, Seongho Kim, Hyunjun Kim, Sungjune Park, Yong Man Ro
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While Multimodal Large Language Models (MLLMs) have enhanced grounding capabilities in general scenes, their robustness in crowded scenes remains underexplored. Crowded scenes entail visual challenges (i.e., occlusion and small objects), which impair object semantics and degrade grounding performance. In contrast, language expressions are immune to such degradation and preserve object semantics. In light of these observations, we propose a novel method that overcomes such constraints by leveraging Language-Guided Semantic Cues (LGSCs). Specifically, our approach introduces a Semantic Cue Extractor (SCE) to derive semantic cues of objects from the visual pipeline of an MLLM. We then guide these cues using corresponding text embeddings to produce LGSCs as linguistic semantic priors. Subsequently, they are reintegrated into the original visual pipeline to refine object semantics. Extensive experiments and analyses demonstrate that incorporating LGSCs into an MLLM effectively improves grounding accuracy in crowded scenes.
[134] A Data-Centric Framework for Intraoperative Fluorescence Lifetime Imaging for Glioma Surgical Guidance
Silvia Noble Anbunesan, Mohamed Abul Hassan, Jinyi Qi, Lisanne Kraft, Han Sung Lee, Orin Bloch, Laura Marcu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate intraoperative assessment of glioma infiltration is essential for maximizing tumor resection while preserving functional brain tissue. Fluorescence lifetime imaging (FLIm) offers real-time, label-free biochemical contrast, but its clinical utility is challenged by biological heterogeneity, class imbalance, and variability in histopathological labeling. We present a data-centric AI (DC-AI) framework that integrates confident learning (CL), class refinement, and targeted label evaluation to develop a robust multi-class FLIm classifier for glioblastoma (GBM) resection margins. FLIm data were collected from 192 tissue margins across 31 newly diagnosed IDH-wildtype GBM patients and initially labeled into seven tumor cellularity classes by an expert neuropathologist. CL was applied to quantify FLIm point-level confidence, identify label inconsistencies, and guide iterative class merging into a three-class scheme (“low”, “moderate”, “high”). The resulting high-fidelity dataset enabled training a model that achieved 96% accuracy in the three-class task. SHAP analysis revealed class-specific FLIm feature importance, highlighting distinct optical signatures across the infiltration spectrum. Targeted FLIm analysis further identified biological (e.g., gray matter composition) and acquisition-related (e.g., blood contamination) contributors to low-confidence predictions. Blinded re-evaluation of margins flagged by CL demonstrated intra-pathologist variability, underscoring the value of selective relabeling rather than exhaustive review. Together, these findings demonstrate that a DC-AI framework can systematically improve data reliability, enhance model robustness, and refine biological interpretation of FLIm signals, supporting the development of clinically actionable optical tools for real-time glioma margin assessment.
[135] Why Domain Matters: A Preliminary Study of Domain Effects in Underwater Object Detection
Melanie Wille, Dimity Miller, Tobias Fischer, Scarlett Raine
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Domain shift, where deviations between training and deployment data distributions degrade model performance, is a key challenge in underwater environments. Existing benchmarks testing performance for underwater domain shift simulate variability through synthetic style transfer. This fails to capture intrinsic scene factors such as visibility, illumination, scene composition, or acquisition factors, limiting analysis of real-world effects. We propose a labeling framework that defines underwater domains using measurable image, scene, and acquisition characteristics. Unlike prior benchmarks, it captures physically meaningful factors, enabling semantically consistent image grouping and supporting domain-specific evaluation of detection performance including failure analysis. We validate this on public datasets, showing systematic variations across domain factors and revealing hidden failure modes.
[136] Lifting Embodied World Models for Planning and Control
Alex N. Wang, Trevor Darrell, Pavel Izmailov, Yutong Bai, Amir Bar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: World models of embodied agents predict future observations conditioned on an action taken by the agent. For complex embodiments, action spaces are high-dimensional and difficult to specify: for example, precisely controlling a human agent requires specifying the motion of each joint. This makes the world model hard to control and expensive to plan with as search-based methods like CEM scale poorly with action dimensionality. To address this issue, we train a lightweight policy that maps high-level actions to sequences of low-level joint actions. Composing this policy with the frozen world model produces a lifted world model that predicts a sequence of future observations from a single high-level action. We instantiate this framework for a human-like embodiment, defining the high-level action space as a small set of 2D waypoints annotated on the current observation frame, each specifying a near-term goal position for a leaf joint (pelvis, head, hands). Waypoints are low-dimensional, visually interpretable, and easy to specify manually or to search over. We show that the lifted world model substantially outperforms searching directly in low-level joint space ($3.8\times$ lower mean joint error to the goal pose), while remaining more compute-efficient and generalizing to environments unseen by the policy.
[137] Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models
Shreyansh Pathak, Jyotishman Das
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The need to selectively and efficiently erase learned information from deep neural networks is becoming increasingly important for privacy, regulatory compliance, and adaptive system design. We introduce Graph-Propagated Projection Unlearning (GPPU), a unified and scalable algorithm for class-level unlearning that operates across both vision and audio models. GPPU employs graph-based propagation to identify class-specific directions in the feature space and projects representations onto the orthogonal subspace, followed by targeted fine-tuning, to ensure that target class information is effectively and irreversibly removed. Through comprehensive evaluations on six vision datasets and two large-scale audio benchmarks spanning a variety of architectures including CNNs, Vision Transformers, and Audio Transformers, we demonstrate that GPPU achieves highly efficient unlearning, realizing 10-20x speedups over prior methodologies while preserving model utility on retained classes. Our framework provides a principled and modality-agnostic approach to machine unlearning, evaluated at a scale that has received limited attention in prior work, contributing toward more efficient and responsible deep learning.
[138] Privacy-Preserving Clothing Classification using Vision Transformer for Thermal Comfort Estimation
Tatsuya Chuman, Yousuke Udagawa, Hitoshi Kiya
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: A privacy-preserving clothing classification scheme is presented to enable secure occupant-centric control (OCC) systems. Although the utilization of camera images for HVAC control has been widely studied to optimize thermal comfort, privacy protection of occupant images has not been considered in prior works. While various privacy-preserving methods have been proposed for image classification, applying conventional schemes results in severe accuracy degradation. In this paper, we introduce a privacy-preserving classification method using Vision Transformer (ViT) applied to clothing insulation estimation. In an experiment using the DeepFashion dataset categorized by clothing insulation, while the conventional pixel-based method suffers a severe accuracy drop, our scheme maintains a high accuracy on encrypted images, showing no degradation from plain images across all categories.
[139] ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection
Ganxi Xu, Zhao-Rong Lai, Yuting Tang, Yonghao Song, Shuyan Zhou, Guoxu Zhou, Boyu Wang, Jian Zhu, Jinyi Long
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Brain encoding models not only serve to decipher how visual stimuli are transformed into neural responses, but also represent a critical step toward visual prostheses that restore vision for patients with severe vision disorders. Brain encoding involves two fundamental steps: achieving faithful reconstruction of neural responses and establishing cross-modal alignment between visual stimuli and neural responses. To this end, we propose ViBE, a novel brain encoding framework for generating magnetoencephalography (MEG) and electroencephalography (EEG) signals from visual stimuli. Specifically, we first design a spatio-temporal convolutional variational autoencoder (TSC-VAE) that captures the spatio-temporal characteristics of M/EEG signals for effective neural response reconstruction. To bridge the modality gap between visual features and neural representations, we employ Q-Former to map CLIP image embeddings to the TSC-VAE latent space, producing neural proxy embeddings. For comprehensive cross-modal alignment, we combine mean squared error (MSE) loss for point-wise feature matching with sliced Wasserstein distance (SWD) for probability distribution alignment between the neural proxy embeddings and TSC-VAE latent embeddings. We conduct extensive experiments on the THINGS-EEG2 and THINGS-MEG datasets, demonstrating the effectiveness of our approach in generating high-quality M/EEG signals from visual stimuli.
[140] Seeking Consensus: Geometric-Semantic On-the-Fly Recalibration for Open-Vocabulary Remote Sensing Semantic Segmentation
Guanchun Wang, Chenxiao Wu, Xiangrong Zhang, Zelin Peng, Jianxun Lai, Tianyang Zhang, Xu Tang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Open-vocabulary semantic segmentation (OVSS) in remote sensing images is a promising task that employs textual descriptions for identifying undefined land cover categories. Despite notable advances, existing methods typically employ a static inference paradigm, overlooking the distinct distribution of each scene, resulting in semantic ambiguity in diverse land covers and incomplete foreground activation. Motivated by this, we propose Seeking Consensus, termed SeeCo, a plug-and-play framework to boost the performance of training-free OVSS models in remote sensing images, which recalibrates arbitrary OVSS models on-the-fly by seeking dual consensus: geometric consensus learning (GCL) through multi-view consistent observations and semantic consensus learning (SCL) via textual description adaptive calibration, which assists collaborative recalibration of visual and textual semantics. The two consensus are injected via an online consensus injector (OCI), effectively alleviating the under-activation and semantic bias. SeeCo requires no specific training process, yet recalibrates semantic-geometric alignment for each unique scene during inference. Extensive experiments on eight remote sensing OVSS benchmarks show consistent gains, proving its effectiveness and universality.
[141] HOI-aware Adaptive Network for Weakly-supervised Action Segmentation
Runzhong Zhang, Suchen Wang, Yueqi Duan, Yansong Tang, Yue Zhang, Yap-Peng Tan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In this paper, we propose an HOI-aware adaptive network named AdaAct for weakly-supervised action segmentation. Most existing methods learn a fixed network to predict the action of each frame with the neighboring frames. However, this would result in ambiguity when estimating similar actions, such as pouring juice and pouring coffee. To address this, we aim to exploit temporally global but spatially local human-object interactions (HOI) as video-level prior knowledge for action segmentation. The long-term HOI sequence provides crucial contextual information to distinguish ambiguous actions, where our network dynamically adapts to the given HOI sequence at test time. More specifically, we first design a video HOI encoder that extracts, selects, and integrates the most representative HOI throughout the video. Then, we propose a two-branch HyperNetwork to learn an adaptive temporal encoder, which automatically adjusts the parameters based on the HOI information of various videos on the fly. Extensive experiments on two widely-used datasets including Breakfast and 50Salads demonstrate the effectiveness of our method under different evaluation metrics.
[142] DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation
Junhu Fu, Ke Chen, Weidong Guo, Shuyu Liang, Jie Xu, Chen Ma, Kehao Wang, Shengli Lin, Zeju Li, Yuanyuan Wang, Yi Guo, Shuo Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Controllable medical video generation has achieved remarkable progress, but it still lacks interpretability, which requires the alignment of generated contents with physical priors and faithful clinical manifestations. To push the boundaries from mere controllability to interpretability, we propose DepthPilot, the first interpretable framework for colonoscopy video generation. This work takes a step toward trustworthy generation through two synergistic paradigms. To achieve explicit geometric grounding, DepthPilot devises a prior distribution alignment strategy, injecting depth constraints into the diffusion backbone via parameter-efficient fine-tuning to ensure anatomical fidelity. To enhance intrinsic nonlinear modeling under these geometric constraints, DepthPilot employs an adaptive spline denoising module, replacing fixed linear weights with learnable spline functions to capture complex spatio-temporal dynamics. Extensive evaluations across three public datasets and in-house clinical data confirm DepthPilot’s robust ability to produce physically consistent videos. It achieves FID scores below 15 across all benchmarks and ranks first in clinician assessments, bridging the gap between “visually realistic” and “clinically interpretable”. Moreover, DepthPilot-generated videos are expected to enable reliable 3D reconstruction, facilitating surgical navigation and blind region identification, and serve as a foundation toward the colorectal world model.
[143] EnerGS: Energy-Based Gaussian Splatting with Partial Geometric Priors
Rui Song, Tianhui Cai, Markus Gross, Yun Zhang, Walter Zimmer, Zhiyu Huang, Olaf Wysocki, Jiaqi Ma
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: 3D Gaussian Splatting (3DGS) has been widely adopted for scene reconstruction, where training inherently constitutes a highly coupled and non-convex optimization problem. Recent works commonly incorporate geometric priors, such as LiDAR measurements, either for initialization or as training constraints, with the goal of improving photometric reconstruction quality. However, in large-scale outdoor scenarios, such geometric supervision is often spatially incomplete and uneven, which limits its effectiveness as a reliable prior and can even be detrimental to the final reconstruction. To address this challenge, we model partially observable geometry as a continuous energy field induced by geometric evidence and propose EnerGS. Rather than enforcing geometry as a hard constraint, EnerGS provides a soft geometric guidance for the optimization of Gaussian primitives, allowing geometric information to steer the optimization process without directly restricting the solution space. Extensive experiments on large-scale outdoor scenes demonstrate that, under both sparse multi-view and monocular settings, EnerGS consistently improves photometric quality and geometric stability, while effectively mitigating overfitting during 3DGS training.
[144] Camera-RFID Fusion for Robust Asset Tracking in Forested Environments
John Hateley, Sriram Narasimhan, Omid Abari
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Passive RFID tags offer a cost-effective and scalable solution for tracking numerous deployed assets. However, in forested environments, signal attenuation and multipath effects generally limit RFID spatial accuracy to the meter level. Conversely, while cameras employing stereo vision can achieve centimeter-level precision, relying solely on computer vision fails to resolve issues arising from spatial association ambiguity and partial occlusions in dense settings. Fusing these modalities allows systems to harness the high-accuracy benefits of vision while retaining the robust, non-line-of-sight identification advantages of RFID. Yet, a primary challenge in achieving this, which is the central focus of this paper, lies in accurately associating the disparate trajectories generated by these two sensors. To overcome this limitation, we introduce a novel camera–RFID fusion framework that integrates depth and object information with advanced trajectory-matching algorithms. By successfully bridging the meter-to-centimeter accuracy gap, the proposed approach helps achieve reliable tag localization even when assets temporarily leave the camera’s field of view. To the best of our knowledge, this represents the first application of camera–RFID fusion for asset tracking in natural forested environments.
[145] MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution
Jiaqi Guo, Mingzhen Li, Haohong Wang, Aggelos K. Katsaggelos
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We study generative super-resolution (SR) in real-world scenarios where content and degradations vary across domains, genres, and segments. For example, images and videos may alternate between text overlays, fast motion, smooth cartoons, and low-light faces, each benefiting from different forms of side information. Existing metadata-guided SR methods typically use a fixed conditioning design, which is suboptimal when useful cues are content dependent and transmission budgets are limited. We propose MetaSR, a Diffusion Transformer (DiT)-based framework that selects and injects task-relevant metadata to guide SR under resource constraints. Specifically, we use the DiT’s own VAE and transformer backbone to fuse heterogeneous metadata, and adopt an efficient distillation strategy that enables one-step diffusion inference. Experiments across diverse content buckets and degradation regimes show that MetaSR outperforms reference solutions by up to 1.0~dB PSNR while achieving up to 50% transmission bitrate saving at matched quality. We assess these gains under a rate–distortion optimization (RDO) framework that jointly accounts for sender-side bitrate and receiver/display quality metrics (e.g., PSNR and SSIM).
[146] Beyond Shortcuts: Mitigating Visual Illusions in Frozen VLMs via Qualitative Reasoning
Hao Guo, Fei Wang, Junjie Chen, Yiqi Nie, Jiaqi Zhao, Qiankun Li, Subin Huang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While Vision-Language Models (VLMs) have achieved state-of-the-art performance in general visual tasks, their perceptual robustness remains remarkably brittle when confronted with optical illusions. These failures are often attributed to shortcut heuristics, where models prioritize linguistic priors and memorized prototypes over direct visual evidence. In this work, we propose Structured Qualitative Inference (SQI), a training-free, data-centric framework designed to fortify visual grounding in frozen VLMs. SQI addresses perceptual anomalies through three systematic modules: (1) Axiomatic Constraint Injection, which suppresses erroneous metric estimations and quantitative hallucinations; (2) Hierarchical Scene Decomposition, which decouples target visual manifolds from complex background distractors; and (3) Counterfactual Self-Verification, an adversarial reasoning step that mitigates confirmation bias. By orchestrating these qualitative constraints at inference time, SQI effectively aligns high-level linguistic reasoning with low-level visual perception. Our framework was evaluated on the DataCV 2026 Challenge (Task I: Classic Illusion Understanding), where it ranked 2nd place overall. Experimental results demonstrate that SQI not only significantly enhances accuracy across diverse illusion categories but also provides superior diagnostic interpretability without any model fine-tuning. Our success underscores the potential of structured qualitative grounding as a robust paradigm for developing next-generation, illusion-resistant vision-language systems.
[147] Multi-Stage Bi-Atrial Segmentation Framework from 3D Late Gadolinium-Enhanced MRI using V-Net Family Models
Hao Wen, Jingsu Kang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We report our multi-stage framework designed for the problem of multi-class bi-atrial segmentation from 3D late gadolinium-enhanced (LGE) MRI of the human heart. The pipeline consists of a preprocessing step using multidimensional contrast limited adaptive histogram equalization (MCLAHE); coarse region segmentation from MCLAHE-enhanced and down-sampled MRI using a V-Net family model; and fine segmentation from the coarse region using another V-Net model. Asymmetric loss is adopted to optimize the model weights.
[148] OmniTrend: Content-Context Modeling for Scalable Social Popularity Prediction
Liliang Ye, Guiyi Zeng, Yunyao Zhang, Yi-Ping Phoebe Chen, Junqing Yu, Zikai Song
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Predicting social media popularity requires understanding both the intrinsic appeal of content and the external context that determines how it is exposed to users. Existing methods focus on content signals but do not separate them from exposure-related patterns, which causes the learned representations to absorb platform-specific visibility effects and weakens both interpretability and cross-platform transfer. This paper introduces OmniTrend, a unified framework that models popularity as the joint outcome of content attractiveness and contextual exposure. The content module learns cross-modal representations from visual, audio, and textual cues to quantify intrinsic appeal, while the context module estimates exposure from exogenous signals such as posting time, author activity, topical trends, and retrieval-based neighborhood statistics. OmniTrend learns separate predictors for content attractiveness and contextual exposure and integrates them in the final popularity estimate, which makes the role of each factor explicit and supports robust transfer across image and video platforms.
[149] GaitKD: A Universal Decoupled Distillation Framework for Efficient Gait Recognition
Yuqi Li, Qian Zhou, Huiran Duan, Jingjie Wang, Shunli Zhang, Chuanguang Yang, Guoying Zhao, Yingli Tian
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Gait recognition is an attractive biometric modality for long-range and contact-free identification, but high-performing gait models often rely on deep and computationally expensive architectures that are difficult to deploy in practice. Knowledge distillation (KD) offers a natural way to transfer knowledge from a powerful teacher to an efficient student; however, standard KD is often less effective for part-structured gait models, where supervision is formed from both part-wise classification logits and part-wise retrieval embeddings. In this paper, we propose GaitKD, a distillation framework that decouples gait knowledge transfer into two complementary components: decision-level distillation and boundary-level distillation. Specifically, GaitKD aligns the teacher and student through part-calibrated logit distillation to transfer inter-class decision relations, while preserving the teacher-induced partitioning of the embedding space through an activation-boundary objective instead of direct feature regression. With a simple aligned part-wise design, GaitKD supports heterogeneous teacher-student gait models without introducing additional inference cost. Experimental results across multiple gait recognition benchmarks and teacher-student configurations show consistent improvements over strong gait baselines. Our study demonstrates that the two transfer components are complementary, and boundary-preserving distillation provides more stable performance than direct feature regression. Source code is available at https://github.com/liyiersan/GaitKD/
[150] Multiple Consistent 2D-3D Mappings for Robust Zero-Shot 3D Visual Grounding
Yufei Yin, Jie Zheng, Qianke Meng, Zhou Yu, Minghao Chen, Jiajun Ding, Min Tan, Yuling Xi, Zhiwen Chen, Chengfei Lv
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Zero-shot 3D Visual Grounding (3DVG) is a critical capability for open-world embodied AI. However, existing methods are fundamentally bottlenecked by the poor quality of open-vocabulary 3D proposals, suffering from inaccurate categories and imprecise geometries, as well as the spatial redundancy of exhaustive multi-view reasoning. To address these challenges, we propose MCM-VG, a novel framework that achieves robust zero-shot 3DVG by explicitly establishing Multiple Consistent 2D-3D Mappings. Instead of passively relying on noisy 3D segments, MCM-VG enforces 2D-3D consistency across three fundamental dimensions to achieve precise target localization and reliable reasoning. First, a Semantic Alignment module corrects category mismatches via LLM-driven query parsing and coarse-to-fine 2D-3D matching. Second, an Instance Rectification module leverages VLM-guided 2D segmentations to reconstruct missing targets, back-projecting these reliable visual priors to establish accurate 3D geometries. Finally, to eliminate spatial redundancy, a Viewpoint Distillation module clusters 3D camera directions to extract optimal frames. By pairing these optimal RGB frames with Bird’s Eye View maps into concise visual prompt sets, we formulate the final target disambiguation as a multiple-choice reasoning task for Vision-Language Models. Extensive evaluations on ScanRefer and Nr3D benchmarks demonstrate that MCM-VG sets a new state-of-the-art for zero-shot 3D visual grounding. Remarkably, it achieves 62.0% and 53.6% in Acc@0.25 and Acc@0.5 on ScanRefer, outperforming previous baselines by substantial margins of 6.4% and 4.0%.
[151] Semantic Foam: Unifying Spatial and Semantic Scene Decomposition
Amr Sharafeldin, Shrisudhan Govindarajan, Thomas Walker, Aryan Mikaeili, Daniel Rebain, Kwang Moo Yi, Andrea Tagliasacchi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Modern scene reconstruction methods, such as 3D Gaussian Splatting, enable photo-realistic novel view synthesis at real-time speeds. However, their adoption in interactive graphics applications remains limited due to the difficulty of interacting with these representations compared to traditional, human-authored 3D assets. While prior work has attempted to impose semantic decomposition on these models, significant challenges remain in segmentation quality and cross-view consistency.To address these limitations, we introduce Semantic Foam, which extends the recently proposed Radiant Foam representation to semantic decomposition tasks. Our approach leverages the inherent spatial structure of Radiant Foam’s volumetric Voronoi mesh and augments it with an explicit semantic feature field defined at the cell level. This design enables direct spatial regularization, improving consistency across views and mitigating artifacts caused by occlusion and inconsistent supervision, which are common issues in point-based representations.Experimental results demonstrate that our method achieves superior object-level segmentation performance compared to state-of-the-art approaches such as Gaussian Grouping and SAGA.Project page: http://semanticfoam.github.io/
[152] High-Dimensional Noise to Low-Dimensional Manifolds: A Manifold-Space Diffusion Framework for Degraded Hyperspectral Image Classification
Boxiang Yang, Ning Chen, Xia Yue, Yichang Luo, Yingbo Fan, Haoyuan Zhang, Haoyu Ma, Jun Yue, Shanjun Mao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recently, Hyperspectral Image (HSI) classification has attracted increasing attention in remote sensing. However, HSI data are inherently high-dimensional but low-rank, with discriminative information concentrated on a low-dimensional latent manifold. In real-world remote sensing scenarios, the superposition of multiple degradation factors disrupts this intrinsic manifold structure, driving samples away from their original low-dimensional distribution and introducing substantial redundant and non-discriminative variations. To better handle this challenge, this paper proposes a manifold-space diffusion framework (MSDiff) for robust hyperspectral classification under complex degradation conditions. Specifically, the proposed method first maps high-dimensional, degradation-affected HSI data into a compact low-dimensional manifold through a discriminative spectral-spatial reconstruction task, preserving class semantics and reducing redundant variations. A diffusion-based generative model is then applied to regularize the spectral-spatial distribution within the manifold, enabling progressive refinement and stabilization of latent features against residual degradations. The key advantage of the proposed framework lies in performing diffusion-based distribution modeling directly on the low-dimensional manifold, effectively decoupling degradation-induced disturbances from intrinsic discriminative structures and enhancing representation stability under complex degradations. Experimental results on multiple hyperspectral benchmarks demonstrate consistent performance improvements over state-of-the-art methods under diverse composite degradation settings. The code will be available at https://github.com/yangboxiang1207/MSDiff
[153] MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution
Chunzheng Zhu, Jiaqi Zeng, Junyu Jiang, Jianxin Lin, Yijun Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly invoke during image interpretation. We pinpoint a fundamental cognitive misalignment in medical VLMs caused by discrete tokenization, leading to quantization loss, long-range information dissipation, and missing case-adaptive expertise. To bridge this gap, we propose ours, a framework for latent diagnostic memory evolution that simulates the experiential invocation of clinicians by dynamically synthesizing implicit diagnostic memories within the model’s hidden stream. Specifically, it begins with a Meta Query for Prior Memorization mechanism, where learnable probes retrieve structured priors from an anatomical prior encoder to generate condensed implicit memories. To ensure clinical fidelity, we introduce Causal Counterfactual Refinement (CCR), which leverages reinforcement learning and counterfactual rewards derived from region-level feature masking to quantify the causal contribution of each memory, thereby pruning redundancies and aligning latent representations with diagnostic logic. This evolutionary process culminates in Intrinsic Memory Transition (IMT), a privileged-autonomous dual-branch paradigm that internalizes teacher-branch diagnostic patterns into the student-branch via full-vocabulary divergence alignment. Comprehensive empirical evaluations across multiple datasets demonstrate that ours, by transferring external expertise into endogenous parameters, significantly outperforms existing state-of-the-art methods, particularly chain-of-thought paradigms, in diagnostic accuracy.
[154] Event-based Liveness Detection using Temporal Ocular Dynamics: An Exploratory Approach
Nicolas Mastropasqua, Ignacio Bugueno-Cordova, Rodrigo Verschae, Daniel Acevedo, Pablo Negri
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Face liveness detection has been extensively studied using RGB cameras, achieving strong performance under controlled conditions but often failing to generalize across sensors and attack scenarios. In this work, we explore event cameras as an alternative sensing modality for liveness detection based on temporal ocular dynamics. Event cameras capture sparse, asynchronous changes in brightness with microsecond resolution, enabling precise analysis of fast eye movements such as saccades. Replay attacks cannot faithfully reproduce these dynamics due to temporal resampling and display artifacts, leading to distinctive spatio-temporal patterns in the event domain. We design a data collection protocol to extend RGBE-Gaze with replay-attack recordings, yielding an event-based fake counterpart for liveness detection. We analyze event-driven temporal features from eye regions and evaluate their effectiveness for ocular motion segmentation and liveness classification. Our results show that event-based representations enable reliable discrimination between genuine and replayed sequences, achieving up to 95.37% top-1 accuracy with a spiking convolutional neural network. These preliminary findings highlight the potential of event-based sensing for robust and low-latency liveness detection.
[155] CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation
Sonali Sharma, Jin Long, George Shih, Sarah Eid, Christian Bluethgen, Francine L. Jacobson, Emily B. Tsai, Global Radiology Consortium, Ahmed M. Alaa, Curtis P. Langlotz
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Chest X-ray interpretation is one of the most frequently performed diagnostic tasks in medicine and a primary target for AI development, yet current vision–language models are primarily trained on datasets of paired images and reports, not the cognitive processes and visual attention that underlie clinical reasoning. Here, we present CheXthought, a global, multimodal resource containing 103,592 chain-of-thought reasoning traces and 6,609,082 synchronized visual attention annotations across 50,312 multi-read chest X-rays from 501 radiologists in 71 countries. Our analysis reveals clinical reasoning patterns in how experts deploy distinct visual search strategies, integrate clinical context, and communicate uncertainty. We demonstrate the clinical utility of CheXthought across four dimensions. First, CheXthought reasoning significantly outperforms state–of–the–art vision–language model chain-of-thought in factual accuracy and spatial grounding. Second, visual attention data used as an inference–time hint recovers missed findings and significantly reduces hallucinations. Third, models trained on CheXthought data achieve significantly stronger pathology classification, visual faithfulness, temporal reasoning and uncertainty communication. Fourth, leveraging CheXthought’s multi-reader annotations, we predict both human–human and human–AI disagreement directly from an image, enabling transparent communication of case difficulty, uncertainty and model reliability. These findings establish CheXthought as a resource for advancing multimodal clinical reasoning and the development of more transparent, interpretable vision–language models.
[156] The Unseen Adversaries: Robust and Generalized Defense Against Adversarial Patches
Vishesh Kumar, Akshay Agarwal
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The vulnerabilities of deep neural networks against singularities have raised serious concerns regarding their deployment in the physical world. One of the most prominent and impactful physical-world adversarial perturbations is the attachment of patches to clean images, known as an adversarial patch attack. Similarly, natural noises such as Gaussian and Salt&Pepper are highly prevalent in the real world. The current research need arises from the above vulnerabilities and the lack of efforts to tackle these two singularities independently and, especially, in combination. In this research, we have, for the first time, combined these two prominent singularities and proposed a novel dataset. Using this dataset, we have conducted a benchmark study of singularity data-point detection using features from several convolutional neural networks. For classification, rather than the popular neural network-based parameter tuning, we have used traditional yet effective machine learning classifiers. The extensive experiments across various in- and out-of-distribution (OOD) singularities reveal several interesting findings about the effectiveness of classifiers and show that it is hard to defend against adversaries when they are treated independently, and inefficient classifiers are selected.
[157] Point Cloud Registration via Probabilistic Self-Update Local Correspondence and Line Vector Sets
Kuo-Liang Chung, Yu-Cheng Lin, Wu-Chi Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Point cloud registration (PCR) is a fundamental task for integrating 3D observations in remote sensing applications. This paper proposes a fast and effective PCR algorithm utilizing probabilistic self-updating local correspondence and line vector sets. Our dual RANSAC interaction model comprises a global RANSAC evaluating the global correspondence set and a local RANSAC operating on dynamically updated local sets. Initially, these local sets are constructed using angle histogram statistics and line vector length preservation techniques. To improve accuracy, a probabilistic self-updating strategy refines the local sets after each interaction round. To reduce runtime, we introduce a global early termination condition that optimally balances accuracy and efficiency. Finally, a weighted singular value decomposition estimates the registration solution. Evaluations on public datasets demonstrate our algorithm achieves superior time efficiency and at least a 10% root mean square error improvement over state-of-the-art methods. The C++ source code is publicly available at https://github.com/ivpml84079/Probabilistic-Self-Update-Line-Vector-Set-Based-Point-Cloud-Registration.
[158] Motion-Driven Multi-Object Tracking of Model Organisms in Space Science Experiments
Jianing You, Han Wang, Kang Liu, Jiale Ding, Fengjie Chu, Zihan Guo, Shengyang Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Automated animal behavior analysis relies on long-term, interpretable individual trajectories; however, multi-animal tracking in space science experimental videos remains highly challenging due to weak appearance cues, low-quality imaging, complex maneuvering behaviors, and frequent interactions. To address this problem, we first construct the SpaceAnimal-MOT dataset to characterize the motion complexity and long-term identity preservation challenges in biological videos acquired under microgravity conditions. We then propose ART-Track (Adaptive Robust Tracking), a motion-driven tracking framework tailored to this setting. Specifically, multi-model motion estimation is introduced to handle abrupt maneuvers and nonlinear motion, motion-state-driven association is designed to reduce identity switches under dense interactions and temporary mismatch, and uncertainty-adaptive fusion is used to dynamically balance spatial and motion cues when prediction reliability varies. Experimental results show that ART-Track significantly reduces identity switches on zebrafish and fruitfly sequences, while maintaining more stable association under occlusion, deformation, and high-density interactions, thereby providing a more reliable tracking foundation for downstream quantitative behavior analysis. The code is publicly available at https://github.com/yyy7777777/ART_TRACK/tree/main.
[159] Federated Medical Image Classification under Class and Domain Imbalance exploiting Synthetic Sample Generation
Martina Pavan, Matteo Caligiuri, Francesco Barbato, Pietro Zanuttigh
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Exploiting deep learning in medical imaging faces critical challenges, including strict privacy constraints, heterogeneous imaging devices with varying acquisition properties, and class imbalance due to the uneven prevalence of pathologies. In this work, we propose FedSSG, a novel Federated Learning framework that addresses domain shifts caused by diverse imaging devices while mitigating the under-representation of rare pathologies. The key contribution is a strategy for generating synthetic samples and distributing them across clients to improve coverage of both underrepresented pathologies and imaging devices. Experimental results demonstrate that our approach significantly enhances model performance and generalization across heterogeneous institutions, with minimal computational overhead at the client side.
[160] ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance
Yang Yang, Feifan Meng, Han Fang, Weiming Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diffusion models have achieved remarkable success in image generation, yet their training is predominantly driven by full-reference objectives that enforce pixel-wise similarity to ground-truth images.Such supervision, while effective for fidelity, may insufficient in terms of subjective visual perception quality and text-image semantic consistency. In this work, we investigate the problem of incorporating no-reference perceptual quality into diffusion training. A key challenge is that directly optimizing perceptual signals, such as those provided by no-reference image quality assessment (NR-IQA) models, introduces a mismatch with the original diffusion objective, leading to training instability and distributional drift during fine-tuning. To address this issue, we propose an anchor-constrained optimization framework that enables stable perceptual adaptation. Specifically, we leverage a learned NR-IQA model as a perceptual guidance signal, while introducing an anchor-based regularization that enforces consistency with the base diffusion model in terms of noise prediction. This design effectively balances perceptual quality improvement and generative fidelity, allowing controlled adaptation toward perceptually favorable outputs without compromising the original generative behavior. Extensive experiments demonstrate that our method consistently enhances perceptual quality while preserving generation diversity and training stability, highlighting the effectiveness of anchor-constrained perceptual optimization for diffusion models.
[161] SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
Haiyi Qiu, Kaihang Pan, Jiacheng Li, Juncheng Li, Siliang Tang, Yueting Zhuang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent unified image generation models have achieved remarkable success by employing MLLMs for semantic understanding and diffusion backbones for image generation. However, these models remain fundamentally limited in spatially-aware tasks due to a lack of intrinsic spatial understanding and the absence of explicit geometric guidance during generation. In this paper, we propose SpatialFusion, a novel framework that internalizes 3D geometric awareness into unified image generation models. Specifically, we first employ a Mixture-of-Transformers (MoT) architecture to augment the MLLM with a parallel spatial transformer to enhance 3D geometric modeling capability. By sharing self-attention with the MLLM, the spatial transformer learns to derive metric-depth maps of target images from rich semantic contexts. These explicit geometric scaffolds are then injected into the diffusion backbone through a specialized depth adapter, providing precise spatial constraints for spatially-coherent image generation. Through a progressive two-stage training strategy, SpatialFusion significantly enhances performance on spatially-aware benchmarks, notably outperforming leading models such as GPT-4o. Additionally, it achieves generalized performance gains across both text-to-image generation and image editing scenarios, all while maintaining negligible inference overhead.
[162] Which Face and Whose Identity? Solving the Dual Challenge of Deepfake Proactive Forensics in Multi-Face Scenarios
Lei Zhang, Zhiqing Guo, Dan Ma, Gaobo Yang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Unlike single-face forgeries, deepfakes in complex multi-person interaction scenarios (such as group photos and multi-person meetings) more closely reflect real-world threats. Although existing proactive forensics solutions demonstrate good performance, they heavily rely on a “single-face” setting, making it difficult to effectively address the problems of deepfake localization and source tracing in complex multi-person environments. To address this challenge, we propose the Deep Attributable Watermarking Framework (DAWF). This framework adopts a novel multi-face encoder-decoder architecture that bypasses the cumbersome offline pre-processing steps of traditional forensics, facilitating efficient in-network parallel watermark embedding and cross-face collaborative processing. Crucially, we propose a selective regional supervision loss. This innovative mechanism guides the decoder to focus exclusively on the facial regions tampered with by deepfakes. Leveraging this mechanism alongside the embedded identity payloads, DAWF realizes the “which + who” goal, answering the dual questions of which facial region was forged and who was forged. Extensive experiments on challenging multi-face datasets show that DAWF achieves excellent deepfake localization and traceability in complex multi-person scenes.
[163] GateMOT: Q-Gated Attention for Dense Object Tracking
Mingjin Lv, Zelin Liu, Feifei Shao, Yi-Ping Phoebe Chen, Junqing Yu, Wei Yang, Zikai Song
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While large models demonstrate the strong representational power of vanilla attention, this core mechanism cannot be directly applied to Dense Object Tracking: its quadratic all-to-all interactions are computationally prohibitive for dense motion estimation on high-resolution features. This mismatch prevents Dense Object Tracking from fully leveraging attention-based modeling in crowded and occlusion-heavy scenes. To address this challenge, we introduce GateMOT, an online tracking framework centered on Q-Gated Attention (Q-Attention), an efficient and spatially aware attention variant. Our key idea is to repurpose the Query from a similarity-conditioning term into a learnable gating unit. This Gating-Query (Gating-Q) produces a probabilistic gate that modulates Key features in an element-wise manner, enabling explicit relevance selection instead of costly global aggregation. Built on this mechanism, parallel Q-Attention heads transform one shared feature map into task-specific yet consistent representations for detection, motion, and re-identification, yielding a tightly coupled multi-task decoder with linear-complexity gating operations. GateMOT achieves state-of-the-art HOTA of 48.4, MOTA of 67.8, and IDF1 of 64.5 on BEE24, and demonstrates strong performance on additional Dense Object Tracking benchmarks. These results show that Q-Attention is a simple, effective, and transferable building block for attention-based tracking in dense tracking scenarios.
[164] Delineating Knowledge Boundaries for Honest Large Vision-Language Models
Junru Song, Yimeng Hu, Yijing Chen, Huining Li, Qian Li, Lizhen Cui, Yuntao Du
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Vision-Language Models (VLMs) have achieved remarkable multimodal performance yet remain prone to factual hallucinations, particularly in long-tail or specialized domains. Moreover, current models exhibit a weak capacity to refuse queries that exceed their parametric knowledge. In this paper, we propose a systematic framework to enhance the refusal capability of VLMs when facing such unknown questions. We first curate a model-specific “Visual-Idk” (Visual-I don’t know) dataset, leveraging multi-sample consistency probing to distinguish between known and unknown facts. We then align the model using supervised fine-tuning followed by preference-aware optimization (e.g., DPO, ORPO) to effectively delineate its knowledge boundaries. Results on the Visual-Idk dataset show our method improves the Truthful Rate from 57.9% to 67.3%. Additionally, internal probing also demonstrates that the model genuinely recognizes its boundaries instead of just memorizing refusal patterns. Our framework further generalizes to out-of-distribution medical and perceptual domains, providing a robust path toward more trustworthy and prudent visual assistants.
[165] CO-EVO: Co-evolving Semantic Anchoring and Style Diversification for Federated DG-ReID
Fengchun Zhang, Qiang Ma, Liuyu Xiang, Jinshan Lai, Tingxuan Huang, Jianwei Hu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Federated domain generalization for person re-identification (FedDG-ReID) aims to collaboratively train a pedestrian retrieval model across multiple decentralized source domains such that it can generalize to unseen target environments without compromising raw data privacy. However, this task is significantly challenged by the inherent stylistic gaps across decentralized clients. Without global supervision, models easily succumb to shortcut learning where representations overfit to domain specific camera biases rather than universal identity features. We propose CO-EVO, a novel federated framework that resolves this semantic-style conflict through a co-evolutionary mechanism. On the semantic side, Camera-Invariant Semantic Anchoring (CSA) learns identity prompts with cross-camera consistency to establish purified and domain-agnostic anchors that filter out local imaging noise. On the visual side, Global Style Diversification (GSD), powered by a Global Camera-Style Bank (GCSB), synthesizes realistic perturbations to expand the visual boundaries of training data. The core of CO-EVO is its co-evolutionary loop where purified anchors act as gravitational centers to guide the image encoder toward robust anatomical attributes amidst diverse style variations. Extensive experiments demonstrate that CO-EVO achieves state-of-the-art (SOTA) performance, proving that the synergy between semantic purification and style expansion is essential for robust cross-domain generalization. Our code is available at: https://github.com/NanYiyuzurn/ACL-LGPS-2026.
[166] QYOLO: Lightweight Object Detection via Quantum Inspired Shared Channel Mixing
Garvit Kumar Mittal, Sahil Tomar, Sandeep Kumar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The rapid advancement of object detection architectures has positioned single stage detectors as the dominant solution for real-time visual perception. A primary source of computational overhead in these models lies in the deep backbone stages, where C2f bottleneck modules at high stride levels accumulate a disproportionate share of parameters due to quadratic scaling with channel width. This work introduces QYOLO, a quantum-inspired channel mixing framework that achieves genuine architectural compression by replacing the two deepest backbone C2f modules at P4/16 (512 channels) and P5/32 (1024 channels) with a compact QMixBlock. The proposed block performs global channel recalibration through a sinusoidal mixing mechanism with shared learnable parameters across both backbone stages, enforcing consistent channel importance without requiring independent per-stage parameter sets. The neck and detection head remain fully classical and unchanged. Evaluation on the VisDrone2019 benchmark demonstrates that QYOLOv8n achieves a 20.2% reduction in parameter count (3.01M to 2.40M) and 12.3% GFLOPs reduction with only 0.4 pp mAP@50 degradation. QYOLOv8s achieves 21.8% reduction with 0.1 pp degradation. When combined with knowledge distillation, full accuracy parity is recovered at no cost to compression. An expanded backbone plus neck variant achieved 38 to 41% reduction at the cost of greater accuracy degradation, motivating the backbone-only final design.
[167] Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
Zhirong Shen, Rui Huang, Jiacheng Liu, Chang Zou, Peiliang Cai, Shikang Zheng, Zhengyi Shi, Liang Feng, Linfeng Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: To address the high sampling cost of Diffusion Transformers (DiTs), feature caching offers a training-free acceleration method. However, existing methods rely on hand-crafted forecasting formulas that fail under aggressive skipping. We propose L2P (Learnable Linear Predictor), a simple data-driven caching framework that replaces fixed coefficients with learnable per-timestep weights. Rapidly trained in ~20 seconds on a single GPU, L2P accurately reconstructs current features from past trajectories. L2P significantly outperforms existing baselines: it achieves a 4.55x FLOPs reduction and 4.15x latency speedup on FLUX.1-dev, and maintains high visual fidelity under up to 7.18x acceleration on Qwen-Image models, where prior methods show noticeable quality degradation. Our results show learning linear predictors is highly effective for efficient DiT inference. Code is available at https://github.com/Aredstone/L2P-Cache.
[168] Seamless Indoor-Outdoor Mapping for INGENIOUS First Responders
Jürgen Wohlfeil, Henry Meißner, Adrian Schischmanow, Thomas Kraft, Dirk Baumbach, Ines Ernst, Dennis Dahlke
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In several applications it is desired to have 3D models not only from the outdoor spaces but also from inside the building. In the context of First Responder enhancement in large scale natural and man-made disasters, a method is presented to achieve this goal with a high degree of automation. Therefore an autonomously flying aerial mapping system is combined with a person-carried indoor positioning system. Automatically recognized markers (AprilTags) are geo-referenced by the aerial system and their coordinates are sent to the ground-based system. By looking at the AprilTags before entering the building, the ground-based system is registered to world coordinates. Without the further need of any global positioning, it creates a point cloud from the indoor spaces that fits with the point could from the aerial view. This allows a co-visualization of both point-clouds as a seamless indoor-outdoor 3D model in real time.
[169] Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning
Junwon You, Mihyun Jang, Sangwoo Mo, Jae-Hun Jung
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision-language models have shown strong performance, but they often generalize poorly to specialized domains. While semi-supervised vision-language learning mitigates this limitation by leveraging a small set of labeled image-text pairs together with abundant unlabeled images, existing methods remain fundamentally pairwise and fail to model the global structure of multimodal representation manifolds. Existing topology-based alignment methods rely on persistence diagram matching, which neither guarantees geometric alignment nor utilizes the image-text pairing information central to vision-language learning. We propose Topology-Aware Multimodal Representation Alignment (ToMA), a framework that uses persistent homology to identify topologically salient edges and aligns them across modalities through available cross-modal correspondences. ToMA leverages both H_0-death edges and lightweight H_1-birth edges, allowing it to capture both connectivity and cycle structure without constructing 2-simplices. Experiments show that ToMA yields stable gains, with clear improvements on remote sensing and modest but consistent benefits on fashion retrieval. Additional analysis shows that ToMA is more stable than alternative topology-based objectives and that lightweight H_1-birth edges provide useful higher-order structural signals.
[170] A Multimodal Pre-trained Network for Integrated EEG-Video Seizure Detection
Tong Lu, Ke Xu, Zimo Zhang, Zitong Zhao, Danwei Weng, Ruiyu Wang, Miao Liu, Zizuo Zhang, Jingyi Yao, Yixuan Zhao, Wenchao Zhang, Min Wang, Guoming Luan, Minmin Luo, Zhifeng Yue
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reliable seizure detection in mouse models is essential for preclinical epilepsy research, yet manual review of synchronized video-EEG recordings is labor-intensive and single-modality systems fail for complementary reasons: video-based methods are easily confounded by benign behaviors, whereas EEG-based methods are vulnerable to ictal motion artifacts. We present EEGVFusion, a multimodal framework that combines self-supervised EEG representation learning, spatio-temporal video encoding, optimal-transport alignment, and bidirectional cross-attention to integrate neural and behavioral evidence. We also curate an expert-annotated dataset of synchronized EEG and video recordings comprising 93 sessions from 15 mice for training and evaluation. In the random-session split, EEGVFusion achieved a Balanced Accuracy of 0.9957 with perfect event sensitivity and an Event FAR of 0.6250 FP/h, indicating strong seizure detection performance with a low false-alarm burden. In a single held-out-subject evaluation with Subject 110 reserved for testing, EEGVFusion achieved a Balanced Accuracy of 0.9718 and reduced Event FAR from 2.7250 FP/h for the EEG-only counterpart to 0.4833 FP/h while preserving perfect event sensitivity. Targeted ablations further showed that EEG pre-training and OT alignment help reduce false alarms while preserving event sensitivity.
[171] Decoupled Prototype Matching with Vision Foundation Models for Few-Shot Industrial Object Detection
Hari Prasanth S. M., Nilusha Jayawickrama, Risto Ojala
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Industrial object detection systems typically rely on large annotated datasets, which are expensive to collect and challenging to maintain in industrial scenarios where the inventory of objects changes frequently. This work addresses the challenge of few-shot object detection in such industrial scenarios, where only a limited number of labeled samples are available for newly introduced objects. We present a detection framework that leverages vision foundation models to recognize objects with minimal supervision. The method constructs class prototypes from a small set of reference samples by extracting feature representations. For a given query scene during inference, object regions are generated using a segmentation model, and feature embeddings are extracted and matched with class prototypes using similarity matching. We evaluate the detection method on three established industrial datasets from the Benchmark for 6D Object Pose Estimation benchmark following the official 2D object detection evaluation protocol. We demonstrate competitive detection performance, improving AP by 6.9% compared to the state-of-the-art training-free detection methods. Furthermore, the presented method is able to onboard new objects using only a few reference images, without requiring any CAD models or large annotated datasets. These properties make the approach well-suited for real-world industrial applications.
[172] Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
Ahyoung Oh, Wonseok Shin, Songkuk Kim
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Sparse Autoencoders (SAEs) have demonstrated significant success in interpreting Large Language Models (LLMs) by decomposing dense representations into sparse, semantic components. However, their potential for analyzing Vision Transformers (ViTs) remains largely under-explored. In this work, we present the first application of SAEs to the ViT [CLS] token for out-of-distribution (OOD) detection, addressing the limitation of existing methods that rely on entangled feature representations. We propose a novel framework utilizing a Top-k SAE to disentangle the dense [CLS] features into a structured latent space. Through this analysis, we reveal that in-distribution (ID) data exhibits consistent, class-specific activation patterns, which we formalize as Class Activation Profiles (CAPs). Our study uncovers a key structural invariant: while ID samples preserve a stable pattern within CAPs, OOD samples systematically disrupt this structure. Leveraging this insight, we introduce a scoring function based on the divergence of core energy profiles to quantify the deviation from ideal activation profiles. Our method achieves strong results on the FPR95 metric, critical for safety-sensitive applications across multiple benchmarks, while also achieving competitive AUROC. Overall, our findings demonstrate that the sparse, disentangled features revealed by SAEs can serve as a powerful, interpretable tool for robust OOD detection in vision models.
[173] Star-Fusion: A Multi-modal Transformer Architecture for Discrete Celestial Orientation via Spherical Topology
May Hammad, Menatallh Hammad
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reliable celestial attitude determination is a critical requirement for autonomous spacecraft navigation, yet traditional “Lost-in-Space” (LIS) algorithms often suffer from high computational overhead and sensitivity to sensor-induced noise. While deep learning has emerged as a promising alternative, standard regression models are often confounded by the non-Euclidean topology of the celestial sphere and by the periodic boundary conditions of Right Ascension (RA) and Declination (Dec). In this paper, we present Star-Fusion, a multi-modal architecture that reformulates orientation estimation as a discrete topological classification task. Our approach leverages spherical K-Means clustering to partition the celestial sphere into K topologically consistent regions, effectively mitigating coordinate wrapping artifacts. The proposed architecture employs a tripartite fusion strategy: a SwinV2-Tiny transformer backbone for photometric feature extraction, a convolutional heatmap branch for spatial grounding, and a coordinate-based MLP for geometric anchoring. Experimental evaluations on a synthetic Hipparcos-derived dataset demonstrate that Star-Fusion achieves a Top-1 accuracy of 93.4% and a Top-3 accuracy of 97.8%. Furthermore, the model exhibits high computational efficiency, maintaining an inference latency of 18.4 ms on resource-constrained COTS hardware, making it a viable candidate for real-time onboard deployment in next-generation satellite constellations.
[174] Are Data Augmentation and Segmentation Always Necessary? Insights from COVID-19 X-Rays and a Methodology Thereof
Aman Swaraj, Arnav Agarwal, Hitendra Singh Bhadouria, Sandeep Kumar, Karan Verma
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Purpose: Rapid and reliable diagnostic tools are crucial for managing respiratory diseases like COVID-19, where chest X-ray analysis coupled with artificial intelligence techniques has proven invaluable. However, most existing works on X-ray images have not considered lung segmentation, raising concerns about their reliability. Additionally, some have employed disproportionate and impractical augmentation techniques, making models less generalized and prone to overfitting. This study presents a critical analysis of both issues and proposes a methodology (SDL-COVID) for more reliable classification of chest X-rays for COVID-19 detection. Methods: We use class activation mapping to obtain a visual understanding of the predictions made by Convolutional Neural Networks (CNNs), validating the necessity of lung segmentation. To analyze the effect of data augmentation, deep learning models are implemented on two levels: one for an augmented dataset and another for a non-augmented dataset. Results: Careful analysis of X-ray images and their corresponding heat maps under expert medical supervision reveals that lung segmentation is necessary for accurate COVID-19 prediction. Regarding data augmentation, test accuracy significantly drops beyond a certain threshold with additional augmented images, indicating model overfitting. Conclusion: Our proposed methodology, SDL-COVID, achieves a precision of 95.21% and a lower false negative rate, ensuring its reliability for COVID-19 detection using chest X-rays.
[175] Attribution-Guided Multimodal Deepfake Detection via Cross-Modal Forensic Fingerprints
Wasim Ahmad, Wei Zhang, Xuerui Mao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Audio-visual deepfakes have reached a level of realism that makes perceptual detection unreliable, threatening media integrity and biometric security. While multimodal detection has shown promise, most approaches are binary classification tasks that often latch onto dataset-specific artifacts rather than genuine generative traces. We argue that a detector incapable of identifying how a video was forged is likely learning the wrong signal. Unlike binary detection, attribution-guided learning imposes a stronger geometric constraint on the shared embedding space, forcing the model to encode generator-specific forensic content rather than shortcuts. We propose the Attribution-Guided Multimodal Deepfake Detection (AMDD) framework, which jointly learns to detect and attribute manipulation. AMDD treats generator attribution as a structured regularization that constrains representation geometry toward forensically meaningful features. We introduce a Cross-Modal Forensic Fingerprint Consistency (CMFFC) loss to enforce alignment between generator-induced artifacts in visual and audio streams. This exploits the fact that coherent manipulation leaves correlated traces across modalities, grounded in the physical coupling between speech and facial articulation that synthetic pipelines routinely disrupt. Architecturally, we pair a ResNet50 with temporal attention for visual encoding against a pretrained ResNet18 for mel spectrograms, closing the encoder capacity gap found in prior models. On FakeAVCeleb, AMDD achieves 99.7% balanced accuracy and 99.8% AUC with 95.9% attribution accuracy. Cross-dataset evaluation on DeepfakeTIMIT, DFDM, and LAV-DF confirms that real video detection generalizes robustly, while fake detection on unseen generators remains an open challenge that we analyze in depth.
[176] Last-Layer-Centric Feature Recombination: Unleashing 3D Geometric Knowledge in DINOv3 for Monocular Depth Estimation
Gongshu Wang, Zhirui Wang, Kan Yang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Monocular depth estimation (MDE) is a fundamental yet inherently ill-posed task. Recent vision foundation models (VFMs), particularly DINO-based transformers, have significantly improved accuracy and generalization for dense prediction. Prior works generally follow a unified paradigm: sampling a fixed set of intermediate transformer layers at uniform intervals to build multi-scale features. This common practice implicitly assumes that geometric information is uniformly distributed across layers, which may underutilize the structural 3D cues encoded in VFMs. In this study, we present a systematic layer-wise analysis of DINOv3, revealing that 3D information is distributed non-uniformly: deeper layers exhibit stronger depth predictability and better capture inter-sample geometric variation. Motivated by this, we introduce a Last-Layer-Centric Feature Recombination (LFR) module to enhance geometric expressiveness. LFR treats the final layer as a geometric anchor and adaptively selects complementary intermediate layers according to a minimal-similarity criterion. Selected features are fused with the last-layer representation via compact linear adapters.Extensive experiments show that LFR module consistently improves MDE accuracy and achieves state-of-the-art performance. Our analysis sheds light on how geometric knowledge is organized within VFMs and offers an efficient strategy for unlocking their potential in dense 3D tasks.
[177] SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection
Paul Julius Kühn, Mika Pommeranz, Arjan Kuijper, Saptarshi Neil Sinha
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The bottleneck in learning-based industrial defect detection is often limited not by model capacity, but by the scarcity of labeled defect data: defects are rare, annotations are expensive, and collecting balanced training sets is slow. We present an end-to-end pipeline for synthetic defect generation and annotation, combining Vision-Language-Model-based prompts, LoRA-adapted diffusion, mask-guided inpainting, and sample filtering with automatic label derivation, and demonstrates the potential of real data with realistic synthetic samples to overcome data scarcity. The evaluation is conducted on, a challenging dataset of pitting defects on ball screw drives, and then on a subset of the Mobile phone screen surface defect segmentation dataset (MSD) dataset to test cross-domain transfer. Beyond downstream detector performance, we analyze key stages of the pipeline, including prompt construction, LoRA selection, and sample filtering with DreamSim and CLIPScore, to understand which synthetic samples are both realistic and useful. Experiments with YOLOv26, YOLOX, and LW-DETR show that synthetic-only training does not replace real data. When combined with real data, synthetic defects can preserve performance and yield modest gains in selected BSData training regimes. The MSD transfer study shows that the overall pipeline structure carries over to a second industrial inspection domain, while also highlighting the importance of domain-specific adaptation and annotation-quality control. Overall, the paper provides an end-to-end assessment of diffusion-based industrial defect synthesis and shows that its strongest value lies in strengthening scarce real datasets rather than substituting for them.
[178] $\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding
Lingjie Zeng, Hailun Zhang, Xiwen Wang, Qijun Zhao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Temporal modeling remains a fundamental challenge in video understanding, particularly as sequence lengths scale. Traditional video models relying on dense spatiotemporal attention suffer from quadratic computational costs for long videos. To circumvent these costs, recent approaches adapt image models for videos via Parameter-Efficient Fine-Tuning (PEFT) methods such as adapters. However, deeply inserting these modules incurs prohibitive activation memory overhead during back-propagation. While recent efficient State Space Models (SSMs) introduce linear complexity, they disrupt 2D spatial relationships and rely on extensive masked pre-training to recover spatial awareness. To overcome these limitations, we propose Parallel Kinematic Selective State Space Scanners (PKS$^4$). We retain a standard 2D vision backbone for spatial semantics and insert a single plug-and-play PKS$^4$ module with linear-complexity temporal scanning, avoiding temporal attention and multi-layer adapters. We first extract kinematic priors via a Kinematic Prior Encoder, which captures local displacements and motion boundaries through inter-frame correlations and differences. These priors drive linear-complexity SSMs to track underlying kinematic states, adaptively modulating update speeds and read-write strategies at each time step. Instead of global scanning, we deploy parallel scanners along the temporal dimension for each spatial location, preserving spatial structures while reducing overhead. Experiments on spatial-heavy and temporal-heavy action recognition benchmarks show that PKS$^4$ achieves state-of-the-art performance. Remarkably, our method converges in merely $20$ epochs, achieving approximately $10\times$ lower training compute than pure video SSMs, establishing a new paradigm for efficient video understanding.
[179] A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows
Yuxuan Han, Yuanxing Zhang, Yushuo Wang, Yichao Jin, Kenneth Zhu Ke, Jingyuan Zhao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Structured information extraction from long, multilingual scanned financial documents is a core requirement in industrial KYC and compliance workflows. These documents are typically non machine readable, noisy, and visually heterogeneous. They usually span dozens of pages while containing only sparse task relevant information. Although recent vision-language models achieve strong benchmark performance, directly applying them end to end to full financial reports often leads to unreliable extraction under real world conditions. We present a multistage extraction framework that integrates image preprocessing, multilingual OCR, hybrid page-level retrieval, and compact VLM-based structured extraction. The design separates page localization from multimodal reasoning, enabling more accurate extraction from complex multipage documents. We evaluated the framework on 120 production KYC documents comprising about 3000 multilingual scanned pages. Across multiple OCR-VLM combinations, the proposed pipeline consistently outperforms direct PDF-to-VLM baselines, improving field-level accuracy by up to 31.9 percentage points. The best configuration, PaddleOCR with MiniCPM2.6, achieves 87.27 percent accuracy. Ablation studies show that page-level retrieval is the dominant factor in performance improvements, particularly for complex financial statements and non-English documents.
[180] Cross-Domain Transfer of Hyperspectral Foundation Models
Nick Theisen, Peer Neubert
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Hyperspectral imaging (HSI) semantic segmentation typically relies on in-domain training, but limited data availability often restricts model performance in real-world applications. Current approaches to leverage foundation models in proximal sensing use cross-modality techniques, bridging RGB and HSI to exploit vision foundation models. However, these methods either discard spectral information or introduce architectural complexity. We propose cross-domain transfer as an alternative, reusing HSI foundation models - originally trained in remote sensing - for proximal sensing applications. By eliminating the need to bridge modality gaps, our approach preserves spectral information while maintaining a simple architecture. Using the HS3-Bench benchmark, we systematically evaluate and compare conventional in-domain, in-modality training, cross-modality transfer and cross-domain transfer strategies. Our results demonstrate that cross-domain transfer achieves large performance improvements over in-domain, in-modality training, reduces the performance gap to cross-modality approaches and maintains strong performance in limited data settings. Thus, this work advances more effective HSI semantic segmentation in diverse applications.
[181] Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
Nikita Araslanov, Martin Sundermeyer, Hidenobu Matsuki, David Joseph Tan, Federico Tombari
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: One of the most exciting applications of vision models involve pixel-level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio-temporal properties of visual scenes at the pixel level. Existing frameworks either train on image-based pretext tasks, which do not account for dynamic elements, or on video sequences for action-level reasoning, which does not scale to dense pixel-level prediction. We present a framework that learns pixel-accurate feature descriptors from videos, LILA. The core element of our training framework is linear in-context learning. LILA leverages spatio-temporal cue maps – depth and motion – estimated with off-the-shelf networks. Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner. We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.
[182] Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training
Yanyun Wang, Qingqing Ye, Li Liu, Zi Liang, Haibo Hu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Adversarial Training (AT) is one of the most effective methods for developing robust deep neural networks (DNNs). However, AT faces a trade-off problem between clean accuracy and adversarial robustness. In this work, we reveal a surprising phenomenon for the first time: Varying input perturbation intensities for training samples near decision boundaries in AT have minimal impact on model robustness. This finding directly exposes the inconsistency between accuracy and robustness score fluctuations, leading us to identify the misalignment between input and latent spaces as a critical driver of the robustness-accuracy trade-off. To mitigate this misalignment for harmonizing accuracy and robustness, we define Robust Alignment as a new AT target, encouraging the model perception to change with input perturbations provided the final label prediction remains unchanged, which can be achieved via two novel ideas. First, we suggest a reduced and fixed perturbation intensity for those boundary samples, which facilitates the model to utilize the perturbations as learnable patterns, instead of noises that complicate decision boundaries meaninglessly. Second, we propose a Domain Interpolation Consistency Adversarial Regularization (DICAR), based on rigorous theoretical derivations, which explicitly introduces semantic alignment between input and latent spaces into AT. Based on these two ideas, we end up with a new Robust Alignment Adversarial Training (RAAT) method, effectively harmonizing accuracy and robustness. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet-18, PreActResNet-18, and WideResNet-28-10 demonstrate the effectiveness of RAAT in improving the trade-off beyond four common baselines and a total of 14 related state-of-the-art (SOTA) works.
[183] Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
Haosen Li, Wenshuo Chen, Lei Wang, Shaofeng Liang, Bowen Tian, Soning Lai, Yutao Yue
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diffusion models have achieved remarkable success in synthesizing complex static and temporal visuals, a breakthrough largely driven by Classifier-Free Guidance (CFG). However, despite its pivotal role in aligning generated content with textual prompts, standard CFG relies on a globally uniform scalar. This homogeneous amplification traps models in a well-documented “detail-artifact dilemma”: low guidance scales fail to inject intricate semantics, while high scales inevitably cause structural degradation, color over-saturation, and temporal inconsistencies in videos. In this paper, we expose the physical root of this flaw through the lens of differential geometry. By analyzing Tweedie’s Formula, we reveal that CFG intrinsically performs a tangential linear extrapolation. Because the natural data manifold is highly curved, this uniform linear step introduces a severe orthogonal deviation. To keep the generation trajectory safely bounded, we formulate a theoretical upper bound for spatial and adaptive guidance. Based on these geometric insights, we propose Spatial Adaptive Multi Guidance (SAMG), a training-free and virtually zero-cost sampling algorithm. SAMG dynamically computes point-wise conditional guidance energy, applying a conservative minimum scale to high-energy boundary regions to preserve delicate micro-textures, while deploying an aggressive maximum scale in low-energy regions to maximize semantic injection. Extensive experiments across diverse image (SD 1.5, SDXL, SD3.5 Medium) and video (CogVideoX, ModelScope) architectures demonstrate that SAMG effectively resolves the detail-artifact dilemma, achieving superior semantic alignment, structural integrity, and temporal smoothness without any computational overhead.
[184] MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification
Zuzheng Kuang, Honghao Chang, Boqiang Liang, Haoqian Wang, Lijun He, Fan Li, Haixia Bi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Open-vocabulary change detection aims to identify semantic changes in bi-temporal remote sensing images without predefined categories. Recent methods combine foundation models such as SAM, DINO and CLIP, but typically process each timestamp independently or interact only at the final comparison stage. Such paradigms suffer from insufficient temporal coupling during semantic reasoning, which limits their ability to distinguish genuine semantic changes from non-semantic appearance discrepancies. In addition, patch-dominant inference on high-resolution images often weakens global semantic continuity and produces fragmented change regions. To address these issues, we propose MemOVCD, a training-free open-vocabulary change detection framework based on cross-temporal memory reasoning and global-local adaptive rectification. Specifically, we reformulate bi-temporal change detection as a two-frame tracking problem and introduce weighted bidirectional propagation to aggregate semantic evidence from both temporal directions. To stabilize memory propagation across large temporal gaps, we construct histogram-aligned transition frames to smooth abrupt appearance changes. Moreover, a global-local adaptive rectification strategy adaptively fuses local and global-view predictions, improving spatial consistency while preserving fine-grained details. Experiments on five benchmarks demonstrate that MemOVCD achieves favorable performance on two change detection tasks, validating its effectiveness and generalization under diverse open-vocabulary settings.
[185] MTCurv: Deep learning for direct microtubule curvature mapping in noisy fluorescence microscopy images
Achraf Ait Laydi, Sidi Mohamed Sid’El Moctar, Yousef El Mourabit, Hélène Bouvrais
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate quantification of the geometry of curvilinear biological structures is essential for understanding cellular mechanics and disease-related morphological alterations. Microtubule curvature is a key descriptor of filament rigidity and mechanical perturbations. However, reliable curvature extraction from fluorescence microscopy images remains challenging due to noise, low contrast, and partial filament visibility. Existing approaches rely on segmentation pipelines with pre or post-processing, which are highly sensitive to segmentation errors and often fail under adverse imaging conditions. In this work, we propose MTCurv, a deep learning framework for direct, segmenta-tion-free regression of microtubule curvature maps from noisy microscopy images. Leveraging a synthetic dataset with pixel-wise curvature annotations, we reformulated curvature estimation as a regression problem and adapted an attention-based residual U-Net. To reduce hallucinations and enforce spatial coherence, we introduced a gradient-aware loss combining Mean Squared Error with a gradient consistency term. Beyond model and loss design, we evaluated commonly used regression and image quality metrics, revealing that many perceptual and blind metrics are poorly suited for curvature estimation. Correlation-based metrics, particularly Spearman correlation, emerged as more reliable indicators of curvature prediction quality. Experiments on two datasets of increasing difficulty demonstrated that MTCurv accurately recovers local microtubule curvatures, even in the presence of background fluorescence. Ablation studies highlighted the contribution of both residual encoding and attention-based decoding. Overall, this work provides a practical tool for filament curvature analysis and methodological insights for geometry-aware regression in biomedical imaging. Datasets and code are made available.
[186] ViCrop-Det: Spatial Attention Entropy Guided Cropping for Training-Free Small-Object Detection
Hui Wang, Hongze Li, Wei Chen, Xiaojin Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Transformer-based architectures have established a dominant paradigm in global semantic perception; however, they remain fundamentally constrained by the profound spatial heterogeneity inherent in natural images. Specifically, the imposition of a uniform global receptive field across regions of varying information density inevitably leads to local feature degradation, particularly in dense conflict zones populated by microscopic targets. To address this mechanistic limitation, we propose ViCrop-Det, a training-free inference framework that introduces adaptive spatial trust region shrinkage. Inspired by the use of attention entropy in anomaly segmentation, ViCrop-Det leverages the detection decoder’s cross-attention distribution as an endogenous probe. By utilizing Spatial Attention Entropy (SAE) to heuristically evaluate local spatial ambiguity, the framework executes dynamic spatial routing, allocating a fixed computational budget exclusively to regions exhibiting both high target saliency and high cognitive uncertainty. By shrinking the spatial trust region and injecting high-frequency localized observations, ViCrop-Det actively resolves spatial ambiguity and recovers fine-grained features without requiring architectural modifications. Extensive evaluations on VisDrone and DOTA-v1.5 demonstrate that ViCrop-Det yields competitive performance enhancements, consistently adding +1-3 mAP@50 to RT-DETR-R50 and Deformable DETR with a marginal 20-23% latency overhead. On MS COCO, $AP_{S}$ improves while $AP_{M}/AP_{L}$ remains stable, indicating precise fine-scale refinement without compromising the global spatial prior. Under compute-matched settings, our adaptive routing strategy comprehensively surpasses uniform slicing baselines, achieving a highly optimized accuracy-speed trade-off.
[187] GIFGuard: Proactive Forensics against Deepfakes in Facial GIFs via Spatiotemporal Watermarking
Shupeng Che, Zhiqing Guo, Changtao Miao, Dan Ma, Gaobo Yang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The rapid evolution of deepfake technology poses an unprecedented threat to the authenticity of Graphics Interchange Format (GIF) imagery, which serves as a representative of short-loop temporal media in social networks. However, existing proactive forensics works are designed for static images, which limits their applicability to animated GIFs. To bridge this gap, we propose GIFGuard, the first spatiotemporal watermarking framework tailored for deepfake proactive forensics in GIFs. In the embedding stage, we propose the Spatiotemporal Adaptive Residual Encoder (STARE) to ensure robustness against high-level semantic tampering. It employs a 3D convolutional backbone with adaptive channel recalibration to capture globally coherent temporal dependencies. In the extraction stage, we design the Deep Integrity Restoration Decoder (DIRD). It utilizes a spatiotemporal hourglass architecture equipped with 3D attention to restore latent features, allowing for the accurate extraction of watermark signals even under severe facial manipulation. Furthermore, we construct GIFfaces, the first large-scale benchmark dataset curated for GIF proactive forensics to facilitate research in this domain. Extensive results show that GIFGuard achieves high-fidelity visual quality and remarkable robustness performance against deepfakes. Related code and dataset will be released.
[188] 3D-LENS: A 3D Lifting-based Elevated Novel-view Synthesis method for Single-View Aerial-Ground Re-Identification
William Grolleau, Astrid Sabourin, Guillaume Lapouge, Catherine Achard
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Aerial-Ground Re-Identification (AG-ReID) is constrained by the viewpoint-domain gap, as drastic viewpoint disparities occlude or distort discriminative features, making cross-viewpoint image retrieval challenging. While existing methods rely on paired cross-view annotations, real-world deployments, such as wilderness search-and-rescue (SAR), often lack target-domain data, requiring retrieval from ground-level references alone. To our knowledge, we are the first to address this challenge by formalizing the Single-View AG-ReID (SV AG-ReID) setting, where models trained on a single real viewpoint must generalize to an unseen viewpoint. We propose 3D Lifting-based Elevated Novel-view Synthesis (3D-LENS), a unified framework combining geometrically-consistent novel view synthesis that leverages large-scale 3D mesh reconstruction, with a robust representation learning scheme to mitigate synthetic-to-real bias. Unlike 2D generative baselines that suffer from geometric inconsistencies or prior 3D methods that are restricted to class-specific templates, our approach ensures view-consistent synthesis across diverse categories without predefined templates that fail to capture fine-grained details, such as carried objects. Extensive experiments demonstrate that our method achieves state-of-the-art performance on SV AG-ReID scenarios. Code and data will be released at https://github.com/TurtleSmoke/3D-LENS.
[189] DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
Mingji Ge, Qirui Chen, Zeqian Li, Weidi Xie
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Long-term video understanding requires interpreting complex temporal events and reasoning over procedural activities. While instructional video corpora, like HowTo100M, offer rich resources for model training, they present significant challenges, including noisy ASR transcripts and inconsistent temporal alignments between narration and visual content. In this work, we introduce an automated, training-free pipeline to extract high-quality procedural annotations from in-the-wild instructional videos. Our approach segments videos into coherent shots, filters poorly aligned content, and leverages state-of-the-art multimodal and large language models (Qwen2.5-VL and DeepSeek-R1) to generate structured, temporally grounded procedural steps. This pipeline yields DenseStep2M, a large-scale dataset comprising approximately 100K videos and 2M detailed instructional steps, designed to support comprehensive long-form video understanding. To rigorously evaluate our pipeline, we curate DenseCaption100, a benchmark of high-quality, human-written captions. Evaluations demonstrate strong alignment between our auto-generated steps and human annotations. Furthermore, we validate the utility of DenseStep2M across three core downstream tasks: dense video captioning, procedural step grounding, and cross-modal retrieval. Models fine-tuned on DenseStep2M achieve substantial gains in captioning quality and temporal localization, while exhibiting robust zero-shot generalization across egocentric, exocentric, and mixed-perspective domains. These results underscore the effectiveness of DenseStep2M in facilitating advanced multimodal alignment and long-term activity reasoning. Our dataset is available at https://huggingface.co/datasets/mingjige/DenseStep2M.
[190] AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
Xiaoya Cheng, Rouwan Wu, Xinyi Liu, Zeyu Cui, Yan Liu, Na Zhao, Yu Liu, Maojun Zhang, Shen Yan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Despite the rapid progress in data-driven 3D vision, aerial geometric 3D vision remains a formidable challenge due to the severe scarcity of large-scale, high-fidelity training data. Existing benchmarks, predominantly biased toward ground-level or object-centric views, do not account for complex viewpoint transformations and diverse environmental conditions in UAV-based sensing. To bridge this critical gap, we propose AirZoo, a unified large-scale dataset and benchmark for grounding aerial geometric 3D vision. AirZoo possesses three appealing properties: 1) Scalable Generation Pipeline: Leveraging freely available, world-scale photogrammetric 3D meshes, it renders vast outdoor environments with customizable UAV flight trajectories and configurable weather/illumination. 2) Comprehensive Scene Diversity: It provides the most extensive coverage of region types to date (spanning 378 regions across 22 countries), systematically encompassing both highly structured urban landscapes and complex unstructured natural environments. 3) Rich Geometric Annotations: Each frame provides synchronized, pixel-level metric depth and precise 6-DoF geo-referenced poses, essential for geometry-aware learning. Through three rigorous evaluation tracks – aerial image retrieval, cross-view matching, and multi-view 3D reconstruction – we demonstrate that AirZoo serves as a powerful pre-training engine. Extensive experiments on both public and newly collected real-world benchmarks reveal that fine-tuning on AirZoo yields substantial performance gains for SoTA models (e.g., MegaLoc, RoMa, VGGT, and Depth Anything 3), establishing a new performance upper bound for aerial spatial intelligence.
[191] FunFace: Feature Utility and Norm Estimation for Face Recognition
Žiga Babnik, Fadi Boutros, Naser Damer, Deepak Kumar Jain, Peter Peer, Vitomir Štruc
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Face Recognition (FR) is used in a variety of application domains, from entertainment and banking to security and surveillance. Such applications rely on the FR model to be robust and perform well in a variety of settings. To achieve this, state-of-the-art FR models typically use expressive adaptive margin loss functions, which tie the feature norm to concepts related to sample quality, such as recognizability and perceptual image quality. Recently, through the development of Face Image Quality Assessment (FIQA) techniques, biometric utility has become the preferred measure of face-image quality and has been shown to be a better predictor of the usefulness of samples for face recognition compared to more human-centric aspects, such as resolution, blur, and lighting, tied to general image quality. While image quality expressed through feature norms exhibits a certain level of correlation with biometric utility, it does not fully encapsulate all aspects of utility. To address this point, we propose a new adaptive margin loss, FunFace (Face Recognition Through Utility and Norm Estimation), which incorporates biometric utility, estimated by the Certainty Ratio, into the adaptive margin, taking inspiration from AdaFace. We show that FunFace (when used to train a face recognition model) achieves competitive results to other state-of-the-art FR models on benchmarks containing high-quality samples, while surpassing them on low quality benchmarks.
[192] State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading
Yuanze Hu, Gen Li, Yuqin Lan, Qingchen Yu, Zhichao Yang, Junwei Jing, Zhaoxin Fan, Xiaotie Deng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal large language models (MLLMs) have achieved impressive progress on general multimodal tasks, yet they remain brittle on dial-based measurement reading. In this paper, we study this problem through controlled benchmarks and feature-space probing, and show that current MLLMs not only achieve unsatisfactory accuracy on dial-based readout, but also suffer sharp performance drops under viewpoint and illumination changes even when the underlying dial state remains fixed. Our probing analysis further reveals that same-state samples under appearance variation are not consistently clustered, while neighboring states fail to preserve the local structure implied by continuous dial values. These findings suggest that existing MLLMs largely ignore the intrinsic state geometry of dial measurement tasks and instead rely on superficial appearance cues. Motivated by this diagnosis, we propose TriSCA, a tri-level state-consistent alignment framework for dial-based measurement reading. Specifically, TriSCA consists of state-distance-aware representation alignment, metadata-grounded observation-to-state supervision, and state-aware objective alignment. Extensive ablation studies and evaluation experiments on controlled clock and gauge benchmarks, together with evaluation on an external real-world benchmark, demonstrate the effectiveness of our method.
[193] SnapPose3D: Diffusion-Based Single-Frame 2D-to-3D Lifting of Human Poses
Alessandro Simoni, Riccardo Catalini, Davide Di Nucci, Guido Borghi, Davide Davoli, Lorenzo Garattoni, Gianpiero Francesca, Yuki Kawana, Roberto Vezzani
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Depth ambiguity and joint uncertainty are the two main obstacles in obtaining accurate human pose predictions by 2D-to-3D lifting methods proposed in the literature. In particular, these issues are caused by 2D joint locations that can be mapped to multiple 3D positions, inducing multiple possible final poses. Following these considerations, we propose leveraging diffusion-based models generation capability to predict multiple hypotheses and aggregate them in a final accurate pose. Therefore, we introduce SnapPose3D, a pose-lifting framework trained deterministically to denoise 3D poses conditioned on both visual context and 2D pose features. SnapPose3D adopts a probabilistic approach during inference, generating multiple hypotheses through random sampling from a unit Gaussian distribution. Unlike most previous methods that address pose ambiguity by processing temporal sequences, SnapPose3D uses single frames as input, avoiding tracking and limiting computational cost, data acquisition complexity, and the need for online, real-time applications. We extensively evaluate SnapPose3D on well-known benchmarks for the 3D human pose estimation task showing its ability to generate and aggregate accurate hypotheses that lead to state-of-the-art results.
[194] Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations
Shai Bagon, Matan Kichler, Mark Sheinin
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Optical vibration sensing enables recovering the scene sound directly from the surface vibration of nearby objects, turning everyday objects into ``visual microphones’’. However, most prior methods had focused on capturing the vibrations of specific objects with highly favorable vibration responses. These include objects where the surface vibrations are generated by the object itself (e.g., speaker membrane or guitar body) or objects consisting of a thin membrane which is highly reactive to sound (e.g., a chip bag or the leaf of a plant). In this paper, we tackle sound recovery for a more challenging class of solid objects whose vibration responses are poor or highly resonant. We simultaneously capture vibrations for multiple surface points on the object using a speckle-based vibrometry imaging system. Then, we derive a novel physics-guided vibration formation model that relates the scene sound source to the captured multi-point multi-axis vibrations via the object’s vibrational modes. The model is then used to reverse the resonant transfer function of the vibrating object, fusing multiple vibration signals to estimate the original sound source in the scene. We evaluate our approach by recovering sound from a variety of everyday objects, demonstrating that it significantly outperforms traditional single-point speckle vibrometry in challenging scenarios and other signal-processing-based methods for multi-signal fusing.
[195] CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
Guiyi Zeng, Junqing Yu, Yi-Ping Phoebe Chen, Xu Chen, Wei Yang, Zikai Song
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in self-evolution video understanding frameworks have demonstrated the potential of autonomous learning without human annotations. However, existing methods often suffer from weakly controlled optimization and uncontrolled difficulty progression, as they lack structured guidance throughout the iterative learning process. To address these limitations, we propose CurEvo, a curriculum-guided self-evolution framework that introduces curriculum learning into self-evolution to achieve more structured and progressive model improvement. CurEvo dynamically regulates task difficulty, refines evaluation criteria, and balances data diversity according to model competence, forming a curriculum-guided feedback loop that aligns learning complexity with model capability. Built upon this principle, we develop a multi-dimensional adaptive QA framework that jointly evolves question generation and answer evaluation across perception, recognition, and understanding dimensions, ensuring coherent and measurable curriculum progression. Through this integration, CurEvo transforms weakly controlled self-evolution into a more structured learning process for autonomous video understanding. Across seven backbones, CurEvo consistently improves both benchmark accuracy and evaluator-based semantic score on four VideoQA benchmarks, validating the effectiveness of curriculum-guided self-evolution for video understanding.
[196] Learning Sparse BRDF Measurement Samples from Image
Wen Cao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate BRDF acquisition is important for realistic rendering, but dense gonioreflectometer measurements are slow and expensive. We study how to select a small number of BRDF measurements that are most useful for reconstructing material appearance under a learned reflectance prior. Our method combines a set encoder for sparse coordinate-value observations, a pretrained hypernetwork-based BRDF reconstructor, and a differentiable renderer. During sampler training, the reconstructor is kept fixed and gradients from BRDF-space and rendered-image losses are used to optimize measurement locations. This separates sample selection from prior fitting and encourages the sampler to choose directions that are informative under the learned material distribution. Experiments on the MERL dataset show that the proposed sampler improves low-budget reconstruction quality at 8 and 16 measurements compared with neural reconstruction baselines, while PCA-based methods remain strong at larger budgets. We further analyze the effect of image-space supervision, co-optimization, and image-only latent fitting for unseen materials.
[197] GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
GLM-V Team, :, Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, Xijun Liu, Wenmeng Yu, Weihan Wang, Wei Li, Shuaiqi Duan, Sheng Yang, Ruiliang Lv, Mingdao Liu, Lihang Pan, Ke Ning, Junhui Ji, Jinjiang Wang, Jing Chen, Jiazheng Xu, Jiale Zhu, Jiale Cheng, Ji Qi, Guobing Gan, Guo Wang, Cong Yao, Zijun Dou, Zihao Zhou, Zihan Wang, Zhiqi Ge, Zhijie Li, Zhenyu Hou, Zhao Xue, Zehui Wang, Zehai He, Yusen Liu, Yukuo Cen, Yuchen Li, Yuan Wang, Yijian Lu, Yanzi Wang, Yadong Xue, Xinyu Zhang, Xinyu Liu, Wenkai Li, Tianyu Tong, Tianshu Zhang, Shengdong Yan, Qinkai Zheng, Mingde Xu, Licheng Bao, Jiaxing Xu, Jiaxin Fan, Jiawen Qian, Jiali Chen, Jiahui Lin, Haozhi Zheng, Haoran Wang, Haochen Li, Fan Yang, Dan Zhang, Chuangxin Zhao, Chengcheng Wu, Boyan Shi, Bowei Jia, Baoxu Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.
[198] TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection
Ahmed Abdullah, Nikolas Ebert, Oliver Wasenmüller
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent methods demonstrate that large-scale pretrained models, such as CLIP vision transformers, effectively detect AI-generated images (AIGIs) from unseen generative models when used as feature extractors. Many state-of-the-art methods for AI-generated image detection build upon the original CLIP-ViT to enhance this generalization. Since CLIP’s release, numerous vision foundation models (VFMs) have emerged, incorporating architectural improvements and different training paradigms. Despite these advances, their potential for AIGI detection and AI image forensics remains largely unexplored. In this work, we present a comprehensive benchmark across multiple VFM families, covering diverse pretraining objectives, input resolutions, and model scales. We systematically evaluate their out-of-the-box performance for detecting fully-generated AI-images and AI-inpainted images, and discover that the best model outperforms the original CLIP by more than 12% in accuracy, beating established approaches in the process. To fully leverage the features of a modern VFM, we propose a simple redesign of the classifier head by utilizing tunable attention pooling (TAP), which aggregates output tokens into a refined global representation. Integrating TAP with the latest VFMs yields substantial performance gains across several AIGI detection benchmarks, establishing a new state-of-the-art on two challenging benchmarks for in-the-wild detection of AI-generated and -inpainted images.
[199] Virtual-reality based patient-specific simulation of spine surgical procedures: A fast, highly automated and high-fidelity system for surgical education and planning
Raj Kumar Ranabhat, Tayler D Ross, Tony Jiao, Jeremie Larouche, Joel Finkelstein, Michael Hardisty
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Surgical training involves didactic teaching, mentor-led learning, surgical skills laboratories, and direct exposure to surgery; however, increasing clinical pressures have limited operating room (OR) exposure. This work leverages virtual reality (VR) to provide a safe and immersive training environment. Existing VR training is often based on standardized scenarios not tailored to individual clinical cases. This study addresses this limitation using artificial intelligence (AI) based computer vision methods to generate patient-specific simulations from computed tomography (CT) and magnetic resonance imaging (MRI). This study focuses on patient-specific spinal decompression simulation for spinal stenosis in a virtual operating room. The objectives were (1) automatic creation of 3D anatomical models and (2) VR simulation of spinal decompression procedures including laminectomy, disc resection, and foraminotomy. Model construction required multimodal fusion (registration) of CT and MRI and segmentation of relevant structures. Segmentation was evaluated using the Dice Similarity Coefficient (DSC), and registration accuracy using Target Registration Error (TRE). Qualitative feedback was obtained from surgeons and trainees. High-fidelity patient-specific 3D models were generated efficiently (approximately 2.5 minutes per case, N = 15). Segmentation accuracy was high, with a DSC of 0.95 (+/- 0.03) for vertebral bone and 0.895 (+/- 0.02) for soft tissue structures. Registration accuracy showed a mean TRE of 1.73 (+/- 0.42) mm. Semi-structured interviews indicated improved spatial understanding, increased procedural confidence, and strong perceived educational value. This platform significantly reduced the time and costs of patient-specific modelling, thereby facilitating pre-operative planning, post-procedural assessments, and comprehensive surgical simulation.
[200] Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization
Mingbo Hong, Feng Liu, Caroline Gevaert, George Vosselman, Hao Cheng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Detectors often suffer from degraded performance, primarily due to the distributional gap between the source and target domains. This issue is especially evident in single-source domains with limited data, as models tend to rely on confounders (e.g., illumination, co-occurrence, and style) from the source domain, leading to spurious correlations that hinder generalization. To this end, this paper proposes a novel Basis-driven framework for domain generalization, namely \textbf{\textit{Bridge}}, that incorporates causal inference into object detection. By learning the low-rank bases for front-door adjustment, \textbf{\textit{Bridge}} blocks confounders’ effects to mitigate spurious correlations, while simultaneously refining representations by filtering redundant and task-irrelevant components. \textbf{\textit{Bridge}} can be seamlessly integrated with both discriminative (e.g., DINOv2/3, SAM) and generative (e.g., Stable Diffusion) Vision Foundation Models (VFMs). Extensive experiments across multiple domain generalization object detection datasets, i.e., Cross-Camera, Adverse Weather, Real-to-Artistic, Diverse Weather Datasets, and Diverse Weather DroneVehicle (our newly augmented real-world UAV-based benchmark), underscore the superiority of our proposed method over previous state-of-the-art approaches. The project page is available at: https://mingbohong.github.io/Bridge/.
[201] Breaking the Rigid Prior: Towards Articulated 3D Anomaly Detection
Jinye Gan, Bozhong Zheng, Xiaohao Xu, Junye Ren, Zixuan Zhang, Na Ni, Yingna Wu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing 3D anomaly detection methods are built on a rigid prior: normal geometry is pose-invariant and can be canonicalized through registration or alignment. This prior does not hold for articulated objects with hinge or sliding joints, where valid pose changes induce structured geometric variations that cannot be collapsed to a single canonical template, causing pose-induced deformations to be misidentified as anomalies while true structural defects are obscured. No existing benchmark addresses this challenge. We introduce ArtiAD, the first large-scale benchmark for articulated 3D anomaly detection, comprising 15,229 point clouds across 39 object categories with dense joint-angle variations and six structural anomaly types. Each sample is annotated with its joint configuration and part-level motion labels, enabling explicit disentanglement of pose-induced geometry from structural defects. ArtiAD also provides a seen/unseen articulation split to evaluate both interpolation and extrapolation to novel joint configurations. We propose Shape-Pose-Aware Signed Distance Field (SPA-SDF), a baseline that replaces the rigid prior with a continuous pose-conditioned implicit field, factorized into an articulation-independent structural prior and a Fourier-encoded joint embedding. At inference, the articulation state is recovered by minimizing reconstruction energy, and anomalies are identified as point-wise deviations from the learned manifold. SPA-SDF achieves 0.884 object-level AUROC on seen configurations and 0.874 on unseen configurations, substantially outperforming all rigid-based baselines. Our code and benchmark will be publicly released to facilitate future research.
[202] Uncertainty-Aware Pedestrian Attribute Recognition via Evidential Deep Learning
Zhuofan Lou, Shihang Zhang, Fangle Zhu, Shengjie Ye, Pingyu Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We propose UAPAR, an Uncertainty-Aware Pedestrian Attribute Recognition framework. To the best of our knowledge, this is the first EDL-based uncertainty-aware framework for pedestrian attribute recognition (PAR). Unlike conventional deterministic methods, which fail to assess prediction reliability on low-quality samples, UAPAR effectively identifies unreliable predictions and thus enhances system robustness in complex real-world scenarios. To achieve this, UAPAR incorporates Evidential Deep Learning (EDL) into a CLIP-based architecture. Specifically, a Region-Aware Evidence Reasoning module employs cross-attention and spatial prior masks to capture fine-grained local features, which are further processed by an evidence head to estimate attribute-wise epistemic uncertainty. To further enhance training robustness, we develop an uncertainty-guided dual-stage curriculum learning strategy to alleviate the adverse effects of severe label noise during training. Extensive experiments on the PA100K, PETA, RAPv1, and RAPv2 datasets demonstrate that UAPAR achieves competitive or superior performance. Furthermore, qualitative results confirm that the proposed framework generates uncertainty estimates that are predictive of challenging or erroneous samples.
[203] SEAL: Semantic-aware Single-image Sticker Personalization with a Large-scale Sticker-tag Dataset
Changhyun Roh, Yonghyun Jeong, Jonghyun Lee, Chanho Eom, Jihyong Oh
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Synthesizing a target concept from a single reference image is challenging in diffusion-based personalized text-to-image generation, particularly for sticker personalization where prompts often require explicit attribute edits. With only one reference, test-time fine-tuning (TTF) methods tend to overfit, producing \textit{visual entanglement}, where background artifacts are absorbed into the learned concept, and \textit{structural rigidity}, where the model memorizes reference-specific spatial configurations and loses contextual controllability. To address these issues, we introduce \textbf{SE}mantic-aware single-image sticker person\textbf{AL}ization (\textbf{SEAL}), a plug-and-play, architecture-agnostic adaptation module that integrates into existing personalization pipelines without modifying their U-Net-based diffusion backbones. SEAL applies three components during embedding adaptation: (1) a Semantic-guided Spatial Attention Loss, (2) a Split-merge Token Strategy, and (3) Structure-aware Layer Restriction. To support sticker-domain personalization with attribute-level control, we present StickerBench, a large-scale sticker image dataset with structured tags under a six-attribute schema (Appearance, Emotion, Action, Camera Composition, Style, Background). These annotations provide a consistent interface for varying context while keeping target identity fixed, enabling systematic evaluation of identity disentanglement and contextual controllability. Experiments show that SEAL consistently improves identity preservation while maintaining contextual controllability, highlighting the importance of explicit spatial and structural constraints during test-time adaptation. The code, StickerBench, and project page will be publicly released.
[204] Graph-based Semantic Calibration Network for Unaligned UAV RGBT Image Semantic Segmentation and A Large-scale Benchmark
Fangqiang Fan, Zhicheng Zhao, Xiaoliang Ma, Chenglong Li, Jin Tang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Fine-grained RGBT image semantic segmentation is crucial for all-weather unmanned aerial vehicle (UAV) scene understanding. However, UAV RGBT semantic segmentation faces two coupled challenges: cross-modal spatial misalignment caused by sensor parallax and platform vibration, and severe semantic confusion among fine-grained ground objects under top-down aerial views. To address these issues, we propose a Graph-based Semantic Calibration Network (GSCNet) for unaligned UAV RGBT image semantic segmentation. Specifically, we design a Feature Decoupling and Alignment Module (FDAM) that decouples each modality into shared structural and private perceptual components and performs deformable alignment in the shared subspace, enabling robust spatial correction with reduced modality appearance interference. Moreover, we propose a Semantic Graph Calibration Module (SGCM) that explicitly encodes the hierarchical taxonomy and co-occurrence regularities among ground-object categories in UAV scenes into a structured category graph, and incorporates these priors into graph-attention reasoning to calibrate predictions of visually similar and rare categories.In addition, we construct the Unaligned RGB-Thermal Fine-grained (URTF) benchmark, to the best of our knowledge, the largest and most fine-grained benchmark for unaligned UAV RGBT image semantic segmentation, containing over 25,000 image pairs across 61 categories with realistic cross-modal misalignment. Extensive experiments on URTF demonstrate that GSCNet significantly outperforms state-of-the-art methods, with notable gains on fine-grained categories. The dataset is available at https://github.com/mmic-lcl/Datasets-and-benchmark-code.
[205] AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation
Zijie Wu, Chaohui Yu, Fan Wang, Xiang Bai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. We present AnimateAnyMesh++, a feed-forward framework for text-driven animation of arbitrary 3D meshes with substantial upgrades in data, architecture, and generative capability. First, we expand the DyMesh-XL dataset by mining dynamic content from Objaverse-XL, increasing the number of unique identities from 60K to 300K and substantially broadening category and motion diversity. Second, we redesign DyMeshVAE-Flex with power-law topology-aware attention and vertex-normal enhanced features, which significantly improves trajectory reconstruction, local geometry preservation, and mitigates trajectory-sticking artifacts. Third, we introduce architectural changes to both DyMeshVAE-Flex and the rectified-flow (RF) generator to support variable-length sequence training and generation, enabling longer animations while preserving reconstruction fidelity. Extensive experiments demonstrate that AnimateAnyMesh++ generates semantically accurate and temporally coherent mesh animations within seconds, surpassing prior approaches in quality and efficiency. The enlarged DyMesh-XL, the upgraded DyMeshVAE-Flex, and variable-length RF together deliver consistent gains across benchmarks and in-the-wild meshes. We will release code, models, and the expanded DyMesh-XL upon acceptance of this manuscript to facilitate research in 4D content creation.
[206] Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction
David Novikov, Eilon Vaknin, Narek Tumanyan, Mark Sheinin
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The task of capturing and rendering 3D dynamic scenes from 2D images has become increasingly popular in recent years. However, most conventional cameras are bandwidth-limited to 30-60 FPS, restricting these methods to static or slowly evolving scenes. While overcoming bandwidth limitations is difficult for general scenes, recent years have seen a flurry of computational imaging methods that yield high-speed videos using conventional cameras for specific applications (e.g., motion capture and particle image velocimetry). However, most of these methods require modifications to a camera’s optics or the addition of mechanically moving components, limiting them to a single-view high-speed capture. Consequently, these methods cannot be readily used to capture a 3D representation of rapid scene motion. In this paper, we propose a novel method to capture and reconstruct a volumetric representation of a high-speed scene using only unaugmented low-speed cameras. Instead of modifying the hardware or optics of each individual camera, we encode high-speed scene dynamics by illuminating the scene with a rapid, sequential color-coded sequence. This results in simultaneous multi-view capture of the scene, where high-speed temporal information is encoded in the spatial intensity and color variations of the captured images. To construct a high-speed volumetric representation of the dynamic scene, we develop a novel dynamic Gaussian Splatting-based approach that decodes the temporal information from the images. We evaluate our approach on simulated scenes and real-world experiments using a multi-camera imaging setup, showing first-of-a-kind high-speed volumetric scene reconstructions.
[207] World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning
Wanyue Zhang, Wenxiang Wu, Wang Xu, Jiaxin Luo, Helu Zhi, Yibin Huang, Shuo Ren, Zitao Liu, Jiajun Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, we use a view-consistent world model to synthesize geometrically aligned future views and derive structured supervision for both forward (action-to-outcome) and inverse (outcome-to-action) spatial reasoning. We post-train the VLM with a two-stage recipe on a compact dataset generated by this pipeline and evaluate it on multiple spatial reasoning benchmarks. World2VLM delivers consistent improvements over the base model across diverse benchmarks, including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. It also outperforms the test-time world-model-coupled methods while eliminating the need for expensive inference-time generation. Our results suggest that world models can serve not only as inference-time tools, but also as effective training-time teachers, enabling VLMs to internalize spatial imagination in a scalable and efficient manner.
[208] ProcFunc: Function-Oriented Abstractions for Procedural 3D Generation in Python
Alexander Raistrick, Karhan Kayan, Jack Nugent, David Yan, Lingjie Mei, Meenal Parakh, Hongyu Wen, Dylan Li, Yiming Zuo, Erich Liang, Jia Deng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce ProcFunc, a library for Blender-based procedural 3D generation in Python. ProcFunc provides a library of easy-to-use Python functions, which streamline creating, combining, analyzing, and executing procedural generation code. ProcFunc makes it easy to create large-scale diverse training data, by combinatorial compositions of semantic components. VLMs can use ProcFunc to edit procedural material and geometry code and can create new procedural code with significantly fewer coding errors. Finally, as an example use case, we use ProcFunc to develop a new procedural generator of indoor rooms, which includes a collection of new compositional procedural materials. We demonstrate the detail, runtime efficiency, and diversity of this room generator, as well as its use for 3D synthetic data generation. Please visit https://github.com/princeton-vl/procfunc for source code.
[209] Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation
Wanrong Zheng, Yunhao Ge, Laurent Itti
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Breakthrough progress in vision-based navigation through unknown environments has been achieved by using multimodal large language models (MLLMs). These models can plan a sequence of motions by evaluating the current view at each time step against the task and goal given to the agent. However, current zero-shot Vision-and-Language Navigation (VLN) agents powered by MLLMs still tend to drift off course, halt prematurely, and achieve low overall success rates. We propose Three-Step Nav to counteract these failures with a three-view protocol: First, “look forward” to extract global landmarks and sketch a coarse plan. Then, “look now” to align the current visual observation with the next sub-goal for fine-grained guidance. Finally, “look backward” audits the entire trajectory to correct accumulated drift before stopping. Requiring no gradient updates or task-specific fine-tuning, our planner drops into existing VLN pipelines with minimal overhead. Three-Step Nav achieves state-of-the-art zero-shot performance on the R2R-CE and RxR-CE dataset. Our code is available at https://github.com/ZoeyZheng0/3-step-Nav.
[210] OnSiteVRU: A High-Resolution Trajectory Dataset for High-Density Vulnerable Road Users
Zhangcun Yan, Jianqiang Li, Peng Hang, Jian Sun
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: With the acceleration of urbanization and the growth of transportation demands, the safety of vulnerable road users (VRUs, such as pedestrians and cyclists) in mixed traffic flows has become increasingly prominent, necessitating high-precision and diverse trajectory data to support the development and optimization of autonomous driving systems. However, existing datasets fall short in capturing the diversity and dynamics of VRU behaviors, making it difficult to meet the research demands of complex traffic environments. To address this gap, this study developed the OnSiteVRU datasets, which cover a variety of scenarios, including intersections, road segments, and urban villages. These datasets provide trajectory data for motor vehicles, electric bicycles, and human-powered bicycles, totaling approximately 17,429 trajectories with a precision of 0.04 seconds. The datasets integrate both aerial-view natural driving data and onboard real-time dynamic detection data, along with environmental information such as traffic signals, obstacles, and real-time maps, enabling a comprehensive reconstruction of interaction events. The results demonstrate that VRU_Data outperforms traditional datasets in terms of VRU density and scene coverage, offering a more comprehensive representation of VRU behavioral characteristics. This provides critical support for traffic flow modeling, trajectory prediction, and autonomous driving virtual testing. The dataset is publicly available for download at: https://www.kaggle.com/datasets/zcyan2/mixed-traffic-trajectory-dataset-in-from-shanghai.
[211] FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
Zheng Liu, Mengjie Liu, Jingzhou Chen, Jingwei Xu, Bin Cui, Conghui He, Wentao Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce FLARE, a family of vision language models (VLMs) with a fully vision-language alignment and integration paradigm. Unlike existing approaches that rely on single MLP projectors for modality alignment and defer cross-modal interaction to LLM decoding, FLARE achieves deep, dynamic integration throughout the pipeline. Our key contributions include: (1) Text-Guided Vision Encoding that incorporates textual information during vision encoding to achieve pixel-level alignment; (2) Context-Aware Alignment Decoding that aggregates visual features conditioned on textual context during decoding for query-level integration; (3) Dual-Semantic Mapping Loss to supervise feature mapping from both modalities and enable modality-level bridging; and (4) Text-Driven VQA Synthesis that leverages high-quality text to generate VQA pairs and synthesize corresponding images, enabling data-level optimization. We train FLARE at 3B and 8B scales under both fixed and dynamic resolution settings, demonstrating that our full-modality alignment significantly outperforms existing methods while maintaining strong generalizability. FLARE 3B surpasses Cambrian-1 8B and Florence-VL 8B using only 630 vision tokens. Ablation studies reveal that FLARE achieves superior performance over existing methods with minimal computational cost. Even without dynamic resolution, FLARE outperforms LLaVA-NeXT, validating the effectiveness of our approach. We release our code, model weights, and dataset in https://github.com/starriver030515/FLARE.
[212] ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers
Fotios Lygerakis, Ozan Özdenizci, Elmar Rückert
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.20032: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20032&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[213] ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models
Rui Xu, Jiepeng Wang, Hao Pan, Yang Liu, Xin Tong, Shiqing Xin, Changhe Tu, Taku Komura, Wenping Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2405.13729: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.13729&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[214] RetroMotion: Retrocausal Motion Forecasting Models are Instructable
Royden Wagner, Omer Sahin Tas, Felix Hauser, Marlon Steiner, Dominik Strutz, Abhishek Vivekanandan, Jaime Villa, Yinzhe Shen, Carlos Fernandez, Christoph Stiller
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.20414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[215] Time Blindness: Why Video-Language Models Can’t See What Humans Can?
Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, Mohamed Elhoseiny
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.24867: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.24867&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[216] Uncertainty-Aware Information Pursuit for Interpretable and Reliable Medical Image Analysis
Md Nahiduzzaman, Steven Korevaar, Zongyuan Ge, Feng Xia, Alireza Bab-Hadiashar, Ruwan Tennakoon
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.16742: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.16742&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[217] Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation
Sweta Banerjee, Viktoria Weiss, Taryn A. Donovan, Rutger H.J. Fick, Thomas Conrad, Jonas Ammeling, Nils Porsche, Robert Klopfleisch, Christopher Kaltenecker, Katharina Breininger, Marc Aubreville, Christof A. Bertram
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.21444: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.21444&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[218] StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding
Haolin Yang, Feilong Tang, Lingxiao Zhao, Xinlin Zhuang, Yifan Lu, Xiang An, Ming Hu, Xiaofeng Zhang, Abdalla Swikir, Junjun He, Zongyuan Ge, Muhammad Haris Khan, Imran Razzak
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.01875: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01875&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[219] GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning
Fengyi Wu, Yifei Dong, Yilong Dai, Guangyu Chen, Qifeng Wu, Huiting Huang, Hang Wang, Qi Dai, Alexander G. Hauptmann, Zhi-Qi Cheng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.09547: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09547&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[220] Action Hints: Semantic Typicality and Context Uniqueness for Generalizable Skeleton-based Video Anomaly Detection
Canhui Tang, Sanping Zhou, Haoyue Shi, Le Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.11058: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11058&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[221] Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution
Jinpei Guo, Yifei Ji, Shengwei Wang, Zheng Chen, Yufei Wang, Sizhuo Ma, Yong Guo, Baiang Li, Jusheng Zhang, Yulun Zhang, Jian Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.23980: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23980&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[222] A Multimodal Depth-Aware Method For Embodied Reference Understanding
Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.08278: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08278&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[223] SkyReels-Text: Fine-Grained Font-Controllable Text Editing for Poster Design
Yunjie Yu, Jingchen Wu, Junchen Zhu, Chunze Lin, Guibin Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.13285: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13285&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[224] Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention
Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng, Zhixing Tan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.20032: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20032&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[225] Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
Inferix Team, Tianyu Feng, Yizeng Han, Jiahao He, Yuanyu He, Xi Lin, Teng Liu, Hanfeng Lu, Jiasheng Tang, Wei Wang, Zhiyuan Wang, Jichao Wu, Mingyang Yang, Yinghao Yu, Zeyu Zhang, Bohan Zhuang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.20714: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20714&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[226] Contrastive Heliophysical Image Pretraining for Solar Dynamics Observatory Records
Shiyu Shen, Zhe Gao, Taifeng Chai, Yang Huang, Bin Pan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.22958: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22958&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[227] Value-Guided Iterative Refinement and the DIQ-H Benchmark for Evaluating VLM Robustness
Hanwen Wan, Zexin Lin, Yixuan Deng, Xiaoqiang Ji
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.03992: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03992&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[228] GNC-Pose: Geometry-Aware GNC-PnP for Accurate 6D Pose Estimation
Xiujin Liu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.06565: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06565&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[229] Consist-Retinex: One-Step Noise-Emphasized Consistency Training Accelerates High-Quality Retinex Enhancement
Jian Xu, Wei Chen, Shigui Li, Delu Zeng, John Paisley, Qibin Zhao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.08982: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08982&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[230] StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
Tjark Behrens, Anton Obukhov, Bingxin Ke, Fabio Tosi, Matteo Poggi, Konrad Schindler
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.10959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[231] q3-MuPa: Quick, Quiet, Quantitative Multi-Parametric MRI using Physics-Informed Diffusion Models
Shishuai Wang, Florian Wiesinger, Noemi Sgambelluri, Carolin Pirkl, Stefan Klein, Juan A. Hernandez-Tamames, Dirk H.J. Poot
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.23726: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23726&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[232] Efficient Zero-Shot Inpainting with Decoupled Diffusion Guidance
Badr Moufad, Navid Bagheri Shouraki, Alain Oliviero Durmus, Thomas Hirtz, Eric Moulines, Jimmy Olsson, Yazid Janati
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.18365: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18365&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[233] The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
Qingdong He, Xueqin Chen, Yanjie Pan, Peng Tang, Pengcheng Xu, Zhenye Gan, Chengjie Wang, Xiaobin Hu, Jiangning Zhang, Yabiao Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.20340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[234] Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning
Hongbo Bai, Yujin Zhou, Yile Wu, Chi-Min Chan, Pengcheng Wen, Kunhao Pan, Sirui Han, Yike Guo
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.13942: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13942&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[235] Perception Test 2025: Challenge Summary and a Unified VQA Extension
Joseph Heyward, Nikhil Parthasarathy, Tyler Zhu, Aravindh Mahendran, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.06287: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06287&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[236] Bridging Visual and Wireless Sensing via a Unified Radiation Field for 3D Radio Map Construction
Chaozheng Wen, Jingwen Tong, Zehong Lin, Chenghong Bian, Jun Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.19216: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19216&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[237] ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch
Zheng Liu, Honglin Lin, Chonghan Qin, Xiaoyang Wang, Xin Gao, Yu Li, Mengzhang Cai, Yun Zhu, Zhanping Zhong, Qizhi Pei, Zhuoshi Pan, Xiaoran Shang, Bin Cui, Conghui He, Wentao Zhang, Lijun Wu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.13606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[238] Foundation Model-Driven Semantic Change Detection in Remote Sensing Imagery
Hengtong Shen, Li Yan, Hong Xie, Yaxuan Wei, Xinhao Li, Wenfei Shen, Peixian Lv, Fei Tan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.13780: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13780&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[239] COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data
Miguel Espinosa, Eva Gmelich Meijling, Valerio Marsocci, Elliot J. Crowley, Mikolaj Czerkawski
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.03239: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03239&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[240] Video Compression Meets Video Generation: Latent Inter-Frame Pruning with Attention Recovery
Dennis Menn, Yuedong Yang, Bokun Wang, Xiwen Wei, Mustafa Munir, Feng Liang, Radu Marculescu, Chenfeng Xu, Diana Marculescu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.05811: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05811&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[241] OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, Yung-Yao Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.05959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[242] Assessing the Utility of Volumetric Motion Fields for Radar-based Precipitation Nowcasting with Physics-informed Deep Learning
Peter Pavlík, Anna Bou Ezzeddine, Viera Rozinajová
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.13589: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13589&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[243] HumanOmni-Speaker: Identifying Who said What and When
Detao Bai, Shimin Yao, Weixuan Chen, Zhiheng Ma, Xihan Wei, Jingren Zhou
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.21664: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21664&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[244] Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers
Kawtar Zaher, Olivier Buisson, Alexis Joly
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.00809: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00809&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[245] NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results
Shuhong Liu, Chenyu Bao, Ziteng Cui, Xuangeng Chu, Bin Ren, Lin Gu, Xiang Chen, Mingrui Li, Long Ma, Marcos V. Conde, Radu Timofte, Yun Liu, Ryo Umagami, Tomohiro Hashimoto, Zijian Hu, Yuan Gan, Tianhan Xu, Yusuke Kurose, Tatsuya Harada, Junwei Yuan, Gengjia Chang, Xining Ge, Mache You, Qida Cao, Zeliang Li, Xinyuan Hu, Hongde Gu, Changyue Shi, Jiajun Ding, Zhou Yu, Jun Yu, Seungsang Oh, Fei Wang, Donggun Kim, Zhiliang Wu, Seho Ahn, Xinye Zheng, Kun Li, Yanyan Wei, Weisi Lin, Dizhe Zhang, Yuchao Chen, Meixi Song, Hanqing Wang, Haoran Feng, Lu Qi, Jiaao Shan, Yang Gu, Jiacheng Liu, Shiyu Liu, Kui Jiang, Junjun Jiang, Runyu Zhu, Sixun Dong, Qingxia Ye, Zhiqiang Zhang, Zhihua Xu, Zhiwei Wang, Phan The Son, Zhimiao Shi, Zixuan Guo, Xueming Fu, Lixia Han, Changhe Liu, Zhenyu Zhao, Manabu Tsukada, Zheng Zhang, Zihan Zhai, Tingting Li, Ziyang Zheng, Yuhao Liu, Dingju Wang, Jeongbin You, Younghyuk Kim, Il-Youp Kwak, Mingzhe Lyu, Junbo Yang, Wenhan Yang, Hongsen Zhang, Jinqiang Cui, Hong Zhang, Haojie Guo, Hantang Li, Qiang Zhu, Bowen He, Xiandong Meng, Debin Zhao, Xiaopeng Fan, Wei Zhou, Linzhe Jiang, Linfeng Li, Louzhe Xu, Qi Xu, Hang Song, Chenkun Guo, Weizhi Nie, Yufei Li, Xingan Zhan, Zhanqi Shi, Dufeng Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.04135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[246] At FullTilt: Real-Time Open-Set 3D Macromolecule Detection Directly from Tilted 2D Projections
Ming-Yang Ho, Alberto Bartesaghi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10766: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10766&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[247] Rethinking Satellite Image Restoration for Onboard AI: A Lightweight Learning-Based Approach
Adrien Dorise, Marjorie Bellizzi, Omar Hlimi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12807: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12807&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
Zhenggang Tang, Yuehao Wang, Yuchen Fan, Jun-Kun Chen, Yu-Ying Yeh, Kihyuk Sohn, Zhangyang Wang, Qixing Huang, Alexander Schwing, Rakesh Ranjan, Dilin Wang, Zhicheng Yan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16552: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16552&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] Incoherent Deformation, Not Capacity: Diagnosing and Mitigating Overfitting in Dynamic Gaussian Splatting
Ahmad Droby
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16747: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16747&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] Rethinking Cross-Dose PET Denoising: Mitigating Averaging Effects via Residual Noise Learning
Yichao Liu, Zongru Shao, Yueyang Teng, Junwen Guo
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] Causal Disentanglement for Full-Reference Image Quality Assessment
Zhen Zhang, Jielei Chu, Tian Zhang, Fengmao Lv, Tianrui Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.21654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation
Suiyang Guang, Chenyu Liu, Ruohan Zhang, Siyuan Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.22274: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.22274&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] FILTR: Extracting Topological Features from Pretrained 3D Models
Louis Martinez, Maks Ovsjanikov
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.22334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.22334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples
Oussama Bouanani, Jim Berend, Wojciech Samek, Sebastian Lapuschkin, Maximilian Dreyer
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.22477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.22477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views
Jiaxin Shi, Guofeng Zhang, Wufei Ma, Naifu Liang, Adam Kortylewski, Alan Yuille
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.22658: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.22658&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] PointTransformerX: Portable and Efficient 3D Point Cloud Processing without Sparse Algorithms
Laurenz Reichardt, Nikolas Ebert, Oliver Wasenmüller
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.24169: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.24169&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[257] An Affordable, Wearable Stereo-Eye-Tracking Platform
Alexander Zimmer, Yasmeen Abdrabou, Enkelejda Kasneci
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.24331: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.24331&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[258] CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training
Rezaul Karim, Austin Wen, Wang Zongzuo, Weiwei Zhang, Yang Liu, Walid Ahmed
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.24013: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.24013&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] ViPO: Visual Preference Optimization at Scale
Ming Li, Jie Wu, Justin Cui, Xiaojie Li, Rui Wang, Chen Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.24953: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.24953&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[260] Image Compression with Bubble-Aware Frame Rate Adaptation for Energy-Efficient Video Capsule Endoscopy
Oliver Bause, Jörg Gamerdinger, Julia Werner, Oliver Bringmann
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.25464: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.25464&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] The Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation
Muhammad Ali, Kevin Alexander Laube, Madan Ravi Ganesh, Lukas Schott, Niclas Popp, Thomas Brox
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.25530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.25530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[262] TopoMamba: Topology-Aware Scanning and Fusion for Segmenting Heterogeneous Medical Visual Media
Fuchen Zheng, Chengpei Xu, Long Ma, Weixuan Li, Junhua Zhou, Xuhang Chen, Weihuang Liu, Haolun Li, Quanjun Li, Zhenxi Zhang, Lei Zhao, Chi-Man Pun, Shoujun Zhou
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.25545: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.25545&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] Real-time Global Illumination for Dynamic 3D Gaussian Scenes
Chenxiao Hu, Meng Gai, Guoping Wang, Sheng Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.17897: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.17897&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[264] R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation
Xiuwei Xu, Angyuan Ma, Hankun Li, Bingyao Yu, Zheng Zhu, Jie Zhou, Jiwen Lu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.08547: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08547&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[265] Learning Vision-Based Omnidirectional Navigation: A Teacher-Student Approach Using Monocular Depth Estimation
Jan Finke, Wayne Paul Martis, Adrian Schmelter, Lars Erbach, Christian Jestel, Marvin Wiedemann
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.01999: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01999&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] A Diffeomorphism Groupoid and Algebroid Framework for Discontinuous Image Registration
Lili Bao, Bin Xiao, Shihui Ying, Stefan Sommer
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.11806: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11806&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] FASTER: Rethinking Real-Time Flow VLAs
Yuxiang Lu, Zhe Liu, Xianzhe Fan, Zhenya Yang, Jinghua Hou, Junyi Li, Kaixin Ding, Hengshuang Zhao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.19199: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19199&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[268] Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital
T. J. Barton, Chris Constantakis, Patti Hauseman, Annie Mous, Alaska Hoffman, Brian Bergeron, Hunter Goodreau
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We study reliability in autonomous language-model agents that translate user mandates into validated tool actions under real capital. The setting is DX Terminal Pro, a 21-day deployment in which 3,505 user-funded agents traded real ETH in a bounded onchain market. Users configured vaults through structured controls and natural-language strategies, but only agents could choose normal buy/sell trades. The system produced 7.5M agent invocations, roughly 300K onchain actions, about $20M in volume, more than 5,000 ETH deployed, roughly 70B inference tokens, and 99.9% settlement success for policy-valid submitted transactions. Long-running agents accumulated thousands of sequential decisions, including 6,000+ prompt-state-action cycles for continuously active agents, yielding a large-scale trace from user mandate to rendered prompt, reasoning, validation, portfolio state, and settlement. Reliability did not come from the base model alone; it emerged from the operating layer around the model: prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability. Pre-launch testing exposed failures that text-only benchmarks rarely measure, including fabricated trading rules, fee paralysis, numeric anchoring, cadence trading, and misread tokenomics. Targeted harness changes reduced fabricated sell rules from 57% to 3%, reduced fee-led observations from 32.5% to below 10%, and increased capital deployment from 42.9% to 78.0% in an affected test population. We show that capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.
[269] Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields
Yiwei Shi, Zixing Song, Mengyue Yang, Cunjia Liu, Weiru Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: {Closed-loop inverse source localization and characterization (ISLC) requires a mobile agent to select measurements that localize sources and infer latent field parameters under strict time constraints.} {The core challenge lies in the belief-space objective: valid uncertainty estimation requires expensive Bayesian inference, whereas using fast learned belief model leads to reward hacking, in which the policy exploits approximation errors rather than actually reducing uncertainty.} {We propose \textbf{Distill-Belief}, a teacher–student framework that decouples correctness from efficiency. A Bayes-correct particle-filter teacher maintains the posterior and supplies a dense information-gain signal, while a compact student distills the posterior into belief statistics for control and an uncertainty certificate for stopping. At deployment, only the student is used, yielding constant per-step cost.} {Experiments on seven field modalities and two stress tests show that Distill-Belief consistently reduces sensing cost and improves success, posterior contraction, and estimation accuracy over baselines, while mitigating reward hacking.}
[270] Evaluating Strategic Reasoning in Forecasting Agents
Tom Liptay, Dan Schwarz, Rafael Poyiadzi, Jack Wildman, Nikos I. Bosse
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Forecasting benchmarks produce accuracy leaderboards but little insight into why some forecasters are more accurate than others. We introduce Bench to the Future 2 (BTF-2), 1,417 pastcasting questions with a frozen 15M-document research corpus in which agents reproducibly research and forecast offline, producing full reasoning traces. BTF-2 detects accuracy differences of 0.004 Brier score, and can distinguish differential agent strengths in research vs. judgment. We build a forecaster 0.011 Brier more accurate than any single frontier agent, and use it to evaluate agent strategic reasoning without hindsight bias. We find the better forecaster differs primarily in its pre-mortem analysis of its blind spots and consideration of black swans. Expert human forecasters found the dominant strategic reasoning failures of frontier agents are in assessing political and business leaders’ incentives, judging their likelihood to follow through on stated plans, and modeling institutional processes.
[271] Hierarchical Multi-Persona Induction from User Behavioral Logs: Learning Evidence-Grounded and Truthful Personas
Nayoung Choi, Haeyu Jeong, Changbong Kim, Hongjun Lim, Jinho D. Choi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Behavioral logs provide rich signals for user modeling, but are noisy and interleaved across diverse intents. Recent work uses LLMs to generate interpretable natural-language personas from user logs, yet evaluation often emphasizes downstream utility, providing limited assurance of persona quality itself. We propose a hierarchical framework that aggregates user actions into intent memories and induces multiple evidence-grounded personas by clustering and labeling these memories. We formulate persona induction as an optimization problem over persona quality-captured by cluster cohesion, persona-evidence alignment, and persona truthfulness-and train the persona model using a groupwise extension of Direct Preference Optimization (DPO). Experiments on a large-scale service log and two public datasets show that our method induces more coherent, evidence-grounded, and trustworthy personas, while also improving future interaction prediction.
[272] OMEGA: Optimizing Machine Learning by Evaluating Generated Algorithms
Jeremy Nixon, Annika Singh
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In order to automate AI research we introduce a full, end-to-end framework, OMEGA: Optimizing Machine learning by Evaluating Generated Algorithms, that starts at idea generation and ends with executable code. Our system combines structured meta-prompt engineering with executable code generation to create new ML classifiers. The OMEGA framework has been utilized to generate several novel algorithms that outperform scikit-learn baselines across a robust selection of 20 benchmark datasets (infinity-bench). You can access models discussed in this paper and more in the python package: pip install omega-models.
[273] AGEL-Comp: A Neuro-Symbolic Framework for Compositional Generalization in Interactive Agents
Mahnoor Shahid, Hannes Rothe
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Model (LLM)-based agents exhibit systemic failures in compositional generalization, limiting their robustness in interactive environments. This work introduces AGEL-Comp, a neuro-symbolic AI agent architecture designed to address this challenge by grounding actions of the agent. AGEL-Comp integrates three core innovations: (1) a dynamic Causal Program Graph (CPG) as a world model, representing procedural and causal knowledge as a directed hypergraph; (2) an Inductive Logic Programming (ILP) engine that synthesizes new Horn clauses from experiential feedback, grounding symbolic knowledge through interaction; and (3) a hybrid reasoning core where an LLM proposes a set of candidate sub-goals that are verified for logical consistency by a Neural Theorem Prover (NTP). Together, these components operationalize a deduction–abduction learning cycle: enabling the agent to deduce plans and abductively expand its symbolic world model, while a neural adaptation phase keeps its reasoning engine aligned with new knowledge. We propose an evaluation protocol within the \texttt{Retro Quest} simulation environment to probe for compositional generalization scenarios to evaluate our AGEL agent. Our findings clearly indicate the better performance of our AGEL model over pure LLM-based models. Our framework presents a principled path toward agents that build an explicit, interpretable, and compositionally structured understanding of their world.
[274] Persuadability and LLMs as Legal Decision Tools
Oisin Suttle, David Lillis
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As Large Language Models (LLMs) are proposed as legal decision assistants, and even first-instance decision-makers, across a range of judicial and administrative contexts, it becomes essential to explore how they answer legal questions, and in particular the factors that lead them to decide difficult questions in one way or another. A specific feature of legal decisions is the need to respond to arguments advanced by contending parties. A legal decision-maker must be able to engage with, and respond to, including through being potentially persuaded by, arguments advanced by the parties. Conversely, they should not be unduly persuadable, influenced by a particularly compelling advocate to decide cases based on the skills of the advocates, rather than the merits of the case. We explore how frontier open- and closed-weights LLMs respond to legal arguments, reporting original experimental results examining how the quality of the advocate making those arguments affects the likelihood that a model will agree with a particular legal point of view, and exploring the factors driving these results. Our results have implications for the feasibility of adopting LLMs across legal and administrative settings.
[275] Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations
Bochao Liu, Zhipeng Qian, Yang Zhao, Xinyuan Jiang, Zihan Liang, Yufei Ma, Junpeng Zhuang, Ben Chen, Shuo Yang, Hongen Wan, Yao Wu, Chenyi Lei, Xiao Liang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are a natural fit for these tasks, the deployment bottleneck is not reasoning capability but orchestration: selecting, for each operational event, the relevant data (metrics, logs, change events) and the applicable operational knowledge (handbook rules and practitioner experience). Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. We present Bian Que, an agentic framework with three contributions: (i) a \emph{unified operational paradigm} abstracting day-to-day O&M into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) \emph{Flexible Skill Arrangement}, where each Skill specifies which data and knowledge to retrieve for a given business-module context and can be automatically generated and updated by LLMs or iteratively refined through natural-language instructions from on-call engineers; (iii) a \emph{unified self-evolving mechanism} in which one correction signal drives two parallel pathways, case-memory-to-knowledge distillation and targeted Skill refinement. Deployed on the e-commerce search engine of KuaiShou, the major short-video platform in China, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, and cuts mean time to resolution by over 50%. Our framework achieves 99.0% pass rate on offline evaluations. Our code is available at https://github.com/benchen4395/BianQue_Assistant.
[276] Apriori-based Analysis of Learned Helplessness in Mathematics Tutoring: Behavioral Patterns by Level, Intervention, and Outcome
John Paul P. Miranda
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This study applied the Apriori algorithm to analyze behavioral interaction patterns associated with learned helplessness (LH) in mathematics tutoring system logs. Interaction data were examined across three dimensions: LH level (low vs. high), system-based intervention (with vs. without), and problem-solving outcomes (solved vs. unsolved). The analysis of the complete dataset showed that skipping problems without using hints was the most frequent pattern linked to unsolved outcomes, while persistence behaviors such as not skipping were less dominant overall. Comparisons by LH level showed that low-LH students had stronger links between problem solving and not skipping, as well as positive associations between hint use and solved outcomes. High-LH students showed more avoidance patterns, with skipping strongly tied to unsolved outcomes. In the comparison of system-based intervention conditions, students without intervention had the highest lift for persistence-success links, while the with-intervention group had stronger patterns involving skipping behaviors leading to unsolved outcomes. Outcome-specific analysis showed that not skipping was consistently associated with solved problems across all groups, while skipping without hints predicted unsolved outcomes. Practical implications and recommendations are discussed.
[277] DreamProver: Evolving Transferable Lemma Libraries via a Wake-Sleep Theorem-Proving Agent
Youyuan Zhang, Jialiang Sun, Hangrui Bi, Chuqin Geng, Wenjie Ma, Zhaoyu Li, Xujie Si
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce DreamProver, an agentic framework that leverages a “wake-sleep” program induction paradigm to discover reusable lemmas for formal theorem proving. Existing approaches either rely on fixed lemma libraries, which limit adaptability, or synthesize highly specific intermediate lemmas tailored to individual theorems, thereby lacking generality. DreamProver addresses this gap through an iterative two-stage process. In the wake stage, DreamProver attempts to prove theorems from a training set using the current lemma library while proposing new candidate lemmas. In the “sleep” stage, it abstracts, refines, and consolidates these candidates to compress and optimize the library. Through this alternating cycle, DreamProver progressively evolves a compact set of high-level, transferable lemmas that can be effectively used to prove unseen theorems in related domains. Experimental results demonstrate that DreamProver substantially improves proof success rates across a diverse set of mathematical benchmarks, while also producing more concise proofs and reducing computational cost.
[278] Auto-Relational Reasoning
Ioannis Konstantoulas, Dimosthenis Tsimas, Pavlos Peppas, Kyriakos Sgarbas
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Background & Objectives: In the last decade, Machine learning research has grown rapidly, but large models are reaching their soft limits demonstrating diminishing returns and still lack solid reasoning abilities. These limits could be surpassed through synergistic combination of Machine Learning scalability and rigid reasoning. Methods: In this work, we propose a theoretical framework for reasoning through object-relations in an automated manner integrated with Artificial Neural Networks. We present a formal analysis of the Reasoning, and we show the theory in practice through a paradigm integrating Reasoning and Machine Learning. Results: This paradigm is a system that solves Intelligence Quotient problems without any prior knowledge of the problem. Our system achieves 98.03% solving rate corresponding to the top 1% percentile or 132-144 iq score. This result is only limited by the small size of the model and the processing capabilities of the machine it run on. Conclusions: With the integration of prior knowledge in the system and the expansion of the dataset, the system can be generalized to solve a large category of problems. The functionality of the system inherently favors the solution of such problems in few-shot or zero-shot attempts.
[279] Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems
Mahnoor Shahid, Hannes Rothe
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro-symbolic AI is that compositional reasoning will emerge as a byproduct of successful symbol grounding. This work presents the first systematic empirical analysis to challenge this assumption by disentangling the contributions of grounding and reasoning. To operationalize this investigation, we introduce the Iterative Logic Tensor Network ($i$LTN), a fully differentiable architecture designed for multi-step deduction. Using a formal taxonomy of generalization – probing for novel entities, unseen relations, and complex rule compositions – we demonstrate that a model trained solely on a grounding objective fails to generalize. In contrast, our full $i$LTN, trained jointly on perceptual grounding and multi-step reasoning, achieves high zero-shot accuracy across all tasks. Our findings provide conclusive evidence that symbol grounding, while necessary, is insufficient for generalization, establishing that reasoning is not an emergent property but a distinct capability that requires an explicit learning objective.
[280] Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
Mahiro Nakao, Kazuhiro Takemoto
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this context remains poorly characterized. We introduce a dataset of 270 harmful instructions spanning nine prohibited behavior categories grounded in the American Medical Association Principles of Medical Ethics, and use it to evaluate 72 LLMs in a simulation environment based on the Robotic Health Attendant framework. The mean violation rate across all models was 54.4%, with more than half exceeding 50%, and violation rates varied substantially across behavior categories, with superficially plausible instructions such as device manipulation and emergency delay proving harder to refuse than overtly destructive ones. Model size and release date were the primary determinants of safety performance among open-weight models, and proprietary models were substantially safer than open-weight counterparts (median 23.7% versus 72.8%). Medical domain fine-tuning conferred no significant overall safety benefit, and a prompt-based defense strategy produced only a modest reduction in violation rates among the least safe models, leaving absolute violation rates at levels that would preclude safe clinical deployment. These findings demonstrate that safety evaluation must be treated as a first-class criterion in the development and deployment of LLMs for robotic health attendants.
[281] Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics
Jatin Bhusal, Nancy Mahatha, Aayush Acharya, Raunak Regmi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators. This paper tackles the bottleneck issue by suggesting a “Human-in-the-Loop” benchmarking framework to assess the effectiveness of multiple LLMs in automating secondary-level mathematics assessment. Based on the Grade 10 Optional Mathematics curriculum in Nepal, we created a multi-dimensional rubric for four topics and four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation. The multi-provider ensemble, consisted of open-weight models – Eagle (Llama 3.1-8B) and Orion (Llama 3.3-70B) – and proprietary frontier models Nova (Gemini 2.5 Flash) and Lyra (Gemini 3 Pro), was benchmarked against a ground truth defined by two senior mathematics faculty members (kappa_w = 0.8652). The findings show a marked “Architecture-compatibility gap”. Although the Gemini-based Mixture-of-Experts (Sparse MoE) models achieved “Fair Agreement” (kappa_w ~ 0.38), the larger Orion (70B) model exhibited “No Agreement” (kappa_w = -0.0261), suggesting that architectural compliance with instruction constraints outweighs the scale of raw parameters in rubric-constrained tasks. We conclude that while LLMs are not yet suitable for autonomous certification, they provide high-value assistive support for preliminary evidence extraction within a “Human-in-the-Loop” framework.
[282] A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall, Max Tegmark, Christian Schroeder de Witt, Mihaela van der Schaar, David Krueger
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents’ observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} – a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.
[283] When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
Zhimin Lin, Yixin Ji, Jinpeng Li, Yu Luo, Dong Li, Junhua Fang, Juntao Li, Min Zhang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Reasoning Models (LRMs) achieve strong performance on mathematical reasoning tasks but remain unreliable on challenging instances. Existing test-time scaling methods, such as repeated sampling, self-correction, and tree search, improve performance at the cost of increased computation, yet often exhibit diminishing returns on hard problems. We observe that output disagreement is strongly correlated with instance difficulty and prediction correctness, providing a useful signal for guiding instance-level strategy selection at test time. Based on this insight, we propose a training-free framework that formulates test-time scaling as an instance-level routing problem, rather than allocating more computation within a single strategy, dynamically selecting among different scaling strategies based on output disagreement. The framework applies lightweight resolution for consistent cases, majority voting for moderate disagreement, and rewriting-based reformulation for highly ambiguous instances. Experiments on seven mathematical benchmarks and three models show that our method improves accuracy by 3% - 7% while reducing sampling cost compared to existing approaches.
[284] SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data
Dianyu Liu, Chuan Qin, Xi Chen, Xiaohan Li, Wenxi Xu, Yuyang Wang, Xin Chen, Yuanchun Zhou, Hengshu Zhu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulation, and hypothesis generation workflows across domains. However, the effectiveness of these models is fundamentally constrained by the AI-readiness of scientific data, for which no scalable and systematic evaluation mechanism currently exists. In this work, we propose SciHorizon-DataEVA, a novel agentic system to scalable AI-readiness evaluation of heterogeneous scientific data. At the evaluation-criteria level, we introduce the Sci-TQA2 principles, which organize AI-readiness into four complementary dimensions: Governance Trustworthiness, Data Quality, AI Compatibility, and Scientific Adaptability. Each dimension is decomposed into measurable atomic elements that enable fine-grained and executable assessment. To operationalize these principles at scale, we develop Sci-TQA2-Eval, a hierarchical multi-agent evaluation approach orchestrated through a directed, cyclic workflow. Our Sci-TQA2-Eval dynamically constructs dataset-aware evaluation specifications by combining lightweight dataset profiling, applicability-aware metric activation, and knowledge-augmented planning grounded in domain constraints and dataset-paper signals. These specifications are executed through an adaptive, tool-centric evaluation mechanism with built-in verification and self-correction, enabling scalable and reliable assessment across heterogeneous scientific data. Extensive experiments on scientific datasets spanning multiple domains demonstrate the effectiveness and generality of SciHorizon-DataEVA for principled AI-readiness evaluation.
[285] FutureWorld: A Live Environment for Training Predictive Agents with Real-World Outcome Rewards
Zhixin Han, Yanzhi Zhang, Chuyang Wei, Maohang Gao, Xiawei Yue, Kefei Chen, Yu Zhuang, Haoxiang Guan, Jiyan He, Jian Li, Yitong Duan, Yu Shi, Mengting Hu, Shuxin Zheng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from real-world. Just as interactive environments have often driven progress in agents, advancing live future prediction naturally motivates viewing it as a learning environment. Prior works have explored future prediction from several different parts, but have generally not framed it as a unified learning environment. This task is appealing for learning because it can provide a large number of prediction questions grounded in diverse real-world events, while preventing answer leakage. To leverage the advantages of live future prediction, we present FutureWorld, a live agentic reinforcement learning environment that closes the training loop between prediction, outcome realization, and parameters update. In our environment, we take three open-source base models and train them for consecutive days. The results show that training is effective. Furthermore, we build a daily benchmark based on the environment and evaluate several frontier agents on it to establish performance baselines for current agent systems.
[286] Explainable Representation of Finite-Memory Policies for POMDPs using Decision Trees
Muqsit Azeem, Debraj Chakraborty, Sudeep Kanav, Jan Kretinsky
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Partially Observable Markov Decision Processes (POMDPs) are a fundamental framework for decision-making under uncertainty and partial observability. Since in general optimal policies may require infinite memory, they are hard to implement and often render most problems undecidable. Consequently, finite-memory policies are mostly considered instead. However, the algorithms for computing them are typically very complex, and so are the resulting policies. Facing the need for their explainability, we provide a representation of such policies, both (i) in an interpretable formalism and (ii) typically of smaller size, together yielding higher explainability. To that end, we combine models of Mealy machines and decision trees; the latter describing simple, stationary parts of the policies and the former describing how to switch among them. We design a translation for policies of the finite-state-controller (FSC) form from standard literature and show how our method smoothly generalizes to other variants of finite-memory policies. Further, we identify specific properties of recently used “attractor-based” policies, which allow us to construct yet simpler and smaller representations. Finally, we illustrate the higher explainability in a few case studies.
[287] ChinaTravel: An Open-Ended Travel Planning Benchmark with Compositional Constraint Validation for Language Agents
Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Si-Yu Han, Jinghao Pang, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, Yu-Feng Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Travel planning stands out among real-world applications of \emph{Language Agents} because it couples significant practical demand with a rigorous constraint-satisfaction challenge. However, existing benchmarks primarily operate on a slot-filling paradigm, restricting agents to synthetic queries with pre-defined constraint menus, which fails to capture the open-ended nature of natural language interaction, where user requirements are compositional, diverse, and often implicitly expressed. To address this gap, we introduce \emph{ChinaTravel}, with four key contributions: 1) a practical sandbox aligned with the multi-day, multi-POI travel planning, 2) a compositionally generalizable domain-specific language (DSL) for scalable evaluation, covering feasibility, constraint satisfaction, and preference comparison 3) an open-ended dataset that integrates diverse travel requirements and implicit intent from 1154 human participants, and 4) fine-grained analysis reveal the potential of neuro-symbolic agents in travel planning, achieving a 37.0% constraint satisfaction rate on human queries, a 10 \times improvement over purely neural models, yet highlighting significant challenges in compositional generalization. Overall, ChinaTravel provides a foundation for advancing language agents through compositional constraint validation in complex, real-world planning scenarios. Project Page: https://www.lamda.nju.edu.cn/shaojj/ChinaTravel/index.html
[288] Networks of Causal Abstractions: A Sheaf-theoretic Framework
Gabriele D’Acunto, Paolo Di Lorenzo, Sergio Barbarossa
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: A core challenge in causal artificial intelligence is the principled coordination of multiple, imperfect, and subjective causal perspectives arising from distributed agents with limited and heterogeneous access to the environment. This problem has received little formal treatment, as the existing framework assumes a single shared global causal model. This work introduces the causal abstraction network (CAN), a general sheaf-theoretic framework for representing, learning, and reasoning across collections of mixture of causal models (MCMs) - a class that unifies several existing models of context-dependent causal mechanisms. Sheaf theory provides a natural foundation for this task, offering a rigorous framework to coherently align distributed causal knowledge without requiring explicit causal graphs, functional mechanisms, interventional data, or jointly sampled observations. At the theoretical level, we provide a categorical formulation of MCMs and characterize key properties of CANs, including consistency and smoothness. Under consistency, we establish necessary and sufficient conditions: (i) for the existence of global sections, linked to spectral properties of an associated connection Laplacian; and (ii) for the convergence of causal knowledge diffusion over the CAN to the space of global sections. At the methodological level, we exploit the compositionality of causal abstractions to decompose the learning of consistent CANs into local problems on network edges, extending our prior work on Gaussian variables to Gaussian mixtures via the proposed MIXTURE-CALSEP algorithm. We validate the framework on synthetic data and through a financial application involving a multi-agent trading system, demonstrating CAN recovery, CAN-based portfolio optimization, and counterfactual reasoning.
[289] Deterministic Legal Agents: A Canonical Primitive API for Auditable Reasoning over Temporal Knowledge Graphs
Hudson de Martim
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In high-stakes legal domains, retrieval must preserve not only semantic relevance, but also the hierarchy, temporality, and causal provenance of legal norms. Standard Retrieval-Augmented Generation (RAG), based mainly on semantic similarity over text fragments, cannot reliably provide this level of control. Prior work on SAT-Graph RAG addressed the representation problem by modeling legal materials as structure-aware temporal knowledge graphs. This paper addresses the next problem: how an LLM-based reasoning agent can interact with such a graph without reintroducing unreliable retrieval behavior. We specify the SAT-Graph API, a canonical primitive interface for auditable reasoning over temporal knowledge graphs, developed and illustrated in the legal domain. The API exposes typed, atomic, and composable primitives that mediate between a probabilistic language model and a deterministic symbolic substrate. Its design follows Probability Isolation: uncertainty is confined to intent translation, semantic anchoring, and final narrative synthesis, while structural, temporal, and causal graph traversals are executed through deterministic operations over canonical anchors. The interface shifts legal RAG from single-shot Retrieve-then-Generate to active Reason-Act-Observe. An agent decomposes a legal question into an explicit execution plan, invokes primitives for point-in-time retrieval, context reconstruction, provenance tracing, and impact analysis, and produces an answer grounded in an auditable log of graph operations. The result is a formal architectural specification, not an empirical benchmark: a secure interaction protocol that decouples legal knowledge representation from agentic reasoning. Although illustrated in law, the primitive model is domain-portable to other temporally versioned, provenance-sensitive, and authority-governed knowledge bases.
[290] ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
Jianghao Lin, Yuanyuan Shi, Xin Peng, Renjie Ding, Hairui Wang, Yuxuan Peng, Bizhe Bai, Weixi Song, Fengshuo Bai, Huacan Chai, Weinan Zhang, Fei Huang, Ying Wen
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) excel at function calling, but inference scaling has been explored mainly for unstructured generation. We propose an inference-scaling framework for structured outputs that combines fine-grained beam search with \textbf{ToolPRM}, a process reward model scoring each intra-call decision (function name and argument filling). We build the first fine-grained intra-call supervision dataset via function masking, rollout collection, and step-level annotation. ToolPRM outperforms outcome and coarse-grained reward models in predictive accuracy and yields consistent test-time gains on multiple function-calling benchmarks. We further show that structured generation follows ``\textbf{explore more but retain less}’’, since early JSON errors are unrecoverable.
[291] Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
Yihong Dong, Zhaoyu Ma, Xue Jiang, Zhiyuan Fan, Jiaru Qian, Yongmin Li, Jianha Xiao, Zhi Jin, Rongyu Cao, Binhua Li, Fei Huang, Yongbin Li, Ge Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diffusion language models (DLMs) are emerging as a compelling alternative to the dominant autoregressive paradigm, offering inherent advantages in parallel generation and bidirectional context modeling. However, for the tasks with strict structural constraints such as code generation, DLMs face a critical trade-off between inference speed and output quality, where accelerating generation by reducing sampling steps often leads to catastrophic performance collapse. We find that the fundamental reasons are: 1) the generation difficulty is non-uniform in the structured sequence decoding steps, making DLM’s static acceleration strategy suboptimal; 2) the context of tokens generated by DLM evolves continuously, causing early high-confidence predictions to turn into irreversible errors. In this paper, we introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber), a novel training-free sampling algorithm for DLMs that first achieves both better inference speed and output quality in code generation. Saber dynamically adjusts the number of tokens unmasked per step based on the model’s evolving confidence, and utilizes a backtracking mechanism to revert tokens whose confidence drops as new context emerges, with its effectiveness supported by theoretical analysis. Extensive experiments on multiple mainstream code generation benchmarks show that Saber boosts Pass@1 accuracy by an average of 1.9% over mainstream DLM sampling methods, while achieving an average 251.4% inference speedup. By leveraging the inherent advantages of DLMs, our work significantly narrows the performance gap with autoregressive models in code generation.
[292] Student Guides Teacher: Weak-to-Strong Inference via Spectral Orthogonal Exploration
Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li, Deguo Xia, Jizhou Huang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) often suffer from ‘‘Reasoning Collapse’’ on challenging mathematical reasoning tasks, where stochastic sampling produces lexical variations of the same erroneous logic rather than genuine semantic exploration. We observe that failed reasoning traces are often associated with a low-rank bias manifold in the model’s hidden-state geometry, which reduces exploration toward corrective solution directions. To address this, we propose Spectral Orthogonal Exploration (SOE), a geometric inference framework under a ‘‘Student Guides Teacher’’ paradigm. Instead of using a weak auxiliary agent for imitation, SOE uses it as an orthogonal probe to introduce semantically heterogeneous reasoning signals into the teacher’s orthogonal complement of its dominant subspace. This intervention steers the teacher toward more diverse reasoning trajectories and improves exploration beyond standard sampling. Experiments on mathematical benchmarks show that SOE improves average accuracy by 62.4% and average sampling efficiency by 113.7% over baseline methods, suggesting that geometric interventions can be effective for mitigating reasoning collapse in mathematical reasoning. We further provide preliminary evidence that SOE is also effective on logic and code generation benchmarks.
[293] Autonomous Knowledge Graph Exploration with Adaptive Breadth-Depth Retrieval
Joaquín Polonuer, Lucas Vittor, Iñaki Arango, Ayush Noori, David A. Clifton, Luciano Del Corro, Marinka Zitnik
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Retrieving evidence for language model queries from knowledge graphs requires balancing broad search across the graph with multi-hop traversal to follow relational links. Similarity-based retrievers provide coverage but remain shallow, whereas traversal-based methods rely on selecting seed nodes to start exploration, which can fail when queries span multiple entities and relations. We introduce ARK: Adaptive Retriever of Knowledge, a tool-using KG retriever that gives a language model control over this breadth-depth tradeoff using a two-operation toolset: global lexical search over node descriptors and one-hop neighborhood exploration that composes into multi-hop traversal. ARK alternates between breadth-oriented discovery and depth-oriented expansion without depending on a fragile seed selection, a pre-set hop depth, or requiring retrieval training. ARK adapts tool use to queries, using global search for language-heavy queries and neighborhood exploration for relation-heavy queries. On STaRK, ARK reaches 59.1% average Hit@1 and 67.4 average MRR, improving average Hit@1 by up to 31.4% and average MRR by up to 28.0% over retrieval-based and agent-based training-free methods. Finally, we distill ARK’s tool-use trajectories from a large teacher into an 8B model via label-free imitation, improving Hit@1 by +7.0, +26.6, and +13.5 absolute points over the base 8B model on AMAZON, MAG, and PRIME datasets, respectively, while retaining up to 98.5% of the teacher’s Hit@1 rate.
[294] Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification
Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, Michael R. Lyu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent’s ability by iteratively verifying the policy model’s outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and LLM judge baselines by 12%-48% in meta-evaluation F1 score. To enable practical self-evolution, DeepVerifier integrates as a plug-and-play module during test-time inference. The verifier produces detailed rubric-based feedback, which is fed back to the agent for iterative bootstrapping, refining responses without additional training. This test-time scaling delivers 8%-11% accuracy gains on challenging subsets of GAIA and XBench-DeepSearch when powered by capable closed-source LLMs. Finally, to support open-source advancement, we release DeepVerifier-4K, a curated supervised fine-tuning dataset of 4,646 high-quality agent steps focused on DRA verification. These examples emphasize reflection and self-critique, enabling open models to develop robust verification capabilities.
[295] RE-MCDF: Closed-Loop Multi-Expert LLM Reasoning for Knowledge-Grounded Clinical Diagnosis
Shaowei Shen, Xiaohong Yang, Jie Yang, Lianfen Huang, Yongcai Zhang, Yang Zou, Seyyedali Hosseinalipour
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Electronic medical records (EMRs), particularly in neurology, are inherently heterogeneous, sparse, and noisy, which poses significant challenges for large language models (LLMs) in clinical diagnosis. In such settings, single-agent systems are vulnerable to self-reinforcing errors, as their predictions lack independent validation and can drift toward spurious conclusions. Although recent multi-agent frameworks attempt to mitigate this issue through collaborative reasoning, their interactions are often shallow and loosely structured, failing to reflect the rigorous, evidence-driven processes used by clinical experts. More fundamentally, existing approaches largely ignore the rich logical dependencies among diseases, such as mutual exclusivity, pathological compatibility, and diagnostic confusion. This limitation prevents them from ruling out clinically implausible hypotheses, even when sufficient evidence is available. To overcome these, we propose RE-MCDF, a relation-enhanced multi-expert clinical diagnosis framework. RE-MCDF introduces a generation–verification–revision closed-loop architecture that integrates three complementary components: (i) a primary expert that generates candidate diagnoses and supporting evidence, (ii) a laboratory expert that dynamically prioritizes heterogeneous clinical indicators, and (iii) a multi-relation awareness and evaluation expert group that explicitly enforces inter-disease logical constraints. Guided by a medical knowledge graph (MKG), the first two experts adaptively reweight EMR evidence, while the expert group validates and corrects candidate diagnoses to ensure logical consistency. Extensive experiments on the neurology subset of CMEMR (NEEMRs) and on our curated dataset (XMEMRs) demonstrate that RE-MCDF consistently outperforms state-of-the-art baselines in complex diagnostic scenarios (https://github.com/shenshaowei/RE-MCDF).
[296] The Dual Role of Abstracting over the Irrelevant in Symbolic Explanations: Cognitive Effort vs. Understanding
Zeynep G. Saribatur, Johannes Langer, Ute Schmid
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Explanations are central to human cognition, yet AI systems often produce outputs that are difficult to understand. While symbolic AI offers a transparent foundation for interpretability, raw logical traces often impose a high extraneous cognitive load. We investigate how formal abstractions, specifically removal and clustering, impact human reasoning performance and cognitive effort. Utilizing Answer Set Programming (ASP) as a formal framework, we define a notion of irrelevant details to be abstracted over to obtain simplified explanations. Our cognitive experiments, in which participants classified stimuli across domains with explanations derived from an answer set program, show that clustering details significantly improve participants’ understanding, while removal of details significantly reduce cognitive effort, supporting the hypothesis that abstraction enhances human-centered symbolic explanations.
[297] Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use
Ruocheng Guo, Kaiwen Dong, Xiang Gao, Kamalika Das
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While most efforts to improve LLM-based tool-using agents focus on the agent itself - through larger models, better prompting, or fine-tuning - agent performance increasingly plateaus due to the quality of the tool interfaces these agents consume. Tool descriptions are often written for human developers and tolerate ambiguity that agents cannot resolve, particularly as the number of candidate tools grows. Existing approaches to improving tool interfaces (1) require re-running a multi-stage per-tool pipeline - synthesizing queries, executing an agent to collect trajectories, annotating trajectories, and prompting a strong LLM multiple times - for every API that enters the catalog, and (2) typically optimize each tool independently, limiting scalability and generalization to unseen tools. We propose Trace-Free+, a curriculum learning framework that progressively transfers supervision from trace-rich settings to trace-free deployment, encouraging the model to internalize reusable patterns of what makes a tool description effective. To support this approach, we construct a large-scale dataset of high-quality tool interfaces derived from real-world APIs through a principled data synthesis workflow. Experiments on widely adopted benchmarks show that Trace-Free+ improves robustness as tool catalogs scale to 150+ candidates - in scaling experiments, reducing accuracy degradation by 29.23% and improving average query-level success by 60.89% on StableToolBench - generalizes across domains without retraining, and provides complementary gains on top of agent fine-tuning.
[298] Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search
Mengxiang Chen, Zhouwei Zhai, Jin Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Modern e-commerce search is evolving to resolve complex user intents. While Large Language Models (LLMs) offer strong reasoning, existing LLM-based paradigms face a fundamental blindness-latency dilemma: query rewriting is agnostic to retrieval capabilities and real-time inventory, yielding invalid plans; conversely, deep search agents rely on iterative tool calls and reflection, incurring seconds of latency incompatible with industrial sub-second budgets. To resolve this conflict, we propose Environment-Aware Search Planning (EASP), reformulating search planning as a dynamic reasoning process grounded in environmental reality. EASP introduces a Probe-then-Plan mechanism: a lightweight Retrieval Probe exposes the retrieval snapshot, enabling the Planner to diagnose execution gaps and generate grounded search plans. The methodology comprises three stages: (1) Offline Data Synthesis: A Teacher Agent synthesizes diverse, execution-validated plans by diagnosing the probed environment. (2) Planner Training and Alignment: The Planner is initialized via Supervised Fine-Tuning (SFT) to internalize diagnostic capabilities, then aligned with business outcomes (conversion rate) via Reinforcement Learning (RL). (3) Adaptive Online Serving: A complexity-aware routing mechanism selectively activates planning for complex queries, ensuring optimal resource allocation. Extensive offline evaluations and online A/B testing on JD.com demonstrate that EASP significantly improves relevant recall and achieves substantial lifts in UCVR and GMV. EASP has been successfully deployed in JD.com’s AI-Search system.
[299] A Framework for Longitudinal Health AI Agents
Georgianna Lin, Rencong Jiang, Noémie Elhadad, Xuhai “Orson” Xu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Although artificial intelligence (AI) agents are increasingly proposed to support potentially longitudinal health tasks, such as symptom management, behavior change, and patient support, most current implementations fall short of facilitating user intent and fostering accountability. This contrasts with prior work on supporting longitudinal needs, both within and beyond clinical settings, where follow-up, coherent reasoning, and sustained alignment with individuals’ goals are critical for both effectiveness and safety. In this paper, we draw on established clinical and personal health informatics frameworks to define what it would mean to orchestrate longitudinal health interactions with AI agents. We propose a multi-layer framework and corresponding agent architecture that operationalizes adaptation, coherence, continuity, and agency across repeated interactions. Through representative use cases, we demonstrate how longitudinal agents can maintain meaningful engagement, adapt to evolving goals, and support safe, personalized decision-making over time. Our findings underscore both the promise and the complexity of designing systems capable of supporting health trajectories beyond isolated interactions, and we offer guidance for future research and development in multi-session, user-centered health AI.
[300] Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex
Zhonghao Yang, Yu Li, Yanxu Zhu, Tianyi Zhou, Yuejin Xie, Haoyu Luo, Jing Shao, Xia Hu, Dongrui Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As agent systems move into increasingly diverse execution settings, trajectory-level safety evaluation and diagnosis require benchmarks that evolve with them. ATBench is a diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis. This report presents ATBench-Claw and ATBench-Codex, two domain-customized extensions that carry ATBench into the OpenClaw and OpenAI Codex / Codex-runtime settings. The key adaptation mechanism is to analyze each new setting, customize the three-dimensional Safety Taxonomy over risk source, failure mode, and real-world harm, and then use that customized taxonomy to define the benchmark specification consumed by the shared ATBench construction pipeline. This extensibility matters because agent frameworks remain relatively stable at the architectural level even as their concrete execution settings, tool ecosystems, and product capabilities evolve quickly. Concretely, ATBench-Claw targets OpenClaw-sensitive execution chains over tools, skills, sessions, and external actions, while ATBench-Codex targets trajectories in the OpenAI Codex / Codex-runtime setting over repositories, shells, patches, dependencies, approvals, and runtime policy boundaries. Our emphasis therefore falls on taxonomy customization, domain-specific risk coverage, and benchmark design under a shared ATBench generation framework.
[301] Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
Xinru Yan, Boxi Cao, Yaojie Lu, Hongyu Lin, Weixiang Zhou, Le Sun, Xianpei Han
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance’’ of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference
[302] ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
Xirui Li, Ming Li, Ion Stoica, Cho-Jui Hsieh, Tianyi Zhou
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent’s current weaknesses rather than being bounded by existing user logs.
[303] A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities
Aya Cherigui, Florent Guépin, Arnaud Legendre, Jean-François Couchot
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Human mobility data are used in numerous applications, ranging from public health to urban planning. Human mobility is inherently sensitive, as it can contain information such as religious beliefs and political affiliations. Historically, it has been proposed to modify the information using techniques such as aggregation, obfuscation, or noise addition, to adequately protect privacy and eliminate concerns. As these methods come at a great cost in utility, new methods leveraging development in generative models, were introduced. The extent to which such methods answer the privacy-utility trade-off remains an open problem. In this paper, we introduced a first step towards solving it, by the introduction and application of a new framework for utility evaluation. Furthermore, we provide evidence that privacy evaluation remains a great challenge to consider and that it should be tackled through adversarial evaluation in accordance with the current EU regulation. We propose a new membership inference attack against a subcategory of generative models, even though this subcategory was deemed private due to its resistance over the trajectory user-linking problem.
[304] The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications
Zhenyu Zhao, Aparna Balagopalan, Adi Agrawal, Dilshoda Yergasheva, Waseem Alshikh, Daniel M. Bikel
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Given the increased use of LLMs in financial systems today, it becomes important to evaluate the safety and robustness of such systems. One failure mode that LLMs frequently display in general domain settings is that of sycophancy. That is, models prioritize agreement with expressed user beliefs over correctness, leading to decreased accuracy and trust. In this work, we focus on evaluating sycophancy that LLMs display in agentic financial tasks. Our findings are three-fold: first, we find the models show only low to modest drops in performance in the face of user rebuttals or contradictions to the reference answer, which distinguishes sycophancy that models display in financial agentic settings from findings in prior work. Second, we introduce a suite of tasks to test for sycophancy by user preference information that contradicts the reference answer and find that most models fail in the presence of such inputs. Lastly, we benchmark different modes of recovery such as input filtering with a pretrained LLM.
[305] OxyGent: Making Multi-Agent Systems Modular, Observable, and Evolvable via Oxy Abstraction
Junxing Hu, Tianlong Li, Lei Yu, Ai Han
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Deploying production-ready multi-agent systems (MAS) in complex industrial environments remains challenging due to limitations in scalability, observability, and autonomous evolution. We present OxyGent, an open-source framework driven by two core novelties: a unified Oxy abstraction and the OxyBank evolution engine. The unified abstraction encapsulates agents, tools, LLMs, and reasoning flows as pluggable atomic components, enabling Lego-like scalable system composition and non-intrusive monitoring. To enhance observability, OxyGent introduces permission-driven dynamic planning that replaces rigid workflows with execution graphs generated at runtime, providing adaptive visualizations. Furthermore, to support continuous evolution, OxyBank serves as an AI asset management platform that drives automated data backflow, annotation, and joint evolution. Empirical evaluations and real-world case studies show that OxyGent provides a robust and scalable foundation for MAS. OxyGent is fully open-sourced under the Apache License 2.0 at https://github.com/jd-opensource/OxyGent.
[306] Data-Centric Foundation Models in Computational Healthcare: A Survey
Yunkun Zhang, Jin Gao, Zheling Tan, Lingfeng Zhou, Kexin Ding, Mu Zhou, Shaoting Zhang, Dequan Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2401.02458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2401.02458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[307] M2R2: MultiModal Robotic Representation for Temporal Action Segmentation
Daniel Sliwowski, Dongheui Lee
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2504.18662: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.18662&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[308] OT Score: An OT based Confidence Score for Prototype-Assisted Source Free Unsupervised Domain Adaptation
Yiming Zhang, Sitong Liu, Alex Cloninger
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.11669: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11669&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[309] Data Balancing Strategies: A Systematic Survey of Resampling and Augmentation Methods
Behnam Yousefimehr, Mehdi Ghatee, Javad Fazli, Shervin Ghaffari, Zahra Rafei, Mohammad Amin Seifi, Sajed Tavakoli, Abolfazl Nikahd, Mahdi Razi Gandomani, Alireza Orouji, Ramtin Mahmoudi Kashani, Sarina Heshmati, Negin Sadat Mousavi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.13518: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13518&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[310] Efficient Traffic Forecasting on Large-Scale Road Network by Regularized Adaptive Graph Convolution
Kaiqi Wu, Weiyang Kong, Sen Zhang, Zitong Chen, Yubao Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.07179: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07179&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[311] Treatment, evidence, imitation, and chat
Samuel J. Weisenthal
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.23040: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23040&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[312] PBiLoss: Popularity-Aware Regularization to Improve Fairness in Graph-Based Recommender Systems
Mohammad Naeimi, Mostafa Haghir Chehreghani
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.19067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.19067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[313] Neural Bridge Processes
Jian Xu, Yican Liu, Delu Zeng, John Paisley, Qibin Zhao
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.07220: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07220&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[314] Vertex Features for Neural Global Illumination
Rui Su, Honghao Dong, Haojie Jin, Yisong Chen, Guoping Wang, Sheng Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.07852: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07852&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[315] Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering
Emmanouil Kritharakis, Dusan Jakovetic, Antonios Makris, Konstantinos Tserpes
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.12672: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.12672&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[316] The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion
Zoe Kotti, Konstantina Dritsa, Diomidis Spinellis, Panos Louridas
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.16131: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16131&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[317] Hybrid Diffusion for Simultaneous Symbolic and Continuous Planning
Sigmund Hennum Høeg, Aksel Vaaler, Chaoqi Liu, Olav Egeland, Yilun Du
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.21983: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21983&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[318] PATCH: Learnable Tile-level Hybrid Sparsity for LLMs
Younes Hourri, Mohammad Mozaffari, Maryam Mehri Dehnavi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.23410: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23410&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[319] Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective
Zhezheng Hao, Hong Wang, Haoyang Liu, Jian Luo, Jiarui Yu, Hande Dong, Qiang Lin, Can Wang, Jiawei Chen
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.10150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[320] FedPF: Accurate Target Privacy Preserving Federated Learning Balancing Fairness and Utility
Kangkang Sun, Jun Wu, Minyi Guo, Jianhua Li, Jianwei Huang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.26841: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26841&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[321] EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents
Junwei Liu, Chen Xu, Chong Wang, Tong Bai, Weitong Chen, Kaseng Wong, Yiling Lou, Xin Peng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.02399: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.02399&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[322] PRAXIS: Integrating Program Analysis with Observability for Root-Cause Analysis
Shengkun Cui, Rahul Krishna, Saurabh Jha, Ravishankar K. Iyer
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.22113: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22113&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[323] HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, Yanghua Xiao
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.21459: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21459&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[324] ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization
Junbo Jacob Lian, Yujun Sun, Huiling Chen, Chaoyu Zhang, Hanzhang Qin, Chung-Piaw Teo
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.15983: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15983&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[325] CoFL: Continuous Flow Fields for Language-Conditioned Navigation
Haokun Liu, Zhaoqi Ma, Yicheng Chen, Masaki Kitagawa, Wentao Zhang, Zicen Xiong, Jinjie Li, Moju Zhao
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.02854: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02854&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[326] Causally Sufficient and Necessary Feature Expansion for Class-Incremental Learning
Zhen Zhang, Jielei Chu, Jiangtao Hu, Bin Liu, Jie Wang, Ya Liu, Tianrui Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.09145: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09145&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[327] Rethinking the Harmonic Loss via Non-Euclidean Distance Layers
Maxwell Miller-Golub, Collin Coil, Kamil Faber, Marcin Pietron, Panpan Zheng, Pasquale Minervini, Roberto Corizzo
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.10225: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10225&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[328] Integrating Weather Foundation Model and Satellite to Enable Fine-Grained Solar Irradiance Forecasting
Ziqing Ma, Kai Ying, Xinyue Gu, Tian Zhou, Tianyu Zhu, Haifan Zhang, Peisong Niu, Zheng Wang, Cong Bai, Liang Sun
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.14845: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14845&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[329] Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
Chenlu Ye, Xuanchang Zhang, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Huang, Tong Zhang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.19470: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19470&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[330] Unilateral Relationship Revision Power in Human-AI Companion Interaction
Benjamin Lange
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.23315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[331] Generative models on phase space
Zachary Bogorad, Ibrahim Elsharkawy, Yonatan Kahn, Andrew J. Larkoski, Noam Levi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.02415: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02415&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[332] A Self-Calibrating Framework for Analog Circuit Sizing Using LLM-Derived Analytical Equations
Antonio J. Bujana, Aydin I. Karsilayan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.07387: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07387&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[333] SkillForge: Forging Domain-Specific, Self-Evolving Agent Skills in Cloud Technical Support
Xingyan Liu, Xiyue Luo, Linyu Li, Ganghong Huang, Jianfeng Liu, Honglin Qiao
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.08618: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08618&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[334] Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations
Wentao Hu, Yanbo Zhai, Xiaohui Hu, Mingkuan Zhao, Shanhong yu, Xue Liu, Kaidong Yu, Shuangyong Song, Xuelong Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14246: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14246&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[335] Provable Coordination for LLM Agents via Message Sequence Charts
Benedikt Bollig, Matthias Függer, Thomas Nowak
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17612: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17612&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[336] IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem
Aniruddha Adiga, Jingyuan Chou, Anshul Chiranth, Bryan Lewis, Ana I. Bento, Shaun Truelove, Geoffrey Fox, Madhav Marathe, Harry Hochheiser, Srini Venkatramanan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18521: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18521&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[337] Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training
Vin Bhaskara, Haicheng Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18701: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18701&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[338] Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
Open-H-Embodiment Consortium, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang, Alaa Eldin Abdelaal, Alberto Arezzo, Ayberk Acar, Farshid Alambeigi, Carlo Alberto Ammirati, Yunke Ao, Pablo David Aranda Rodriguez, Soofiyan Atar, Mattia Ballo, Noah Barnes, Federica Barontini, Filip Binkiewicz, Peter Black, Sebastian Bodenstedt, Leonardo Borgioli, Nikola Budjak, Benjamin Calmé, Fabio Carrillo, Nicola Cavalcanti, Changwei Chen, Haoxin Chen, Sihang Chen, Qihan Chen, Zhongyu Chen, Ziyang Chen, Shing Shin Cheng, Meiqing Cheng, Min Cheng, Zih-Yun Sarah Chiu, Xiangyu Chu, Camilo Correa-Gallego, Giulio Dagnino, Anton Deguet, Jacob Delgado, Jonathan C. DeLong, Kaizhong Deng, Alexander Dimitrakakis, Qingpeng Ding, Hao Ding, Giovanni Distefano, Daniel Donoho, Anqing Duan, Marco Esposito, Shane Farritor, Jad Fayad, Zahi Fayad, Mario Ferradosa, Filippo Filicori, Chelsea Finn, Philipp Fürnstahl, Jiawei Ge, Stamatia Giannarou, Xavier Giralt Ludevid, Frederic Giraud, Aditya Amit Godbole, Ken Goldberg, Antony Goldenberg, Diego Granero Marana, Xiaoqing Guo, Tamás Haidegger, Evan Hailey, Pascal Hansen, Ziyi Hao, Kush Hari, Kengo Hayashi, Jonathon Hawkins, Shelby Haworth, Ortrun Hellig, S. Duke Herrell, Zhouyang Hong, Andrew Howe, Junlei Hu, Zhaoyang Jacopo Hu, Ria Jain, Mohammad Rafiee Javazm, Howard Ji, Rui Ji, Jianmin Ji, Zhongliang Jiang, Dominic Jones, Jeffrey Jopling, Britton Jordan, Ran Ju, Michael Kam, Luoyao Kang, Fausto Kang, Siddhartha Kapuria, Peter Kazanzides, Sonika Kiehler, Ethan Kilmer, Ji Woong Kim, Przemysław Korzeniowski, Chandra Kuchi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.21017: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.21017&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[339] Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores
Shevya Panda, Shinjini Bose, Ananya Joshi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.22063: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.22063&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[340] A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies
Somyajit Chakraborty
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.22227: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.22227&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[341] Semantic Error Correction and Decoding for Short Block Codes
Jiafu Hao, Chentao Yue, Wanchun Liu, Branka Vucetic, Yonghui Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.22269: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.22269&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[342] Inverting Foundation Models of Brain Function with Simulation-Based Inference
Niels Bracher, Xavier Intes, Stefan T. Radev
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.23865: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.23865&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[343] A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks
Grigorios Papanikolaou, Ioannis Kontopoulos, Konstantinos Tserpes
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.24810: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.24810&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[344] Towards Unified Multi-task EEG Analysis with Low-Rank Adaptation
Sicheng Dai, Kai Chen, Hongwang Xiao, Shan Yu, Qiwei Ye
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.25131: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.25131&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[345] The Role of Symmetry in Optimizing Overparameterized Networks
Kusha Sareen, Mohammad Pedramfar, Sékou-Oumar Kaba, Mehran Shakerinava, Siamak Ravanbakhsh
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.25150: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.25150&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[346] DiRe-RAPIDS: Topology-faithful dimensionality reduction at scale
Alexander Kolpakov, Igor Rivin
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.25209: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.25209&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[347] AHASD: Asynchronous Heterogeneous Architecture for LLM Adaptive Drafting Speculative Decoding on Mobile Devices
Ma Zirui, Fan Zhihua, Li Wenxing, Wu Haibin, Zhang Fulin, Ye Xiaochun, Li Wenming
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.25326: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.25326&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[348] Speech Emotion Recognition Using MFCC Features and LSTM-Based Deep Learning Model
Adelekun Oluwademilade, Ademola Adedamola, Abiola Abdulhakeem, Akinpelu Azeezat, Eraiyetan Israel, Omotosho Oluwadunsin, Ibenye Ikechukwu, Ayuba Muhammad, Olusanya Olamide, Kamorudeen Amuda
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Speech Emotion Recognition (SER) is the use of machines to detect the emotional state of humans based on the speech, which is gaining importance in natural human-computer interaction. Speech is a very valuable source of information, as emotions modify the patterns of speech; pitch, energy and even timing. Nonetheless, SER is not an easy task because speakers are not constant, and situations vary when recording and the sound similarity between specific feelings. In this work, the author introduces a speech emotion recognition system relying on the Mel-Frequency Cepstral Coefficient and Long Short-Term Memory (LSTM) neural network, as a feature extraction method. The Toronto Emotional Speech Set (TESS) speech signal was pre-processed, and transformed into MFCC features to understand the important aspects in terms of time. The resultant features were then introduced to LSTM model, which is able to learn long term features of sequential audio data. The trained model was measured over several emotion classes occurring in the dataset. As seen in the results of experiments, the proposed MFCC-LSTM approach succeeds in capturing the patterns of emotions in speech and provides highly realistic classifications in all the chosen emotion classifications. This study presents a speech emotion recognition system using Mel-Frequency Cepstral Coefficients (MFCCs) as features and a deep learning LSTM classifier. A Support Vector Machine (SVM) with an RBF kernel served as a classical baseline, achieving 98% accuracy, against which the proposed LSTM model, achieving 99% accuracy, was validated. Overall, it is possible to confirm that LSTM-based architectures can be used to address the task of speech emotion recognition. Actual applications of the proposed system may be virtual assistants and mental health surveillance.
[349] Recurrence-Based Nonlinear Vocal Dynamics as Digital Biomarkers for Depression Detection from Conversational Speech
Himadri S Samanta
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Digital biomarkers for depression have largely relied on static acoustic descriptors, pooled summary statistics, or conventional machine learning representations. Such approaches may miss nonlinear temporal organization embedded in conversational vocal dynamics. We hypothesized that depression is associated with altered recurrence structure in vocal state trajectories, reflecting changes in how the vocal system revisits acoustic states over time. Using the depression subset of the DAIC-WOZ corpus with 142 labeled participants, we modeled frame-level COVAREP trajectories as nonlinear dynamical systems and derived recurrence-based biomarkers from 74 vocal channels. Logistic regression with feature selection and stratified cross-validation evaluated classification performance. Recurrence-based biomarkers achieved a mean cross-validated AUC of 0.689, exceeding static acoustic baselines, entropy-dynamics features, Hurst exponent features, determinism features, and Lyapunov-like instability proxies. Permutation testing indicated statistical significance with $p=0.004$. Pooled cross-validated predictions yielded AUC 0.665 with a 95% bootstrap confidence interval of [0.568, 0.758]. These findings suggest that depression may be characterized by altered recurrence structure in conversational vocal dynamics and support nonlinear state-space analysis as a promising direction for digital psychiatric biomarkers.
[350] Diffusion Reconstruction towards Generalizable Audio Deepfake Detection
Bo Cheng, Songjun Cao, Xiaoming Zhang, Jie Chen, Long Ma, Fei Chen
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Achieving robust generalization against unseen attacks remains a challenge in Audio Deepfake Detection (ADD), driven by the rapid evolution of generative models. To address this, we propose a framework centered on hard sample classification. The core idea is that a model capable of distinguishing challenging hard samples is inherently equipped to handle simpler cases effectively. We investigate multiple reconstruction paradigms, identifying the diffusion-based method as optimal for generating hard samples. Furthermore, we leverage multi-layer feature aggregation and introduce a Regularization-Assisted Contrastive Learning (RACL) objective to enhance generalizability. Experiments demonstrate the superior generalization of our approach, with our best model achieving a significant reduction in the average Equal Error Rate (EER) compared to the baseline.
[351] Full band denoising of room impulse response in the wavelet domain with dictionary learning
Théophile Dupré, Romain Couderc, Miguel Moleron, Axel Coulon, Rémy Bruno, Arnaud Laborie
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Conventional wavelet-domain methods for room impulse response denoising rely on thresholding detail coefficients, which is unsuited for low frequencies. In this work, we introduce a wavelet-based post-processing algorithm that extends denoising to approximation coefficients by means of sparse dictionary learning with a time-varying error tolerance. The proposed method leverages an exponential decay envelope model to adapt reconstruction accuracy according to the local signal-to-noise ratio. This approach significantly improves low-frequency denoising of synthetic and measured room impulse responses compared to the baseline method, leading to more accurate estimation of acoustic parameters such as decay time.
[352] A Toolkit for Detecting Spurious Correlations in Speech Datasets
Lara Gauder, Pablo Riera, Andrea Slachevsky, Gonzalo Forno, Adolfo M. García, Luciana Ferrer
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlations may arise due to heterogeneous recording conditions, a common scenario for health-related datasets. When present both in the training and test data, these correlations result in an overestimation of the system performance – a dangerous situation, specially in high-stakes application where systems are required to satisfy minimum performance requirements. Our toolkit implements a diagnostic method based on the detection of the target class using only the non-speech regions in the audio. Better than chance performance at this task indicates that information about the target class can be extracted from the non-speech regions, flagging the presence of spurious correlations. The toolkit is publicly available for research use.
[353] Omni2Sound: Towards Unified Video-Text-to-Audio Generation
Yusheng Dai, Zehua Chen, Yuxuan Jiang, Baolong Gao, Qiuhong Ke, Jianfei Cai, Jun Zhu
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight V-A-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5$\times$ cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions.
[354] Explainable Detection of Machine Generated Music and Early Systematic Evaluation
Yupei Li, Qiyang Sun, Hanqian Li, Lucia Specia, Björn W. Schuller
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Machine-generated music (MGM) has become a groundbreaking innovation with wide-ranging applications, such as music therapy, personalised editing, and creative inspiration within the music industry. However, the unregulated proliferation of MGM presents considerable challenges to the entertainment, education, and arts sectors by potentially undermining the value of high-quality human compositions. Consequently, MGM detection (MGMD) is crucial for preserving the integrity of these fields. Despite its significance, MGMD domain lacks comprehensive systematic evaluation results necessary to drive meaningful progress. To address this gap, we conduct experiments on existing large-scale datasets using a range of foundational models for audio processing, establishing systematic evaluation results tailored to the MGMD task. Our selection includes traditional machine learning models, deep neural networks, Transformer-based architectures, and State space models (SSM). Recognising the inherently multimodal nature of music, which integrates both melody and lyrics, we also explore fundamental multimodal models in our experiments. Beyond providing basic binary classification outcomes, we delve deeper into model behaviour using multiple explainable Artificial Intelligence (XAI) tools, offering insights into their decision-making processes. Our analysis reveals that ResNet18 performs the best according to in-domain and out-of-domain tests. By providing a comprehensive comparison of systematic evaluation results and their interpretability, we propose several directions to inspire future research to develop more robust and effective detection methods for MGM. We provide our codes and some samples on Github repository.
[355] A Dataset for Automatic Vocal Mode Classification
Reemt Hinrichs, Sonja Stephan, Alexander Lange, Jörn Ostermann
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The Complete Vocal Technique (CVT) is a school of singing developed in the past decades by Cathrin Sadolin et al.. CVT groups the use of the voice into so called vocal modes, namely Neutral, Curbing, Overdrive and Edge. Knowledge of the desired vocal mode can be helpful for singing students. Automatic classification of vocal modes can thus be important for technology-assisted singing teaching. Previously, automatic classification of vocal modes has been attempted without major success, potentially due to a lack of data. Therefore, we recorded a novel vocal mode dataset consisting of sustained vowels recorded from four singers, three of which professional singers with more than five years of CVT-experience. The dataset covers the entire vocal range of the subjects, totaling 3,752 unique samples. By using four microphones, thereby offering a natural data augmentation, the dataset consists of more than 13,000 samples combined. An annotation was created using three CVT-experienced annotators, each providing an individual annotation. The merged annotation as well as the three individual annotations come with the published dataset. Additionally, we provide some baseline classification results. The best balanced accuracy across a 5-fold cross validation of 81.3,% was achieved with a ResNet18. The dataset can be downloaded under https://zenodo.org/records/14276415.
[356] Woosh: A Sound Effects Foundation Model
Gaëtan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrichi, Hakim Missoum, Joan Serrà, Yuki Mitsufuji
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI’s publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
cs.LG
[357] A Multimodal and Explainable Machine Learning Approach to Diagnosing Multi-Class Ejection Fraction from Electrocardiograms
Catherine Ning, Yu Ma, Cindy Beini Wang, Sean McMahon, Joseph Radojevic, Steven Zweibel, Dimitris Bertsimas
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Left ventricular ejection fraction (LVEF) assessment depends on echocardiography, limiting access in primary care and resource-constrained settings. We developed a multimodal machine-learning framework that combines engineered 12-lead ECG timeseries features with structured EHR variables to classify LVEF into four clinically used strata: normal (>50%), mildly reduced (40-50%), moderately reduced (30-40%), and severely reduced (<30%). To support model explainability, we identified the most influential ECG and EHR features via SHAP attributions. Using retrospective data from Hartford HealthCare, we trained XGBoost models on 36,784 ECG-echocardiogram pairs from 30,952 outpatients and evaluated temporal generalizability on 19,966 ECGs from a subsequent period. The multimodal model achieved one-vs-rest AUROCs of 0.95 (severe), 0.92 (moderate), 0.82 (mild), and 0.91 (normal), outperforming ECG-only and EHR-only baselines, and maintained performance under temporal validation. This work supports ECG-based, multimodal LVEF stratification as a practical screening and triage aid to prioritize confirmatory imaging where resources are limited.
[358] A Randomized PDE Energy driven Iterative Framework for Efficient and Stable PDE Solutions
Yi Bing, Zheng Ran, Fu Jinyang, Liu Long, Peng Xiang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Efficient and stable solution of partial differential equations (PDEs) is central to scientific and engineering applications, yet existing numerical solvers rely heavily on matrix based discretizations, while learning based methods require costly training and often suffer from limited generalization. In this work, we proposes a PDE energy driven framework that solves PDEs through physically constrained diffusion iterations, without relying on classical matrix based finite element assembly or data driven neural network training. The proposed method evolves arbitrary random initial fields through PDE energy driven implicit iterations combined with Gaussian smoothing, while strictly enforcing boundary conditions at each iteration. The proposed formulation is applied to representative one dimensional Poisson, Heat, and viscous Burgers equations, covering both steady state and transient problems. Numerical results demonstrate stable convergence to the unique physical solution from random initializations, with accurate resolution of sharp gradients and controlled Mean Squared Error (MSE) across a wide range of discretization parameters. Detailed comparisons with analytical solutions indicate that the framework achieves competitive accuracy and stability. Overall, the proposed framework provides a fast, flexible, and physically consistent alternative to traditional numerical solvers, offering a potential pathway for scalable PDE solutions in both research and engineering applications.
[359] A Survey of Multi-Agent Deep Reinforcement Learning with Graph Neural Network-Based Communication
Valentin Cuzin-Rambaud, Laetitia Matignon, Maxime Morge
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In multi-agent reinforcement learning (MARL), the integration of a communication mechanism, allowing agents to better learn to coordinate their actions and converge on their objectives by sharing information. Based on an interaction graph, a subclass of methods employs graph neural networks (GNNs) to learn the communication, enabling agents to improve their internal representations by enriching them with information exchanged. With growing research, we note a lack of explicit structure and framework to distinguish and classify MARL approaches with communication based on GNNs. Thus, this paper surveys recent works in this field. We propose a generalized GNN-based communication process with the goal of making the underlying concepts behind the methods more obvious and accessible.
[360] Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective
Jiaming Yang, Chenwei Tang, Liangli Zhen, Jiancheng Lv
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Key-value (KV) caching is essential for large language model inference, yet its memory overhead poses a critical bottleneck for long-context generation. Existing eviction policies predominantly rely on empirical heuristics, lacking a rigorous theoretical foundation. This work rethinks KV cache eviction through the lens of the Information Bottleneck principle. Under a linear-Gaussian surrogate of attention, we derive a closed-form mutual information objective that characterizes the effective information capacity of a retained KV cache subset. This formulation reveals that a wide range of existing eviction strategies can be interpreted as different approximations of the same capacity-maximization principle. Guided by this insight, we introduce CapKV, a capacity-aware eviction method that directly targets information preservation via a log-determinant approximation using statistical leverage scores. This approach replaces heuristic selection with a theoretically grounded mechanism that preserves the maximum predictive signal. Extensive experiments across multiple models and long-context benchmarks show that CapKV consistently outperforms prior methods, achieving a better trade-off between memory efficiency and generational fidelity.
[361] Mini-Batch Class Composition Bias in Link Prediction
Kieran Maguire, Srinandan Dasmahapatra
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Prior work on node classification has shown that Graph Neural Networks (GNNs) can learn representations that transfer across graphs, when underlying graph properties are shared. For a fixed graph, one would then expect GNNs trained for link prediction to learn a representation consistent with that learnt for node classification. We show this intuition does not hold in the general case. Instead, we find popular link prediction models can learn a trivial mini-batch dependent heuristic, enabled by batch-normalisation layers, to solve the edge classification task. When correcting for this, we observe increased alignment of the network representation with node-class relevant features, suggesting the network has learnt a graph representation that better aligns with the underlying graph’s properties. Our findings suggest that standard link prediction training may be leading us to overestimate link predictors’ ability to learn a generalised representation of a graph that is consistent across tasks.
[362] Open Problems in Frontier AI Risk Management
Marta Ziosi, Miro Plueckebaum, Stephen Casper, Henry Papadatos, Ze Shen Chin, Peter Slattery, James Gealy, Tim G. J. Rudner, Brian Tse, Ariel Gil, Patricia Paskov, Maximilian Negele, Rokas Gipiškis, Nada Madkour, Vera Lummis, Rupal Jain, Luise Eder, Kristina Fort, Malou C. van Draanen Glismann, Inès Belhadj, Amin Oueslati, Anna K. Wisakanto, Richard Mallah, Koen Holtman, Ranj Zuhdi, Daniel S. Schiff, Jessica Newman, Malcolm Murray, Robert Trager
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Frontier AI both amplifies existing risks and introduces qualitatively novel challenges. Not only is there a notable lack of stable scientific consensus resulting from the rapid pace of technological change, but emerging frontier AI safety practices are often misaligned with, or may undermine, established risk management frameworks. To address these challenges, we systematically surface open problems in frontier AI risk management. Adopting a problem-oriented approach, we examine each stage of the risk management process - risk planning, identification, analysis, evaluation, and mitigation - through a structured review of the literature, identifying unresolved challenges and the actors best positioned to address them. Recognising that different types of open problems call for different responses, we classify open problems according to whether they reflect (a) a lack of scientific or technical consensus, (b) misalignment with, or challenges to, established risk management frameworks, or (c) shortcomings in implementation despite apparent consensus and alignment. By mapping these open problems and identifying the actors best positioned to address them - including developers, deployers, regulators, standards bodies, researchers, and third-party evaluators - this work aims to clarify where progress is needed to enable robust and meaningful consensus on frontier AI risk management.The paper does not propose specific solutions; instead, it provides a problem-oriented, agenda-setting reference document, complemented by a living online repository, intended to support coordination, reduce duplication, and guide future research and governance efforts.
[363] Correcting Performance Estimation Bias in Imbalanced Classification with Minority Subconcepts
Taylor Maxson, Roberto Corizzo, Yaning Wu, Nathalie Japkowicz, Colin Bellinger
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Class-level evaluation can conceal substantial performance disparities across subconcepts within the same class, causing models that perform well on average to fail on specific subpopulations. Prior work has shown that common evaluation measures for imbalanced classification are biased toward larger minority subconcepts and that utility-based reweighting using true subconcept labels can mitigate this bias; however, such labels are rarely available at test time. We introduce a practical utility-weighted evaluation that replaces unavailable subconcept labels with predicted posterior probabilities from a multiclass subconcept model. Evaluation weights are defined as the expected utility under this posterior, yielding a soft, uncertainty-aware metric we call predicted-weighted balanced accuracy (pBA). Experiments on tabular benchmarks as well as medical-imaging and text datasets show that unweighted scores can be misleading under within-class heterogeneity, while pBA provides more stable and interpretable assessments when subconcept distributions are uneven but not pathological. Our code is available at: https://anonymous.4open.science/r/correcting-bias-imbalance-9C6C/.
[364] RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
Vyom Sharma, Debajyoti Datta
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP delivers 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.
[365] Observable Neural ODEs for Identifiable Causal Forecasting in Continuous Time
Jennifer Wendland, Nicolas Freitag, Maik Kschischo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Causal inference in continuous-time sequential decision problems is challenged by hidden confounders. We show that, in latent state-space models with time-varying interventions, observability of the latent dynamics from observed data is necessary for identifying dynamic treatment effects, linking control-theoretic observability to causal identifiability, even when hidden confounders affect both treatments and outcomes. We derive a continuous-time adjustment formula expressing potential outcome distributions under treatment trajectories via the measurement model, latent dynamics, and the filtering distribution over latent states given observed histories. We propose Observable Neural ODEs (ObsNODEs), Neural ODE models in observable normal form for causal forecasting. ObsNODEs learn continuous-time dynamics with states reconstructible from observations, enabling outcome prediction under alternative treatment paths. Experiments on synthetic cancer data, semi-synthetic data based on MIMIC-IV, and real-world sepsis data show strong performance over recent sequence models.
[366] Privacy-Preserving Federated Learning Framework for Distributed Chemical Process Optimization
Teetat Pipattaratonchai, Aueaphum Aueawatthanaphisut
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Industrial chemical plants often operate under strict data confidentiality constraints, making centralized data-driven process modeling difficult. Federated learning (FL) provides a promising solution by enabling collaborative model training across distributed facilities without sharing raw operational data. This paper proposes a privacy-preserving federated learning framework for distributed chemical process optimization using data collected from multiple geographically separated plants. Each plant locally trains a neural-network-based process model using its own time-series sensor data, while only model parameters are transmitted to a central aggregation server through secure aggregation mechanisms. This design allows cross-plant knowledge sharing while maintaining strict data locality and industrial confidentiality. Experimental evaluation was conducted using process datasets from three independent chemical plants operating under heterogeneous conditions. The results demonstrate rapid convergence of the federated model, with the global mean squared error decreasing from approximately 2369 to below 50 within the first five communication rounds and stabilizing around 35 after 40 rounds. In comparison with local-only training, the proposed federated framework significantly improves prediction accuracy across all plants, while achieving performance comparable to centralized training. The findings indicate that federated learning provides an effective and scalable solution for collaborative industrial analytics, enabling privacy-preserving predictive modeling and process optimization across distributed chemical production facilities.
[367] PPG-Based Affect Recognition with Long-Range Deep Models: A Measurement-Driven Comparison of CNN, Transformer, and Mamba Architectures
Karim Alghoul, Hussein Al Osman, Abdulmotaleb El Saddik
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Photoplethysmography (PPG) is increasingly used in wearable affective computing due to its low cost and ease of integration into consumer devices. Recent advances in deep learning have introduced long-range sequence models, such as Transformers, and state-space models, like Mamba, which have demonstrated strong performance on natural language and general time-series tasks. However, it remains unclear whether these architectures offer tangible benefits over widely used Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTMs) for PPG-based affect recognition, given that datasets are typically small and noisy. This work presents a measurement-driven comparison of four deep learning architectures, CNN, CNN-LSTM hybrid, Transformers, and Mamba, for classifying arousal, valence, and relaxation states from wrist-based PPG signals. All models are evaluated under a subject-independent 5-fold cross-validation protocol using identical preprocessing, segmentation, and training pipelines. Our results show that the Transformer and Mamba models achieve performance comparable to that of a CNN baseline, but do not consistently outperform it across all tasks. CNNs remain the most effective overall, providing the highest accuracy with the smallest model size, whereas Transformers have a better balance of F1 scores for Arousal and Relaxation. The study provides the first evaluation of Transformer and Mamba models for PPG-based affect recognition, offering practical guidance on model selection for wearable affective monitoring systems.
[368] Momentum-Conserving Graph Neural Networks for Deformable Objects
Jiahong Wang, Logan Numerow, Stelian Coros, Christian Theobalt, Vahid Babaei, Bernhard Thomaszewski
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Graph neural networks (GNNs) have emerged as a versatile and efficient option for modeling the dynamic behavior of deformable materials. While GNNs generalize readily to arbitrary shapes, mesh topologies, and material parameters, existing architectures struggle to correctly predict the temporal evolution of key physical quantities such as linear and angular momentum. In this work, we propose MomentumGNN – a novel architecture designed to accurately track momentum by construction. Unlike existing GNNs that output unconstrained nodal accelerations, our model predicts per-edge stretching and bending impulses which guarantee the preservation of linear and angular momentum. We train our network in an unsupervised fashion using a physics-based loss, and we show that our method outperforms baselines in a number of common scenarios where momentum plays a pivotal role.
[369] reward-lens: A Mechanistic Interpretability Library for Reward Models
Mohammed Suhail B Nadaf
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Every RLHF-trained language model is shaped by a reward model, yet the mechanistic interpretability toolkit – logit lens, direct logit attribution, activation patching, sparse autoencoders – was built for generative LLMs whose primitives all project onto a vocabulary unembedding. Reward models replace that with a scalar regression head, breaking each tool. We present reward-lens, an open-source library that ports this toolkit to reward models, organised around one observation: the reward head’s weight vector $w_r$ is the natural axis for every interpretability question. The library provides a Reward Lens, component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE feature attribution, cross-model comparison, and five theory-grounded extensions (distortion index, divergence-aware patching, misalignment cascade detection, reward-term conflict analysis, concept-vector analysis). A ten-method adapter protocol covers Llama, Mistral, Gemma-2, and ArmoRM multi-objective heads, with a generic adapter for any HuggingFace sequence classification model. We validate on two production reward models across ~695 RewardBench pairs. The central empirical finding is negative: linear attribution does not predict causal patching effects (mean Spearman $ρ= -0.256$ on Skywork, $-0.027$ on ArmoRM). The framework treats this disagreement as a property to expose, not a bug – motivating a design that keeps observational and causal views first-class and directly comparable.
[370] Spatially-constrained clustering of geospatial features for heat vulnerability assessment of favelas in Rio de Janeiro
Baptiste Clemence, Thomas Hallopeau, Vanderlei Pascoal De Matos, Laurent Demagistri, Joris Guerin
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Informal settlements face disproportionate exposure to climate-related health hazards. However, existing methodologies lack systematic approaches to link diverse settlement characteristics with environmental health outcomes. We develop a data-driven framework to assess heat vulnerability in Rio de Janeiro’s favelas by combining spatially-constrained clustering with land surface temperature (LST) analysis. Using remote sensing and geospatial features, we identify two distinct favela typologies: recent, well-connected settlements on flat terrain (Cluster 0) and historical, poorly-connected communities on vegetated slopes (Cluster 1). Analysis of 16 extreme heat events reveals systematic temperature differences of 2–3$^\circ$C between clusters, with flat-terrain favelas experiencing significantly higher heat exposure. Our findings demonstrate that settlement morphology critically influences heat vulnerability, providing a replicable framework for targeted urban planning and public health interventions in informal settlements globally.
[371] Budget-Constrained Causal Bandits: Bridging Uplift Modeling and Sequential Decision-Making
Abhirami Pillai
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Treatment allocation under budget constraints is a central challenge in digital advertising: advertisers must decide which users to show ads to while spending a limited budget wisely. The standard approach follows a two-stage offline pipeline - first collect historical data to estimate heterogeneous treatment effects (HTE), then solve a constrained optimization to allocate the budget. This works well with abundant data, but fails in cold-start settings such as new campaigns, new markets, or new customer segments where little historical data exists. We propose Budget-Constrained Causal Bandits (BCCB), an online framework that learns which users respond to ads while simultaneously spending the budget, making treatment decisions one user at a time. BCCB unifies three components into a single sequential process: learning individual-level ad effectiveness, exploring users whose response is uncertain, and pacing the budget over time. We evaluated on the Criteo Uplift dataset, a large-scale advertising dataset from a real randomized controlled trial. Our key finding is a data-efficiency crossover: offline methods require approximately 10,000 historical observations to produce reliable results, while BCCB operates effectively from the very first user. Furthermore, BCCB exhibits 3-5x lower performance variance between runs, making it more practical for real campaign planning. Among purely online methods, BCCB consistently outperforms standard Thompson Sampling, budgeted Thompson Sampling, and greedy HTE estimation across all budget levels tested.
[372] Entropy Centroids as Intrinsic Rewards for Test-Time Scaling
Wenshuo Zhao, Qi Zhu, Xingshan Zeng, Fei Mi, Lifeng Shang, Yiren Feng
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: An effective way to scale up test-time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heavy and Gemini Deep Think. Existing selection methods often rely on external reward models, which requires training a strong reward model and introduces additional computation overhead. As an alternative, previous approaches have explored intrinsic signals, such as confidence and entropy, but these signals are noisy with naive aggregation. In this work, we observe that high-entropy tokens tend to cluster into consecutive groups during inference, providing a more stable notion of model uncertainty than individual tokens. Together, these clusters reveal temporal patterns of model uncertainty throughout the inference process. Motivated by this observation, we propose to use the temporal structure of uncertainty as an intrinsic reward. To this end, we first formalize the basic unit of segment-level uncertainty as the High Entropy Phase (HEP), a variable-length segment that begins at a high-entropy token and ends when consecutive low-entropy tokens appear. We then define the Entropy Centroid, inspired by the concept of the center of mass in physics, as the weighted average position of all HEPs along the trajectory. Intuitively, a lower centroid indicates early exploration followed by confident generation, which we find often corresponds to higher response quality. Based on this insight, we propose the Lowest Centroid method, which selects the response with the lowest entropy centroid among multiple candidates. Experiments on mathematics, code generation, logical reasoning, and agentic tasks, across model scales ranging from 14B to 480B, show that Lowest Centroid consistently outperforms existing baselines and delivers stable gains as model size increases. Code is available at https://github.com/hkust-nlp/entropy-centroid.
[373] SWAN: World-Aware Adaptive Multimodal Networks for Runtime Variations
Jason Wu, Shir-Kang Scott Jinn, Yuyang Yuan, Maggie Wigness, Lance M. Kaplan, Hang Qiu, Mani Srivastava
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal deep neural networks deployed in realistic environments must contend with runtime variations: changes in modality quality, overall input complexity, and available platform resources. Current networks struggle with such fluctuations – adaptive networks cannot adhere to a strict compute budget, controller-based networks neglect to consider input complexity, and statically provisioned networks fail at all the above. Consequently, they do not extract maximum utility from the expended computational resources. We present SWAN (Sample and World-Aware Multimodal Network), the first adaptive multimodal network that accomplishes all three goals. SWAN employs a quality-aware controller to assign resources among modalities according to a variable user-specified maximum budget. Within this budget, an adaptive gating module further optimizes efficiency by scaling layer utilization according to sample complexity. For further gains, SWAN also employs a token dropping module that masks semantically irrelevant multimodal features before performing detections. We evaluate SWAN in the domain of autonomous driving with complex multi-object 3D detection, reducing FLOPs by up to 49% with minimal degradation.
[374] Efficient and Interpretable Transformer for Counterfactual Fairness
Panyi Dong, Zhiyu Quan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The growing reliance of machine learning models in high-stakes, highly regulated domains such as finance and insurance has created a growing tension between predictive performance, interpretability, and regulatory fairness requirements. In these settings, models are expected not only to deliver reliable predictions but also to provide transparent decision rationales and comply with strict fairness requirements. Attention-based transformers offer powerful mechanisms for modeling complex data relationships as demonstrated in various language tasks, yet their attention mechanisms alone do not ensure counterfactually fair predictions, even when combined with fairness-aware techniques. To address these limitations, we propose the Feature Correlation Transformer (FCorrTransformer), an attention-light architecture tailored for tabular data. In this design, the attention matrix admits a direct statistical interpretation as pairwise feature dependencies, enhancing both interpretability and efficiency. Leveraging this structure, we introduce Counterfactual Attention Regularization (CAR), a framework that enforces group-invariant fair representations of sensitive features at the attention level, promoting counterfactually fair predictions without relying on explicit causal assumptions. Empirical evaluations on imbalanced classification and regression benchmarks demonstrate that FCorrTransformer combined with CAR achieves strong counterfactual fairness while maintaining competitive predictive performance and substantially reducing model complexity compared with standard transformer-based baselines. Overall, this work bridges a critical gap between fairness theory and machine learning models, offering a practical framework for responsible AI in regulatory-sensitive domains.
[375] Unsupervised Graph Modeling for Anomaly Detection in Accounting Subject Relationships
Yuhan Wang, Ruobing Yan, Zhe Su, Hejing Chen, Ningjing Sang, Yunfei Nie
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper addresses the problem of anomaly detection in accounting subject association structures, proposing a structured modeling and unsupervised discriminant framework based on graph neural networks. This framework is used to mine stable correspondences between subjects and identify structural deviations from general ledger details and voucher entries. The method first abstracts accounting subjects as graph nodes, and the co-occurrence and debit/credit correspondence of subjects in the same business record are abstracted as weighted edges. The edge weights are characterized by statistical measures such as co-occurrence frequency or amount aggregation, thus forming a period-level accounting subject association graph. In the representation learning stage, a message passing mechanism is used to fuse the node’s own attributes and neighborhood context to obtain node embeddings containing structural information. In the anomaly detection stage, the rationality of subject pair connections is estimated through a relation reconstruction decoder, and edge-level anomaly scores are defined based on the degree of deviation in reconstruction probabilities. These scores are then aggregated to obtain node-level risk ranking and local anomaly localization. This framework can simultaneously capture local substructure anomalies and cross-community anomaly connections without relying on anomaly labeling, outputting traceable subject pair risk clues. Comparative experiments demonstrate more stable comprehensive discriminant capabilities and higher top-ranking accuracy.
[376] DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training
Tianhao Hu, Xiangcheng Liu, Youshao Xiao, Yang Zheng, Xuan Huang, Jinrui Ding, Yufei Zhang, Tao Liang, Hongyu Zang, Quan Chen, Yueqing Sun, Wenjie Shi, Chao Zhang, Wei Wang, Qi Gu, Yerui Sun, Yucheng Xie, Xunliang Cai
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement learning (RL) has become a critical paradigm for LLM post-training, yet the rollout phase – accounting for 50–80% of total step time – is bottlenecked by skewed generation: long-tailed trajectories indispensable for model performance block the entire training pipeline. Asynchronous training offers a natural remedy by overlapping generation with training, but introduces a fundamental tension between efficiency and algorithmic correctness. We identify three constraints in asynchronous training to preserve convergence: intra-trajectory policy consistency, data integrity, and bounded staleness. Existing approaches fail to intrinsically address the long-tailed trajectory problem, which is further exacerbated by the imbalance characteristic of Mix-of-Experts models, or deviate from the standard RL training formulation, thereby hindering model convergence. Therefore, we propose DORA (Dynamic ORchestration for Asynchronous Rollout), which addresses this challenge through algorithm-system co-design. DORA introduces multi-version streaming rollout, a novel asynchronous paradigm that maintains multiple policy versions concurrently – simultaneously achieving full bubble elimination without compromising algorithmic constraints. Experimental results demonstrate that our DORA system achieves substantial improvements in throughput – up to 2–3 times higher than state-of-the-art systems on open-source benchmarks – without compromising convergence. Furthermore, in large-scale industrial applications with tens of thousands of accelerators, DORA accelerates RL training by 2–4 times compared to synchronous training across various scenarios. The resultant open-source models, LongCat-Flash-Thinking, exhibit competitive performance on complex reasoning benchmarks, matching the capability of most advanced LLMs.
[377] NeuroPlastic: A Plasticity-Modulated Optimizer for Biologically Inspired Learning Dynamics
Douglas Jiang, Yuechen Wang, Jiayi Wang, Jiaying Geng, Qinglong Wang, Feng Tian
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Optimization algorithms are fundamental to modern deep learning, yet most widely used methods rely on update rules based primarily on local gradient statistics. We introduce NeuroPlastic, a plasticity-modulated optimizer that augments gradient-based updates with an adaptive multi-signal modulation mechanism inspired by multi-factor synaptic plasticity, a concept from neurobiology. NeuroPlastic dynamically scales gradient updates using interacting components that capture gradient, activity-like, and memory-like statistics, forming a lightweight modulation layer compatible with standard deep learning training pipelines. Across image classification benchmarks, NeuroPlastic consistently improves over a controlled gradient-only ablation, with more pronounced gains on the Fashion-MNIST benchmark and in reduced-data regimes. In transfer experiments on CIFAR-10 with ResNet-18, the method remains stable and competitive without retuning. These results suggest that multi-signal plasticity-inspired modulation can provide a useful extension to conventional gradient-driven optimization, particularly when learning signals are limited or noisy, and offer a promising direction for gradient-based methods in deep learning.
[378] Cheeger–Hodge Contrastive Learning for Structurally Robust Graph Representation Learning
Mengyang Zhao, Longlong Li, Cunquan Qu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Graph Contrastive Learning (GCL) has emerged as a prominent framework for unsupervised graph representation learning. However, relying on augmentation design alone to define the invariances learned by GCL can be brittle under structural perturbations. To address this issue, we propose Cheeger–Hodge Contrastive Learning (CHCL), a framework that aligns a perturbation-stable Cheeger–Hodge joint signature across augmented views for robust graph representation learning. The proposed signature combines a Cheeger-inspired connectivity signature derived from the algebraic connectivity (λ_2) with the low-frequency spectrum of the 1-Hodge Laplacian, thereby capturing both global connectivity and higher-order structural information. By aligning encoder representations with the proposed Cheeger–Hodge joint signature across augmented views, CHCL learns graph embeddings that are robust to local structural perturbations. Extensive experiments on standard benchmarks, transfer settings demonstrate that CHCL consistently improves performance, robustness, and generalization.
[379] Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
Bolian Li, Yifan Wang, Yi Ding, Anamika Lochab, Ananth Grama, Ruqi Zhang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement learning (RL) has unlocked complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from performance saturation, preventing further gains as RL training scales. This problem can be characterized by the collapse of entropy, a key diagnostic for exploration in RL. Existing attempts have tried to prevent entropy collapse through regularization or clipping, but their resulting entropy curves often exhibit instability in the long term, which hinders performance gains. In this paper, we introduce Entrocraft, a simple rejection-sampling approach that realizes any user-customized entropy schedule by biasing the advantage distributions. Entrocraft requires no objective regularization and is advantage-estimator-agnostic. Theoretically, we relate per-step entropy change to the advantage distribution under minimal assumptions, which explains the behavior of existing RL and entropy-preserving methods. Entrocraft also enables a systematic study of entropy schedules, where we find that linear annealing, which starts high and decays to a slightly lower target, performs best. Empirically, Entrocraft addresses performance saturation, significantly improving generalization, output diversity, and long-term training. It enables a 4B model to outperform an 8B baseline, sustains improvement for up to 4x longer before plateauing, and raises pass@K by 50% over the baseline.
[380] AlphaJet: Automated Conceptual Aircraft Synthesis via Disentangled Generative Priors and Topology-Preserving Evolutionary Search
Boris Kriuk
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Conceptual aircraft design is traditionally an expert-mediated iterative process in which a human designer proposes a configuration, runs low-order physics, inspects the result, and re-proposes. We present AlphaJet, an end-to-end automated synthesis pipeline that closes this loop. From a textual mission specification (mass, range, cruise speed, hard size envelope, engine count, areal density) AlphaJet evolves a feasible 3D aircraft in real time, scored by a transparent multi-disciplinary fitness function covering aerodynamics, structures, weights, stability, packaging, and geometric mount consistency. Three contributions distinguish our approach: (i) an Anatomically-Disentangled Variational Autoencoder (AD-VAE) whose first 25 latent dimensions are supervised to align with named anatomical parameters, providing an interpretable shape prior; (ii) a topology-elitist genetic algorithm that protects the best individual from each of five tail topologies and triggers stagnation restarts, preventing premature collapse to a single configuration; and (iii) mount-aware geometric scoring that computes signed penetration between engines and other structural parts, eliminating the redundant artifacts common in generative aircraft models. The full loop runs interactively on a CPU and streams every generation to a browser viewer, making it a practical real-world automation tool for early-phase design-space exploration.
[381] Adaptive and Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning
Weihang Li, Jianchun Liu, Hongli Xu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: LoRA-MoE has emerged as an effective paradigm for parameter-efficient fine-tuning, combining the low training cost of LoRA with the increased adaptation capacity of Mixture-of-Experts (MoE). However, existing LoRA-MoE frameworks typically adopt a fixed and uniform expert configuration across heterogeneous Transformer modules (\eg, attention query/key projections and MLP gating networks), ignoring their distinct functional roles and capacity requirements. This design leads to localized over-provisioning, redundant trainable parameters, and unnecessary optimizer-state overhead. Moreover, prior methods enforce load balancing among experts throughout training. Although beneficial in the early stage, this constraint becomes restrictive once routing patterns stabilize, limiting expert specialization on downstream tasks. In this paper, we propose DMEP, a novel LoRA-MoE fine-tuning framework based on Dynamic Module-wise Expert Pruning. DMEP tracks expert utilization during training and physically removes low-utility experts on a per-module basis, yielding a more compact expert structure tailored to different modules. The pruned model then continues training without the load-balancing constraint, freeing the remaining experts to focus entirely on the downstream task and develop specialized expertise. By jointly adapting module-wise expert capacity and eliminating unnecessary balancing, DMEP improves both parameter efficiency and training efficiency. Extensive experiments on multiple reasoning benchmarks show that DMEP reduces trainable parameters by 35%–43% and improves training throughput by about 10%, while maintaining or surpassing the downstream reasoning accuracy of uniform LoRA-MoE baselines.
[382] Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
Disha Singha
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement learning (RL) systems typically optimize scalar reward functions that assume precise and reliable evaluation of outcomes. However, real-world objectives–especially those derived from human preferences–are often uncertain, context-dependent, and internally inconsistent. This mismatch can lead to alignment failures such as reward hacking, over-optimization, and overconfident behavior. We introduce a dual-source uncertainty-aware reward framework that explicitly models both epistemic uncertainty in value estimation and uncertainty in human preferences. Model uncertainty is captured via ensemble disagreement over value predictions, while preference uncertainty is derived from variability in reward annotations. We combine these signals through a confidence-adjusted Reliability Filter that adaptively modulates action selection, encouraging a balance between exploitation and caution. Empirical results across multiple discrete grid configurations (6x6, 8x8, 10x10) and high-dimensional continuous control environments (Hopper-v4, Walker2d-v4) demonstrate that our approach yields more stable training dynamics and reduces exploitative behaviors under reward ambiguity, achieving a 93.7% reduction in reward-hacking behavior as measured by trap visitation frequency. We demonstrate statistical significance of these improvements and robustness under up to 30% supervisory noise, albeit with a trade-off in peak observed reward compared to unconstrained baselines. By treating uncertainty as a first-class component of the reward signal, this work offers a principled approach toward more reliable and aligned reinforcement learning systems.
[383] CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs
Zhe Ding, Su Pan, Duowei Pan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Post-training quantization (PTQ) has become an important technique for reducing the inference cost of Large Language Models (LLMs). While recent mixed-precision methods improve ultra-low bit quantization by preserving critical subspaces in high precision, they typically construct these subspaces relying solely on activation statistics. This ignores the fundamental nature of linear operations, where the output perturbation is jointly driven by both activation and weight quantization noise. In this paper, we propose CoQuant, a joint weight-activation subspace projection method. By theoretically modeling the expected output error, CoQuant formulates a closed-form weighted PCA solution that balances activation and weight covariances to select the optimal high-precision subspace. Extensive experiments on Llama-3.2 and Qwen2.5 models show that CoQuant consistently outperforms strong PTQ baselines in both WikiText perplexity and zero-shot common-sense reasoning accuracy. These results demonstrate that joint weight-activation subspace modeling provides a principled and effective direction for low-bit LLM quantization. The source code is available at https://github.com/Zachary5895/CoQuant.
[384] Unifying Runtime Monitoring Approaches for Safety-Critical Machine Learning: Application to Vision-Based Landing
Mathieu Dario, Florent Chenevier, Kévin Delmas, Joris Guerin, Jérémie Guiochet
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Runtime monitoring is essential to ensure the safety of ML applications in safety-critical domains. However, current research is fragmented, with independent methods emerging from different communities. In this paper, we propose a unified framework categorising runtime monitoring approaches into three distinct types: Operational Design Domain (ODD) monitoring, which ensures compliance with expected operating conditions; Out-of-Distribution (OOD) monitoring, which rejects inputs that deviate from the training data; and Out-of-Model-Scope (OMS) monitoring, which detects anomalous model behaviour based its internal states or outputs. We demonstrate the benefits of this categorization with a dedicated experiment on an aeronautical safety-critical application: runway detection during landing. This framework facilitates design of monitoring activities, with complementary categories of monitors, and enables evaluation and comparison of different monitors using common, safety-oriented metrics.
[385] STLGT: A Scalable Trace-Based Linear Graph Transformer for Tail Latency Prediction in Microservices
Yongliang Ding, Qigong Bi, Peng Pu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate end-to-end tail-latency forecasting is critical for proactive SLO management in microservice systems. However, modeling long-range dependency propagation and non-stationary, bursty workloads while maintaining inference efficiency at scale remains challenging. We present STLGT (Scalable Trace-based Linear Graph Transformer), a per-API predictor that encodes traces as span graphs for multi-step p95 tail-latency forecasting. STLGT uses a structure-aware linear graph Transformer to propagate cross-service dependencies with inference time linear in span graph size, and a decoupled temporal module to capture workload dynamics. Across a personalized education microservice application, DeathStarBench, and Alibaba traces, STLGT improves forecasting accuracy over PERT-GNN by 8.5% MAPE on average and achieves up to 12x faster CPU inference at N=32, matching the maximum span graph size after preprocessing the Alibaba traces. Ablation studies further demonstrate the effectiveness of each component, especially under bursty traffic.
[386] Layer-wise Lipschitz-Product Control for Deep Kolmogorov–Arnold Network Representations of Compositionally Structured Functions
Aleksander Tankman
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We prove that any continuous function f from [0,1]^n to R representable by a finite computation tree with N internal nodes and compositional sparsity s = O(1) admits a deep Kolmogorov-Arnold Network (KAN) representation. Each internal node is realised by a primitive KAN block with controlled block depth and Lipschitz product. The layer-wise Lipschitz product satisfies the primary domain-sensitive bound independent of the input dimension n. It simplifies to P(KAN_f) <= max(C*,1)^L_f with L_f <= c_max * N. For the standard operations {+,-,x,sin,cos} with x nodes on [0,1]-bounded inputs we obtain P(KAN) <= 1. Layer widths satisfy n_l <= n + 2 w_max * N. The uniform approximation error is bounded by N * max(C*,1)^d(f) * epsilon_Op (simplifies when C* <=1). For f in C^m we obtain optimal B-spline rates. Range bounds are also derived (B_f <= N+1 for additive trees). This addresses the gap on Lipschitz control in deep KAN stacks noted by Liu et al. (2024). Experiments confirm P(KAN)=1.0 for several compositionally structured functions.
[387] Near-Optimal Cryptographic Hardness of Learning With Homogeneous Halfspaces Under Gaussian Marginals
Jizhou Huang, Brendan Juba
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We study three problems that involve identifying homogeneous halfspaces under Gaussian distributions: agnostic learning, one-sided reliable learning, and fairness auditing. In each of these problems, we are given labeled examples $(\mathbf{x}, \mathrm{y})$ drawn from an unknown distribution on $\mathbb{R}^d\times{-1, +1}$, whose marginal distribution on $\mathbf{x}$ is standard Gaussian and on $\mathrm{y}$ is arbitrary. The goal of each problem is to output a homogeneous halfspace that approaches the best-fitting homogeneous halfspace in terms of its corresponding loss measure. We prove near-optimal computational hardness results for these problems under the widely believed hardness assumption of the Learning With Errors (LWE) problem. Prior hardness results for these problems were mostly established for general halfspaces; our findings extend some of these hardness results to homogeneous halfspaces. Remarkably, our lower bound strictly generalizes over prior works and narrows the gap between the upper and lower bounds for agnostically learning homogeneous halfspaces under Gaussian marginals.
[388] Hierarchical adaptive control for real-time dynamic inference at the edge
Francesco Daghero, Mahyar Tourchi Moghaddam, Mikkel Baun Kjærgaard
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Industrial systems increasingly depend on Machine Learning (ML), and operate on heterogeneous nodes that must satisfy tight latency, energy, and memory constraints. Dynamic ML models, which reconfigure their computational footprint at runtime, promise high energy efficiency and lower average latency for modest accuracy tradeoffs; however, their deployment is complex due to the additional hyperparameters they rely on. These hyperparameters, controlling the accuracy versus average latency tradeoff, are often tuned on a calibration dataset that must match the test time distribution, an assumption that rarely holds in real-world scenarios, leading to suboptimal operational conditions, possibly below static models. We propose a two-tier adaptive architecture that co-optimizes model and system decisions. At the global level, a scheduler configures and deploys, for each edge node, a cascade of classifiers composed of lightweight specialized models and a generalist fallback, satisfying latency and memory constraints. At the node level, a local controller tracks data drifts and hardware resources, enabling or disabling specialized predictors (SP) to preserve high energy efficiency and avoid latency-constraint violations under varying conditions. This design allows longer operating times without forcing a global redeployment step, and enables efficient execution in case of an unreachable remote global controller. We evaluate the approach on two datasets under controlled distribution mismatch scenarios, showing average per-inference reductions of latency up to 2.45x and energy up to 2.86x, with less than 4% accuracy drop compared to static baselines. Our contributions are:(1) a budgeted SP-cascade formulation that preserves worst-case latency constraints;(2) a hierarchical controller that maintains efficiency under data and resource changes; and (3) an experimental evaluation on embedded hardware.
[389] Understanding DNNs in Feature Interaction Models: A Dimensional Collapse Perspective
Jiancheng Wang, Mingjia Yin, Hao Wang, Enhong Chen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: DNNs have gained widespread adoption in feature interaction recommendation models. However, there has been a longstanding debate on their roles. On one hand, some works claim that DNNs possess the ability to implicitly capture high-order feature interactions. Conversely, recent studies have highlighted the limitations of DNNs in effectively learning dot products, specifically second-order interactions, let alone higher-order interactions. In this paper, we present a novel perspective to understand the effectiveness of DNNs: their impact on the dimensional robustness of the representations. In particular, we conduct extensive experiments involving both parallel DNNs and stacked DNNs. Our evaluation encompasses an overall study of complete DNN on two feature interaction models, alongside a fine-grained ablation analysis of components within DNNs. Experimental results demonstrate that both parallel and stacked DNNs can effectively mitigate the dimensional collapse of embeddings. Furthermore, a gradient-based theoretical analysis, supported by empirical evidence, uncovers the underlying mechanisms of dimensional collapse.
[390] Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
Jinjiang Guo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The rapid growth of molecular foundation models and general-purpose large language models has encouraged a scale-centric view of artificial intelligence in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and task-specific graph neural networks (GNNs). We test this assumption on 22 molecular property and activity endpoints, including public ADMET and Tox21 benchmarks and two internal anti-infective activity datasets. Across 167,056 held-out task–molecule evaluations under structure-similarity-separated five-fold cross-validation (37,756 ADMET, 77,946 Tox21, 49,266 anti-TB and 2,088 antimalaria), classical machine-learning (ML) models such as RF(ECFP4) and ExtraTrees(RDKit descriptors) win ten primary-metric tasks, GNNs such as GIN and Ligandformer win nine, and pretrained molecular sequence models such as MoLFormer and ChemBERTa2 win three. Rule-based SAR reasoning baselines, represented by GPT5.5-SAR and Opus4.7-SAR, do not win under the prespecified primary metrics, although train-fold-derived SAR knowledge provides measurable but uneven gains for SAR reasoning and interpretation. These results indicate that compact, specialized models remain highly effective for molecular property and activity prediction. The performance differences among classical ML, GNN and pretrained sequence models are often modest and endpoint-dependent, whereas larger or more general models do not provide a universal predictive advantage. Large models may still add value for zero-shot reasoning, SAR interpretation and hypothesis generation, but the results suggest that predictive performance depends on the alignment among molecular representation, inductive bias, data regime, endpoint biology and validation protocol.
[391] Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models
Cyril Shih-Huan Hsu, Wig Yuan-Cheng Cheng, Chrysa Papagianni
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully offloading inference to the cloud is often impractical in bandwidth-limited environments, where transmitting raw visual data introduces substantial latency overhead. While recent edge-cloud collaborative architectures attempt to partition VLM workloads across devices, they typically rely on transmitting fixed-size representations, lacking adaptability to dynamic network conditions and failing to fully exploit semantic redundancy. In this paper, we propose a progressive semantic communication framework for edge-cloud VLM inference, using a Meta AutoEncoder that compresses visual tokens into adaptive, progressively refinable representations, enabling plug-and-play deployment with off-the-shelf VLMs without additional fine-tuning. This design allows flexible transmission at different information levels, providing a controllable trade-off between communication cost and semantic fidelity. We implement a full end-to-end edge-cloud system comprising an embedded NXP i.MX95 platform and a GPU server, communicating over bandwidth-constrained networks. Experimental results show that, at 1 Mbps uplink, the proposed progressive scheme significantly reduces network latency compared to full-edge and full-cloud solutions, while maintaining high semantic consistency even under high compression. The implementation code will be released upon publication at https://github.com/open-ep/ProSemComVLM.
[392] Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning
Seungyub Han, Hyungjin Kim, Jungwoo Lee
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-based framework that enables test-time adaptation in offline safe RL without retraining. In SAS, the main mechanism is self-alignment: at test time, the pretrained agent generates several imagined trajectories and selects those satisfying the Lyapunov condition. These feasible segments are then recycled as in-context prompts, allowing the agent to realign its behavior toward safety while avoiding parameter updates. In effect, SAS turns Lyapunov-guided imagination into control-invariant prompts, and its transformer architecture admits a hierarchical RL interpretation where prompting functions as Bayesian inference over latent skills. Across Safety Gymnasium and MuJoCo benchmarks, SAS consistently reduces cost and failure while maintaining or improving return.
[393] Large-scale semi-supervised learning with online spectral graph sparsification
Daniele Calandriello, Alessandro Lazaric, Michal Valko
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce Sparse-HFS, a scalable algorithm that can compute solutions to SSL problems using only O(n polylog(n)) space and O(m polylog(n)) time.
[394] Advancing multi-site emission control: A physics-informed transfer learning framework with mixture of experts for carbon-pollutant synergy
Yuxuan Ying, Hanqing Yang, Kaige Wang, Yu Hu, Zhiming Zheng, Yunliang Jiang, Xiaoqing Lin, Xiaodong Li, Jun Chen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Municipal solid waste incineration is increasingly central to urban waste management, yet its sustainability benefit depends on controlling carbon emissions and multiple air pollutants under highly heterogeneous operating conditions. Current data-driven models are often accurate within individual plants but are difficult to transfer across facilities, limiting their value for scalable emission-control strategies. Here we show that multi-site emission behaviour can be represented through transferable system-level structures when physical constraints, operating-regime heterogeneity and carbon–pollutant coupling are jointly considered. We develop a physics-informed transfer learning framework built on a carbon–pollutant mixture-of-experts model, which combines regime-dependent expert routing with conservation-based regularization and a carbon–pollutant synergistic index for integrated risk evaluation. Across 13 municipal solid waste incineration plants, the model captured both pollutant-specific emissions and system-level risk, achieving source-domain average pollutant $R^2$ values of 0.668–0.904 and CPSI $R^2$ values of 0.666–0.970. After transfer from a reference facility to 12 target plants, average pollutant $R^2$ remained between 0.661 and 0.842, while CPSI retained comparable transferability ($R^2$ = 0.610–0.841). Expert-utilization patterns further indicate that adaptation occurs through structured re-weighting of operating regimes rather than complete model re-learning. By extending the learned representation into an interpretable digital twin, this framework provides a route from emission prediction to regime-aware operational navigation, supporting scalable carbon–pollutant synergistic control across heterogeneous waste-to-energy systems.
[395] PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners
Zhiquan Tan, Yinrong Hong
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Improving large language model (LLM) reasoning requires supervision that is both aligned with the model’s own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exploration but offers sparse, high-variance credit; supervised fine-tuning and distillation provide dense targets but often train on fixed trajectories or rely on stronger teachers. Recent privileged on-policy self-distillation explores a middle ground by scoring student rollouts with the same model under verified solution context. We revisit this setting through a contextual re-scoring lens: for reasoning, the important choices are not only whether privileged context is available, but how much of it should be revealed and where its distribution should shape the student. We propose PAINT (Partial-solution Adaptive INterpolated Training), which masks the verified solution according to rollout-reference overlap and applies a small energy-space interpolation on a sparse set of entropy-mismatch token positions. Across competition-level math benchmarks, PAINT consistently improves over a strong prior on-policy self-distillation baseline at all three Qwen3 scales. On Qwen3-8B, it raises macro Avg@12 by 2.1 points over this prior baseline and 2.9 points over GRPO.
[396] PiGGO: Physics-Guided Learnable Graph Kalman Filters for Virtual Sensing of Nonlinear Dynamic Structures under Uncertainty
Marcus Haywood-Alexander, Gregory Duthé, Eleni Chatzi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Digital twins provide a powerful paradigm for diagnostic and prognostic tasks in the monitoring and control of engineered systems; however, their deployment for complex structures remains challenged by model-form uncertainty, arising from unknown nonlinear dynamics, and by sparse sensing. These limitations hinder reliable online state estimation using either purely physics-based or purely data-driven approaches. This work introduces the Physics-Guided Graph Neural ODE (PiGGO) framework, a physics-informed, graph-based Bayesian state estimation approach in which a learned graph neural ordinary differential equation (GNODE) serves as the continuous-time state-transition model within an extended Kalman filter. The graph representation explicitly defines the system state-space, while physics-guided inductive biases encode known structural relationships and constrain the learning of nonlinear dynamics. By integrating graph-native learned dynamics with recursive Bayesian filtering, the proposed PiGGO framework enables online virtual sensing and uncertainty-aware state estimation for nonlinear systems with unknown model form, while maintaining generalisation across topologically similar structures. Numerical case studies demonstrate improved robustness to model uncertainty and measurement noise, outperforming both open-loop graph neural models and conventional filtering approaches in online prediction tasks.
[397] Who Trains Matters: Federated Learning under Enrollment and Participation Selection Biases
Gota Morishita
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Federated learning (FL) trains a shared model from updates contributed by distributed clients, often implicitly assuming that contributing clients are representative of the target population. In practice, this representativeness assumption can fail at two distinct stages, inducing selection bias. First, eligibility rules such as device constraints, software requirements, or user consent determine which clients are ever enrolled and reachable for training, inducing \emph{enrollment bias}. Second, among enrolled clients, user and system factors such as battery state, network status, and local time determine which clients participate in each communication round, inducing \emph{participation bias}. Although existing work has largely addressed round-level participation bias, it has paid far less attention to population-level enrollment bias, which can induce a persistent mismatch between the training objective and the target-population objective. We formalize FL under a two-stage selection model and derive \textsc{FedIPW}, an inverse-probability-weighted aggregation scheme that recovers the target-population mean update under standard ignorability and positivity assumptions. Because client-level covariates are often unavailable for non-enrolled clients, we also introduce a limited-information aggregate-calibration extension that uses known target-population summaries to reweight the enrolled sample, partially correcting enrollment bias. We further provide an algorithm-agnostic optimization analysis under residual weighting error and show that incomplete selection correction can induce a non-vanishing bias floor. Finally, experiments on synthetic federated logistic regression validate the predicted objective mismatch and show that enrollment correction reduces target-population error under two-stage selection.
[398] Electricity price forecasting across Norway’s five bidding zones in the post-crisis era
My Thi Diem Phan, Trung Tuyen Truong, Hoai Phuong Ha, Dat Thanh Nguyen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Norway’s electricity market is heavily dominated by hydropower, but the 2021–2022 energy crisis and stronger integration with Continental Europe have fundamentally altered price formation, reducing the reliability of forecasting models calibrated on historical data. Despite the critical need for updated models, a unified benchmark evaluating feature contributions across all structurally diverse Norwegian bidding zones remains lacking. Here we present a comprehensive evaluation of electricity price forecasting across all five Norwegian Nord Pool bidding zones. We constructed a multimodal hourly dataset spanning 2019–2025 and evaluated eight forecasting model families including LightGBM, ARX, and advanced deep learning architectures using a strictly causal test set. We implemented robust rolling-origin backtesting, leave-one-group-out feature ablation, and conditional regime analysis to dissect model performance and feature utility. Our results show that LightGBM achieves the best performance in every zone with MAE ranging from 1.64 to 5.74~EUR/MWh, while the ridge ARX model remains a highly competitive linear benchmark in northern zones. Feature ablation reveals that models relying solely on lagged prices and calendar variables achieve high accuracy and often match or exceed full multimodal integration. However, conditional regime analysis demonstrates that external features like reservoir levels and gas prices remain crucial to stratify forecast errors, which consistently increase under stressed market regimes. This highlights the practical value of model interpretability and regime awareness for decision makers facing structural changes in market dynamics.
[399] Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework
Zhangzhi Xiong, Haoyi Wu, You Wu, Shuqi Gu, Kan Ren, Kewei Tu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The Probabilistic Transformer (PT) establishes that the Transformer’s self-attention plus its feed-forward block is mathematically equivalent to Mean-Field Variational Inference (MFVI) on a Conditional Random Field (CRF). Under this equivalence the Transformer ceases to be a black-box neural network and becomes a programmable factor graph: graph topology, factor potentials, and the message-passing schedule are all explicit and inspectable primitives that can be engineered. PT was originally developed for natural language and in this report we investigate its potential for time series. We first lift PT into the Spatial-Temporal Probabilistic Transformer (ST-PT) to repair PT’s missing channel axis and weak per-step semantics, and adopt ST-PT as a shared cornerstone backbone. We then identify three distinct properties that PT/ST-PT offers as a factor-graph model and derive three Research Questions, one per property, that probe how each property can be exploited in time series: RQ1. The graph topology and potentials are direct programmable primitives. Can this be used to inject symbolic time-series priors into ST-PT through structural graph modifications, especially under data scarcity and noise? RQ2. The CRF’s factor matrices are the operator’s potentials. Can an external condition program these factor matrices on a per-sample basis, so that conditional generation becomes structural rather than feature-level modulation of a fixed one? RQ3. Each MFVI iteration is a Bayesian posterior update on the factor graph. Can this turn the latent transition of latent-space AutoRegressive (AR) forecasting from an opaque MLP into a principled posterior update, and can a CRF teacher distill its latents into the AR student to counter cumulative error? We give one empirical study per question. Together, these three studies position ST-PT as a programmable framework for time-series modeling.
[400] Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
Hayate Iso, Tiyasa Mitra, Sudipta Mondal, Rasoul Shafipour, Venmugil Elango, Terry Kong, Yuki Huang, Seonjin Na, Izzy Putterman, Benjamin Chislett, Maor Ashkenazi, Joseph Guman, Gerald Shen, Tugrul Konuk, Ashwath Aithal, Ritika Borkar, Ran Zilberstein, Bita Rouhani
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central systems challenge. Many existing efficiency methods improve throughput by changing the rollout or optimization regime, for example, through off-policy execution, replay, or lower-precision generation. We study speculative decoding as a lossless acceleration primitive for RL rollouts that preserves the target model’s output distribution. We implement speculative decoding in NeMo-RL with a vLLM backend, supporting both synchronous and asynchronous pipelines and enabling speculation during RL rollouts. This benefit is realizable across speculation mechanisms, such as pretrained MTP heads, small external draft models or even techniques such as Eagle3, which are traditionally applied after RL phase. This yields a deployment path for state-of-the-art speculative decoding inside RL training. In a reasoning post-training workload at 8B scale under synchronous RL, speculative decoding improves rollout throughput by 1.8x. Using a high-fidelity performance simulator, we project that combining speculative decoding with asynchronous RL yields up to 2.5x end-to-end training speedup at 235B scale.
[401] Hankel and Toeplitz Rank-1 Decomposition of Arbitrary Matrices with Applications to Signal Direction-of-Arrival Estimation
Georgios I. Orfanidis, Dimitris A. Pados, George Sklivanitis, Elizabeth Serena Bentley
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We consider the problems of computing the optimal rank-$1$ Hankel and Toeplitz-structured approximation of arbitrary matrices under $L_2$ and $L_1$-norm error. Such problems arise naturally in engineered systems, including the basic few-shot signal Direction-of-Arrival (DoA) estimation problem that is of importance to modern autonomous systems applications. We develop accurate and computationally efficient structured matrix decomposition algorithms for both formulations and then derive analytically grounded small-sample-support DoA estimators for practical sensing system deployments. The resulting estimators under the $L_2$ and $L_1$ norms are formally shown to be maximum-likelihood optimal under white Gaussian and Laplace noise, respectively. The estimators are further validated through extensive simulation studies and real-world data experiments in few-shot DoA inference.
[402] Super-resolution Multi-signal Direction-of-Arrival Estimation by Hankel-structured Sensing and Decomposition
Georgios I. Orfanidis, Dimitris A. Pados, George Sklivanitis, Elizabeth Serena Bentley
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Motivated by sensing modalities in modern autonomous systems that involve hardware-constrained spatial sampling over large arrays with limited coherence time, we develop a novel framework for rapid super-resolution multi-signal direction-of-arrival (DoA) estimation based on Hankel-structured sensing and data matrix decomposition of arbitrary rank, under both the $L_2$ and $L_1$-norm formulation. The resulting $L_2$-norm estimator is shown to be maximum-likelihood optimal in white Gaussian noise. The $L_1$-norm estimator is shown to be maximum-likelihood optimal in independent, identically distributed (i.i.d.) isotropic Laplace noise, offering broad robustness to impulsive interference and corrupted measurements commonly encountered in practice. Extensive simulations demonstrate that the proposed methods exhibit powerful super-resolution capabilities, requiring significantly lower SNR and achieving substantially higher resolution probability than recent competing approaches.
[403] A Multi-Dataset Benchmark of Multiple Instance Learning for 3D Neuroimage Classification
Ethan Harvey, Dennis Johan Loevlie, Amir Ali Satani, Wansu Chen, David M. Kent, Michael C. Hughes
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Despite being resource-intensive to train, 3D convolutional neural networks (CNNs) have been the standard approach to classify CT and MRI scans. Recent work suggests that deep multiple instance learning (MIL) may be a more efficient alternative for 3D brain scans, especially when the pre-trained image encoder used to embed each 2D slice is frozen and only the pooling operation and classifier are trained. In this paper, we provide a systematic comparison of simple MIL, attention-based MIL, 3D CNNs, and 3D ViTs across three CT and four MRI datasets, including two large datasets of at least 10,000 scans. Our goal is to help resource-constrained practitioners understand which neural networks work well for 3D neuroimages and why. We further compare design choices for attention-based MIL, including different encoders, pooling operations, and architectural orderings. We find that simple mean pooling MIL, without any learnable attention, matches or outperforms recent MIL or 3D CNN alternatives on 4 of 6 moderate-sized tasks. This baseline remains competitive on two large datasets while being 25x faster to train. To explain mean pooling’s success, we examine per-slice attention quality and a semi-synthetic dataset where we can derive the best possible classifier via a Bayes estimator. This analysis reveals the limits of existing MIL approaches and suggests routes for future improvements.
[404] Asynchronous Federated Unlearning with Invariance Calibration for Medical Imaging
Zhaoyuan Cai, Xinglin Zhang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Federated Unlearning (FU) is an emerging paradigm in Federated Learning (FL) that enables participating clients to fully remove their contributions from a trained global model, driven by data protection regulations that mandate the right to be forgotten. However, existing FU methods mostly rely on synchronous coordination. This requirement forces the entire federation to halt and wait for stragglers to complete erasure, creating significant delays due to device heterogeneity. Furthermore, these methods often face the problem that the influence of erased data is merely suppressed temporarily and resurfaces during subsequent training, rather than being genuinely removed. To overcome these limitations, this paper proposes Asynchronous Federated Unlearning with Invariance Calibration (AFU-IC), a novel framework for medical imaging that decouples the erasure process from the global training workflow. This enables the target client to perform unlearning asynchronously without interrupting global training. Meanwhile, a server-side invariance calibration mechanism prevents the model from relearning the erased data. Extensive experiments on three medical benchmarks demonstrate that AFU-IC achieves unlearning efficacy and model fidelity comparable to gold-standard retraining while significantly reducing wall-clock latency compared to synchronous baselines. AFU-IC ensures efficient, compliant and reliable FL in cross-silo medical environments.
[405] Semi-supervised learning with max-margin graph cuts
Branislav Kveton, Michal Valko, Ali Rahimi, Ling Huang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper proposes a novel algorithm for semisupervised learning. This algorithm learns graph cuts that maximize the margin with respect to the labels induced by the harmonic function solution. We motivate the approach, compare it to existing work, and prove a bound on its generalization error. The quality of our solutions is evaluated on a synthetic problem and three UCI ML repository datasets. In most cases, we outperform manifold regularization of support vector machines, which is a state-of-the-art approach to semi-supervised max-margin learning.
[406] Random Cloud: Finding Minimal Neural Architectures Without Training
Javier Gil Blázquez
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: I propose the \emph{Random Cloud} method, a training-free approach to neural architecture search that discovers minimal feedforward network topologies through stochastic exploration and progressive structural reduction. Unlike post-training pruning methods that require a full train-prune-retrain cycle, this method evaluates randomly initialized networks without backpropagation, progressively reduces their topology, and only trains the best minimal candidate at the end. I evaluate on 7 classification benchmarks against magnitude pruning and random pruning baselines. The Random Cloud matches or outperforms both baselines in 6 of 7 datasets, achieving statistically significant improvements on Sonar ($+4.9$pp accuracy, $p{=}0.017$ vs magnitude pruning) with 87% parameter reduction. Crucially, the method is faster than both pruning baselines in 4 of 5 datasets (0.67–0.94$\times$ the cost of full training), since it avoids training the full-size network entirely.
[407] Uncertainty-Aware Predictive Safety Filters for Probabilistic Neural Network Dynamics
Bernd Frauenknecht, Lukas Kesper, Daniel Mayfrank, Henrik Hose, Sebastian Trimpe
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Predictive safety filters (PSFs) leverage model predictive control to enforce constraint satisfaction during deep reinforcement learning (RL) exploration, yet their reliance on first-principles models or Gaussian processes limits scalability and broader applicability. Meanwhile, model-based RL (MBRL) methods routinely employ probabilistic ensemble (PE) neural networks to capture complex, high-dimensional dynamics from data with minimal prior knowledge. However, existing attempts to integrate PEs into PSFs lack rigorous uncertainty quantification. We introduce the Uncertainty-Aware Predictive Safety Filter (UPSi), a PSF that provides rigorous safety predictions using PE dynamics models by formulating future outcomes as reachable sets. UPSi introduces an explicit certainty constraint that prevents model exploitation and integrates seamlessly into common MBRL frameworks. We evaluate UPSi within Dyna-style MBRL on standard safe RL benchmarks and report substantial improvements in exploration safety over prior neural network PSFs while maintaining performance on par with standard MBRL. UPSi bridges the gap between the scalability and generality of modern MBRL and the safety guarantees of predictive safety filters.
[408] Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
Zihan Zhao, Baotong Lu, Shengjie Lin, Yizou Chen, Jing Liu, Yanqi Zhang, Ziming Miao, Ming-Chang Yang, Haiying Shen, Qi Chen, Fan Yang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extending the KV storage to CPU memory. In practice, however, these algorithmic savings rarely translate into end-to-end system-level gains because sparse methods typically operate at different granularities and thus rely on ad hoc, per-algorithm implementations. At the same time, hierarchical KV storage introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary can easily erase the benefits of sparsity. We present SPIN, a sparse-attention-aware inference framework that co-designs the execution pipeline with hierarchical KV storage through three techniques: (1) a unified partition abstraction that maps different sparsity granularities onto a shared page-based KV substrate; (2) a locality-aware KV cache manager that dynamically sizes per-request HBM budgets and uses a GPU-friendly bucketed LRU policy to cut PCIe round-trips; and (3) a two-level hierarchical metadata layout sized to the active working set rather than the worst-case address space. Built on vLLM with three representative sparse attention algorithms, SPIN delivers 1.66-5.66x higher end-to-end throughput and 7-9x lower TTFT than vLLM, and reduces TPOT by up to 58% over the original sparse-attention implementations.
[409] Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data
Bao Pham, Mohammed J. Zaki, Luca Ambrogioni, Dmitry Krotov, Matteo Negri
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: When do language diffusion models memorize their training data, and how to quantitatively assess their true generative regime? We address these questions by showing that Uniform-based Discrete Diffusion Models (UDDMs) fundamentally behave as Associative Memories (AMs) $\textit{with emergent creative capabilities}$. The core idea of an AM is to reliably recover stored data points as $\textit{memories}$ by establishing distinct basins of attraction around them. Historically, models like Hopfield networks use an explicit energy function to guarantee these stable attractors. We broaden this perspective by leveraging the observation that energy is not strictly necessary, as basins of attraction can also be formed via conditional likelihood maximization. By evaluating token recovery of $\textit{training}$ and $\textit{test}$ examples, we identify in UDDMs a sharp memorization-to-generalization transition governed by the size of the training dataset: as it increases, basins around training examples shrink and basins around unseen test examples expand, until both later converge to the same level. Crucially, we can detect this transition using only the conditional entropy of predicted token sequences: memorization is characterized by vanishing conditional entropy, while in the generalization regime the conditional entropy of most tokens remains finite. Thus, conditional entropy offers a practical probe for the memorization-to-generalization transition in deployed models.
[410] KAYRA: A Microservice Architecture for AI-Assisted Karyotyping with Cloud and On-Premise Deployment
Attila Pintér, Javier Rico, Attila Répai, Jalal Al-Afandi, Adrienn Éva Borsy, András Kozma, Hajnalka Andrikovics, György Cserey
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present KAYRA, an end-to-end karyotyping system that operates inside the operational constraints of a clinical cytogenetic laboratory. KAYRA is architected as a containerized microservice pipeline whose ML stack combines an EfficientNet-B5 + U-Net semantic segmenter, a Mask R-CNN (ResNet-50 + FPN) instance detector, and a ResNet-18 classifier, orchestrated through a cascaded ROI-narrowing strategy that focuses each downstream model on the chromosome-bearing region. The same container images are deployed both as a cloud service and as an on-premise installation, supporting clinical environments where patient-data egress is not permitted as well as those where it is. A pilot clinical evaluation against two commercial reference karyotyping systems on 459 chromosomes from 10 metaphase spreads shows segmentation accuracy of 98.91 % (vs. 78.21 % / 40.52 %), classification accuracy of 89.1 % (vs. 86.9 % / 54.5 %), and rotation accuracy of 89.76 % (vs. 94.55 % / 78.43 %). KAYRA improves over the older density-thresholding reference on all three axes (p < 0.0001 for segmentation and classification by Fisher’s exact test on chromosome-level counts), and on segmentation also against the modern AI- supported reference (p < 0.0001); on classification the difference vs. the modern AI reference is not statistically significant at the present test-set size (p = 0.34). The system reaches TRL 6 maturity and integrates the human-in-the-loop expert-review workflow that diagnostic cytogenetic practice requires. The thesis of this paper is that a multi-model cytogenetic AI service can be packaged as a microservice architecture supporting flexible deployment - cloud-hosted or on-premise - while delivering strong empirical performance on a pilot clinical evaluation.
[411] Multiple Additive Neural Networks for Structured and Unstructured Data
Janis Mohr, Jörg Frochte
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper extends and explains the Multiple Additive Neural Networks (MANN) methodology, an enhancement to the traditional Gradient Boosting framework, utilizing nearly shallow neural networks instead of decision trees as base learners. This innovative approach leverages neural network architectures, notably Convolutional Neural Networks (CNNs) and Capsule Neural Networks, to extend its application to both structured data and unstructured data such as images and audio. For structured data the advantages of capsule neural networks as feature extractors are used and combined with MANN as a classifier. MANN’s unique architecture promotes continuous learning and integrates advanced heuristics to combat overfitting, ensuring robustness and reducing sensitivity to hyperparameter settings like learning rate and iterations. Our empirical studies reveal that MANN surpasses traditional methods such as Extreme Gradient Boosting (XGB) in accuracy across well-known datasets. This research demonstrates MANN’s superior precision and generalizability, making it a versatile tool for diverse data types and complex learning environments.
[412] Causal Learning with Neural Assemblies
Evangelia Kopadi, Dimitris Kalles
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Can Neural Assemblies – groups of neurons that fire together and strengthen through co-activation – learn the direction of causal influence between variables? While established as a computationally general substrate for classification, parsing, and planning, neural assemblies have not yet been shown to internalize causal directionality. We demonstrate that the inherent operations of neural assemblies – projection, local plasticity control, and sparse winner selection – are sufficient for directional learning. We introduce DIRECT (DIRectional Edge Coupling/Training), a mechanism that co-activates source and target assemblies under an adaptive gain schedule to internalize directed relations. Unlike backpropagation-based methods, DIRECT relies solely on local plasticity, making the resulting causal claims auditable at the mechanism level. Our findings are verified through a dual-readout validation strategy: (i) synaptic-strength asymmetry, measuring the emergent weight gap between forward and reverse links, and (ii) functional propagation overlap, quantifying the reliability of directional signal flow. Across multiple domains, the framework achieves perfect structural recovery under a supervised, known-structure setting. These results establish neural assemblies as an auditable bridge between biologically plausible dynamics and formal causal models, offering an “explainable by design” framework where causal claims are traceable to specific neural winners and synaptic asymmetries.
[413] On the Learning Curves of Revenue Maximization
Steve Hanneke, Alkis Kalavasis, Shay Moran, Grigoris Velegkas
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Learning curves are a fundamental primitive in supervised learning, describing how an algorithm’s performance improves with more data and providing a quantitative measure of its generalization ability. Formally, a learning curve plots the decay of an algorithm’s error for a fixed underlying distribution as a function of the number of training samples. Prior work on revenue-maximizing learning algorithms, starting with the seminal work of Cole and Roughgarden [STOC, 2014], adopts a distribution-free perspective, which parallels the PAC learning framework in learning theory. This approach evaluates performance against the hardest possible sequence of valuation distributions, one for each sample size, effectively defining the upper envelope of learning curves over all possible distributions, thus leading to error bounds that do not capture the shape of the learning curves. In this work we initiate the study of learning curves for revenue maximization and provide a near-complete characterization of their rate of decay in the basic setting of a single item and a single buyer. In the absence of any restriction on the valuation distribution, we show that there exists a Bayes-consistent algorithm, meaning that its learning curve converges to zero for any arbitrary valuation distribution as the number of samples $n \to \infty$. However, this convergence must be arbitrarily slow, even if the optimal revenue is finite. In contrast, if the optimal revenue is achieved by a finite price, then the optimal rate of decay is roughly $1/\sqrt{n}$. Finally, for distributions supported on discrete sets of values, we show that learning curves decay almost exponentially fast, a rate unattainable under the PAC framework.
[414] A Note on How to Remove the $\ln\ln T$ Term from the Squint Bound
Francesco Orabona
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In Orabona and Pál [2016], we introduced the shifted KT potentials, to remove the $\ln \ln T$ factor in the parameter-free learning with expert bound. In this short technical note, I show that this is equivalent to changing the prior in the Krichevsky–Trofimov algorithm. Then, I show how to use the same idea to remove the $\ln \ln T$ factor in the data-independent bound for the Squint algorithm.
[415] Hyper Input Convex Neural Networks for Shape Constrained Learning and Optimal Transport
Shayan Hundrieser, Insung Kong, Johannes Schmidt-Hieber
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce Hyper Input Convex Neural Networks (HyCNNs), a novel neural network architecture designed for learning convex functions. HyCNNs combine the principles of Maxout networks with input convex neural networks (ICNNs) to create a neural network that is always convex in the input, theoretically capable of leveraging depth, and performs reliable when trained at scale compared to ICNNs. Concretely, we prove that HyCNNs require exponentially fewer parameters than ICNNs to approximate quadratic functions up to a given precision. Throughout a series of synthetic experiments, we demonstrate that HyCNNs outperform existing ICNNs and MLPs in terms of predictive performance for convex regression and interpolation tasks. We further apply HyCNNs to learn high-dimensional optimal transport maps for synthetic examples and for single-cell RNA sequencing data, where they oftentimes outperform ICNN-based neural optimal transport methods and other baselines across a wide range of settings.
[416] Time series classification with random convolution kernels: pooling operators and input representations matter
Mouhamadou Mansour Lo, Gildas Morvan, Mathieu Rossi, Fabrice Morganti, David Mercier
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2409.01115: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.01115&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[417] Quantifying Climate Change Impacts on Renewable Energy Generation: A Super-Resolution Recurrent Diffusion Model
Xiaochong Dong, Jun Dan, Yingyun Sun, Yang Liu, Xuemin Zhang, Shengwei Mei
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2412.11399: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.11399&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[418] Improving Bayesian Optimization for Portfolio Management with an Adaptive Scheduling
Zinuo You, John Cartlidge, Karen Elliott, Menghan Ge, Daniel Gold
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2504.13529: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13529&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[419] Compton Form Factor Extraction using Quantum Deep Neural Networks
Brandon B. Le, Dustin Keller
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2504.15458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.15458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[420] A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety
Ankita Kushwaha, Kiran Ravish, Preeti Lamba, Pawan Kumar
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.17342: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.17342&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[421] DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces
Romeo Valentin, Sydney M. Katz, Vincent Vanhoucke, Mykel J. Kochenderfer
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.18441: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18441&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[422] A projection-based framework for gradient-free and parallel learning
Andreas Bergmeister, Manish Krishan Lal, Stefanie Jegelka, Suvrit Sra
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.05878: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05878&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[423] MARVIS: Modality Adaptive Reasoning over VISualizations
Benjamin Feuer, Lennart Purucker, Oussama Elachqar, Chinmay Hegde
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.01544: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.01544&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[424] The Serial Scaling Hypothesis
Yuxi Liu, Konpat Preechakul, Kananart Kuwaranancharoen, Yutong Bai
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.12549: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.12549&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[425] Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning
Tan Jing, Xiaorui Li, Chao Yao, Xiaojuan Ban, Yuetong Fang, Renjing Xu, Zhaolin Yuan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.19900: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.19900&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[426] RoseCDL: Robust and Scalable Convolutional Dictionary Learning for Rare event and Anomaly Detection
Jad Yehya, Mansour Benbakoura, Cédric Allain, Benoît Malezieux, Matthieu Kowalski, Thomas Moreau
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.07523: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07523&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[427] Comparing Data Assimilation and Likelihood-Based Inference on Latent State Estimation in Agent-Based Models
Blas Kolic, Corrado Monti, Gianmarco De Francisci Morales, Marco Pangallo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.17625: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.17625&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[428] Incorporating Expert Knowledge into Bayesian Causal Discovery of Mixtures of Directed Acyclic Graphs
Zachris Björkman, Jorge Loría, Sophie Wharrie, Samuel Kaski
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.06735: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06735&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[429] SD2AIL: Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models
Pengcheng Li, Qiang Fang, Tong Zhao, Yixing Lan, Xin Xu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.18583: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18583&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[430] Hybrid Quantum-Classical Ridgelet Neural Networks for Portfolio Optimization
Bahadur Yadav, Sanjay Kumar Mohanty
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.03654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[431] Factorizable joint shift revisited
Dirk Tasche
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.15036: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15036&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[432] The hidden risks of temporal resampling in clinical reinforcement learning
Thomas Frost, Hrisheekesh Vaidya, Steve Harris
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.06603: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06603&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[433] Graph Property Inference in Small Language Models: Effects of Representation and Reasoning Strategy
Michal Podstawski
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.06635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[434] Latent Autoencoder Ensemble Kalman Filter for Nonlinear Data assimilation
Xin T. Tong, Yanyan Wang, Liang Yan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.06752: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06752&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[435] Obliviator Reveals the Cost of Nonlinear Guardedness in Concept Erasure
Ramin Akbari, Milad Afshari, Vishnu Naresh Boddeti
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.07529: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07529&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[436] From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents
Shuoling Liu, Zhiquan Tan, Kun Yi, Hui Wu, Yihan Li, Jiangpeng Yan, Liyuan Chen, Kai Chen, Qiang Yang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.25342: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25342&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[437] PromptEvolver: Prompt Inversion through Evolutionary Optimization in Natural-Language Space
Asaf Buchnick, Aviv Shamsian, Aviv Navon, Ethan Fetaya
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.06061: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06061&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[438] Tree-of-Evidence: Efficient “System 2” Search for Faithful Multimodal Grounding
Micky C. Nnamdi, Benoit L. Marteau, Yishan Zhong, J. Ben Tamo, May D. Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.07692: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07692&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[439] Untrained CNNs Match Backpropagation at V1: A Systematic RSA Comparison of Four Learning Rules Against Human fMRI
Nils Leutenegger
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16875: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16875&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[440] Agentic Fusion of Large Atomic and Language Models to Accelerate Superconductors Discovery
Mingze Li, Yu Rong, Songyou Li, Lihong Wang, Jiacheng Cen, Liming Wu, Anyi Li, Zongzhao Li, Qiuliang Liu, Rui Jiao, Tian Bian, Pengju Wang, Hao Sun, Jianfeng Zhang, Ji-Rong Wen, Deli Zhao, Shifeng Jin, Tingyang Xu, Wenbing Huang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.23758: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.23758&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[441] TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, James Cheng
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL divergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher’s effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule. Experimental results across four student-teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher’s performance and generalize to tasks on which the teacher fails. Our code is available at https://github.com/kokolerk/TCOD.
[442] FedSLoP: Memory-Efficient Federated Learning with Low-Rank Gradient Projection
Yutong He, Zhengyang Huang, Jiahe Geng
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.24012: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.24012&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[443] Distribution-Free Stochastic Analysis and Robust Multilevel Vector Field Anomaly Detection
Julio E Castrillon-Candas, Michael Rosenbaum, Mark Kon
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2207.06229: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2207.06229&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[444] Deep neural networks with ReLU, leaky ReLU, and softplus activation provably overcome the curse of dimensionality for Kolmogorov partial differential equations with Lipschitz nonlinearities in the $L^p$-sense
Julia Ackermann, Arnulf Jentzen, Thomas Kruse, Benno Kuckuck, Joshua Lee Padgett
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2309.13722: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2309.13722&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[445] DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing
Utsab Saha, Tanvir Muntakim Tonoy, Hafiz Imtiaz
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2411.16121: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.16121&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[446] L2RU: a Structured State Space Model with prescribed L2-bound
Leonardo Massai, Muhammad Zakwan, Giancarlo Ferrari-Trecate
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.23818: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.23818&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[447] Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective
Soo Min Kwon, Alec S. Xu, Can Yaras, Laura Balzano, Qing Qu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.14808: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.14808&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[448] Optimal differentially private kernel learning with random projection
Bonwoo Lee, Cheolwoo Park, Jeongyoun Ahn
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.17544: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.17544&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[449] Generative Bid Shading in Real-Time Bidding Advertising
Yinqiu Huang, Hao Ma, Wenshuai Chen, Zongwei Wang, Shuli Wang, Yongqiang Zhang, Xue Wei, Yinhua Zhu, Haitao Wang, Xingxing Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.06550: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.06550&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[450] NeuralFLoC: Neural Flow-Based Joint Registration and Clustering of Functional Data
Xinyang Xiong, Siyuan jiang, Pengcheng Zeng
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.03169: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03169&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[451] PACIFIER: Pacing Opinion Depolarization via a Unified Graph Learning Framework
Mingkai Liao
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.23390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[452] VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
Zihao Zheng, Zhihao Mao, Xingyue Zhou, Jiayu Chen, Maoliang Li, Xinhao Sun, Hailong Zou, Zhaobo Zhang, Xuanzhe Liu, Donggang Cao, Hong Mei, Xiang Chen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.07080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[453] RL unknotter, hard unknots and unknotting number
Anne Dranowski, Yura Kabkov, Daniel Tubbenhauer
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.07955: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07955&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[454] A Tutorial Review of Bayesian Optimization with Gaussian Processes to Accelerate Stationary Point Searches
Rohit Goswami
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.10992: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10992&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[455] A Priori Sampling of Transition States with Guided Diffusion
Hyukjun Lim, Soojung Yang, Lucas Pinède, Miguel Steiner, Yuanqi Du, Rafael Gómez-Bombarelli
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.25980: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25980&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[456] A Nonlinear Separation Principle via Contraction Theory: Applications to Neural Networks, Control, and Learning
Anand Gokhale, Anton V. Proskurnikov, Yu Kawano, Francesco Bullo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.15238: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15238&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[457] Concave Statistical Utility Maximization Bandits via Influence-Function Gradients
Matías Carrasco, Alejandro Cholaquidis
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.22140: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.22140&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[458] Probabilistic Graphical Model using Graph Neural Networks for Bayesian Inversion of Discrete Structural Component States
Teng Li, Stephen Wu, Yong Huang, James L. Beck, Hui Li
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.23514: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.23514&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[459] Deep Learning of Solver-Aware Turbulence Closures from Nudged LES Dynamics
Ashwin Suriyanarayanan, Dibyajyoti Chakraborty, Romit Maulik
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.23874: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.23874&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[460] When Agents Shop for You: Role Coherence in AI-Mediated Markets
Soogand Alavi, Salar Nozari
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Consumers are increasingly delegating purchase decisions to AI agents, providing natural-language descriptions of their preferences and identity. We argue that these representations constitute an information channel, role coherence, through which sellers can infer willingness to pay without explicit disclosure by the buyer agent, leading to preference leakage. In an experiment where a language-model buyer agent shops on behalf of a verbal consumer profile, we show that seller-side inference from dialogue alone recovers willingness to pay nearly one-for-one. Comparing this setting to a numeric-budget condition with confidentiality instructions cleanly isolates role coherence as distinct from instruction-following failure. Because this leakage arises from delegation itself, it cannot be mitigated at the prompt level. Instead, we propose architectural interventions that trade off personalization against preference privacy.
[461] Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation
Ariel Sela
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multi-agent deliberation systems using large language models (LLMs) are increasingly proposed for policy simulation, yet they suffer from artificial consensus: evaluator agents converge on the same option regardless of their assigned value perspectives. We present the AI Council, a three-phase deliberation framework, and conduct 120 deliberations across two policy scenarios to test two interventions. First, architectural heterogeneity (assigning a different 7-9B parameter model to each value perspective) significantly reduces first-choice concentration compared to a homogeneous baseline (child welfare: 70.9% to 46.1%, p < 0.001, r = 0.58; housing: 46.0% to 22.9%, p < 0.001, r = 0.50). This contrasts with accuracy-oriented multi-agent debate, where heterogeneity does not reduce convergence, suggesting model diversity operates differently when no objectively correct answer exists. Second, coherence validation (using a frontier model to assess whether each evaluator’s reasoning is grounded in its assigned values) reveals a fidelity-diversity tradeoff: on a scenario with a dominant option, it further reduces concentration (46.1% to 40.8%, p = 0.004), but on a scenario with genuinely competitive options, it increases concentration (22.9% to 26.6%, p = 0.96) by amplifying high-coherence evaluators who cluster on one option. This tradeoff may be a general property of multi-agent systems employing quality weighting. We report negative results from three failed Delphi designs, demonstrate that 8B models exhibit binary rather than graded responses to counter-arguments, and propose the trustworthy tension rate as a diagnostic measure of small-model deliberation capabilities.
[462] Emergent Coordination in Multi-Agent Language Models
Christoph Riedl
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: When are multi-agent LLM systems merely a collection of individual agents versus an integrated collective with higher-order structure? We introduce an information-theoretic framework to test – in a purely data-driven way – whether multi-agent systems show signs of higher-order structure. This information decomposition lets us measure whether dynamical emergence is present in multi-agent LLM systems, localize it, and distinguish spurious temporal coupling from performance-relevant cross-agent synergy. We implement a practical criterion and an emergence capacity criterion operationalized as partial information decomposition of time-delayed mutual information (TDMI). We apply our framework to experiments using a simple guessing game without direct agent communication and minimal group-level feedback with three randomized interventions. Groups in the control condition exhibit strong temporal synergy but little coordinated alignment across agents. Assigning a persona to each agent introduces stable identity-linked differentiation. Combining personas with an instruction to ``think about what other agents might do’’ shows identity-linked differentiation and goal-directed complementarity across agents. Taken together, our framework establishes that multi-agent LLM systems can be steered with prompt design from mere aggregates to higher-order collectives. Our results are robust across emergence measures and entropy estimators, and not explained by coordination-free baselines or temporal dynamics alone. Without attributing human-like cognition to the agents, the patterns of interaction we observe mirror well-established principles of collective intelligence in human groups: effective performance requires both alignment on shared objectives and complementary contributions across members.
[463] The Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety
Elias Malomgré, Pieter Simoens
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multi-agent systems provide mature methodologies for role decomposition, coordination, and normative governance, capabilities that remain essential as increasingly powerful autonomous decision components are embedded within agent-based systems. While learned and generative models substantially expand system capability, their safety behavior is often entangled with training, making it opaque, difficult to audit, and costly to update after deployment. This paper formalizes the Alignment Flywheel as a governance-centric hybrid MAS architecture that decouples decision generation from safety governance. A Proposer, representing any autonomous decision component, generates candidate trajectories, while a Safety Oracle returns raw safety signals through a stable interface. An enforcement layer applies explicit risk policy at runtime, and a governance MAS supervises the Oracle through auditing, uncertainty-driven verification, and versioned refinement. The central engineering principle is patch locality: many newly observed safety failures can be mitigated by updating the governed oracle artifact and its release pipeline rather than retracting or retraining the underlying decision component. The architecture is implementation-agnostic with respect to both the Proposer and the Safety Oracle, and specifies the roles, artifacts, protocols, and release semantics needed for runtime gating, audit intake, signed patching, and staged rollout across distributed deployments. The result is a hybrid MAS engineering framework for integrating highly capable but fallible autonomous systems under explicit, version-controlled, and auditable oversight.
[464] Impacts of Electric Vehicle Charging Regimes and Infrastructure Deployments on System Performance: An Agent-Based Study
Jiahua Hu, Hai L. Vu, Wynita Griggs, Hao Wang
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The rapid growth of electric vehicles (EVs) requires more effective charging infrastructure planning. Infrastructure layout not only determines deployment cost, but also reshapes charging behavior and influences overall system performance. In addition, destination charging and en-route charging represent distinct charging regimes associated with different power requirements, which may lead to substantially different infrastructure deployment outcomes. This study applies an agent-based modeling framework to generate trajectory-level latent public charging demand under three charging regimes based on a synthetic representation of the Melbourne (Australia) metropolitan area. Two deployment strategies, an optimization-based approach and a utilization-refined approach, are evaluated across different infrastructure layouts. Results show that utilization-refined deployments reduce total system cost, accounting for both infrastructure deployment cost and user generalized charging cost, with the most significant improvement observed under the combined charging regime. In particular, a more effective allocation of AC slow chargers reshapes destination charging behavior, which in turn reduces unnecessary reliance on en-route charging and lowers detour costs associated with en-route charging. This interaction highlights the behavioral linkage between destination and en-route charging regimes and demonstrates the importance of accounting for user response and multiple charging regimes in charging infrastructure planning.
[465] Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver
Joshua Sherwood, Ben Aybar, Benjamin Kaplan
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Forecasting when AI systems will become capable of meaningfully accelerating AI research is a central challenge for AI safety. Existing benchmarks measure broad capability growth, but may not provide ample early warning signals for recursive self-improvement. We propose measuring AI’s capability to autonomously implement end-to-end machine learning pipelines from past AI research breakthroughs, given a minimal task description. By providing a concise task description instead of the full prior work as reference, we hope to better elicit emerging AI research taste. We introduce a proof-of-concept benchmark in which frontier coding agents autonomously implement an AlphaZero-style machine learning pipeline for Connect Four on consumer hardware within a three-hour budget, and we evaluate the resulting game AIs in a round-robin tournament anchored to the Pascal Pons Connect Four solver. Across four agents with eight trials each, we find substantial differentiation: Claude Opus 4.7 won as first-mover against Pons in seven of eight trials, statistically significantly better than the other agents tested, none of which exceeded two of eight. The task, which no frontier agent could reliably complete when we began development in January of 2026, is now near-saturation. Our evaluation also surfaced anomalous behavior in GPT-5.4, which consistently used far less of its allocated time budget than other agents. A follow-up 16-trial probe using shorter, less evaluation-coded prompts substantially increased GPT-5.4’s time-budget usage, consistent with but not diagnostic of sandbagging; Bradley-Terry ratings across probe conditions showed only directional differences, despite significant differences in time-budget usage. We release our data, code, and prompts to support reproduction and extension.
cs.MM
[466] OpenLifelogQA: An Open-Ended Multi-Modal Lifelog Question-Answering Dataset
Quang-Linh Tran, Hoang-Bao Le, Tuong-Nghiem Diep, Binh Nguyen, Gareth J. F. Jones, Cathal Gurrin
Main category: cs.MM
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce OpenLifelogQA, a large-scale open-ended lifelog QA dataset constructed from 18 months of multimodal lifelog data. Lifelogging is the passive collection and analysis of personal daily activities using wearable devices, producing rich multimodal data such as images, locations, and biometrics. Question answering (QA) over lifelog data enables users to interactively query their own experiences, supporting applications in memory support, lifestyle analysis, and personal assistance. OpenLifelogQA contains 14,187 Q&A pairs spanning multiple question types and difficulty levels, designed to support robust evaluation in realistic settings. Compared with prior resources, OpenLifelogQA offers greater diversity and practicality for real-world applications. To establish baselines, we evaluate the LLaVA-NeXT-Interleave 7B model, achieving 89.7% BERTScore, 25.87% ROUGE-L, and an average LLM Score of 3.97. By releasing OpenLifelogQA, we aim to promote future research on lifelog technologies, paving the way for personal lifelog assistants capable of memory augmentation, healthcare support, and lifestyle coaching.
eess.AS
[467] SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment
Dapeng Wu, Shun Lei, Wei Tan, Guangzheng Li, Yunzhe Wang, Huaicheng Zhang, Lishi Zuo, Zhiyong Wu
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advancements in Text-to-Song generation have enabled realistic musical content production, yet existing evaluation benchmarks lack the professional granularity to capture multi-dimensional aesthetic nuances. In this paper, we propose SongBench, a specialized framework for fine-grained song assessment across seven key dimensions: Vocal, Instrument, Melody, Structure, Arrangement, Mixing, and Musicality. Utilizing this framework, we construct an expert-annotated database comprising 11,717 samples from state-of-the-art models, labeled by music professionals. Extensive experimental results demonstrate that SongBench achieves high correlation with expert ratings. By revealing fine-grained performance gaps in current state-of-the-art models, SongBench serves as a diagnostic benchmark to steer the development toward more professional and musically coherent song generation.
[468] Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection
Jaskirat Sudan, Hashim Ali, Surya Subramani, Hafiz Malik
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Supervised contrastive learning (SupCon) is widely used to shape representations, but has seen limited targeted study for audio deepfake detection. Existing work typically combines contrastive terms with broader pipelines; however, the focus on SupCon itself is missing. In this work, we run a controlled study on wav2vec2 XLS-R (300M) that varies (i) similarity in SupCon (cosine vs angular similarity derived from the hyperspherical angle) and (ii) negative scaling using a warm-started global cross-batch queue. Stage 1 fine-tunes the encoder and projection head with SupCon; Stage 2 freezes them and trains a linear classifier with BCE. Trained on ASVspoof 2019 LA and evaluated on ASV19 eval plus ITW and ASVspoof 2021 DF/LA, Cosine SupCon with a delayed queue achieves the best ITW EER (8.29%) and pooled EER (4.44), while angular similarity performs strongly without queued negatives (ITW 8.70), indicating reduced reliance on large negative sets.
[469] One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech
Amanuel Gizachew Abebe, Yasmin Moslem
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Preserving a speaker’s voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, particularly in specialized domains such as scientific communication. In this paper, we address this challenge through our system submission to the International Conference on Spoken Language Translation (IWSLT 2026), the Cross-Lingual Voice Cloning shared task. First, we evaluate several state-of-the-art voice cloning models for cross-lingual speech generation of scientific texts in Arabic, Chinese, and French. Then, we build voice cloning systems based on the OmniVoice foundation model. We employ data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus. We investigate the effect of using this synthetic data for fine-tuning, demonstrating consistent improvements in intelligibility (WER and CER) across languages while preserving speaker similarity.
[470] DiffAnon: Diffusion-based Prosody Control for Voice Anonymization
Ismail Rasim Ulgen, Zexin Cai, Nicholas Andrews, Philipp Koehn, Berrak Sisman
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: To preserve or not to preserve prosody is a central question in voice anonymization. Prosody conveys meaning and affect, yet is tightly coupled with speaker identity. Existing methods either discard prosody for privacy or lack a principled mechanism to control the utility-privacy trade-off, operating at fixed design points. We propose DiffAnon, a diffusion-based anonymization method with classifier-free guidance (CFG) that provides explicit, continuous inference-time control over prosody preservation. DiffAnon refines acoustic detail over semantic embeddings of an RVQ codec, enabling smooth interpolation between anonymization strength and prosodic fidelity within a single model. To the best of our knowledge, it is the first voice anonymization framework to provide structured, interpolatable inference-time prosody control. Experiments demonstrate structured trade-off behavior, achieving strong utility while maintaining competitive privacy across controllable operating points.
[471] SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding
Mingyu Zhao, Zijian Lin, Kun Wei, Zhiyong Wu
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Conventional neural speech codecs suffer from severe intelligibility degradation at ultra-low bitrates, where the bottleneck transitions from acoustic distortion to semantic loss. To address this issue, this paper conducts a systematic investigation into the role and fundamental limits of integrating frozen semantic priors – specifically HuBERT and Whisper – into neural speech coding. We introduce and quantitatively validate a novel Semantic Retirement phenomenon: while semantic constraints reduce the Word Error Rate (WER) by up to ~10% relatively at 1.5 kbps, their benefits rapidly diminish beyond 6 kbps, indicating a practical capacity boundary. We further uncover a clear trade-off between different prior types: acoustic-rich priors (HuBERT) better preserve prosodic and timbral details, whereas high-level linguistic priors (Whisper) effectively suppress phonetic hallucinations in noisy environments (reducing hallucination rates by 26 percent) and substantially narrow the generalization gap for unseen speakers. Building on these findings, we propose a bitrate-aware regulation strategy that dynamically adjusts prior strength to optimize the trade-off between semantic consistency and perceptual naturalness. Extensive experimental evaluations confirm that our approach achieves competitive intelligibility and noise robustness compared to existing baselines, offering a principled pathway toward ultra-low-bitrate generative speech coding.
[472] Dual-LoRA: Parameter-Efficient Adversarial Disentanglement for Cross-Lingual Speaker Verification
Qituan Shangguan, Junhao Du, Kunyang Peng, Feng Xue, Hui Zhang, Xinsheng Wang, Kai Yu, Shuai Wang
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Cross-lingual speaker verification suffers from severe language-speaker entanglement. This causes systematic degradation in the hardest scenario: correctly accepting utterances from the same speaker across different languages while rejecting those from different speakers sharing the same language. Standard adversarial disentanglement degrades speaker discriminability; blind discriminators inadvertently penalize speaker-discriminative traits that merely correlate with language. To address this, we propose Dual-LoRA, injecting trainable task-factorized LoRA adapters into a frozen pre-trained backbone. Our core innovation is a Language-Anchored Adversary: by grounding the discriminator with an explicit language branch, adversarial gradients target true linguistic cues rather than arbitrary correlations, preserving essential speaker characteristics. Evaluated on the TidyVoice benchmark, our system achieves a 0.91% validation EER and achieves 3rd place in the official challenge.
[473] The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation
Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou, Tzu-Wen Hsu, Yun-Man Hsu, Chun Wei Chen, Shrikanth Narayanan, Hung-yi Lee
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.
[474] Multi-Speaker DOA Estimation in Binaural Hearing Aids using Deep Learning and Speaker Count Fusion
Farnaz Jazaeri, Homayoun Kamkar-Parsi, François Grondin, Martin Bouchard
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: For extracting a target speaker voice, direction-of-arrival (DOA) estimation is crucial for binaural hearing aids operating in noisy, multi-speaker environments. Among the solutions developed for this task, a deep learning convolutional recurrent neural network (CRNN) model leveraging spectral phase differences and magnitude ratios between microphone signals is a popular option. In this paper, we explore adding source-count information for multi-sources DOA estimation. The use of dual-task training with joint multi-sources DOA estimation and source counting is first considered. We then consider using the source count as an auxiliary feature in a standalone DOA estimation system, where the number of active sources (0, 1, or 2+) is integrated into the CRNN architecture through early, mid, and late fusion strategies. Experiments using real binaural recordings are performed. Results show that the dual-task training does not improve DOA estimation performance, although it benefits source-count prediction. However, a ground-truth (oracle) source count used as an auxiliary feature significantly enhances standalone DOA estimation performance, with late fusion yielding up to 14% higher average F1-scores over the baseline CRNN. This highlights the potential of using source-count estimation for robust DOA estimation in binaural hearing aids.
[475] Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
Jianbo Ma, Richard Cartwright
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of-Details (CoD), a novel framework that explicitly models temporal coarse-to-fine dynamics in speech generation using a cascaded architecture. Our method progressively refines temporal details across multiple stages, with each stage targeting a specific temporal granularity. All temporal detail predictions are performed using a shared decoder, enabling efficient parameter utilization across different temporal resolutions. Notably, we observe that the lowest detail level naturally performs phonetic planning without the need for an explicit phoneme duration predictor. We evaluate our method on several datasets and compare it against several baselines. Experimental results show that CoD achieves competitive performance with significantly fewer parameters than existing approaches. Our findings demonstrate that explicit modeling of temporal dynamics with the CoD framework leads to more natural speech synthesis.
eess.IV
[476] Adaptive Transform Coding for Semantic Compression
Andriy Enttsel, Vincent Corlay
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Visual data compression is shifting from human-centered reconstruction to machine-oriented representation coding. In this setting, an image is often mapped to a compact semantic embedding, which is then compressed and transmitted for downstream inference. We propose an adaptive transform-coding method for semantic-feature compression motivated by the conditional rate-distortion function of a Gaussian mixture model. The scheme uses mode-dependent transforms and quantizers selected according to the inferred source component, enabling more efficient coding of heterogeneous feature distributions. Evaluations on features from widely used vision backbones and foundation models show that the proposed method outperforms or is competitive with state-of-the-art neural compression methods while preserving flexibility and interpretability.
[477] Circular Phase Representation and Geometry-Aware Optimization for Ptychographic Image Reconstruction
Carson Yu Liu, Jun Cheng, Chien-Chun Chen, Steve F. Shu
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Traditional iterative reconstruction methods are accurate but computationally expensive, limiting their use in high-throughput and real-time ptychography. Recent deep learning approaches improve speed, but often predict phase as a Euclidean scalar despite its $2π$ periodicity, which can introduce wrapping artifacts, discontinuities at $\pmπ$, and a mismatch between the loss and the underlying signal geometry. We present a deep learning framework for ptychographic reconstruction that models phase on the unit circle using cosine and sine components. Phase error is optimized with a differentiable geodesic loss, which avoids branch-cut discontinuities and provides bounded gradients. The network further incorporates saturation-aware dual-gain input scaling, parallel encoder branches, and three decoders for amplitude, cosine, and sine prediction, together with a composite loss that promotes circular consistency and structural fidelity. Experiments on synthetic and experimental datasets show consistent improvements in both amplitude and phase reconstruction over existing deep learning methods. Frequency-domain analysis further shows better preservation of mid- and high-frequency phase content. The proposed method also provides substantial speedup over iterative solvers while maintaining physically consistent reconstructions.
[478] COMMA: Coordinate-aware Modulated Mamba Network for 3D Dispersed Vessel Segmentation
Gen Shi, Hui Zhang, Jie Tian
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate segmentation of 3D vascular structures is essential for various medical imaging applications. The dispersed nature of vascular structures leads to inherent spatial uncertainty and necessitates location awareness, yet most current 3D medical segmentation models rely on the patch-wise training strategy that usually loses this spatial context. In this study, we introduce the Coordinate-aware Modulated Mamba Network (COMMA) and contribute a manually labeled dataset of 570 cases, the largest publicly available 3D vessel dataset to date. COMMA leverages both entire and cropped patch data through global and local branches, ensuring robust and efficient spatial location awareness. Specifically, COMMA employs a channel-compressed Mamba (ccMamba) block to encode entire image data, capturing long-range dependencies while optimizing computational costs. Additionally, we propose a coordinate-aware modulated (CaM) block to enhance interactions between the global and local branches, allowing the local branch to better perceive spatial information. We evaluate COMMA on six datasets, covering two imaging modalities and five types of vascular tissues. The results demonstrate COMMA’s superior performance compared to state-of-the-art methods with computational efficiency, especially in segmenting small vessels. Ablation studies further highlight the importance of our proposed modules and spatial information. The code and data will be open source at https://github.com/shigen-StoneRoot/COMMA.
[479] Triple-Phase Sequential Fusion Network for Hepatobiliary Phase Liver MRI Synthesis
Qiuli Wang, Xinhuan Sun, Fengxi Chen, Yongxu Liu, Jie Cheng, Lin Chen, Jiafei Chen, Yue Zhang, Xiaoming Li, Wei Chen
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Gadoxetate disodium-enhanced MRI is essential for the detection and characterization of hepatocellular carcinoma. However, acquisition of the hepatobiliary phase (HBP) requires a prolonged post-contrast delay, which reduces workflow efficiency and increases the risk of motion artifacts. In this study, we propose a Triple-Phase Sequential Fusion Network (TriPF-Net) to synthesize HBP images by leveraging the sequential information from pre-HBP sequences: while T1-weighted imaging serves as the indispensable baseline, the model adaptively integrates arterial-phase (AP) and venous-phase (VP) features when available. By modeling the tissue-specific contrast uptake and excretion dynamics across these three phases, TriPF-Net ensures robust HBP synthesis even under the stochastic absence of one or both dynamic contrast-enhanced sequences. The framework comprises an Enhanced Region-Guided Encoder and a Dynamic Feature Unification Module, optimized with a Region-Guided Sequential Fusion Loss to maintain physiological consistency. In addition, clinical variables, including age, sex, total bilirubin, and albumin, are incorporated to enhance physiological consistency. Compared with conventional methods, TriPF-Net achieved superior performance on datasets from two centers. On the internal dataset, the model achieved an MAE of 10.65, a PSNR of 23.27, and an SSIM of 0.76. On the external validation dataset, the corresponding values were 12.41, 23.11, and 0.78, respectively. This flexible solution enhances clinical workflow and lesion depiction, potentially eliminating the need for delayed HBP acquisition in HCC imaging.